shell脚本处理文件效率慢

亲们，我这里要处理如下格式的文件：

现      O       O
货      O       O
LILY    O       O
2014    O       O
连      B       B
衣      M       M
装      E       E
新      O       O
款      O       O
代      O       O
购      O       O

然后想要按照第三列，将B开头，E结尾的组合成一个词语，写了如下shell脚本：
while read line
do
   var=`echo $line | awk '{print $3}'`
   word=`echo $line | awk '{print $1}'`
   if [ $var = 'B' ];then
        pro=$word
   elif [ $var = 'M' ];then
        pro=$pro$word
   elif [ $var = 'E' ];then
        pro=$pro$word
        echo $pro >> ./acution/product_res
        pro=""
   elif [ $var = 'S' ];then
        echo $word >>./auction/product_res
   fi
done < ./auction/test_res

但是由于这个文件的行数有3000多万行，太大了，已经运行了10几个小时了还没结束，哪位大神提供一下提高效率的办法啊？跪谢！

4 个解决方案

#1

the file is not as big as you thought. I guess it's around 500MB?

I am a bash fool, but if I were you, I will read each line into array instead of calling echo pipe awk.

while IFS=$'\t' read -r -a line

Then you can access each column by index without calling two extra process echo and awk. This will save a lot of context switch and time to create/destroy processes.

If you have a machine with more than 500MB RAM and I am sure you do, you could try to read all 30M lines into a bash variable and parse it in your script. This could greatly reduce tons of reading the script has to do line by line...

$IFS=$'\t'
$arr=($(<file))

#2

创建一个awk 脚本文件，名称如:getdat.awk，内容如下：
BEGIN{
  PRO=""
  START=0
}

($3=="B"){
  START=1
  PRO=$1
}

($3=="E"){
  START=0
  printf("%s%s\n",PRO,$3)
  PRO=""
}

{
  if (START==1) {
    PRO=sprintf("%s%s",PRO,$3)
  }
}

END{
  if (START==1) && (PRO!="") {
    printf("%s\n",PRO)
  }
}

然后运行如下shell脚本：
awk -f getdat.awk ./auction/test_res >./acution/product_res

#3

只用AWK处理会很快，比shell逐行读入逐行处理快多了！

#4

３０００W，数据量果然挺大的。

也可以导入到数据库里面，操作可能会比较快吧。

#1

#2

#3

只用AWK处理会很快，比shell逐行读入逐行处理快多了！

#4

３０００W，数据量果然挺大的。

也可以导入到数据库里面，操作可能会比较快吧。

秒客网

shell脚本处理文件效率慢

4 个解决方案

#1

#2

#3

#4

#1

#2

#3

#4

相关文章