现 O O
货 O O
LILY O O
2014 O O
连 B B
衣 M M
装 E E
新 O O
款 O O
代 O O
购 O O
然后想要按照第三列,将B开头,E结尾的组合成一个词语,写了如下shell脚本:
while read line
do
var=`echo $line | awk '{print $3}'`
word=`echo $line | awk '{print $1}'`
if [ $var = 'B' ];then
pro=$word
elif [ $var = 'M' ];then
pro=$pro$word
elif [ $var = 'E' ];then
pro=$pro$word
echo $pro >> ./acution/product_res
pro=""
elif [ $var = 'S' ];then
echo $word >>./auction/product_res
fi
done < ./auction/test_res
但是由于这个文件的行数有3000多万行,太大了,已经运行了10几个小时了还没结束,哪位大神提供一下提高效率的办法啊?跪谢!
4 个解决方案
#1
the file is not as big as you thought. I guess it's around 500MB?
I am a bash fool, but if I were you, I will read each line into array instead of calling echo pipe awk.
while IFS=$'\t' read -r -a line
Then you can access each column by index without calling two extra process echo and awk. This will save a lot of context switch and time to create/destroy processes.
If you have a machine with more than 500MB RAM and I am sure you do, you could try to read all 30M lines into a bash variable and parse it in your script. This could greatly reduce tons of reading the script has to do line by line...
$IFS=$'\t'
$arr=($(<file))
I am a bash fool, but if I were you, I will read each line into array instead of calling echo pipe awk.
while IFS=$'\t' read -r -a line
Then you can access each column by index without calling two extra process echo and awk. This will save a lot of context switch and time to create/destroy processes.
If you have a machine with more than 500MB RAM and I am sure you do, you could try to read all 30M lines into a bash variable and parse it in your script. This could greatly reduce tons of reading the script has to do line by line...
$IFS=$'\t'
$arr=($(<file))
#2
创建一个awk 脚本文件,名称如:getdat.awk,内容如下:
BEGIN{
PRO=""
START=0
}
($3=="B"){
START=1
PRO=$1
}
($3=="E"){
START=0
printf("%s%s\n",PRO,$3)
PRO=""
}
{
if (START==1) {
PRO=sprintf("%s%s",PRO,$3)
}
}
END{
if (START==1) && (PRO!="") {
printf("%s\n",PRO)
}
}
然后运行如下shell脚本:
awk -f getdat.awk ./auction/test_res >./acution/product_res
BEGIN{
PRO=""
START=0
}
($3=="B"){
START=1
PRO=$1
}
($3=="E"){
START=0
printf("%s%s\n",PRO,$3)
PRO=""
}
{
if (START==1) {
PRO=sprintf("%s%s",PRO,$3)
}
}
END{
if (START==1) && (PRO!="") {
printf("%s\n",PRO)
}
}
然后运行如下shell脚本:
awk -f getdat.awk ./auction/test_res >./acution/product_res
#3
只用AWK处理会很快,比shell逐行读入逐行处理快多了!
#4
3000W,数据量果然挺大的。
也可以导入到数据库里面,操作可能会比较快吧。
也可以导入到数据库里面,操作可能会比较快吧。
#1
the file is not as big as you thought. I guess it's around 500MB?
I am a bash fool, but if I were you, I will read each line into array instead of calling echo pipe awk.
while IFS=$'\t' read -r -a line
Then you can access each column by index without calling two extra process echo and awk. This will save a lot of context switch and time to create/destroy processes.
If you have a machine with more than 500MB RAM and I am sure you do, you could try to read all 30M lines into a bash variable and parse it in your script. This could greatly reduce tons of reading the script has to do line by line...
$IFS=$'\t'
$arr=($(<file))
I am a bash fool, but if I were you, I will read each line into array instead of calling echo pipe awk.
while IFS=$'\t' read -r -a line
Then you can access each column by index without calling two extra process echo and awk. This will save a lot of context switch and time to create/destroy processes.
If you have a machine with more than 500MB RAM and I am sure you do, you could try to read all 30M lines into a bash variable and parse it in your script. This could greatly reduce tons of reading the script has to do line by line...
$IFS=$'\t'
$arr=($(<file))
#2
创建一个awk 脚本文件,名称如:getdat.awk,内容如下:
BEGIN{
PRO=""
START=0
}
($3=="B"){
START=1
PRO=$1
}
($3=="E"){
START=0
printf("%s%s\n",PRO,$3)
PRO=""
}
{
if (START==1) {
PRO=sprintf("%s%s",PRO,$3)
}
}
END{
if (START==1) && (PRO!="") {
printf("%s\n",PRO)
}
}
然后运行如下shell脚本:
awk -f getdat.awk ./auction/test_res >./acution/product_res
BEGIN{
PRO=""
START=0
}
($3=="B"){
START=1
PRO=$1
}
($3=="E"){
START=0
printf("%s%s\n",PRO,$3)
PRO=""
}
{
if (START==1) {
PRO=sprintf("%s%s",PRO,$3)
}
}
END{
if (START==1) && (PRO!="") {
printf("%s\n",PRO)
}
}
然后运行如下shell脚本:
awk -f getdat.awk ./auction/test_res >./acution/product_res
#3
只用AWK处理会很快,比shell逐行读入逐行处理快多了!
#4
3000W,数据量果然挺大的。
也可以导入到数据库里面,操作可能会比较快吧。
也可以导入到数据库里面,操作可能会比较快吧。