比较基于列的两个文件并将共同的元素附加到文件中

时间:2021-08-07 07:15:02

Basically I want to combine the power of grepf with awk or 'bash' commands. I have two files like this:

基本上我想把grepf的强大功能与awk或'bash'命令结合起来。我有两个这样的文件:

$file1
ENSG00000000003 TSPAN6  ensembl_havana  TSPAN6
ENSG00000000419 DPM1    ensembl_havana  DPM1
ENSG00000000457 SCYL3   ensembl_havana  SCYL3
ENSG00000000460 C1orf112    ensembl_havana  C1orf112
ENSG00000000971 CFH ensembl_havana  CFH
ENSG00000001036 FUCA2   ensembl_havana  FUCA2

$file2
ENSG00000000003.12  0.0730716237772557  -0.147970450702234
ENSG00000000419.5   0.156405616866614   -0.0398488625782745
ENSG00000000457.3   -0.110396121325736  -0.0147093758392248
ENSG00000000460.15  -0.0457144601264149 0.322340330477282
ENSG00000000971.12  0.0613967504891434  -0.0198254029339757
ENSG00000001036.4   0.00879628204710496 0.0560438506950908

And here my desired output

在这里我想要的输出

ENSG00000000003.12  TSPAN6  0.0730716237772557  -0.147970450702234
ENSG00000000419.5   DPM1    0.156405616866614   -0.0398488625782745 
ENSG00000000457.3   SCYL3   -0.110396121325736  -0.0147093758392248 
ENSG00000000460.15  C1orf112    -0.0457144601264149 0.322340330477282   
ENSG00000000971.12  CFH 0.0613967504891434  -0.0198254029339757 
ENSG00000001036.4   FUCA2   0.00879628204710496 0.0560438506950908  

This output also will be useful

此输出也很有用

ENSG00000000003 TSPAN6  0.0730716237772557  -0.147970450702234
ENSG00000000419 DPM1    0.156405616866614   -0.0398488625782745 
ENSG00000000457 SCYL3   -0.110396121325736  -0.0147093758392248 
ENSG00000000460 C1orf112    -0.0457144601264149 0.322340330477282   
ENSG00000000971 CFH 0.0613967504891434  -0.0198254029339757 
ENSG00000001036 FUCA2   0.00879628204710496 0.0560438506950908

I have tried the command from Obtain patterns from a file, compare to a column of another file, print matching lines, using awk

我尝试了从文件中获取模式,与另一个文件的列进行比较,使用awk打印匹配行的命令

awk 'NR==FNR{a[$0]=1;next} {n=0;for(i in a){if($0~i){print; break}}} n' file2 file 

But obviously it does not give me the desired output

但显然它没有给我想要的输出

Thanks

谢谢

2 个解决方案

#1


1  

With awk:

用awk:

awk 'NR == FNR { a[$1] = $2; next } { split($1, b, "."); print $1, a[b[1]], $2, $3 }' file1 file2

This works as follows:

其工作原理如下:

NR == FNR {                  # While processing the first file
  a[$1] = $2                 # just remember the second field by the first
  next
}
{                            # while processing the second file
  split($1, b, ".")          # split first field to isolate the key
  print $1, a[b[1]], $2, $3  # print relevant fields and the remembered
                             # bit from the first file.
}

#2


1  

$ awk 'NR==FNR{m[$1]=$2;next} {sub(/[[:space:]]/," "m[$1])} 1' file1 FS='.' file2
ENSG00000000003.12 TSPAN6 0.0730716237772557  -0.147970450702234
ENSG00000000419.5 DPM1  0.156405616866614   -0.0398488625782745
ENSG00000000457.3 SCYL3  -0.110396121325736  -0.0147093758392248
ENSG00000000460.15 C1orf112 -0.0457144601264149 0.322340330477282
ENSG00000000971.12 CFH 0.0613967504891434  -0.0198254029339757
ENSG00000001036.4 FUCA2  0.00879628204710496 0.0560438506950908

#1


1  

With awk:

用awk:

awk 'NR == FNR { a[$1] = $2; next } { split($1, b, "."); print $1, a[b[1]], $2, $3 }' file1 file2

This works as follows:

其工作原理如下:

NR == FNR {                  # While processing the first file
  a[$1] = $2                 # just remember the second field by the first
  next
}
{                            # while processing the second file
  split($1, b, ".")          # split first field to isolate the key
  print $1, a[b[1]], $2, $3  # print relevant fields and the remembered
                             # bit from the first file.
}

#2


1  

$ awk 'NR==FNR{m[$1]=$2;next} {sub(/[[:space:]]/," "m[$1])} 1' file1 FS='.' file2
ENSG00000000003.12 TSPAN6 0.0730716237772557  -0.147970450702234
ENSG00000000419.5 DPM1  0.156405616866614   -0.0398488625782745
ENSG00000000457.3 SCYL3  -0.110396121325736  -0.0147093758392248
ENSG00000000460.15 C1orf112 -0.0457144601264149 0.322340330477282
ENSG00000000971.12 CFH 0.0613967504891434  -0.0198254029339757
ENSG00000001036.4 FUCA2  0.00879628204710496 0.0560438506950908