awk或shell脚本来更改制表符分隔文件的格式

时间:2021-08-09 22:55:26

I need to change the format of my tab delimited data from the input to output format as mentioned below, kindly help me to write a script.

我需要将输入格式分隔数据的格式从输入格式更改为输出格式,如下所述,请帮我写一个脚本。

Input File:

输入文件:

BRANCH_CODE DEPT_CODE   ITEM_CODE   UNIT_CODE   01/04/2017  02/04/2017  03/04/2017  04/04/2017  05/04/2017  06/04/2017  07/04/2017  08/04/2017  09/04/2017  10/04/2017    
KI-01   DP-0001 10001   KG  31.5    45  72  84  67.5    39  57  22.5    22  56    
KI-01   DP-0001 10002   KG  22  0   62  18  49  13  75  17  0   72

Output format:

输出格式:

DOC_DATE    BRANCH_CODE DEPT_CODE   ITEM_CODE   UNIT_CODE   QTY     
01/04/2017  KI-01   DP-0001 10001   KG  31.5
01/04/2017  KI-01   DP-0001 10002   KG  22
02/04/2017  KI-01   DP-0001 10001   KG  45
02/04/2017  KI-01   DP-0001 10002   KG  0
03/04/2017  KI-01   DP-0001 10001   KG  72
03/04/2017  KI-01   DP-0001 10002   KG  62

and so on

等等

I was writing a code like this in a .sh file.

我在.sh文件中编写了这样的代码。

#!/bin/bash
awk 'NR!=1{print $0}' input.tsv > temp_data_wo_header.tsv;
lc=$(wc -l < temp_data_wo_header.tsv);
for ((i=6; i<=15; i++))
do
    echo "Constructing date file "$i" and ...";
    (for (( c=1; c<=$lc; c++));
        do 
            awk 'NR==1{print $'$i'}' input.tsv;
        done
    ) > temp_date.tsv;
    echo "Adding date to data file...";
    paste <(awk '{print $1}' temp_date.tsv ) <(awk 'BEGIN { FS = "\t" } ; {print $1,$2,$3,$5,$'$i'}' temp_data_wo_header.tsv ) > "temp_day_"$i"_data.tsv";
    echo "Finished adding...";
done;

is there any other way to do it in a better code.

有没有其他方法可以在更好的代码中做到这一点。

2 个解决方案

#1


1  

Here's one in GNU awk since using 2D arrays:

这是自使用2D数组以来GNU awk中的一个:

$ awk '
BEGIN {
    FS=OFS="\t" }                          # set the delimiters
{
    sub(/\r/,"",$NF)                       # in case of \r\n line endings
    a[NR][1]                               # define array element
    n=split($0,a[NR],FS)                   # split record to a[NR]
    a[NR][4]=$1 OFS $2 OFS $3 OFS $4       # gather constants to one element
    if(NR==1)
        a[NR][4]="DOC_DATE" OFS a[NR][4] OFS "QTY"
}
END {                                      # everything is in memory
    print a[1][4];                         # header print
    for(j=5;j<=n;j++)                      # loop all data fields
        for(i=2;i<=NR;i++)                 # loop all records
            print a[1][j],a[i][4],a[i][j]  # output
}' file
DOC_DATE        BRANCH_CODE     DEPT_CODE       ITEM_CODE       UNIT_CODE       QTY
01/04/2017      KI-01   DP-0001 10001   KG      31.5
01/04/2017      KI-01   DP-0001 10002   KG      22
02/04/2017      KI-01   DP-0001 10001   KG      45
02/04/2017      KI-01   DP-0001 10002   KG      0
03/04/2017      KI-01   DP-0001 10001   KG      72

#2


1  

It would be better to do the entire processing in awk. For example:

在awk中进行整个处理会更好。例如:

BEGIN {
  FS = "\t"
  OFS = "\t"
}
NR == 1 {
  header = $0
  for (i = 5; i <= NF; i++) days[i - 5] = $i
}
NR > 1 {
  for (i = 5; i <= NF; i++) {
    keys[NR, i] = $1 OFS $2 OFS $3 OFS $4
    data[NR, i] = $i
  }
}
END {
  $0 = header
  print "DOC_DATE", $1, $2, $3, $4, "QTY"
  for (i = 5; i <= NF; i++) {
    for (j = 2; j <= NR; j++) {
      print $i, keys[j, i], data[j, i]
    }
  }
}

You could save this in a script, say script.awk, and then you could run it with:

您可以将其保存在脚本中,例如script.awk,然后您可以使用以下命令运行它:

awk -f script.awk input.tsv

#1


1  

Here's one in GNU awk since using 2D arrays:

这是自使用2D数组以来GNU awk中的一个:

$ awk '
BEGIN {
    FS=OFS="\t" }                          # set the delimiters
{
    sub(/\r/,"",$NF)                       # in case of \r\n line endings
    a[NR][1]                               # define array element
    n=split($0,a[NR],FS)                   # split record to a[NR]
    a[NR][4]=$1 OFS $2 OFS $3 OFS $4       # gather constants to one element
    if(NR==1)
        a[NR][4]="DOC_DATE" OFS a[NR][4] OFS "QTY"
}
END {                                      # everything is in memory
    print a[1][4];                         # header print
    for(j=5;j<=n;j++)                      # loop all data fields
        for(i=2;i<=NR;i++)                 # loop all records
            print a[1][j],a[i][4],a[i][j]  # output
}' file
DOC_DATE        BRANCH_CODE     DEPT_CODE       ITEM_CODE       UNIT_CODE       QTY
01/04/2017      KI-01   DP-0001 10001   KG      31.5
01/04/2017      KI-01   DP-0001 10002   KG      22
02/04/2017      KI-01   DP-0001 10001   KG      45
02/04/2017      KI-01   DP-0001 10002   KG      0
03/04/2017      KI-01   DP-0001 10001   KG      72

#2


1  

It would be better to do the entire processing in awk. For example:

在awk中进行整个处理会更好。例如:

BEGIN {
  FS = "\t"
  OFS = "\t"
}
NR == 1 {
  header = $0
  for (i = 5; i <= NF; i++) days[i - 5] = $i
}
NR > 1 {
  for (i = 5; i <= NF; i++) {
    keys[NR, i] = $1 OFS $2 OFS $3 OFS $4
    data[NR, i] = $i
  }
}
END {
  $0 = header
  print "DOC_DATE", $1, $2, $3, $4, "QTY"
  for (i = 5; i <= NF; i++) {
    for (j = 2; j <= NR; j++) {
      print $i, keys[j, i], data[j, i]
    }
  }
}

You could save this in a script, say script.awk, and then you could run it with:

您可以将其保存在脚本中,例如script.awk,然后您可以使用以下命令运行它:

awk -f script.awk input.tsv