从txt文件中删除重复的行

时间:2022-09-25 16:38:56

I am processing large text files (~20MB) containing data delimited by line. Most data entries are duplicated and I want to remove these duplications to only keep one copy.

我正在处理包含由行分隔的数据的大型文本文件(~20MB)。大多数数据条目是重复的,我想删除这些重复只保留一个副本。

Also, to make the problem slightly more complicated, some entries are repeated with an extra bit of info appended. In this case I need to keep the entry containing the extra info and delete the older versions.

此外,为了使问题稍微复杂一些,重复一些条目并附加额外的信息。在这种情况下,我需要保留包含额外信息的条目并删除旧版本。

e.g. I need to go from this:

例如我需要离开这个:

BOB 123 1DB
JIM 456 3DB AX
DAVE 789 1DB
BOB 123 1DB
JIM 456 3DB AX
DAVE 789 1DB
BOB 123 1DB EXTRA BITS
to this:
JIM 456 3DB AX
DAVE 789 1DB
BOB 123 1DB EXTRA BITS
NB. the final order doesn't matter.

What is an efficient way to do this?

有效的方法是什么?

I can use awk, python or any standard linux command line tool.

我可以使用awk,python或任何标准的linux命令行工具。

Thanks.

8 个解决方案

#1


12  

How about the following (in Python):

以下内容(在Python中):

prev = None
for line in sorted(open('file')):
  line = line.strip()
  if prev is not None and not line.startswith(prev):
    print prev
  prev = line
if prev is not None:
  print prev

If you find memory usage an issue, you can do the sort as a pre-processing step using Unix sort (which is disk-based) and change the script so that it doesn't read the entire file into memory.

如果您发现内存使用问题,您可以使用Unix排序(基于磁盘)进行排序作为预处理步骤,并更改脚本以使其不会将整个文件读入内存。

#2


3  

awk '{x[$1 " " $2 " " $3] = $0} END {for (y in x) print x[y]}'

awk'{x [$ 1“”$ 2“”$ 3] = $ 0} END {for(y in x)print x [y]}'

If you need to specify the number of columns for different files:

如果需要指定不同文件的列数:

awk -v ncols=3 '
  {
    key = "";
    for (i=1; i<=ncols; i++) {key = key FS $i}
    if (length($0) > length(x[key])) {x[key] = $0}
  }
  END {for (y in x) print y "\t" x[y]}
'

#3


2  

This variation on glenn jackman's answer should work regardless of the position of lines with extra bits:

无论具有额外位的线的位置如何,glenn jackman的答案的这种变化应该起作用:

awk '{idx = $1 " " $2 " " $3; if (length($0) > length(x[idx])) x[idx] = $0} END {for (idx in x) print x[idx]}' inputfile

Or

awk -v ncols=3 '
  {
    key = "";
    for (i=1; i<=ncols; i++) {key = key FS $i}
    if (length($0) > length(x[key])) x[key] = $0
  }
  END {for (y in x) print x[y]}
' inputfile

#4


2  

This or a slight variant should do:

这个或轻微的变体应该做:

finalData = {}
for line in input:
    parts = line.split()
    key,extra = tuple(parts[0:3]),parts[3:]
    if key not in finalData or extra:
        finalData[key] = extra

pprint(finalData)

outputs:

{('BOB', '123', '1DB'): ['EXTRA', 'BITS'],
 ('DAVE', '789', '1DB'): [],
 ('JIM', '456', '3DB'): ['AX']}

#5


1  

You'll have to define a function to split your line into important bits and extra bits, then you can do:

你必须定义一个函数来将你的行分成重要的位和额外的位,然后你可以这样做:

def split_extra(s):
    """Return a pair, the important bits and the extra bits."""
    return blah blah blah

data = {}
for line in open('file'):
    impt, extra = split_extra(line)
    existing = data.setdefault(impt, extra)
    if len(extra) > len(existing):
        data[impt] = extra

out = open('newfile', 'w')
for impt, extra in data.iteritems():
    out.write(impt + extra)

#6


1  

Since you need the extra bits the fastest way is to create a set of unique entries (sort -u will do) and then you must compare each entry against each other, e.g.

由于您需要额外的位,最快的方法是创建一组唯一的条目(排序-u会这样做),然后您必须将每个条目相互比较,例如

if x.startswith(y) and not y.startswith(x)
and just leave x and discard y.

#7


1  

If you have perl and want only the last entry to be preserved :

如果你有perl并且只想要保留最后一个条目:

cat file.txt | perl -ne 'BEGIN{%k={}} @_ = split(/ /);$kw = shift(@_); $kws{$kw} = "@_"; END{ foreach(sort keys %kws){ print "$_ $kws{$_}";} }' > file.new.txt

#8


1  

The function find_unique_lines will work for a file object or a list of strings.

函数find_unique_lines适用于文件对象或字符串列表。

import itertools

def split_line(s):
    parts = s.strip().split(' ')
    return " ".join(parts[:3]), parts[3:], s

def find_unique_lines(f):
    result = {}
    for key, data, line in itertools.imap(split_line, f):
        if data or key not in result:
            result[key] = line
    return result.itervalues()

test = """BOB 123 1DB
JIM 456 3DB AX
DAVE 789 1DB
BOB 123 1DB
JIM 456 3DB AX
DAVE 789 1DB
BOB 123 1DB EXTRA BITS""".split('\n')

for line in find_unique_lines(test):
        print line
BOB 123 1DB EXTRA BITS
JIM 456 3DB AX
DAVE 789 1DB

#1


12  

How about the following (in Python):

以下内容(在Python中):

prev = None
for line in sorted(open('file')):
  line = line.strip()
  if prev is not None and not line.startswith(prev):
    print prev
  prev = line
if prev is not None:
  print prev

If you find memory usage an issue, you can do the sort as a pre-processing step using Unix sort (which is disk-based) and change the script so that it doesn't read the entire file into memory.

如果您发现内存使用问题,您可以使用Unix排序(基于磁盘)进行排序作为预处理步骤,并更改脚本以使其不会将整个文件读入内存。

#2


3  

awk '{x[$1 " " $2 " " $3] = $0} END {for (y in x) print x[y]}'

awk'{x [$ 1“”$ 2“”$ 3] = $ 0} END {for(y in x)print x [y]}'

If you need to specify the number of columns for different files:

如果需要指定不同文件的列数:

awk -v ncols=3 '
  {
    key = "";
    for (i=1; i<=ncols; i++) {key = key FS $i}
    if (length($0) > length(x[key])) {x[key] = $0}
  }
  END {for (y in x) print y "\t" x[y]}
'

#3


2  

This variation on glenn jackman's answer should work regardless of the position of lines with extra bits:

无论具有额外位的线的位置如何,glenn jackman的答案的这种变化应该起作用:

awk '{idx = $1 " " $2 " " $3; if (length($0) > length(x[idx])) x[idx] = $0} END {for (idx in x) print x[idx]}' inputfile

Or

awk -v ncols=3 '
  {
    key = "";
    for (i=1; i<=ncols; i++) {key = key FS $i}
    if (length($0) > length(x[key])) x[key] = $0
  }
  END {for (y in x) print x[y]}
' inputfile

#4


2  

This or a slight variant should do:

这个或轻微的变体应该做:

finalData = {}
for line in input:
    parts = line.split()
    key,extra = tuple(parts[0:3]),parts[3:]
    if key not in finalData or extra:
        finalData[key] = extra

pprint(finalData)

outputs:

{('BOB', '123', '1DB'): ['EXTRA', 'BITS'],
 ('DAVE', '789', '1DB'): [],
 ('JIM', '456', '3DB'): ['AX']}

#5


1  

You'll have to define a function to split your line into important bits and extra bits, then you can do:

你必须定义一个函数来将你的行分成重要的位和额外的位,然后你可以这样做:

def split_extra(s):
    """Return a pair, the important bits and the extra bits."""
    return blah blah blah

data = {}
for line in open('file'):
    impt, extra = split_extra(line)
    existing = data.setdefault(impt, extra)
    if len(extra) > len(existing):
        data[impt] = extra

out = open('newfile', 'w')
for impt, extra in data.iteritems():
    out.write(impt + extra)

#6


1  

Since you need the extra bits the fastest way is to create a set of unique entries (sort -u will do) and then you must compare each entry against each other, e.g.

由于您需要额外的位,最快的方法是创建一组唯一的条目(排序-u会这样做),然后您必须将每个条目相互比较,例如

if x.startswith(y) and not y.startswith(x)
and just leave x and discard y.

#7


1  

If you have perl and want only the last entry to be preserved :

如果你有perl并且只想要保留最后一个条目:

cat file.txt | perl -ne 'BEGIN{%k={}} @_ = split(/ /);$kw = shift(@_); $kws{$kw} = "@_"; END{ foreach(sort keys %kws){ print "$_ $kws{$_}";} }' > file.new.txt

#8


1  

The function find_unique_lines will work for a file object or a list of strings.

函数find_unique_lines适用于文件对象或字符串列表。

import itertools

def split_line(s):
    parts = s.strip().split(' ')
    return " ".join(parts[:3]), parts[3:], s

def find_unique_lines(f):
    result = {}
    for key, data, line in itertools.imap(split_line, f):
        if data or key not in result:
            result[key] = line
    return result.itervalues()

test = """BOB 123 1DB
JIM 456 3DB AX
DAVE 789 1DB
BOB 123 1DB
JIM 456 3DB AX
DAVE 789 1DB
BOB 123 1DB EXTRA BITS""".split('\n')

for line in find_unique_lines(test):
        print line
BOB 123 1DB EXTRA BITS
JIM 456 3DB AX
DAVE 789 1DB