I have some code that reads a file of names and creates a list:
我有一些代码可以读取名称文件并创建列表:
names_list = open("names", "r").read().splitlines()
Each name is separated by a newline, like so:
每个名字都用换行符隔开,如下所示:
AllmanAtkinsonBehlendorf
I want to ignore any lines that contain only whitespace. I know I can do this by by creating a loop and checking each line I read and then adding it to a list if it's not blank.
我想忽略任何只包含空格的行。我知道我可以通过创建一个循环,检查我读过的每一行,如果不是空白的,就把它添加到列表中。
I was just wondering if there was a more Pythonic way of doing it?
我想知道是否有一种更符合毕达哥拉斯的方法?
7 个解决方案
#1
43
I would stack generator expressions:
我将堆栈生成器表达式:
with open(filename) as f_in: lines = (line.rstrip() for line in f_in) # All lines including the blank ones lines = (line for line in lines if line) # Non-blank lines
Now, lines
is all of the non-blank lines. This will save you from having to call strip on the line twice. If you want a list of lines, then you can just do:
行就是所有的非空行。这将使您不必在线路上打两次电话。如果你想要一个行列表,你可以这样做:
with open(filename) as f_in: lines = (line.rstrip() for line in f_in) lines = list(line for line in lines if line) # Non-blank lines in a list
You can also do it in a one-liner (exluding with
statement) but it's no more efficient and harder to read:
你也可以用一行一行(用语句来表示),但它既没有效率,也不容易阅读:
with open(filename) as f_in: lines = list(line for line in (l.strip() for l in f_in) if line)
Update:
I agree that this is ugly because of the repetition of tokens. You could just write a generator if you prefer:
我同意这是丑陋的,因为重复的代币。如果你喜欢的话,你可以写一个生成器:
def nonblank_lines(f): for l in f: line = l.rstrip() if line: yield line
Then call it like:
然后调用它:
with open(filename) as f_in: for line in nonblank_lines(f_in): # Stuff
update 2:
with open(filename) as f_in: lines = filter(None, (line.rstrip() for line in f_in))
and on CPython (with deterministic reference counting)
在CPython上(使用确定性引用计数)
lines = filter(None, (line.rstrip() for line in open(filename)))
In Python 2 use itertools.ifilter
if you want a generator and in Python 3, just pass the whole thing to list
if you want a list.
在Python 2中使用迭代工具。如果您想要一个生成器,在Python 3中,如果您想要一个列表,只需将整个内容传递给list。
#2
15
You could use list comprehension:
你可以使用列表理解:
with open("names", "r") as f: names_list = [line.strip() for line in f if line.strip()]
Updated: Removed unnecessary readlines()
.
更新:删除不必要的readline()。
To avoid calling line.strip()
twice, you can use a generator:
为了避免调用line.strip()两次,可以使用生成器:
names_list = [l for l in (line.strip() for line in f) if l]
#3
7
If you want you can just put what you had in a list comprehension:
如果你想要你可以把你所拥有的东西放在一个列表中:
names_list = [line for line in open("names.txt", "r").read().splitlines() if line]
names_list =[在打开的队列中排队]。txt”、“r”).read().splitlines()如果线)
or
或
all_lines = open("names.txt", "r").read().splitlines()names_list = [name for name in all_lines if name]
splitlines() has already removed the line endings.
splitlines()已经删除了行结束符。
I don't think those are as clear as just looping explicitly though:
我不认为这些是明确的循环
names_list = []with open('names.txt', 'r') as _: for line in _: line = line.strip() if line: names_list.append(line)
Edit:
编辑:
Although, filter looks quite readable and concise:
虽然,过滤器看起来相当可读和简洁:
names_list = filter(None, open("names.txt", "r").read().splitlines())
names_list =过滤器(没有,打开(“名字。txt”、“r”).read().splitlines())
#4
3
When a treatment of text must be done to just extract data from it, I always think first to the regexes, because:
当必须对文本进行处理才能从中提取数据时,我总是首先想到regexes,因为:
-
as far as I know, regexes have been invented for that
据我所知,regexes就是为此而发明的
-
iterating over lines appears clumsy to me: it essentially consists to search the newlines then to search the data to extract in each line; that makes two searches instead of a direct unique one with a regex
对我来说,遍历行是笨拙的:它本质上是搜索换行,然后在每一行中搜索数据;这将使两个搜索而不是使用正则表达式的直接惟一搜索
-
way of bringing regexes into play is easy; only the writing of a regex string to be compiled into a regex object is sometimes hard, but in this case the treatment with an iteration over lines will be complicated too
使regex发挥作用的方法很容易;只有将regex字符串写入regex对象的过程有时是困难的,但在这种情况下,对代码进行迭代的处理也会很复杂。
For the problem discussed here, a regex solution is fast and easy to write:
对于这里讨论的问题,regex解决方案快速且易于编写:
import renames = re.findall('\S+',open(filename).read())
I compared the speeds of several solutions:
我比较了几种解决方案的速度:
import refrom time import clockA,AA,B1,B2,BS,reg = [],[],[],[],[],[]D,Dsh,C1,C2 = [],[],[],[]F1,F2,F3 = [],[],[]def nonblank_lines(f): for l in f: line = l.rstrip() if line: yield linedef short_nonblank_lines(f): for l in f: line = l[0:-1] if line: yield linefor essays in xrange(50): te = clock() with open('raa.txt') as f: names_listA = [line.strip() for line in f if line.strip()] # Felix Kling A.append(clock()-te) te = clock() with open('raa.txt') as f: names_listAA = [line[0:-1] for line in f if line[0:-1]] # Felix Kling with line[0:-1] AA.append(clock()-te) #------------------------------------------------------- te = clock() with open('raa.txt') as f_in: namesB1 = [ name for name in (l.strip() for l in f_in) if name ] # aaronasterling without list() B1.append(clock()-te) te = clock() with open('raa.txt') as f_in: namesB2 = [ name for name in (l[0:-1] for l in f_in) if name ] # aaronasterling without list() and with line[0:-1] B2.append(clock()-te) te = clock() with open('raa.txt') as f_in: namesBS = [ name for name in f_in.read().splitlines() if name ] # a list comprehension with read().splitlines() BS.append(clock()-te) #------------------------------------------------------- te = clock() with open('raa.txt') as f: xreg = re.findall('\S+',f.read()) # eyquem reg.append(clock()-te) #------------------------------------------------------- te = clock() with open('raa.txt') as f_in: linesC1 = list(line for line in (l.strip() for l in f_in) if line) # aaronasterling C1.append(clock()-te) te = clock() with open('raa.txt') as f_in: linesC2 = list(line for line in (l[0:-1] for l in f_in) if line) # aaronasterling with line[0:-1] C2.append(clock()-te) #------------------------------------------------------- te = clock() with open('raa.txt') as f_in: yD = [ line for line in nonblank_lines(f_in) ] # aaronasterling update D.append(clock()-te) te = clock() with open('raa.txt') as f_in: yDsh = [ name for name in short_nonblank_lines(f_in) ] # nonblank_lines with line[0:-1] Dsh.append(clock()-te) #------------------------------------------------------- te = clock() with open('raa.txt') as f_in: linesF1 = filter(None, (line.rstrip() for line in f_in)) # aaronasterling update 2 F1.append(clock()-te) te = clock() with open('raa.txt') as f_in: linesF2 = filter(None, (line[0:-1] for line in f_in)) # aaronasterling update 2 with line[0:-1] F2.append(clock()-te) te = clock() with open('raa.txt') as f_in: linesF3 = filter(None, f_in.read().splitlines()) # aaronasterling update 2 with read().splitlines() F3.append(clock()-te)print 'names_listA == names_listAA==namesB1==namesB2==namesBS==xreg\n is ',\ names_listA == names_listAA==namesB1==namesB2==namesBS==xregprint 'names_listA == yD==yDsh==linesC1==linesC2==linesF1==linesF2==linesF3\n is ',\ names_listA == yD==yDsh==linesC1==linesC2==linesF1==linesF2==linesF3,'\n\n\n'def displ((fr,it,what)): print fr + str( min(it) )[0:7] + ' ' + whatmap(displ,(('* ', A, '[line.strip() for line in f if line.strip()] * Felix Kling\n'), (' ', B1, ' [name for name in (l.strip() for l in f_in) if name ] aaronasterling without list()'), ('* ', C1, 'list(line for line in (l.strip() for l in f_in) if line) * aaronasterling\n'), ('* ', reg, 're.findall("\S+",f.read()) * eyquem\n'), ('* ', D, '[ line for line in nonblank_lines(f_in) ] * aaronasterling update'), (' ', Dsh, '[ line for line in short_nonblank_lines(f_in) ] nonblank_lines with line[0:-1]\n'), ('* ', F1 , 'filter(None, (line.rstrip() for line in f_in)) * aaronasterling update 2\n'), (' ', B2, ' [name for name in (l[0:-1] for l in f_in) if name ] aaronasterling without list() and with line[0:-1]'), (' ', C2, 'list(line for line in (l[0:-1] for l in f_in) if line) aaronasterling with line[0:-1]\n'), (' ', AA, '[line[0:-1] for line in f if line[0:-1] ] Felix Kling with line[0:-1]\n'), (' ', BS, '[name for name in f_in.read().splitlines() if name ] a list comprehension with read().splitlines()\n'), (' ', F2 , 'filter(None, (line[0:-1] for line in f_in)) aaronasterling update 2 with line[0:-1]'), (' ', F3 , 'filter(None, f_in.read().splitlines() aaronasterling update 2 with read().splitlines()')) )
Solution with regex is straightforward and neat. Though, it isn't among the fastest ones.The solution of aaronasterling with filter() is surprisigly fast for me (I wasn't aware of this particular filter()'s speed) and times of optimized solutions go down until 27 % of the biggest time. I wonder what makes the miracle of the filter-splitlines association:
使用regex的解决方案是简单而简洁的。不过,它并不是最快的。使用filter()的aaronasterling解决方案对我来说是非常快的(我没有意识到这个特殊的过滤器()的速度)和优化的解决方案的时间下降到27%的最大时间。我想知道是什么创造了过滤-分裂线协会的奇迹:
names_listA == names_listAA==namesB1==namesB2==namesBS==xreg is Truenames_listA == yD==yDsh==linesC1==linesC2==linesF1==linesF2==linesF3 is True * 0.08266 [line.strip() for line in f if line.strip()] * Felix Kling 0.07535 [name for name in (l.strip() for l in f_in) if name ] aaronasterling without list()* 0.06912 list(line for line in (l.strip() for l in f_in) if line) * aaronasterling* 0.06612 re.findall("\S+",f.read()) * eyquem* 0.06486 [ line for line in nonblank_lines(f_in) ] * aaronasterling update 0.05264 [ line for line in short_nonblank_lines(f_in) ] nonblank_lines with line[0:-1]* 0.05451 filter(None, (line.rstrip() for line in f_in)) * aaronasterling update 2 0.04689 [name for name in (l[0:-1] for l in f_in) if name ] aaronasterling without list() and with line[0:-1] 0.04582 list(line for line in (l[0:-1] for l in f_in) if line) aaronasterling with line[0:-1] 0.04171 [line[0:-1] for line in f if line[0:-1] ] Felix Kling with line[0:-1] 0.03265 [name for name in f_in.read().splitlines() if name ] a list comprehension with read().splitlines() 0.03638 filter(None, (line[0:-1] for line in f_in)) aaronasterling update 2 with line[0:-1] 0.02198 filter(None, f_in.read().splitlines() aaronasterling update 2 with read().splitlines()
But this problem is particular, the most simple of all: only one name in each line. So the solutions are only games with lines, splitings and [0:-1] cuts.
但是这个问题很特别,最简单的是:每行只有一个名字。因此,解决方案只是用线、线和[0:-1]割来做游戏。
On the contrary, regex doesn't matter with lines, it straightforwardly finds the desired data: I consider it is a more natural way of resolution, applying from the simplest to the more complex cases, and hence is often the way to be prefered in treatments of texts.
相反,regex与行无关,它直接查找所需的数据:我认为这是一种更自然的解析方法,从最简单的情况应用到更复杂的情况,因此通常是在处理文本时首选的方法。
EDIT
编辑
I forgot to say that I use Python 2.7 and I measured the above times with a file containing 500 times the following chain
我忘了说我使用了Python 2.7,并且我用一个包含500次以下链的文件测量了上面的时间。
SMITHJONESWILLIAMSTAYLORBROWNDAVIESEVANSWILSONTHOMASJOHNSONROBERTSROBINSONTHOMPSONWRIGHTWALKERWHITEEDWARDSHUGHESGREENHALLLEWISHARRISCLARKEPATELJACKSONWOODTURNERMARTINCOOPERHILLWARDMORRISMOORECLARKLEEKINGBAKERHARRISONMORGANALLENJAMESSCOTTPHILLIPSWATSONDAVISPARKERPRICEBENNETTYOUNGGRIFFITHSMITCHELLKELLYCOOKCARTERRICHARDSONBAILEYCOLLINSBELLSHAWMURPHYMILLERCOXRICHARDSKHANMARSHALLANDERSONSIMPSONELLISADAMSSINGHBEGUMWILKINSONFOSTERCHAPMANPOWELLWEBBROGERSGRAYMASONALIHUNTHUSSAINCAMPBELLMATTHEWSOWENPALMERHOLMESMILLSBARNESKNIGHTLLOYDBUTLERRUSSELLBARKERFISHERSTEVENSJENKINSMURRAYDIXONHARVEY
#5
0
@S.Lott
@S.Lott
The following code processes lines one at a time and produces a result that isn't memory eager:
下面的代码一次处理一行代码,生成的结果不是内存渴望的:
filename = 'english names.txt'with open(filename) as f_in: lines = (line.rstrip() for line in f_in) lines = (line for line in lines if line) the_strange_sum = 0 for l in lines: the_strange_sum += 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'.find(l[0])print the_strange_sum
So the generator (line.rstrip() for line in f_in) is quite the same acceptable than the nonblank_lines() function.
因此,对于f_in中的行,生成器(line.rstrip()与nonblank_lines()函数是完全相同的。
#6
0
What about LineSentence module, it will ignore such lines:
那么LineSentence模块呢,它会忽略这样的行:
Bases: object
基地:对象
Simple format: one sentence = one line; words already preprocessed and separated by whitespace.
简单格式:一句话=一行;由空格预先处理和分隔的词。
source can be either a string or a file object. Clip the file to the first limit lines (or not clipped if limit is None, the default).
源文件可以是字符串,也可以是文件对象。将文件剪辑到第一个限制行(如果限制为None,则不剪切,默认)。
from gensim.models.word2vec import LineSentencetext = LineSentence('text.txt')
#7
0
I guess there is a simple solution which I recently used after going through so many answers here.
我想有一个简单的解决办法,我在这里讲了这么多答案后,最近才用到它。
with open(file_name) as f_in: for line in open(f_in): if len(line.split()) == 0: continue
This just does the same work, ignoring all empty line.
这只是做同样的工作,忽略所有空行。
#1
43
I would stack generator expressions:
我将堆栈生成器表达式:
with open(filename) as f_in: lines = (line.rstrip() for line in f_in) # All lines including the blank ones lines = (line for line in lines if line) # Non-blank lines
Now, lines
is all of the non-blank lines. This will save you from having to call strip on the line twice. If you want a list of lines, then you can just do:
行就是所有的非空行。这将使您不必在线路上打两次电话。如果你想要一个行列表,你可以这样做:
with open(filename) as f_in: lines = (line.rstrip() for line in f_in) lines = list(line for line in lines if line) # Non-blank lines in a list
You can also do it in a one-liner (exluding with
statement) but it's no more efficient and harder to read:
你也可以用一行一行(用语句来表示),但它既没有效率,也不容易阅读:
with open(filename) as f_in: lines = list(line for line in (l.strip() for l in f_in) if line)
Update:
I agree that this is ugly because of the repetition of tokens. You could just write a generator if you prefer:
我同意这是丑陋的,因为重复的代币。如果你喜欢的话,你可以写一个生成器:
def nonblank_lines(f): for l in f: line = l.rstrip() if line: yield line
Then call it like:
然后调用它:
with open(filename) as f_in: for line in nonblank_lines(f_in): # Stuff
update 2:
with open(filename) as f_in: lines = filter(None, (line.rstrip() for line in f_in))
and on CPython (with deterministic reference counting)
在CPython上(使用确定性引用计数)
lines = filter(None, (line.rstrip() for line in open(filename)))
In Python 2 use itertools.ifilter
if you want a generator and in Python 3, just pass the whole thing to list
if you want a list.
在Python 2中使用迭代工具。如果您想要一个生成器,在Python 3中,如果您想要一个列表,只需将整个内容传递给list。
#2
15
You could use list comprehension:
你可以使用列表理解:
with open("names", "r") as f: names_list = [line.strip() for line in f if line.strip()]
Updated: Removed unnecessary readlines()
.
更新:删除不必要的readline()。
To avoid calling line.strip()
twice, you can use a generator:
为了避免调用line.strip()两次,可以使用生成器:
names_list = [l for l in (line.strip() for line in f) if l]
#3
7
If you want you can just put what you had in a list comprehension:
如果你想要你可以把你所拥有的东西放在一个列表中:
names_list = [line for line in open("names.txt", "r").read().splitlines() if line]
names_list =[在打开的队列中排队]。txt”、“r”).read().splitlines()如果线)
or
或
all_lines = open("names.txt", "r").read().splitlines()names_list = [name for name in all_lines if name]
splitlines() has already removed the line endings.
splitlines()已经删除了行结束符。
I don't think those are as clear as just looping explicitly though:
我不认为这些是明确的循环
names_list = []with open('names.txt', 'r') as _: for line in _: line = line.strip() if line: names_list.append(line)
Edit:
编辑:
Although, filter looks quite readable and concise:
虽然,过滤器看起来相当可读和简洁:
names_list = filter(None, open("names.txt", "r").read().splitlines())
names_list =过滤器(没有,打开(“名字。txt”、“r”).read().splitlines())
#4
3
When a treatment of text must be done to just extract data from it, I always think first to the regexes, because:
当必须对文本进行处理才能从中提取数据时,我总是首先想到regexes,因为:
-
as far as I know, regexes have been invented for that
据我所知,regexes就是为此而发明的
-
iterating over lines appears clumsy to me: it essentially consists to search the newlines then to search the data to extract in each line; that makes two searches instead of a direct unique one with a regex
对我来说,遍历行是笨拙的:它本质上是搜索换行,然后在每一行中搜索数据;这将使两个搜索而不是使用正则表达式的直接惟一搜索
-
way of bringing regexes into play is easy; only the writing of a regex string to be compiled into a regex object is sometimes hard, but in this case the treatment with an iteration over lines will be complicated too
使regex发挥作用的方法很容易;只有将regex字符串写入regex对象的过程有时是困难的,但在这种情况下,对代码进行迭代的处理也会很复杂。
For the problem discussed here, a regex solution is fast and easy to write:
对于这里讨论的问题,regex解决方案快速且易于编写:
import renames = re.findall('\S+',open(filename).read())
I compared the speeds of several solutions:
我比较了几种解决方案的速度:
import refrom time import clockA,AA,B1,B2,BS,reg = [],[],[],[],[],[]D,Dsh,C1,C2 = [],[],[],[]F1,F2,F3 = [],[],[]def nonblank_lines(f): for l in f: line = l.rstrip() if line: yield linedef short_nonblank_lines(f): for l in f: line = l[0:-1] if line: yield linefor essays in xrange(50): te = clock() with open('raa.txt') as f: names_listA = [line.strip() for line in f if line.strip()] # Felix Kling A.append(clock()-te) te = clock() with open('raa.txt') as f: names_listAA = [line[0:-1] for line in f if line[0:-1]] # Felix Kling with line[0:-1] AA.append(clock()-te) #------------------------------------------------------- te = clock() with open('raa.txt') as f_in: namesB1 = [ name for name in (l.strip() for l in f_in) if name ] # aaronasterling without list() B1.append(clock()-te) te = clock() with open('raa.txt') as f_in: namesB2 = [ name for name in (l[0:-1] for l in f_in) if name ] # aaronasterling without list() and with line[0:-1] B2.append(clock()-te) te = clock() with open('raa.txt') as f_in: namesBS = [ name for name in f_in.read().splitlines() if name ] # a list comprehension with read().splitlines() BS.append(clock()-te) #------------------------------------------------------- te = clock() with open('raa.txt') as f: xreg = re.findall('\S+',f.read()) # eyquem reg.append(clock()-te) #------------------------------------------------------- te = clock() with open('raa.txt') as f_in: linesC1 = list(line for line in (l.strip() for l in f_in) if line) # aaronasterling C1.append(clock()-te) te = clock() with open('raa.txt') as f_in: linesC2 = list(line for line in (l[0:-1] for l in f_in) if line) # aaronasterling with line[0:-1] C2.append(clock()-te) #------------------------------------------------------- te = clock() with open('raa.txt') as f_in: yD = [ line for line in nonblank_lines(f_in) ] # aaronasterling update D.append(clock()-te) te = clock() with open('raa.txt') as f_in: yDsh = [ name for name in short_nonblank_lines(f_in) ] # nonblank_lines with line[0:-1] Dsh.append(clock()-te) #------------------------------------------------------- te = clock() with open('raa.txt') as f_in: linesF1 = filter(None, (line.rstrip() for line in f_in)) # aaronasterling update 2 F1.append(clock()-te) te = clock() with open('raa.txt') as f_in: linesF2 = filter(None, (line[0:-1] for line in f_in)) # aaronasterling update 2 with line[0:-1] F2.append(clock()-te) te = clock() with open('raa.txt') as f_in: linesF3 = filter(None, f_in.read().splitlines()) # aaronasterling update 2 with read().splitlines() F3.append(clock()-te)print 'names_listA == names_listAA==namesB1==namesB2==namesBS==xreg\n is ',\ names_listA == names_listAA==namesB1==namesB2==namesBS==xregprint 'names_listA == yD==yDsh==linesC1==linesC2==linesF1==linesF2==linesF3\n is ',\ names_listA == yD==yDsh==linesC1==linesC2==linesF1==linesF2==linesF3,'\n\n\n'def displ((fr,it,what)): print fr + str( min(it) )[0:7] + ' ' + whatmap(displ,(('* ', A, '[line.strip() for line in f if line.strip()] * Felix Kling\n'), (' ', B1, ' [name for name in (l.strip() for l in f_in) if name ] aaronasterling without list()'), ('* ', C1, 'list(line for line in (l.strip() for l in f_in) if line) * aaronasterling\n'), ('* ', reg, 're.findall("\S+",f.read()) * eyquem\n'), ('* ', D, '[ line for line in nonblank_lines(f_in) ] * aaronasterling update'), (' ', Dsh, '[ line for line in short_nonblank_lines(f_in) ] nonblank_lines with line[0:-1]\n'), ('* ', F1 , 'filter(None, (line.rstrip() for line in f_in)) * aaronasterling update 2\n'), (' ', B2, ' [name for name in (l[0:-1] for l in f_in) if name ] aaronasterling without list() and with line[0:-1]'), (' ', C2, 'list(line for line in (l[0:-1] for l in f_in) if line) aaronasterling with line[0:-1]\n'), (' ', AA, '[line[0:-1] for line in f if line[0:-1] ] Felix Kling with line[0:-1]\n'), (' ', BS, '[name for name in f_in.read().splitlines() if name ] a list comprehension with read().splitlines()\n'), (' ', F2 , 'filter(None, (line[0:-1] for line in f_in)) aaronasterling update 2 with line[0:-1]'), (' ', F3 , 'filter(None, f_in.read().splitlines() aaronasterling update 2 with read().splitlines()')) )
Solution with regex is straightforward and neat. Though, it isn't among the fastest ones.The solution of aaronasterling with filter() is surprisigly fast for me (I wasn't aware of this particular filter()'s speed) and times of optimized solutions go down until 27 % of the biggest time. I wonder what makes the miracle of the filter-splitlines association:
使用regex的解决方案是简单而简洁的。不过,它并不是最快的。使用filter()的aaronasterling解决方案对我来说是非常快的(我没有意识到这个特殊的过滤器()的速度)和优化的解决方案的时间下降到27%的最大时间。我想知道是什么创造了过滤-分裂线协会的奇迹:
names_listA == names_listAA==namesB1==namesB2==namesBS==xreg is Truenames_listA == yD==yDsh==linesC1==linesC2==linesF1==linesF2==linesF3 is True * 0.08266 [line.strip() for line in f if line.strip()] * Felix Kling 0.07535 [name for name in (l.strip() for l in f_in) if name ] aaronasterling without list()* 0.06912 list(line for line in (l.strip() for l in f_in) if line) * aaronasterling* 0.06612 re.findall("\S+",f.read()) * eyquem* 0.06486 [ line for line in nonblank_lines(f_in) ] * aaronasterling update 0.05264 [ line for line in short_nonblank_lines(f_in) ] nonblank_lines with line[0:-1]* 0.05451 filter(None, (line.rstrip() for line in f_in)) * aaronasterling update 2 0.04689 [name for name in (l[0:-1] for l in f_in) if name ] aaronasterling without list() and with line[0:-1] 0.04582 list(line for line in (l[0:-1] for l in f_in) if line) aaronasterling with line[0:-1] 0.04171 [line[0:-1] for line in f if line[0:-1] ] Felix Kling with line[0:-1] 0.03265 [name for name in f_in.read().splitlines() if name ] a list comprehension with read().splitlines() 0.03638 filter(None, (line[0:-1] for line in f_in)) aaronasterling update 2 with line[0:-1] 0.02198 filter(None, f_in.read().splitlines() aaronasterling update 2 with read().splitlines()
But this problem is particular, the most simple of all: only one name in each line. So the solutions are only games with lines, splitings and [0:-1] cuts.
但是这个问题很特别,最简单的是:每行只有一个名字。因此,解决方案只是用线、线和[0:-1]割来做游戏。
On the contrary, regex doesn't matter with lines, it straightforwardly finds the desired data: I consider it is a more natural way of resolution, applying from the simplest to the more complex cases, and hence is often the way to be prefered in treatments of texts.
相反,regex与行无关,它直接查找所需的数据:我认为这是一种更自然的解析方法,从最简单的情况应用到更复杂的情况,因此通常是在处理文本时首选的方法。
EDIT
编辑
I forgot to say that I use Python 2.7 and I measured the above times with a file containing 500 times the following chain
我忘了说我使用了Python 2.7,并且我用一个包含500次以下链的文件测量了上面的时间。
SMITHJONESWILLIAMSTAYLORBROWNDAVIESEVANSWILSONTHOMASJOHNSONROBERTSROBINSONTHOMPSONWRIGHTWALKERWHITEEDWARDSHUGHESGREENHALLLEWISHARRISCLARKEPATELJACKSONWOODTURNERMARTINCOOPERHILLWARDMORRISMOORECLARKLEEKINGBAKERHARRISONMORGANALLENJAMESSCOTTPHILLIPSWATSONDAVISPARKERPRICEBENNETTYOUNGGRIFFITHSMITCHELLKELLYCOOKCARTERRICHARDSONBAILEYCOLLINSBELLSHAWMURPHYMILLERCOXRICHARDSKHANMARSHALLANDERSONSIMPSONELLISADAMSSINGHBEGUMWILKINSONFOSTERCHAPMANPOWELLWEBBROGERSGRAYMASONALIHUNTHUSSAINCAMPBELLMATTHEWSOWENPALMERHOLMESMILLSBARNESKNIGHTLLOYDBUTLERRUSSELLBARKERFISHERSTEVENSJENKINSMURRAYDIXONHARVEY
#5
0
@S.Lott
@S.Lott
The following code processes lines one at a time and produces a result that isn't memory eager:
下面的代码一次处理一行代码,生成的结果不是内存渴望的:
filename = 'english names.txt'with open(filename) as f_in: lines = (line.rstrip() for line in f_in) lines = (line for line in lines if line) the_strange_sum = 0 for l in lines: the_strange_sum += 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'.find(l[0])print the_strange_sum
So the generator (line.rstrip() for line in f_in) is quite the same acceptable than the nonblank_lines() function.
因此,对于f_in中的行,生成器(line.rstrip()与nonblank_lines()函数是完全相同的。
#6
0
What about LineSentence module, it will ignore such lines:
那么LineSentence模块呢,它会忽略这样的行:
Bases: object
基地:对象
Simple format: one sentence = one line; words already preprocessed and separated by whitespace.
简单格式:一句话=一行;由空格预先处理和分隔的词。
source can be either a string or a file object. Clip the file to the first limit lines (or not clipped if limit is None, the default).
源文件可以是字符串,也可以是文件对象。将文件剪辑到第一个限制行(如果限制为None,则不剪切,默认)。
from gensim.models.word2vec import LineSentencetext = LineSentence('text.txt')
#7
0
I guess there is a simple solution which I recently used after going through so many answers here.
我想有一个简单的解决办法,我在这里讲了这么多答案后,最近才用到它。
with open(file_name) as f_in: for line in open(f_in): if len(line.split()) == 0: continue
This just does the same work, ignoring all empty line.
这只是做同样的工作,忽略所有空行。