Python:读取文本文件的一部分

时间:2020-12-01 15:44:52

HI all

I'm new to python and programming. I need to read in chunks of a large text file, format looks like the following:

我是python和编程的新手。我需要读取大块文本文件的块,格式如下所示:

<word id="8" form="hibernis" lemma="hibernus1" postag="n-p---nb-" head-"7" relation="ADV"/>

I need the form, lemma and postag information. e.g. for above I need hibernis, hibernus1 and n-p---nb-.

我需要表格,引理和postag信息。例如对于上面我需要hibernis,hibernus1和n-p --- nb-。

How do I tell python to read until it reaches form, to read forward until it reaches the quote mark " and then read the information between the quote marks "hibernis"? Really struggling with this.

我如何告诉python阅读,直到它达到形式,向前阅读,直到它到达引号“然后读取引号之间的信息”hibernis“?真的在努力解决这个问题。

My attempts so far have been to remove the punctuation, split the sentence and then pull the info I need from a list. Having trouble getting python to iterate over whole file though, I can only get this working for 1 line. My code is below:

到目前为止,我的尝试是删除标点符号,拆分句子,然后从列表中提取我需要的信息。虽然让python迭代整个文件有困难,但我只能在1行中使用它。我的代码如下:

f=open('blank.txt','r')
quotes=f.read()
noquotes=quotes.replace('"','')
f.close()

rf=open('blank.txt','w')
rf.write(noquotes)
rf.close()   

f=open('blank.txt','r')
finished = False
postag=[]
while not finished:
   line=f.readline()
   words=line.split()
   postag.append(words[4])
   postag.append(words[6])
   postag.append(words[8])              
   finished=True

Would appreciate any feedback/criticisms

非常感谢任何反馈/批评

thanks

9 个解决方案

#1


If it's XML, use ElementTree to parse it:

如果是XML,请使用ElementTree来解析它:

from xml.etree import ElementTree

line = '<word id="8" form="hibernis" lemma="hibernus1" postag="n-p---nb-" head="7" relation="ADV"/>'

element = ElementTree.fromstring(line)

For each XML element you can easily extract the name and all the attributes:

对于每个XML元素,您可以轻松提取名称和所有属性:

>>> element.tag
'word'
>>> element.attrib
{'head': '7', 'form': 'hibernis', 'postag': 'n-p---nb-', 'lemma': 'hibernus1', 'relation': 'ADV', 'id': '8'}

So if you have a document with a bunch of word XML elements, something like this will extract the information you want from each one:

因此,如果您有一个包含大量单词XML元素的文档,这样的内容将从每个元素中提取您想要的信息:

from xml.etree import ElementTree

XML = '''
<words>
    <word id="8" form="hibernis" lemma="hibernus1" postag="n-p---nb-" head="7" relation="ADV"/>
</words>'''

root = ElementTree.fromstring(XML)

for element in root.findall('word'):
    form = element.attrib['form']
    lemma = element.attrib['lemma']
    postag = element.attrib['postag']

    print form, lemma, postag

Use parse() instead of fromstring() if you only have a filename.

如果只有文件名,请使用parse()而不是fromstring()。

#2


I'd suggest using the regular expression module: re

我建议使用正则表达式模块:re

Something along these lines perhaps?

沿着这些路线可能会有什么?

#!/usr/bin/python
import re

if __name__ == '__main__':
    data = open('x').read()
    RE = re.compile('.*form="(.*)" lemma="(.*)" postag="(.*?)"', re.M)
    matches = RE.findall(data)
    for m in matches:
        print m

This does assume that the <word ...> lines are each on a single line and that each part is in that exact order, and that you don't need to deal with full xml parsing.

这确实假设 行分别在一行上,并且每个部分都按照确切的顺序排列,并且您不需要处理完整的xml解析。

#3


Is your file proper XML? If so, try a SAX parser:

你的文件是正确的XML吗?如果是这样,请尝试SAX解析器:

import xml.sax
class Handler (xml.sax.ContentHandler):
   def startElement (self, tag, attrs):
       if tag == 'word':
           print 'form=', attrs['form']
           print 'lemma=',attrs['lemma']
           print 'postag=',attrs['postag']

ch = Handler ()
f = open ('myfile')
xml.sax.parse (f, ch)

(this is rough .. it may not be entirely correct).

(这很粗糙......可能不完全正确)。

#4


In addition to the usual RegEx answer, since this appears to be a form of XML, you might try something like BeautifulSoup ( http://www.crummy.com/software/BeautifulSoup/ )

除了通常的RegEx答案之外,由于这似乎是一种XML形式,您可以尝试使用BeautifulSoup(http://www.crummy.com/software/BeautifulSoup/)

It's very easy to use, and find tags/attributes in things like HTML/XML, even if they're not "well formed". Might be worth a look.

它非常易于使用,并且可以在HTML / XML等内容中找到标签/属性,即使它们不是“格式良好”。也许值得一瞧。

#5


Parsing xml by hand is usually the wrong thing. For one thing, your code will break if there's an escaped quote in any of the attributes. Getting the attributes from an xml parser is probably cleaner and less error-prone.

手工解析xml通常是错误的。首先,如果任何属性中存在转义引号,则代码将中断。从xml解析器获取属性可能更简洁,更不容易出错。

An approach like this can also run into problems parsing the entire file if you have lines that don't match the format. You can deal with this either by creating a parseline method (something like

如果您的行与格式不匹配,这样的方法也会遇到解析整个文件的问题。你可以通过创建一个parseline方法来解决这个问题

def parse (line):
      try: 
          return parsed values here
        except: 

You can also simplify this with filter and map functions:

您还可以使用过滤器和地图功能简化此操作:

lines = filter( lambda line: parseable(line), f.readlines())
values = map (parse, lines)

#6


Just to highlight your problem:

只是为了突出你的问题:

finished = False
counter = 0
while not finished:
   counter += 1
   finished=True
print counter

#7


With regular expressions, this is the gist (you can do the file.readline() part):

使用正则表达式,这是要点(您可以执行file.readline()部分):

import re
line = '<word id="8" form="hibernis" lemma="hibernus1" postag="n-p---nb-" head-"7" relation="ADV"/>'
r = re.compile( 'form="([^"]*)".*lemma="([^"]*)".*postag="([^"]*)"' )
match = r.search( line )
print match.groups()

>>> 
('hibernis', 'hibernus1', 'n-p---nb-')
>>> 

#8


First, don't spend a lot of time rewriting your file. It's generally a waste of time. The processing to clean up and parse the tags is so fast, that you'll be perfectly happy working from the source file all the time.

首先,不要花很多时间重写文件。这通常是浪费时间。清理和解析标签的过程非常快,您将始终非常高兴地从源文件中进行操作。

source= open( "blank.txt", "r" )
for line in source:
    # line has a tag-line structure
    # <word id="8" form="hibernis" lemma="hibernus1" postag="n-p---nb-" head-"7" relation="ADV"/>
    # Assumption -- no spaces in the quoted strings.
    parts = line.split()
    # parts is [ '<word', 'id="8"', 'form="hibernis"', ... ]
    assert parts[0] == "<word"
    nameValueList = [ part.partition('=') for part in parts[1:] ]
    # nameValueList is [ ('id','=','"8"'), ('form','=','"hibernis"'), ... ]
    attrs = dict( (n,eval(v)) for n, _, v in nameValueList )
    # attrs is { 'id':'8', 'form':'hibernis', ... }
    print attrs['form'], attrs['lemma'], attrs['posttag']

#9


wow, you guys are fast :) If you want all attributes of a list (and the ordering is known), then you can use something like this:

哇,你们很快:)如果你想要一个列表的所有属性(并且已知订购),那么你可以使用这样的东西:

import re
print re.findall('"(.+?)"',INPUT)

INPUT is a line like:

INPUT是这样的一行:

<word id="8" form="hibernis" lemma="hibernus1" postag="n-p---nb-" head="7" relation="ADV"/>

and the printed list is:

印刷清单是:

['8', 'hibernis', 'hibernus1', 'n-p---nb-', '7', 'ADV']

#1


If it's XML, use ElementTree to parse it:

如果是XML,请使用ElementTree来解析它:

from xml.etree import ElementTree

line = '<word id="8" form="hibernis" lemma="hibernus1" postag="n-p---nb-" head="7" relation="ADV"/>'

element = ElementTree.fromstring(line)

For each XML element you can easily extract the name and all the attributes:

对于每个XML元素,您可以轻松提取名称和所有属性:

>>> element.tag
'word'
>>> element.attrib
{'head': '7', 'form': 'hibernis', 'postag': 'n-p---nb-', 'lemma': 'hibernus1', 'relation': 'ADV', 'id': '8'}

So if you have a document with a bunch of word XML elements, something like this will extract the information you want from each one:

因此,如果您有一个包含大量单词XML元素的文档,这样的内容将从每个元素中提取您想要的信息:

from xml.etree import ElementTree

XML = '''
<words>
    <word id="8" form="hibernis" lemma="hibernus1" postag="n-p---nb-" head="7" relation="ADV"/>
</words>'''

root = ElementTree.fromstring(XML)

for element in root.findall('word'):
    form = element.attrib['form']
    lemma = element.attrib['lemma']
    postag = element.attrib['postag']

    print form, lemma, postag

Use parse() instead of fromstring() if you only have a filename.

如果只有文件名,请使用parse()而不是fromstring()。

#2


I'd suggest using the regular expression module: re

我建议使用正则表达式模块:re

Something along these lines perhaps?

沿着这些路线可能会有什么?

#!/usr/bin/python
import re

if __name__ == '__main__':
    data = open('x').read()
    RE = re.compile('.*form="(.*)" lemma="(.*)" postag="(.*?)"', re.M)
    matches = RE.findall(data)
    for m in matches:
        print m

This does assume that the <word ...> lines are each on a single line and that each part is in that exact order, and that you don't need to deal with full xml parsing.

这确实假设 行分别在一行上,并且每个部分都按照确切的顺序排列,并且您不需要处理完整的xml解析。

#3


Is your file proper XML? If so, try a SAX parser:

你的文件是正确的XML吗?如果是这样,请尝试SAX解析器:

import xml.sax
class Handler (xml.sax.ContentHandler):
   def startElement (self, tag, attrs):
       if tag == 'word':
           print 'form=', attrs['form']
           print 'lemma=',attrs['lemma']
           print 'postag=',attrs['postag']

ch = Handler ()
f = open ('myfile')
xml.sax.parse (f, ch)

(this is rough .. it may not be entirely correct).

(这很粗糙......可能不完全正确)。

#4


In addition to the usual RegEx answer, since this appears to be a form of XML, you might try something like BeautifulSoup ( http://www.crummy.com/software/BeautifulSoup/ )

除了通常的RegEx答案之外,由于这似乎是一种XML形式,您可以尝试使用BeautifulSoup(http://www.crummy.com/software/BeautifulSoup/)

It's very easy to use, and find tags/attributes in things like HTML/XML, even if they're not "well formed". Might be worth a look.

它非常易于使用,并且可以在HTML / XML等内容中找到标签/属性,即使它们不是“格式良好”。也许值得一瞧。

#5


Parsing xml by hand is usually the wrong thing. For one thing, your code will break if there's an escaped quote in any of the attributes. Getting the attributes from an xml parser is probably cleaner and less error-prone.

手工解析xml通常是错误的。首先,如果任何属性中存在转义引号,则代码将中断。从xml解析器获取属性可能更简洁,更不容易出错。

An approach like this can also run into problems parsing the entire file if you have lines that don't match the format. You can deal with this either by creating a parseline method (something like

如果您的行与格式不匹配,这样的方法也会遇到解析整个文件的问题。你可以通过创建一个parseline方法来解决这个问题

def parse (line):
      try: 
          return parsed values here
        except: 

You can also simplify this with filter and map functions:

您还可以使用过滤器和地图功能简化此操作:

lines = filter( lambda line: parseable(line), f.readlines())
values = map (parse, lines)

#6


Just to highlight your problem:

只是为了突出你的问题:

finished = False
counter = 0
while not finished:
   counter += 1
   finished=True
print counter

#7


With regular expressions, this is the gist (you can do the file.readline() part):

使用正则表达式,这是要点(您可以执行file.readline()部分):

import re
line = '<word id="8" form="hibernis" lemma="hibernus1" postag="n-p---nb-" head-"7" relation="ADV"/>'
r = re.compile( 'form="([^"]*)".*lemma="([^"]*)".*postag="([^"]*)"' )
match = r.search( line )
print match.groups()

>>> 
('hibernis', 'hibernus1', 'n-p---nb-')
>>> 

#8


First, don't spend a lot of time rewriting your file. It's generally a waste of time. The processing to clean up and parse the tags is so fast, that you'll be perfectly happy working from the source file all the time.

首先,不要花很多时间重写文件。这通常是浪费时间。清理和解析标签的过程非常快,您将始终非常高兴地从源文件中进行操作。

source= open( "blank.txt", "r" )
for line in source:
    # line has a tag-line structure
    # <word id="8" form="hibernis" lemma="hibernus1" postag="n-p---nb-" head-"7" relation="ADV"/>
    # Assumption -- no spaces in the quoted strings.
    parts = line.split()
    # parts is [ '<word', 'id="8"', 'form="hibernis"', ... ]
    assert parts[0] == "<word"
    nameValueList = [ part.partition('=') for part in parts[1:] ]
    # nameValueList is [ ('id','=','"8"'), ('form','=','"hibernis"'), ... ]
    attrs = dict( (n,eval(v)) for n, _, v in nameValueList )
    # attrs is { 'id':'8', 'form':'hibernis', ... }
    print attrs['form'], attrs['lemma'], attrs['posttag']

#9


wow, you guys are fast :) If you want all attributes of a list (and the ordering is known), then you can use something like this:

哇,你们很快:)如果你想要一个列表的所有属性(并且已知订购),那么你可以使用这样的东西:

import re
print re.findall('"(.+?)"',INPUT)

INPUT is a line like:

INPUT是这样的一行:

<word id="8" form="hibernis" lemma="hibernus1" postag="n-p---nb-" head="7" relation="ADV"/>

and the printed list is:

印刷清单是:

['8', 'hibernis', 'hibernus1', 'n-p---nb-', '7', 'ADV']