I'm trying to parse through a bunch of logfiles (up to 4GiB) in a tar.gz file. The source files come from RedHat 5.8 Server systems and SunOS 5.10, processing has to be done on WindowsXP.
我试图在tar中解析一堆日志文件(最多4GiB)。gz文件。源文件来自RedHat 5.8服务器系统和SunOS 5.10,必须在WindowsXP上进行处理。
I iterate through the tar.gz files, read the files, decode the file contents to UTF-8 and parse them with regular expressions before further processing.
我遍历沥青。gz文件,读取文件,将文件内容解码为UTF-8,并在进一步处理之前用正则表达式解析它们。
When I'm writing out the processed data along with the raw-data that was read from the tar.gz, I get the following error:
当我写出处理后的数据以及从tar中读取的原始数据时。gz,我得到了以下错误:
Traceback (most recent call last):
File "C:\WoMMaxX\lt_automation\Tools\LogParser.py", line 375, in <module>
p.analyze_longtails()
File "C:\WoMMaxX\lt_automation\Tools\LogParser.py", line 196, in analyze_longtails
oFile.write(entries[key]['source'] + '\n')
File "C:\Python\3.2\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode characters in position 24835-24836: character maps
to <undefined>
Heres the part where I read and parse the logfiles:
这是我读取和解析日志文件的部分:
def getSalesSoaplogEntries(perfid=None):
for tfile in parser.salestarfiles:
path = os.path.join(parser.logpath,tfile)
if os.path.isfile(path):
if tarfile.is_tarfile(path):
tar = tarfile.open(path,'r:gz')
for tarMember in tar.getmembers():
if 'salescomponent-soap.log' in tarMember.name:
tarMemberFile = tar.extractfile(tarMember)
content = tarMemberFile.read().decode('UTF-8','surrogateescape')
for m in parser.soaplogregex.finditer(content):
entry = {}
entry['time'] = datetime(datetime.now().year, int(m.group('month')), int(m.group('day')),int(m.group('hour')), int(m.group('minute')), int(m.group('second')), int(m.group('millis'))*1000)
entry['perfid'] = m.group('perfid')
entry['direction'] = m.group('direction')
entry['payload'] = m.group('payload')
entry['file'] = tarMember.name
entry['source'] = m.group(0)
sm = parser.soaplogmethodregex.match(entry['payload'])
if sm:
entry['method'] = sm.group('method')
if entry['time'] >= parser.starttime and entry['time'] <= parser.endtime:
if perfid and entry['perfid'] == perfid:
yield entry
tar.members = []
And heres the part where I write the processed information along with the raw data out(its an aggregation of all log-entries for one specific process:
这里是我将处理过的信息和原始数据一起写出来的部分(它是一个特定过程中所有日志条目的集合:
if len(entries) > 0:
time = perfentry['time']
filename = time.isoformat('-').replace(':','').replace('-','') + 'longtail_' + perfentry['perfid'] + '.txt'
oFile = open(os.path.join(parser.logpath,filename), 'w')
oFile.write(perfentry['source'] +'\n')
oFile.write('------\n')
for key in sorted(entries.keys()):
oFile.write('------\n')
oFile.write(entries[key]['source'] + '\n') #<-- here it is failing
What I don't get is why it seems to be correct to read the files in UTF-8, it is not possible to just write them out as UTF-8. What am I doing wrong?
我不明白的是为什么用UTF-8读文件似乎是正确的,不可能把它们写成UTF-8。我做错了什么?
1 个解决方案
#1
1
Your output file is using the default encoding for your OS, which is not UTF-8. Use codecs.open
instead of open
and specify encoding='utf-8'
.
您的输出文件使用的是OS的默认编码,不是UTF-8。使用编解码器。打开而不是打开,并指定编码='utf-8'。
oFile = codecs.open(os.path.join(parser.logpath,filename), 'w', encoding='utf-8')
See http://docs.python.org/howto/unicode.html#reading-and-writing-unicode-data
看到http://docs.python.org/howto/unicode.html reading-and-writing-unicode-data
#1
1
Your output file is using the default encoding for your OS, which is not UTF-8. Use codecs.open
instead of open
and specify encoding='utf-8'
.
您的输出文件使用的是OS的默认编码,不是UTF-8。使用编解码器。打开而不是打开,并指定编码='utf-8'。
oFile = codecs.open(os.path.join(parser.logpath,filename), 'w', encoding='utf-8')
See http://docs.python.org/howto/unicode.html#reading-and-writing-unicode-data
看到http://docs.python.org/howto/unicode.html reading-and-writing-unicode-data