Following is sample code, aim is just to merges text files from give folder and it's sub folder. i am getting Traceback occasionally so not sure where to look. also need some help to enhance the code to prevent blank line being merge & to display no lines in merged/master file. Probably it's good idea to before merging file, some cleanup should performed or just to ignores blank line during merging process.
下面是示例代码,目标是合并来自give文件夹和它的子文件夹的文本文件。我偶尔会被追踪到,所以不知道去哪里找。还需要一些帮助来增强代码以防止合并空行&在合并/主文件中不显示任何行。在合并文件之前,应该执行一些清理工作,或者在合并过程中忽略空行。
Text file in folder is not more then 1000 lines but aggregate master file could cross 10000+ lines very easily.
文件夹中的文本文件不超过1000行,但是聚合主文件很容易跨越10000行以上。
import os
root = 'C:\\Dropbox\\ans7i\\'
files = [(path,f) for path,_,file_list in os.walk(root) for f in file_list]
out_file = open('C:\\Dropbox\\Python\\master.txt','w')
for path,f_name in files:
in_file = open('%s/%s'%(path,f_name), 'r')
# write out root/path/to/file (space) file_contents
for line in in_file:
out_file.write('%s/%s %s'%(path,f_name,line))
in_file.close()
# enter new line after each file
out_file.write('\n')
with open('master.txt', 'r') as f:
lines = f.readlines()
with open('master.txt', 'w') as f:
f.write("".join(L for L in lines if L.strip()))
Traceback (most recent call last):
File "C:\Dropbox\Python\master.py", line 9, in <module> for line in in_file:
File "C:\PYTHON32\LIB\encodings\cp1252.py", line 23, in decode return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 972: character maps to <undefined>
1 个解决方案
#1
5
The error is thrown because Python 3 opens your files with a default encoding that doesn't match the contents.
由于Python 3使用与内容不匹配的默认编码打开文件,因此会抛出错误。
If all you are doing is copying file contents, you'd be better off using the shutil.copyfileobj()
function together with opening the files in binary mode. That way you avoid encoding issues altogether (as long as all your source files are the same encoding of course, so you don't end up with a target file with mixed encodings):
如果您所做的只是复制文件内容,那么最好使用shutil.copyfileobj()函数并以二进制模式打开文件。这样,您就可以完全避免编码问题(只要所有源文件都是相同的编码,那么您就不会以混合编码结束目标文件):
import shutil
import os.path
with open('C:\\Dropbox\\Python\\master.txt','wb') as output:
for path, f_name in files:
with open(os.path.join(path, f_name), 'rb') as input:
shutil.copyfileobj(input, output)
output.write(b'\n') # insert extra newline between files
I've cleaned up the code a little to use context managers (so your files get closed automatically when done) and to use os.path
to create the full path for your files.
我已经对代码进行了一些清理,以便使用上下文管理器(以便在完成时自动关闭文件)和操作系统。为文件创建完整路径的路径。
If you do need to process your input line by line you'll need to tell Python what encoding to expect, so it can decode the file contents to python string objects:
如果您确实需要逐行处理输入,您需要告诉Python需要什么编码,以便它能够将文件内容解码给Python字符串对象:
open(path, mode, encoding='UTF8')
Note that this requires you to know up front what encoding the files use.
注意,这要求您预先知道文件使用什么编码。
Read up on the Python Unicode HOWTO if you have further questions about python 3, files and encodings.
如果您对Python 3、文件和编码有进一步的疑问,请阅读Python Unicode how。
#1
5
The error is thrown because Python 3 opens your files with a default encoding that doesn't match the contents.
由于Python 3使用与内容不匹配的默认编码打开文件,因此会抛出错误。
If all you are doing is copying file contents, you'd be better off using the shutil.copyfileobj()
function together with opening the files in binary mode. That way you avoid encoding issues altogether (as long as all your source files are the same encoding of course, so you don't end up with a target file with mixed encodings):
如果您所做的只是复制文件内容,那么最好使用shutil.copyfileobj()函数并以二进制模式打开文件。这样,您就可以完全避免编码问题(只要所有源文件都是相同的编码,那么您就不会以混合编码结束目标文件):
import shutil
import os.path
with open('C:\\Dropbox\\Python\\master.txt','wb') as output:
for path, f_name in files:
with open(os.path.join(path, f_name), 'rb') as input:
shutil.copyfileobj(input, output)
output.write(b'\n') # insert extra newline between files
I've cleaned up the code a little to use context managers (so your files get closed automatically when done) and to use os.path
to create the full path for your files.
我已经对代码进行了一些清理,以便使用上下文管理器(以便在完成时自动关闭文件)和操作系统。为文件创建完整路径的路径。
If you do need to process your input line by line you'll need to tell Python what encoding to expect, so it can decode the file contents to python string objects:
如果您确实需要逐行处理输入,您需要告诉Python需要什么编码,以便它能够将文件内容解码给Python字符串对象:
open(path, mode, encoding='UTF8')
Note that this requires you to know up front what encoding the files use.
注意,这要求您预先知道文件使用什么编码。
Read up on the Python Unicode HOWTO if you have further questions about python 3, files and encodings.
如果您对Python 3、文件和编码有进一步的疑问,请阅读Python Unicode how。