I have a text file which contains entry like
我有一个包含条目的文本文件。
70154::308933::3
UserId::ProductId::Score
I wrote this program to read: (Sorry the indendetion is bit messed up here)
我写了这个程序来读(抱歉,这里的缩进有点乱)
def generateSyntheticData(fileName):
dataDict = {}
# rowDict = []
innerDict = {}
try:
# for key in range(5):
# count = 0
myFile = open(fileName)
c = 0
#del innerDict[0:len(innerDict)]
for line in myFile:
c += 1
#line = str(line)
n = len(line)
#print 'n: ',n
if n is not 1:
# if c%100 ==0: print "%d: "%c, " entries read so far"
# words = line.replace(' ','_')
words = line.replace('::',' ')
words = words.strip().split()
#print 'userid: ', words[0]
userId = int( words[0]) # i get error here
movieId = int (words[1])
rating =float( words[2])
print "userId: ", userId, " productId: ", movieId," :rating: ", rating
#print words
#words = words.replace('_', ' ')
innerDict = dataDict.setdefault(userId,{})
innerDict[movieId] = rating
dataDict[userId] = (innerDict)
innerDict = {}
except IOError as (errno,strerror):
print "I/O error({0}) :{1} ".format(errno,strerror)
finally:
myFile.close()
print "total ratings read from file",fileName," :%d " %c
return dataDict
But i get the error:
但是我得到了错误:
ValueError: invalid literal for int() with base 10: ''
Funny thing is, it is working just fine reading the same format data from other file.. Actually while posting this question, I noticed something weird.. The entry 70154::308933::3 each number has a space.in between like 7 space 0 space 1 space 5 space 4 space :: space 3... BUt the text file looks fine..:( on copy pasting only it shows this nature.. Anyways.. but any clue whats going on. Thanks
有趣的是,它的工作原理就是从其他文件中读取相同的格式数据。实际上,在发布这个问题的时候,我注意到了一些奇怪的事情。条目70154::308933::3每个数字都有一个空格。在7空间之间的空间1空间5空间4空间::空间3…但是文本文件看起来很好。(仅在复制粘贴中显示此属性。不管怎样. .但是任何线索。谢谢
2 个解决方案
#1
3
The "spaces" thay you are seeing appear to be NULs ("\x00"). There is a 99.9% chance that your file is encoded in UTF-16, UTF-16LE, or UTF-16BE. If this is a one-off file, just open it with Notepad and save as "ANSI", not "Unicode" and not "Unicode bigendian". If however you need to process it as is, you'll need to know/detect what the encoding is. To find out which, do this:
您所看到的“空格”似乎是null(“\x00”)。你的文件被编码为UTF-16、UTF-16LE或UTF-16的几率为99.9%。如果这是一个一次性的文件,只需用记事本打开并保存为“ANSI”,而不是“Unicode”而不是“Unicode bigendian”。如果您需要处理它,那么您需要知道/检测编码是什么。为了找出其中的原因,你可以这样做:
print repr(open("yourfile.txt", "rb").read(20))
and compare the srtart of the output with the following:
并将输出的srtart与以下内容进行比较:
>>> ucode = u"70154:"
>>> for sfx in ["", "LE", "BE"]:
... enc = "UTF-16" + sfx
... print enc, repr(ucode.encode(enc))
...
UTF-16 '\xff\xfe7\x000\x001\x005\x004\x00:\x00'
UTF-16LE '7\x000\x001\x005\x004\x00:\x00'
UTF-16BE '\x007\x000\x001\x005\x004\x00:'
>>>
You can make a detector that's good enough for your purposes by inspecting the first 2 bytes:
你可以通过检查前两个字节来制作一个足够好的检测器:
[pseudocode]
if f2b in `"\xff\xfe\xff"`: UTF-16
elif f2b[1] == `"\x00"`: UTF-16LE
elif f2b[0] == `"\x00"`: UTF-16BE
else: cp1252 or UTF-8 or whatever else is prevalent in your neck of the woods.
You could avoid hard-coding the fallback encoding:
您可以避免硬编码回退编码:
>>> import locale
>>> locale.getpreferredencoding()
'cp1252'
Your line-reading code will look like this:
您的行读代码如下:
rawbytes = open(myFile, "rb").read()
enc = detect_encoding(rawbytes[:2])
for line in rawbytes.decode(enc).splitlines():
# whatever
Oh, and the lines will be unicode
objects ... if that gives you a problem, ask another question.
哦,这些线将是unicode的对象…如果这给你一个问题,问另一个问题。
#2
2
Debugging 101: simply change the line:
调试101:简单的改变线路:
words = words.strip().split()
to:
:
words = words.strip().split()
print words
and see what comes out.
看看结果如何。
I will mention a couple of things. If you have the literal UserId::...
in the file and you try to process it, it won't take kindly to trying to convert that to an integer.
我会提到一些事情。如果你有文字用户id::…在这个文件中,你试着去处理它,它不会愿意尝试将它转换成一个整数。
And the ... unusual line:
和…不同寻常的行:
if n is not 1:
I would probably write as:
我可能会这样写:
if n != 1:
If, as you indicate in your comment, you end up seeing:
如果,正如你在评论中指出的那样,你最终会看到:
['\x007\x000\x001\x005\x004\x00', '\x003\x000\x008\x009\x003\x003\x00', '3']
then I'd be checking your input file for binary (non-textual) data. You should never end up with that binary information if you're just reading text and trimming/splitting.
然后,我将检查您的输入文件以获取二进制(非文本)数据。如果你只是在阅读文本和修剪/分割,你永远不应该得到二进制信息。
And because you state that the digits seem to have spaces between them, you should do a hex dump of the file to find out what's really in there. It may be a UTF-16 Unicode string, for example.
因为你说这些数字之间似乎有空格,你应该做一个文件的十六进制转储来找出里面到底有什么。例如,它可能是UTF-16 Unicode字符串。
#1
3
The "spaces" thay you are seeing appear to be NULs ("\x00"). There is a 99.9% chance that your file is encoded in UTF-16, UTF-16LE, or UTF-16BE. If this is a one-off file, just open it with Notepad and save as "ANSI", not "Unicode" and not "Unicode bigendian". If however you need to process it as is, you'll need to know/detect what the encoding is. To find out which, do this:
您所看到的“空格”似乎是null(“\x00”)。你的文件被编码为UTF-16、UTF-16LE或UTF-16的几率为99.9%。如果这是一个一次性的文件,只需用记事本打开并保存为“ANSI”,而不是“Unicode”而不是“Unicode bigendian”。如果您需要处理它,那么您需要知道/检测编码是什么。为了找出其中的原因,你可以这样做:
print repr(open("yourfile.txt", "rb").read(20))
and compare the srtart of the output with the following:
并将输出的srtart与以下内容进行比较:
>>> ucode = u"70154:"
>>> for sfx in ["", "LE", "BE"]:
... enc = "UTF-16" + sfx
... print enc, repr(ucode.encode(enc))
...
UTF-16 '\xff\xfe7\x000\x001\x005\x004\x00:\x00'
UTF-16LE '7\x000\x001\x005\x004\x00:\x00'
UTF-16BE '\x007\x000\x001\x005\x004\x00:'
>>>
You can make a detector that's good enough for your purposes by inspecting the first 2 bytes:
你可以通过检查前两个字节来制作一个足够好的检测器:
[pseudocode]
if f2b in `"\xff\xfe\xff"`: UTF-16
elif f2b[1] == `"\x00"`: UTF-16LE
elif f2b[0] == `"\x00"`: UTF-16BE
else: cp1252 or UTF-8 or whatever else is prevalent in your neck of the woods.
You could avoid hard-coding the fallback encoding:
您可以避免硬编码回退编码:
>>> import locale
>>> locale.getpreferredencoding()
'cp1252'
Your line-reading code will look like this:
您的行读代码如下:
rawbytes = open(myFile, "rb").read()
enc = detect_encoding(rawbytes[:2])
for line in rawbytes.decode(enc).splitlines():
# whatever
Oh, and the lines will be unicode
objects ... if that gives you a problem, ask another question.
哦,这些线将是unicode的对象…如果这给你一个问题,问另一个问题。
#2
2
Debugging 101: simply change the line:
调试101:简单的改变线路:
words = words.strip().split()
to:
:
words = words.strip().split()
print words
and see what comes out.
看看结果如何。
I will mention a couple of things. If you have the literal UserId::...
in the file and you try to process it, it won't take kindly to trying to convert that to an integer.
我会提到一些事情。如果你有文字用户id::…在这个文件中,你试着去处理它,它不会愿意尝试将它转换成一个整数。
And the ... unusual line:
和…不同寻常的行:
if n is not 1:
I would probably write as:
我可能会这样写:
if n != 1:
If, as you indicate in your comment, you end up seeing:
如果,正如你在评论中指出的那样,你最终会看到:
['\x007\x000\x001\x005\x004\x00', '\x003\x000\x008\x009\x003\x003\x00', '3']
then I'd be checking your input file for binary (non-textual) data. You should never end up with that binary information if you're just reading text and trimming/splitting.
然后,我将检查您的输入文件以获取二进制(非文本)数据。如果你只是在阅读文本和修剪/分割,你永远不应该得到二进制信息。
And because you state that the digits seem to have spaces between them, you should do a hex dump of the file to find out what's really in there. It may be a UTF-16 Unicode string, for example.
因为你说这些数字之间似乎有空格,你应该做一个文件的十六进制转储来找出里面到底有什么。例如,它可能是UTF-16 Unicode字符串。