Consider a text file called "new.txt" containing the following elements:
考虑一个名为“new”的文本文件。txt“包含以下内容:
μm
∂r
∆λ
In Python 2.7, I can read the file by typing:
在Python 2.7中,我可以通过输入来读取文件:
>>> import codecs
>>> f = codecs.open('new.txt', encoding='utf-8')
>>> lines = [line.strip() for line in f2.readlines()]
>>> lines
[u'\u03bcm', u'\u2202r', u'\u2206\u03bb']
>>> print lines[0]
μm
So far so good. I can easily convert this list to a numpy array via:
目前为止一切都很顺利。我可以很容易地将此列表转换为numpy数组:
>>> import numpy as np
>>> arr = np.array(lines)
>>> arr
array([u'\u03bcm', u'\u2202r', u'\u2206\u03bb'],
dtype='<U2')
The issue is, I can't read this file directly via numpy's loadtxt function:
问题是,我不能通过numpy的loadtxt函数直接读取这个文件:
>>> np.loadtxt('new.txt', dtype=np.unicode_)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib64/python2.7/site-packages/numpy/lib/npyio.py", line 805, in loadtxt
X = np.array(X, dtype)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xce in position 0: ordinal not in range(128)
What is the correct way to read this file into numpy directly?
将这个文件直接读取到numpy的正确方法是什么?
Thanks.
谢谢。
2 个解决方案
#1
8
In memory, unicode strings are represented as UCS-2 or UCS-4, depending on how your Python interpreter was compiled. Your file is encoded in UTF-8, so you need to recode it before you can map it to the NumPy array. loadtxt()
can't do the recoding for you -- after all NumPy is mainly targeted at numerical arrays.
在内存中,unicode字符串被表示为UCS-2或UCS-4,这取决于您的Python解释器是如何编译的。您的文件是用UTF-8编码的,所以您需要在将其映射到NumPy数组之前重新编码。loadtxt()无法对您进行重新编码——毕竟NumPy主要针对数字数组。
Assuming every line has the same number of characters, you could also use the more efficient variant
假设每一行都有相同数量的字符,您也可以使用更有效的变体。
s = codecs.open("new.txt", encoding="utf-8").read()
arr = numpy.frombuffer(s, dtype="<U3")
This will include the newline characters in the strings. To not include them, use
这将包括字符串中的换行字符。不包括它们,使用。
arr = numpy.frombuffer(s.replace("\n", ""), dtype="<U2")
Edit: If the lines of your file have different lengths and you would like to avoid the intermediate list, you can use
编辑:如果您的文件的行长度不同,并且您希望避免中间列表,您可以使用。
arr = numpy.fromiter(codecs.open("new.txt", encoding="utf-8"), dtype="<U2")
I'm not sure if this will internally create some temporary list, though.
不过,我不确定这是否会在内部创建一些临时列表。
#2
2
If you want to use loadtxt
, you can either first load the raw byte array and then decode:
如果你想使用loadtxt,你可以先加载原始字节数组,然后解码:
data = np.loadtxt('foo.txt', dtype='S8')
unicode_data = data.view(np.chararray).decode('utf-8')
or specify a converter for decoding:
或指定用于解码的转换器:
data = np.loadtxt('foo.txt', converters={0: lambda x: unicode(x, 'utf-8')}, dtype='U2')
However, using fromiter
as in Sven's answer is probably going to be more effective than loadtxt
.
然而,使用fromiter作为Sven的答案可能会比loadtxt更有效。
#1
8
In memory, unicode strings are represented as UCS-2 or UCS-4, depending on how your Python interpreter was compiled. Your file is encoded in UTF-8, so you need to recode it before you can map it to the NumPy array. loadtxt()
can't do the recoding for you -- after all NumPy is mainly targeted at numerical arrays.
在内存中,unicode字符串被表示为UCS-2或UCS-4,这取决于您的Python解释器是如何编译的。您的文件是用UTF-8编码的,所以您需要在将其映射到NumPy数组之前重新编码。loadtxt()无法对您进行重新编码——毕竟NumPy主要针对数字数组。
Assuming every line has the same number of characters, you could also use the more efficient variant
假设每一行都有相同数量的字符,您也可以使用更有效的变体。
s = codecs.open("new.txt", encoding="utf-8").read()
arr = numpy.frombuffer(s, dtype="<U3")
This will include the newline characters in the strings. To not include them, use
这将包括字符串中的换行字符。不包括它们,使用。
arr = numpy.frombuffer(s.replace("\n", ""), dtype="<U2")
Edit: If the lines of your file have different lengths and you would like to avoid the intermediate list, you can use
编辑:如果您的文件的行长度不同,并且您希望避免中间列表,您可以使用。
arr = numpy.fromiter(codecs.open("new.txt", encoding="utf-8"), dtype="<U2")
I'm not sure if this will internally create some temporary list, though.
不过,我不确定这是否会在内部创建一些临时列表。
#2
2
If you want to use loadtxt
, you can either first load the raw byte array and then decode:
如果你想使用loadtxt,你可以先加载原始字节数组,然后解码:
data = np.loadtxt('foo.txt', dtype='S8')
unicode_data = data.view(np.chararray).decode('utf-8')
or specify a converter for decoding:
或指定用于解码的转换器:
data = np.loadtxt('foo.txt', converters={0: lambda x: unicode(x, 'utf-8')}, dtype='U2')
However, using fromiter
as in Sven's answer is probably going to be more effective than loadtxt
.
然而,使用fromiter作为Sven的答案可能会比loadtxt更有效。