This question already has an answer here:
这个问题已经有了答案:
- How to make separator in read_csv more flexible wrt whitespace? 4 answers
- 如何使read_csv中的分隔符更灵活?4答案
I used to read my data with numpy.loadtxt()
. However, lately I found out in SO, that pandas.read_csv()
is much more faster.
我曾经使用numpi .loadtxt()读取数据。然而,最近我发现pandas.read_csv()要快得多。
To read these data I use:
要阅读我使用的这些数据:
pd.read_csv(filename, sep=' ',header=None)
The problem that I encounter right now is that in my case the separator can differ from one space, x spaces to even a tab.
我现在遇到的问题是,在我的例子中,分隔符可以与一个空格,x空格,甚至一个选项卡不同。
Here how my data could look like:
这里我的数据是这样的:
56.00 101.85 52.40 101.85 56.000000 101.850000 1
56.00 100.74 50.60 100.74 56.000000 100.740000 2
56.00 100.74 52.10 100.74 56.000000 100.740000 3
56.00 102.96 52.40 102.96 56.000000 102.960000 4
56.00 100.74 55.40 100.74 56.000000 100.740000 5
That leads to results like:
结果是:
0 1 2 3 4 5 6 7 8
0 56 NaN NaN 101.85 52.4 101.85 56 101.85 1
1 56 100.74 50.6 100.74 56.0 100.74 2 NaN NaN
2 56 100.74 52.1 100.74 56.0 100.74 3 NaN NaN
3 56 102.96 52.4 102.96 56.0 102.96 4 NaN NaN
4 56 100.74 55.4 100.74 56.0 100.74 5 NaN NaN
I have to specify that my data are >100 MB. So I can not preprocess the data or clean them first. Any ideas how to get this fixed?
我必须指定我的数据是>100mb,所以我不能先对数据进行预处理或清理。有什么办法解决这个问题吗?
1 个解决方案
#1
13
Your original line:
你的原来的线:
pd.read_csv(filename, sep=' ',header=None)
was specifying the separator as a single space, because your csvs can have spaces or tabs you can pass a regular expression to the sep
param like so:
将分隔符指定为单个空间,因为您的csv可以有空格或制表符,您可以将正则表达式传递给sep param,如下所示:
pd.read_csv(filename, sep='\s+',header=None)
This defines separator as being one single white space or more, there is a handy cheatsheet that lists regular expressions.
这将分隔符定义为一个或多个空格,有一个方便的cheatsheet列出正则表达式。
#1
13
Your original line:
你的原来的线:
pd.read_csv(filename, sep=' ',header=None)
was specifying the separator as a single space, because your csvs can have spaces or tabs you can pass a regular expression to the sep
param like so:
将分隔符指定为单个空间,因为您的csv可以有空格或制表符,您可以将正则表达式传递给sep param,如下所示:
pd.read_csv(filename, sep='\s+',header=None)
This defines separator as being one single white space or more, there is a handy cheatsheet that lists regular expressions.
这将分隔符定义为一个或多个空格,有一个方便的cheatsheet列出正则表达式。