选择特定的CSV列(过滤)- Python/熊猫。

时间:2021-08-17 20:30:07

I have a very large CSV File with 100 columns. In order to illustrate my problem I will use a very basic example.

我有一个很大的CSV文件,有100列。为了说明我的问题,我将用一个非常基本的例子。

Let's suppose that we have a CSV file.

假设我们有一个CSV文件。

in  value   d     f
0    975   f01    5
1    976   F      4
2    977   d4     1
3    978   B6     0
4    979   2C     0

I want to select a specific columns.

我想选择一个特定的列。

import pandas
data = pandas.read_csv("ThisFile.csv")

In order to select the first 2 columns I used

为了选择我使用的前两列

data.ix[:,:2]

In order to select different columns like the 2nd and the 4th. What should I do?

为了选择不同的列,比如2和4。我应该做什么?

There is another way to solve this problem by re-writing the CSV file. But it's huge file; So I am avoiding this way.

还有一种方法可以通过重写CSV文件来解决这个问题。但它巨大的文件;所以我避免这样。

2 个解决方案

#1


12  

This selects the second and fourth columns (since Python uses 0-based indexing):

这将选择第二和第四列(因为Python使用基于0的索引):

In [272]: df.iloc[:,(1,3)]
Out[272]: 
   value  f
0    975  5
1    976  4
2    977  1
3    978  0
4    979  0

[5 rows x 2 columns]

df.ix can select by location or label. df.iloc always selects by location. When indexing by location use df.iloc to signal your intention more explicitly. It is also a bit faster since Pandas does not have to check if your index is using labels.

df。ix可以按位置或标签进行选择。df。iloc总是按位置选择。当按位置索引时使用df。iloc可以更明确地表明你的意图。它也快了一点,因为熊猫不需要检查你的索引是否使用标签。


Another possibility is to use the usecols parameter:

另一种可能是使用usecols参数:

data = pandas.read_csv("ThisFile.csv", usecols=[1,3])

This will load only the second and fourth columns into the data DataFrame.

这将只加载数据DataFrame中的第二和第四列。

#2


5  

If you rather select column by name, you can use

如果您愿意按名称选择列,可以使用

data[['value','f']]

   value  f
0    975  5
1    976  4
2    977  1
3    978  0
4    979  0

#1


12  

This selects the second and fourth columns (since Python uses 0-based indexing):

这将选择第二和第四列(因为Python使用基于0的索引):

In [272]: df.iloc[:,(1,3)]
Out[272]: 
   value  f
0    975  5
1    976  4
2    977  1
3    978  0
4    979  0

[5 rows x 2 columns]

df.ix can select by location or label. df.iloc always selects by location. When indexing by location use df.iloc to signal your intention more explicitly. It is also a bit faster since Pandas does not have to check if your index is using labels.

df。ix可以按位置或标签进行选择。df。iloc总是按位置选择。当按位置索引时使用df。iloc可以更明确地表明你的意图。它也快了一点,因为熊猫不需要检查你的索引是否使用标签。


Another possibility is to use the usecols parameter:

另一种可能是使用usecols参数:

data = pandas.read_csv("ThisFile.csv", usecols=[1,3])

This will load only the second and fourth columns into the data DataFrame.

这将只加载数据DataFrame中的第二和第四列。

#2


5  

If you rather select column by name, you can use

如果您愿意按名称选择列,可以使用

data[['value','f']]

   value  f
0    975  5
1    976  4
2    977  1
3    978  0
4    979  0