I have a large .csv
file which is constantly being updated in real time with several thousand lines displayed as follows:
我有一个大的.csv文件,它不断实时更新,显示数千行,如下所示:
time1,stockA,bid,1
time2,stockA,ask,1.1
time3,stockB,ask,2.1
time4,stockB,bid,2.0
time5,stockA,bid,1.1
time6,stockA,ask,1.2
What is the fastest way to read this into a dataframe
that looks like this:
将此内容读取到如下所示的数据框中的最快方法是什么:
time stock bid ask
time1 stockA 1
time2 stockA 1.1
time3 stockB 2.1
time4 stockB 2.0
time5 stockA 1.1
time6 stockA 1.2
Any help is appreciated
任何帮助表示赞赏
2 个解决方案
#1
You can use read_csv
and specify header=None
and pass the column names as a list:
您可以使用read_csv并指定header = None并将列名称作为列表传递:
In [124]:
t="""time1,stockA,bid,1
time2,stockA,ask,1.1
time3,stockB,ask,2.1
time4,stockB,bid,2.0"""
df = pd.read_csv(io.StringIO(t), header=None, names=['time', 'stock', 'bid', 'ask'])
df
Out[124]:
time stock bid ask
0 time1 stockA bid 1.0
1 time2 stockA ask 1.1
2 time3 stockB ask 2.1
3 time4 stockB bid 2.0
You'll have to re-encode the bid column to 1 or 2:
您必须将出价列重新编码为1或2:
In [126]:
df['bid'] = df['bid'].replace('bid', 1)
df['bid'] = df['bid'].replace('ask', 2)
df
Out[126]:
time stock bid ask
0 time1 stockA 1 1.0
1 time2 stockA 2 1.1
2 time3 stockB 2 2.1
3 time4 stockB 1 2.0
EDIT
Based on your updated sample data and desired output the following works:
根据您更新的样本数据和所需的输出,以下工作:
In [29]:
t="""time1,stockA,bid,1
time2,stockA,ask,1.1
time3,stockB,ask,2.1
time4,stockB,bid,2.0
time5,stockA,bid,1.1
time6,stockA,ask,1.2"""
df = pd.read_csv(io.StringIO(t), header=None, names=['time', 'stock', 'bid', 'ask'])
df
Out[29]:
time stock bid ask
0 time1 stockA bid 1.0
1 time2 stockA ask 1.1
2 time3 stockB ask 2.1
3 time4 stockB bid 2.0
4 time5 stockA bid 1.1
5 time6 stockA ask 1.2
In [30]:
df.loc[df['bid'] == 'bid', 'bid'] = df['ask']
df.loc[df['bid'] != 'ask', 'ask'] = ''
df.loc[df['bid'] == 'ask','bid'] = ''
df
Out[30]:
time stock bid ask
0 time1 stockA 1
1 time2 stockA 1.1
2 time3 stockB 2.1
3 time4 stockB 2
4 time5 stockA 1.1
5 time6 stockA 1.2
#2
Here is a more concise way I think.
我认为这是一种更简洁的方式。
df = pd.read_csv('prices.csv', header=None, names=['time', 'stock', 'type', 'prices'],
index_col=['time', 'stock', 'type'])
In [1062]:
df
Out[1062]:
prices
time stock type
time1 stockA bid 1.0
time2 stockA ask 1.1
time3 stockB ask 2.1
time4 stockB bid 2.0
time5 stockA bid 1.1
time6 stockA ask 1.2
time7 stockA high1.5
time8 stockA low 0.5
I think that's what the DataFrame should look like. Then do
我认为这就是DataFrame的样子。然后做
In [1064]:
df.unstack()
Out[1064]:
prices
type ask bid high low
time stock
time1 stockA NaN 1.0 NaN NaN
time2 stockA 1.1 NaN NaN NaN
time3 stockB 2.1 NaN NaN NaN
time4 stockB NaN 2.0 NaN NaN
time5 stockA NaN 1.1 NaN NaN
time6 stockA 1.2 NaN NaN NaN
time7 stockA NaN NaN 1.5 NaN
time8 stockA NaN NaN NaN 0.5
You can fill the Nans with whatever you prefer using df.fillna
. Generally speaking, converting a columns values into column headers is called pivoting. .unstack
pivots a level of a MultiIndex. You can check .pivot
as well.
您可以使用df.fillna填写您喜欢的任何内容。一般来说,将列值转换为列标题称为透视。 .unstack支持MultiIndex的级别。你也可以检查.pivot。
#1
You can use read_csv
and specify header=None
and pass the column names as a list:
您可以使用read_csv并指定header = None并将列名称作为列表传递:
In [124]:
t="""time1,stockA,bid,1
time2,stockA,ask,1.1
time3,stockB,ask,2.1
time4,stockB,bid,2.0"""
df = pd.read_csv(io.StringIO(t), header=None, names=['time', 'stock', 'bid', 'ask'])
df
Out[124]:
time stock bid ask
0 time1 stockA bid 1.0
1 time2 stockA ask 1.1
2 time3 stockB ask 2.1
3 time4 stockB bid 2.0
You'll have to re-encode the bid column to 1 or 2:
您必须将出价列重新编码为1或2:
In [126]:
df['bid'] = df['bid'].replace('bid', 1)
df['bid'] = df['bid'].replace('ask', 2)
df
Out[126]:
time stock bid ask
0 time1 stockA 1 1.0
1 time2 stockA 2 1.1
2 time3 stockB 2 2.1
3 time4 stockB 1 2.0
EDIT
Based on your updated sample data and desired output the following works:
根据您更新的样本数据和所需的输出,以下工作:
In [29]:
t="""time1,stockA,bid,1
time2,stockA,ask,1.1
time3,stockB,ask,2.1
time4,stockB,bid,2.0
time5,stockA,bid,1.1
time6,stockA,ask,1.2"""
df = pd.read_csv(io.StringIO(t), header=None, names=['time', 'stock', 'bid', 'ask'])
df
Out[29]:
time stock bid ask
0 time1 stockA bid 1.0
1 time2 stockA ask 1.1
2 time3 stockB ask 2.1
3 time4 stockB bid 2.0
4 time5 stockA bid 1.1
5 time6 stockA ask 1.2
In [30]:
df.loc[df['bid'] == 'bid', 'bid'] = df['ask']
df.loc[df['bid'] != 'ask', 'ask'] = ''
df.loc[df['bid'] == 'ask','bid'] = ''
df
Out[30]:
time stock bid ask
0 time1 stockA 1
1 time2 stockA 1.1
2 time3 stockB 2.1
3 time4 stockB 2
4 time5 stockA 1.1
5 time6 stockA 1.2
#2
Here is a more concise way I think.
我认为这是一种更简洁的方式。
df = pd.read_csv('prices.csv', header=None, names=['time', 'stock', 'type', 'prices'],
index_col=['time', 'stock', 'type'])
In [1062]:
df
Out[1062]:
prices
time stock type
time1 stockA bid 1.0
time2 stockA ask 1.1
time3 stockB ask 2.1
time4 stockB bid 2.0
time5 stockA bid 1.1
time6 stockA ask 1.2
time7 stockA high1.5
time8 stockA low 0.5
I think that's what the DataFrame should look like. Then do
我认为这就是DataFrame的样子。然后做
In [1064]:
df.unstack()
Out[1064]:
prices
type ask bid high low
time stock
time1 stockA NaN 1.0 NaN NaN
time2 stockA 1.1 NaN NaN NaN
time3 stockB 2.1 NaN NaN NaN
time4 stockB NaN 2.0 NaN NaN
time5 stockA NaN 1.1 NaN NaN
time6 stockA 1.2 NaN NaN NaN
time7 stockA NaN NaN 1.5 NaN
time8 stockA NaN NaN NaN 0.5
You can fill the Nans with whatever you prefer using df.fillna
. Generally speaking, converting a columns values into column headers is called pivoting. .unstack
pivots a level of a MultiIndex. You can check .pivot
as well.
您可以使用df.fillna填写您喜欢的任何内容。一般来说,将列值转换为列标题称为透视。 .unstack支持MultiIndex的级别。你也可以检查.pivot。