Pandas将数据帧转换为元组数组

时间:2022-09-15 15:49:11

I have manipulated some data using pandas and now I want to carry out a batch save back to the database. This requires me to convert the dataframe into an array of tuples, with each tuple corresponding to a "row" of the dataframe.

我使用pandas操纵了一些数据,现在我想执行批量保存回数据库。这需要我将数据帧转换为元组数组,每个元组对应于数据帧的“行”。

My DataFrame looks something like:

我的DataFrame看起来像:

In [182]: data_set
Out[182]: 
  index data_date   data_1  data_2
0  14303 2012-02-17  24.75   25.03 
1  12009 2012-02-16  25.00   25.07 
2  11830 2012-02-15  24.99   25.15 
3  6274  2012-02-14  24.68   25.05 
4  2302  2012-02-13  24.62   24.77 
5  14085 2012-02-10  24.38   24.61 

I want to convert it to an array of tuples like:

我想将它转换为元组数组,如:

[(datetime.date(2012,2,17),24.75,25.03),
(datetime.date(2012,2,16),25.00,25.07),
...etc. ]

Any suggestion on how I can efficiently do this?

关于如何有效地做到这一点的任何建议?

7 个解决方案

#1


112  

How about:

怎么样:

subset = data_set[['data_date', 'data_1', 'data_2']]
tuples = [tuple(x) for x in subset.values]

#2


43  

list(data_set.itertuples(index=False))

As of 17.1, the above will return a list of namedtuples - see the docs.

从17.1开始,上面将返回一个namedtuples列表 - 请参阅文档。

#3


37  

A generic way:

通用方式:

[tuple(x) for x in data_set.to_records(index=False)]

#4


13  

Motivation
Many data sets are large enough that we need to concern ourselves with speed/efficiency. So I offer this solution in that spirit. It happens to also be succinct.

动机许多数据集足够大,我们需要关注速度/效率。所以我本着这种精神提供这种解决方案。它恰好也很简洁。

For the sake of comparison, let's drop the index column

为了便于比较,让我们删除索引列

df = data_set.drop('index', 1)

Solution
I'll propose the use of zip and a comprehension

解决方案我会建议使用拉链和理解

list(zip(*[df[c].values.tolist() for c in df]))

[('2012-02-17', 24.75, 25.03),
 ('2012-02-16', 25.0, 25.07),
 ('2012-02-15', 24.99, 25.15),
 ('2012-02-14', 24.68, 25.05),
 ('2012-02-13', 24.62, 24.77),
 ('2012-02-10', 24.38, 24.61)]

It happens to also be flexible if we wanted to deal with a specific subset of columns. We'll assume the columns we've already displayed are the subset we want.

如果我们想要处理特定的列子集,它恰好也是灵活的。我们假设我们已经显示的列是我们想要的子集。

list(zip(*[df[c].values.tolist() for c in ['data_date', 'data_1', 'data_2']))

[('2012-02-17', 24.75, 25.03),
 ('2012-02-16', 25.0, 25.07),
 ('2012-02-15', 24.99, 25.15),
 ('2012-02-14', 24.68, 25.05),
 ('2012-02-13', 24.62, 24.77),
 ('2012-02-10', 24.38, 24.61)]

All the following produce the same results

以下所有产生相同的结果

  • [tuple(x) for x in df.values]
  • [df.values中x的元组(x)]
  • df.to_records(index=False).tolist()
  • df.to_records(指数=假).tolist()
  • list(map(tuple,df.values))
  • 列表(图(元组,df.values))
  • list(map(tuple, df.itertuples(index=False)))
  • list(map(tuple,df.itertuples(index = False)))

What is quicker?
zip and comprehension is faster by a large margin

什么更快?拉链和理解速度更快

%timeit [tuple(x) for x in df.values]
%timeit list(map(tuple, df.itertuples(index=False)))
%timeit df.to_records(index=False).tolist()
%timeit list(map(tuple,df.values))
%timeit list(zip(*[df[c].values.tolist() for c in df]))

small data

小数据

10000 loops, best of 3: 55.7 µs per loop
1000 loops, best of 3: 596 µs per loop
10000 loops, best of 3: 38.2 µs per loop
10000 loops, best of 3: 54.3 µs per loop
100000 loops, best of 3: 12.9 µs per loop

large data

大数据

10 loops, best of 3: 58.8 ms per loop
10 loops, best of 3: 43.9 ms per loop
10 loops, best of 3: 29.3 ms per loop
10 loops, best of 3: 53.7 ms per loop
100 loops, best of 3: 6.09 ms per loop

#5


7  

Here's a vectorized approach (assuming the dataframe, data_set to be defined as df instead) that returns a list of tuples as shown:

这是一个矢量化方法(假设数据帧,data_set被定义为df),它返回一个元组列表,如下所示:

>>> df.set_index(['data_date'])[['data_1', 'data_2']].to_records().tolist()

produces:

生产:

[(datetime.datetime(2012, 2, 17, 0, 0), 24.75, 25.03),
 (datetime.datetime(2012, 2, 16, 0, 0), 25.0, 25.07),
 (datetime.datetime(2012, 2, 15, 0, 0), 24.99, 25.15),
 (datetime.datetime(2012, 2, 14, 0, 0), 24.68, 25.05),
 (datetime.datetime(2012, 2, 13, 0, 0), 24.62, 24.77),
 (datetime.datetime(2012, 2, 10, 0, 0), 24.38, 24.61)]

The idea of setting datetime column as the index axis is to aid in the conversion of the Timestamp value to it's corresponding datetime.datetime format equivalent by making use of the convert_datetime64 argument in DF.to_records which does so for a DateTimeIndex dataframe.

将datetime列设置为索引轴的想法是通过在DF.to_records中使用convert_datetime64参数来帮助将Timestamp值转换为相应的datetime.datetime格式,这对于DateTimeIndex数据帧也是如此。

This returns a recarray which could be then made to return a list using .tolist

这将返回一个重新排列,然后可以使用.tolist返回一个列表


More generalized solution depending on the use case would be:

根据用例,更通用的解决方案是:

df.to_records().tolist()                              # Supply index=False to exclude index

#6


2  

More pythonic way:

更多pythonic方式:

df = data_set[['data_date', 'data_1', 'data_2']]
map(tuple,df.values)

#7


0  

#try this one:

tuples = list(zip(data_set["data_date"], data_set["data_1"],data_set["data_2"]))
print (tuples)

#1


112  

How about:

怎么样:

subset = data_set[['data_date', 'data_1', 'data_2']]
tuples = [tuple(x) for x in subset.values]

#2


43  

list(data_set.itertuples(index=False))

As of 17.1, the above will return a list of namedtuples - see the docs.

从17.1开始,上面将返回一个namedtuples列表 - 请参阅文档。

#3


37  

A generic way:

通用方式:

[tuple(x) for x in data_set.to_records(index=False)]

#4


13  

Motivation
Many data sets are large enough that we need to concern ourselves with speed/efficiency. So I offer this solution in that spirit. It happens to also be succinct.

动机许多数据集足够大,我们需要关注速度/效率。所以我本着这种精神提供这种解决方案。它恰好也很简洁。

For the sake of comparison, let's drop the index column

为了便于比较,让我们删除索引列

df = data_set.drop('index', 1)

Solution
I'll propose the use of zip and a comprehension

解决方案我会建议使用拉链和理解

list(zip(*[df[c].values.tolist() for c in df]))

[('2012-02-17', 24.75, 25.03),
 ('2012-02-16', 25.0, 25.07),
 ('2012-02-15', 24.99, 25.15),
 ('2012-02-14', 24.68, 25.05),
 ('2012-02-13', 24.62, 24.77),
 ('2012-02-10', 24.38, 24.61)]

It happens to also be flexible if we wanted to deal with a specific subset of columns. We'll assume the columns we've already displayed are the subset we want.

如果我们想要处理特定的列子集,它恰好也是灵活的。我们假设我们已经显示的列是我们想要的子集。

list(zip(*[df[c].values.tolist() for c in ['data_date', 'data_1', 'data_2']))

[('2012-02-17', 24.75, 25.03),
 ('2012-02-16', 25.0, 25.07),
 ('2012-02-15', 24.99, 25.15),
 ('2012-02-14', 24.68, 25.05),
 ('2012-02-13', 24.62, 24.77),
 ('2012-02-10', 24.38, 24.61)]

All the following produce the same results

以下所有产生相同的结果

  • [tuple(x) for x in df.values]
  • [df.values中x的元组(x)]
  • df.to_records(index=False).tolist()
  • df.to_records(指数=假).tolist()
  • list(map(tuple,df.values))
  • 列表(图(元组,df.values))
  • list(map(tuple, df.itertuples(index=False)))
  • list(map(tuple,df.itertuples(index = False)))

What is quicker?
zip and comprehension is faster by a large margin

什么更快?拉链和理解速度更快

%timeit [tuple(x) for x in df.values]
%timeit list(map(tuple, df.itertuples(index=False)))
%timeit df.to_records(index=False).tolist()
%timeit list(map(tuple,df.values))
%timeit list(zip(*[df[c].values.tolist() for c in df]))

small data

小数据

10000 loops, best of 3: 55.7 µs per loop
1000 loops, best of 3: 596 µs per loop
10000 loops, best of 3: 38.2 µs per loop
10000 loops, best of 3: 54.3 µs per loop
100000 loops, best of 3: 12.9 µs per loop

large data

大数据

10 loops, best of 3: 58.8 ms per loop
10 loops, best of 3: 43.9 ms per loop
10 loops, best of 3: 29.3 ms per loop
10 loops, best of 3: 53.7 ms per loop
100 loops, best of 3: 6.09 ms per loop

#5


7  

Here's a vectorized approach (assuming the dataframe, data_set to be defined as df instead) that returns a list of tuples as shown:

这是一个矢量化方法(假设数据帧,data_set被定义为df),它返回一个元组列表,如下所示:

>>> df.set_index(['data_date'])[['data_1', 'data_2']].to_records().tolist()

produces:

生产:

[(datetime.datetime(2012, 2, 17, 0, 0), 24.75, 25.03),
 (datetime.datetime(2012, 2, 16, 0, 0), 25.0, 25.07),
 (datetime.datetime(2012, 2, 15, 0, 0), 24.99, 25.15),
 (datetime.datetime(2012, 2, 14, 0, 0), 24.68, 25.05),
 (datetime.datetime(2012, 2, 13, 0, 0), 24.62, 24.77),
 (datetime.datetime(2012, 2, 10, 0, 0), 24.38, 24.61)]

The idea of setting datetime column as the index axis is to aid in the conversion of the Timestamp value to it's corresponding datetime.datetime format equivalent by making use of the convert_datetime64 argument in DF.to_records which does so for a DateTimeIndex dataframe.

将datetime列设置为索引轴的想法是通过在DF.to_records中使用convert_datetime64参数来帮助将Timestamp值转换为相应的datetime.datetime格式,这对于DateTimeIndex数据帧也是如此。

This returns a recarray which could be then made to return a list using .tolist

这将返回一个重新排列,然后可以使用.tolist返回一个列表


More generalized solution depending on the use case would be:

根据用例,更通用的解决方案是:

df.to_records().tolist()                              # Supply index=False to exclude index

#6


2  

More pythonic way:

更多pythonic方式:

df = data_set[['data_date', 'data_1', 'data_2']]
map(tuple,df.values)

#7


0  

#try this one:

tuples = list(zip(data_set["data_date"], data_set["data_1"],data_set["data_2"]))
print (tuples)