I have a dataset that on one of its columns, each element is a list. I would like to flatten it, such that every list element would have a row of it's own.
我有一个数据集,在其中一个列上,每个元素都是一个列表。我想弄平它,这样每个列表元素都会有一行自己的行。
I managed to solve it with iterrows
, dict
and append
(see below) but it is too slow with my true DF that is large. Is there a way to make things faster?
我设法用iterrows,dict和append解决它(见下文),但是我的真DF很大。有没有办法让事情变得更快?
I can consider replacing the column with list per element in another format (maybe hierarchical df? ) if that would make more sense.
我可以考虑用另一种格式(可能是分层df?)替换每个元素的列,如果这更有意义的话。
EDIT: I have many columns, and some might change in the future. The only thing i know for sure is that I have the fields column. That's why I used dict
in my solution
编辑:我有很多专栏,有些可能会在未来发生变化。我唯一知道的是我有田野专栏。这就是我在我的解决方案中使用dict的原因
A minimal example, creating a df to play with:
一个最小的例子,创建一个df来玩:
import StringIO
df = pd.read_csv(StringIO.StringIO("""
id|name|fields
1|abc|[qq,ww,rr]
2|efg|[zz,xx,rr]
"""), sep='|')
df.fields = df.fields.apply(lambda s: s[1:-1].split(','))
print df
resulting df:
得到的df:
id name fields
0 1 abc [qq, ww, rr]
1 2 efg [zz, xx, rr]
my (slow) solution:
我的(慢)解决方案:
new_df = pd.DataFrame(index=[], columns=df.columns)
for _, i in df.iterrows():
flattened_d = [dict(i.to_dict(), fields=c) for c in i.fields]
new_df = new_df.append(flattened_d )
Resulting with
结果
id name fields
0 1.0 abc qq
1 1.0 abc ww
2 1.0 abc rr
0 2.0 efg zz
1 2.0 efg xx
2 2.0 efg rr
3 个解决方案
#1
1
You can break the lists in the fields
column into multiple columns by applying pandas.Series
to fields
and then merging to id
and name
like so:
您可以通过将pandas.Series应用于字段然后合并到id和name来将fields列中的列表分成多个列,如下所示:
cols = df.columns[df.columns != 'fields'].tolist() # adapted from @jezrael
df = df[cols].join(df.fields.apply(pandas.Series))
Then you can melt the resulting new columns using set_index
and stack
, and then reseting the index:
然后,您可以使用set_index和stack来融合生成的新列,然后重置索引:
df = df.set_index(cols).stack().reset_index()
Finally, drop the redundant column generated by reset_index and rename the generated column to "field":
最后,删除reset_index生成的冗余列,并将生成的列重命名为“field”:
df = df.drop(df.columns[-2], axis=1).rename(columns={0: 'field'})
#2
4
You can use numpy
for better performance:
你可以使用numpy来获得更好的性能:
Both solutions use mainly numpy.repeat
.
两种解决方案主要使用numpy.repeat。
from itertools import chain
vals = df.fields.str.len()
df1 = pd.DataFrame({
"id": np.repeat(df.id.values,vals),
"name": np.repeat(df.name.values, vals),
"fields": list(chain.from_iterable(df.fields))})
df1 = df1.reindex_axis(df.columns, axis=1)
print (df1)
id name fields
0 1 abc qq
1 1 abc ww
2 1 abc rr
3 2 efg zz
4 2 efg xx
5 2 efg rr
Another solution:
另一种方案:
df[['id','name']].values
converts columns to numpy array
and duplicate them by numpy.repeat
, then stack values in lists
by numpy.hstack
and add it by numpy.column_stack
.
df [['id','name']]。values将列转换为numpy数组并通过numpy.repeat复制它们,然后通过numpy.hstack将值堆叠在列表中,并通过numpy.column_stack添加它。
df1 = pd.DataFrame(np.column_stack((df[['id','name']].values.
repeat(list(map(len,df.fields)),axis=0),np.hstack(df.fields))),
columns=df.columns)
print (df1)
id name fields
0 1 abc qq
1 1 abc ww
2 1 abc rr
3 2 efg zz
4 2 efg xx
5 2 efg rr
More general solution is filter out column fields
and then add it to DataFrame
constructor, because always last column:
更一般的解决方案是过滤掉列字段,然后将其添加到DataFrame构造函数,因为总是最后一列:
cols = df.columns[df.columns != 'fields'].tolist()
print (cols)
['id', 'name']
df1 = pd.DataFrame(np.column_stack((df[cols].values.
repeat(list(map(len,df.fields)),axis=0),np.hstack(df.fields))),
columns=cols + ['fields'])
print (df1)
id name fields
0 1 abc qq
1 1 abc ww
2 1 abc rr
3 2 efg zz
4 2 efg xx
5 2 efg rr
#3
2
If your CSV is many thousands of lines long, then using_string_methods
(below) may be faster than using_iterrows
or using_repeat
:
如果你的CSV长了几千行,那么using_string_methods(下面)可能比using_iterrows或using_repeat更快:
With
同
csv = 'id|name|fields'+("""
1|abc|[qq,ww,rr]
2|efg|[zz,xx,rr]"""*10000)
In [210]: %timeit using_string_methods(csv)
10 loops, best of 3: 100 ms per loop
In [211]: %timeit using_itertuples(csv)
10 loops, best of 3: 119 ms per loop
In [212]: %timeit using_repeat(csv)
10 loops, best of 3: 126 ms per loop
In [213]: %timeit using_iterrows(csv)
1 loop, best of 3: 1min 7s per loop
So for a 10000-line CSV, using_string_methods
is over 600x faster than using_iterrows
, and marginally faster than using_repeat
.
因此,对于10000行CSV,using_string_methods比using_iterrows快600倍,并且比using_repeat快一点。
import pandas as pd
try: from cStringIO import StringIO # for Python2
except ImportError: from io import StringIO # for Python3
def using_string_methods(csv):
df = pd.read_csv(StringIO(csv), sep='|', dtype=None)
other_columns = df.columns.difference(['fields']).tolist()
fields = (df['fields'].str.extract(r'\[(.*)\]', expand=False)
.str.split(r',', expand=True))
df = pd.concat([df.drop('fields', axis=1), fields], axis=1)
result = (pd.melt(df, id_vars=other_columns, value_name='field')
.drop('variable', axis=1))
result = result.dropna(subset=['field'])
return result
def using_iterrows(csv):
df = pd.read_csv(StringIO(csv), sep='|')
df.fields = df.fields.apply(lambda s: s[1:-1].split(','))
new_df = pd.DataFrame(index=[], columns=df.columns)
for _, i in df.iterrows():
flattened_d = [dict(i.to_dict(), fields=c) for c in i.fields]
new_df = new_df.append(flattened_d )
return new_df
def using_repeat(csv):
df = pd.read_csv(StringIO(csv), sep='|')
df.fields = df.fields.apply(lambda s: s[1:-1].split(','))
cols = df.columns[df.columns != 'fields'].tolist()
df1 = pd.DataFrame(np.column_stack(
(df[cols].values.repeat(list(map(len,df.fields)),axis=0),
np.hstack(df.fields))), columns=cols + ['fields'])
return df1
def using_itertuples(csv):
df = pd.read_csv(StringIO(csv), sep='|')
df.fields = df.fields.apply(lambda s: s[1:-1].split(','))
other_columns = df.columns.difference(['fields']).tolist()
data = []
for tup in df.itertuples():
data.extend([[getattr(tup, col) for col in other_columns]+[field]
for field in tup.fields])
return pd.DataFrame(data, columns=other_columns+['field'])
csv = 'id|name|fields'+("""
1|abc|[qq,ww,rr]
2|efg|[zz,xx,rr]"""*10000)
Generally, fast NumPy/Pandas operations are possible only when the data is in a native NumPy dtype (such as int64
or float64
, or strings.) Once you place lists (a non-native NumPy dtype) in a DataFrame the jig is up -- you are forced to use Python-speed loops to process the lists.
通常,只有当数据采用本机NumPy dtype(例如int64或float64或字符串)时,才可能进行快速NumPy / Pandas操作。一旦在数据框中放置列表(非本地NumPy dtype),夹具就会启动 - - 您*使用Python-speed循环来处理列表。
So to improve performance, you need to avoid placing lists in a DataFrame.
因此,为了提高性能,您需要避免将列表放在DataFrame中。
using_string_methods
loads the fields
data as strings:
using_string_methods将字段数据作为字符串加载:
df = pd.read_csv(StringIO(csv), sep='|', dtype=None)
and avoid using the apply
method (which is generally as slow as a plain Python loop):
并避免使用apply方法(通常与普通的Python循环一样慢):
df.fields = df.fields.apply(lambda s: s[1:-1].split(','))
Instead, it uses faster vectorized string methods to break the strings up into separate columns:
相反,它使用更快的矢量化字符串方法将字符串分解为单独的列:
fields = (df['fields'].str.extract(r'\[(.*)\]', expand=False)
.str.split(r',', expand=True))
Once you have the fields in separate columns, you can use pd.melt
to reshape the DataFrame into the desired format.
将字段放在单独的列中后,可以使用pd.melt将DataFrame重新整形为所需的格式。
pd.melt(df, id_vars=['id', 'name'], value_name='field')
By the way, you might be interested to see that with a slight modification using_iterrows
can be just as fast as using_repeat
. I show the changes in using_itertuples
. df.itertuples
tends to be slightly faster than df.iterrows
, but the difference is minor. The majority of the speed gain is achieved by avoiding calling df.append
in a for-loop since that leads to quadratic copying.
顺便说一下,您可能有兴趣看到稍微修改一下,using_iterrows可以和using_repeat一样快。我在using_itertuples中显示了更改。 df.itertuples往往比df.iterrows略快,但差别很小。大多数速度增益是通过避免在for循环中调用df.append来实现的,因为这会导致二次复制。
#1
1
You can break the lists in the fields
column into multiple columns by applying pandas.Series
to fields
and then merging to id
and name
like so:
您可以通过将pandas.Series应用于字段然后合并到id和name来将fields列中的列表分成多个列,如下所示:
cols = df.columns[df.columns != 'fields'].tolist() # adapted from @jezrael
df = df[cols].join(df.fields.apply(pandas.Series))
Then you can melt the resulting new columns using set_index
and stack
, and then reseting the index:
然后,您可以使用set_index和stack来融合生成的新列,然后重置索引:
df = df.set_index(cols).stack().reset_index()
Finally, drop the redundant column generated by reset_index and rename the generated column to "field":
最后,删除reset_index生成的冗余列,并将生成的列重命名为“field”:
df = df.drop(df.columns[-2], axis=1).rename(columns={0: 'field'})
#2
4
You can use numpy
for better performance:
你可以使用numpy来获得更好的性能:
Both solutions use mainly numpy.repeat
.
两种解决方案主要使用numpy.repeat。
from itertools import chain
vals = df.fields.str.len()
df1 = pd.DataFrame({
"id": np.repeat(df.id.values,vals),
"name": np.repeat(df.name.values, vals),
"fields": list(chain.from_iterable(df.fields))})
df1 = df1.reindex_axis(df.columns, axis=1)
print (df1)
id name fields
0 1 abc qq
1 1 abc ww
2 1 abc rr
3 2 efg zz
4 2 efg xx
5 2 efg rr
Another solution:
另一种方案:
df[['id','name']].values
converts columns to numpy array
and duplicate them by numpy.repeat
, then stack values in lists
by numpy.hstack
and add it by numpy.column_stack
.
df [['id','name']]。values将列转换为numpy数组并通过numpy.repeat复制它们,然后通过numpy.hstack将值堆叠在列表中,并通过numpy.column_stack添加它。
df1 = pd.DataFrame(np.column_stack((df[['id','name']].values.
repeat(list(map(len,df.fields)),axis=0),np.hstack(df.fields))),
columns=df.columns)
print (df1)
id name fields
0 1 abc qq
1 1 abc ww
2 1 abc rr
3 2 efg zz
4 2 efg xx
5 2 efg rr
More general solution is filter out column fields
and then add it to DataFrame
constructor, because always last column:
更一般的解决方案是过滤掉列字段,然后将其添加到DataFrame构造函数,因为总是最后一列:
cols = df.columns[df.columns != 'fields'].tolist()
print (cols)
['id', 'name']
df1 = pd.DataFrame(np.column_stack((df[cols].values.
repeat(list(map(len,df.fields)),axis=0),np.hstack(df.fields))),
columns=cols + ['fields'])
print (df1)
id name fields
0 1 abc qq
1 1 abc ww
2 1 abc rr
3 2 efg zz
4 2 efg xx
5 2 efg rr
#3
2
If your CSV is many thousands of lines long, then using_string_methods
(below) may be faster than using_iterrows
or using_repeat
:
如果你的CSV长了几千行,那么using_string_methods(下面)可能比using_iterrows或using_repeat更快:
With
同
csv = 'id|name|fields'+("""
1|abc|[qq,ww,rr]
2|efg|[zz,xx,rr]"""*10000)
In [210]: %timeit using_string_methods(csv)
10 loops, best of 3: 100 ms per loop
In [211]: %timeit using_itertuples(csv)
10 loops, best of 3: 119 ms per loop
In [212]: %timeit using_repeat(csv)
10 loops, best of 3: 126 ms per loop
In [213]: %timeit using_iterrows(csv)
1 loop, best of 3: 1min 7s per loop
So for a 10000-line CSV, using_string_methods
is over 600x faster than using_iterrows
, and marginally faster than using_repeat
.
因此,对于10000行CSV,using_string_methods比using_iterrows快600倍,并且比using_repeat快一点。
import pandas as pd
try: from cStringIO import StringIO # for Python2
except ImportError: from io import StringIO # for Python3
def using_string_methods(csv):
df = pd.read_csv(StringIO(csv), sep='|', dtype=None)
other_columns = df.columns.difference(['fields']).tolist()
fields = (df['fields'].str.extract(r'\[(.*)\]', expand=False)
.str.split(r',', expand=True))
df = pd.concat([df.drop('fields', axis=1), fields], axis=1)
result = (pd.melt(df, id_vars=other_columns, value_name='field')
.drop('variable', axis=1))
result = result.dropna(subset=['field'])
return result
def using_iterrows(csv):
df = pd.read_csv(StringIO(csv), sep='|')
df.fields = df.fields.apply(lambda s: s[1:-1].split(','))
new_df = pd.DataFrame(index=[], columns=df.columns)
for _, i in df.iterrows():
flattened_d = [dict(i.to_dict(), fields=c) for c in i.fields]
new_df = new_df.append(flattened_d )
return new_df
def using_repeat(csv):
df = pd.read_csv(StringIO(csv), sep='|')
df.fields = df.fields.apply(lambda s: s[1:-1].split(','))
cols = df.columns[df.columns != 'fields'].tolist()
df1 = pd.DataFrame(np.column_stack(
(df[cols].values.repeat(list(map(len,df.fields)),axis=0),
np.hstack(df.fields))), columns=cols + ['fields'])
return df1
def using_itertuples(csv):
df = pd.read_csv(StringIO(csv), sep='|')
df.fields = df.fields.apply(lambda s: s[1:-1].split(','))
other_columns = df.columns.difference(['fields']).tolist()
data = []
for tup in df.itertuples():
data.extend([[getattr(tup, col) for col in other_columns]+[field]
for field in tup.fields])
return pd.DataFrame(data, columns=other_columns+['field'])
csv = 'id|name|fields'+("""
1|abc|[qq,ww,rr]
2|efg|[zz,xx,rr]"""*10000)
Generally, fast NumPy/Pandas operations are possible only when the data is in a native NumPy dtype (such as int64
or float64
, or strings.) Once you place lists (a non-native NumPy dtype) in a DataFrame the jig is up -- you are forced to use Python-speed loops to process the lists.
通常,只有当数据采用本机NumPy dtype(例如int64或float64或字符串)时,才可能进行快速NumPy / Pandas操作。一旦在数据框中放置列表(非本地NumPy dtype),夹具就会启动 - - 您*使用Python-speed循环来处理列表。
So to improve performance, you need to avoid placing lists in a DataFrame.
因此,为了提高性能,您需要避免将列表放在DataFrame中。
using_string_methods
loads the fields
data as strings:
using_string_methods将字段数据作为字符串加载:
df = pd.read_csv(StringIO(csv), sep='|', dtype=None)
and avoid using the apply
method (which is generally as slow as a plain Python loop):
并避免使用apply方法(通常与普通的Python循环一样慢):
df.fields = df.fields.apply(lambda s: s[1:-1].split(','))
Instead, it uses faster vectorized string methods to break the strings up into separate columns:
相反,它使用更快的矢量化字符串方法将字符串分解为单独的列:
fields = (df['fields'].str.extract(r'\[(.*)\]', expand=False)
.str.split(r',', expand=True))
Once you have the fields in separate columns, you can use pd.melt
to reshape the DataFrame into the desired format.
将字段放在单独的列中后,可以使用pd.melt将DataFrame重新整形为所需的格式。
pd.melt(df, id_vars=['id', 'name'], value_name='field')
By the way, you might be interested to see that with a slight modification using_iterrows
can be just as fast as using_repeat
. I show the changes in using_itertuples
. df.itertuples
tends to be slightly faster than df.iterrows
, but the difference is minor. The majority of the speed gain is achieved by avoiding calling df.append
in a for-loop since that leads to quadratic copying.
顺便说一下,您可能有兴趣看到稍微修改一下,using_iterrows可以和using_repeat一样快。我在using_itertuples中显示了更改。 df.itertuples往往比df.iterrows略快,但差别很小。大多数速度增益是通过避免在for循环中调用df.append来实现的,因为这会导致二次复制。