Background
I got a data frame with integers. These integers represents a series of features that are either present or not present for that row.
我有一个带整数的数据框。这些整数表示该行存在或不存在的一系列功能。
I want these features to be named columns in my data frame.
我希望这些功能在我的数据框中命名为列。
Problem
My current solution explodes in memory and is crazy slow. How do I improve the memory efficiency of this?
我目前的解决方案在内存中爆炸并且速度很慢。如何提高此内存效率?
import pandas as pd
df = pd.DataFrame({'some_int':range(5)})
df['some_int'].astype(int).apply(bin).str[2:].str.zfill(4).apply(list).apply(pd.Series).rename(columns=dict(zip(range(4), ["f1", "f2", "f3", "f4"])))
f1 f2 f3 f4
0 0 0 0 0
1 0 0 0 1
2 0 0 1 0
3 0 0 1 1
4 0 1 0 0
It seems to be the .apply(pd.Series)
that is slowing this down. Everything else is quite fast until I add this.
似乎是.apply(pd.Series)正在减慢这种速度。在我添加之前,其他所有内容都非常快。
I cannot skip it because a simple list will not make a dataframe.
我不能跳过它,因为一个简单的列表不会产生数据帧。
3 个解决方案
#1
4
Here's a vectorized NumPy approach -
这是一个矢量化的NumPy方法 -
def num2bin(nums, width):
return ((nums[:,None] & (1 << np.arange(width-1,-1,-1)))!=0).astype(int)
Sample run -
样品运行 -
In [70]: df
Out[70]:
some_int
0 1
1 5
2 3
3 8
4 4
In [71]: pd.DataFrame( num2bin(df.some_int.values, 4), \
columns = [["f1", "f2", "f3", "f4"]])
Out[71]:
f1 f2 f3 f4
0 0 0 0 1
1 0 1 0 1
2 0 0 1 1
3 1 0 0 0
4 0 1 0 0
Explanation
说明
1) Inputs :
1)输入:
In [98]: nums = np.array([1,5,3,8,4])
In [99]: width = 4
2) Get the 2 powered range numbers :
2)获得2个动力范围号码:
In [100]: (1 << np.arange(width-1,-1,-1))
Out[100]: array([8, 4, 2, 1])
3) Convert nums to a 2D array version as we later on want to do element-wise bit-ANDing between it and the 2-powered numbers in a vectorized mannner following the rules of broadcasting
:
3)将nums转换为2D数组版本,因为我们稍后想要按照广播规则在矢量化的mannner中按照元素方式进行比特和运算:
In [101]: nums[:,None]
Out[101]:
array([[1],
[5],
[3],
[8],
[4]])
In [102]: nums[:,None] & (1 << np.arange(width-1,-1,-1))
Out[102]:
array([[0, 0, 0, 1],
[0, 4, 0, 1],
[0, 0, 2, 1],
[8, 0, 0, 0],
[0, 4, 0, 0]])
To understand the bit-ANDIng, let's consider the number 5
from nums
and its bit-ANDing for it against all 2-powered numbers [8,4,2,1]
:
要理解bit-ANDIng,让我们考虑nums中的数字5和它与所有2个驱动数字[8,4,2,1]的位AND运算:
In [103]: 5 & 8 # 0101 & 1000
Out[103]: 0
In [104]: 5 & 4 # 0101 & 0100
Out[104]: 4
In [105]: 5 & 2 # 0101 & 0010
Out[105]: 0
In [106]: 5 & 1 # 0101 & 0001
Out[106]: 1
Thus, we see that there are no intersection against [8,2]
, whereas for others we have non-zeros.
因此,我们看到与[8,2]没有交集,而对于其他人,我们有非零。
4) In the final stage, look for matches (non-zeros) and simply convert those to 1s and rest to 0s by comparing against 0
resulting in a boolean array and then converting to int dtype :
4)在最后阶段,查找匹配(非零)并简单地将它们转换为1并通过与0进行比较来休息为0,从而生成布尔数组,然后转换为int dtype:
In [107]: matches = nums[:,None] & (1 << np.arange(width-1,-1,-1))
In [108]: matches!=0
Out[108]:
array([[False, False, False, True],
[False, True, False, True],
[False, False, True, True],
[ True, False, False, False],
[False, True, False, False]], dtype=bool)
In [109]: (matches!=0).astype(int)
Out[109]:
array([[0, 0, 0, 1],
[0, 1, 0, 1],
[0, 0, 1, 1],
[1, 0, 0, 0],
[0, 1, 0, 0]])
Runtime test
运行时测试
In [58]: df = pd.DataFrame({'some_int':range(100000)})
# @jezrael's soln-1
In [59]: %timeit pd.DataFrame(df['some_int'].astype(int).apply(bin).str[2:].str.zfill(4).apply(list).values.tolist())
1 loops, best of 3: 198 ms per loop
# @jezrael's soln-2
In [60]: %timeit pd.DataFrame([list('{:20b}'.format(x)) for x in df['some_int'].values])
10 loops, best of 3: 154 ms per loop
# @jezrael's soln-3
In [61]: %timeit pd.DataFrame(df['some_int'].apply(lambda x: list('{:20b}'.format(x))).values.tolist())
10 loops, best of 3: 132 ms per loop
# @MaxU's soln-1
In [62]: %timeit pd.DataFrame([list(np.binary_repr(x, width=20)) for x in df.some_int.values])
1 loops, best of 3: 193 ms per loop
# @MaxU's soln-2
In [64]: %timeit df.some_int.apply(lambda x: pd.Series(list(np.binary_repr(x, width=20))))
1 loops, best of 3: 11.8 s per loop
# Proposed in this post
In [65]: %timeit pd.DataFrame( num2bin(df.some_int.values, 20))
100 loops, best of 3: 5.64 ms per loop
#2
5
you can use numpy.binary_repr method:
你可以使用numpy.binary_repr方法:
In [336]: df.some_int.apply(lambda x: pd.Series(list(np.binary_repr(x, width=4)))) \
.add_prefix('f')
Out[336]:
f0 f1 f2 f3
0 0 0 0 0
1 0 0 0 1
2 0 0 1 0
3 0 0 1 1
4 0 1 0 0
or
要么
In [346]: pd.DataFrame([list(np.binary_repr(x, width=4)) for x in df.some_int.values],
...: columns=np.arange(1,5)) \
...: .add_prefix('f')
...:
Out[346]:
f1 f2 f3 f4
0 0 0 0 0
1 0 0 0 1
2 0 0 1 0
3 0 0 1 1
4 0 1 0 0
#3
3
I think you need:
我认为你需要:
a = pd.DataFrame(df['some_int'].astype(int)
.apply(bin)
.str[2:]
.str.zfill(4)
.apply(list).values.tolist(), columns=["f1","f2","f3","f4"])
print (a)
f1 f2 f3 f4
0 0 0 0 0
1 0 0 0 1
2 0 0 1 0
3 0 0 1 1
4 0 1 0 0
Another solution, thanks Jon Clements and ayhan:
另一个解决方案,感谢Jon Clements和ayhan:
a = pd.DataFrame(df['some_int'].apply(lambda x: list('{:04b}'.format(x))).values.tolist(),
columns=['f1', 'f2', 'f3', 'f4'])
print (a)
f1 f2 f3 f4
0 0 0 0 0
1 0 0 0 1
2 0 0 1 0
3 0 0 1 1
4 0 1 0 0
A bit changed:
有点改变:
a = pd.DataFrame([list('{:04b}'.format(x)) for x in df['some_int'].values],
columns=['f1', 'f2', 'f3', 'f4'])
print (a)
f1 f2 f3 f4
0 0 0 0 0
1 0 0 0 1
2 0 0 1 0
3 0 0 1 1
4 0 1 0 0
Timings:
时序:
df = pd.DataFrame({'some_int':range(100000)})
In [80]: %timeit pd.DataFrame(df['some_int'].astype(int).apply(bin).str[2:].str.zfill(20).apply(list).values.tolist())
1 loop, best of 3: 231 ms per loop
In [81]: %timeit pd.DataFrame([list('{:020b}'.format(x)) for x in df['some_int'].values])
1 loop, best of 3: 232 ms per loop
In [82]: %timeit pd.DataFrame(df['some_int'].apply(lambda x: list('{:020b}'.format(x))).values.tolist())
1 loop, best of 3: 222 ms per loop
In [83]: %timeit pd.DataFrame([list(np.binary_repr(x, width=20)) for x in df.some_int.values])
1 loop, best of 3: 343 ms per loop
In [84]: %timeit df.some_int.apply(lambda x: pd.Series(list(np.binary_repr(x, width=20))))
1 loop, best of 3: 16.4 s per loop
In [87]: %timeit pd.DataFrame( num2bin(df.some_int.values, 20))
100 loops, best of 3: 11.4 ms per loop
#1
4
Here's a vectorized NumPy approach -
这是一个矢量化的NumPy方法 -
def num2bin(nums, width):
return ((nums[:,None] & (1 << np.arange(width-1,-1,-1)))!=0).astype(int)
Sample run -
样品运行 -
In [70]: df
Out[70]:
some_int
0 1
1 5
2 3
3 8
4 4
In [71]: pd.DataFrame( num2bin(df.some_int.values, 4), \
columns = [["f1", "f2", "f3", "f4"]])
Out[71]:
f1 f2 f3 f4
0 0 0 0 1
1 0 1 0 1
2 0 0 1 1
3 1 0 0 0
4 0 1 0 0
Explanation
说明
1) Inputs :
1)输入:
In [98]: nums = np.array([1,5,3,8,4])
In [99]: width = 4
2) Get the 2 powered range numbers :
2)获得2个动力范围号码:
In [100]: (1 << np.arange(width-1,-1,-1))
Out[100]: array([8, 4, 2, 1])
3) Convert nums to a 2D array version as we later on want to do element-wise bit-ANDing between it and the 2-powered numbers in a vectorized mannner following the rules of broadcasting
:
3)将nums转换为2D数组版本,因为我们稍后想要按照广播规则在矢量化的mannner中按照元素方式进行比特和运算:
In [101]: nums[:,None]
Out[101]:
array([[1],
[5],
[3],
[8],
[4]])
In [102]: nums[:,None] & (1 << np.arange(width-1,-1,-1))
Out[102]:
array([[0, 0, 0, 1],
[0, 4, 0, 1],
[0, 0, 2, 1],
[8, 0, 0, 0],
[0, 4, 0, 0]])
To understand the bit-ANDIng, let's consider the number 5
from nums
and its bit-ANDing for it against all 2-powered numbers [8,4,2,1]
:
要理解bit-ANDIng,让我们考虑nums中的数字5和它与所有2个驱动数字[8,4,2,1]的位AND运算:
In [103]: 5 & 8 # 0101 & 1000
Out[103]: 0
In [104]: 5 & 4 # 0101 & 0100
Out[104]: 4
In [105]: 5 & 2 # 0101 & 0010
Out[105]: 0
In [106]: 5 & 1 # 0101 & 0001
Out[106]: 1
Thus, we see that there are no intersection against [8,2]
, whereas for others we have non-zeros.
因此,我们看到与[8,2]没有交集,而对于其他人,我们有非零。
4) In the final stage, look for matches (non-zeros) and simply convert those to 1s and rest to 0s by comparing against 0
resulting in a boolean array and then converting to int dtype :
4)在最后阶段,查找匹配(非零)并简单地将它们转换为1并通过与0进行比较来休息为0,从而生成布尔数组,然后转换为int dtype:
In [107]: matches = nums[:,None] & (1 << np.arange(width-1,-1,-1))
In [108]: matches!=0
Out[108]:
array([[False, False, False, True],
[False, True, False, True],
[False, False, True, True],
[ True, False, False, False],
[False, True, False, False]], dtype=bool)
In [109]: (matches!=0).astype(int)
Out[109]:
array([[0, 0, 0, 1],
[0, 1, 0, 1],
[0, 0, 1, 1],
[1, 0, 0, 0],
[0, 1, 0, 0]])
Runtime test
运行时测试
In [58]: df = pd.DataFrame({'some_int':range(100000)})
# @jezrael's soln-1
In [59]: %timeit pd.DataFrame(df['some_int'].astype(int).apply(bin).str[2:].str.zfill(4).apply(list).values.tolist())
1 loops, best of 3: 198 ms per loop
# @jezrael's soln-2
In [60]: %timeit pd.DataFrame([list('{:20b}'.format(x)) for x in df['some_int'].values])
10 loops, best of 3: 154 ms per loop
# @jezrael's soln-3
In [61]: %timeit pd.DataFrame(df['some_int'].apply(lambda x: list('{:20b}'.format(x))).values.tolist())
10 loops, best of 3: 132 ms per loop
# @MaxU's soln-1
In [62]: %timeit pd.DataFrame([list(np.binary_repr(x, width=20)) for x in df.some_int.values])
1 loops, best of 3: 193 ms per loop
# @MaxU's soln-2
In [64]: %timeit df.some_int.apply(lambda x: pd.Series(list(np.binary_repr(x, width=20))))
1 loops, best of 3: 11.8 s per loop
# Proposed in this post
In [65]: %timeit pd.DataFrame( num2bin(df.some_int.values, 20))
100 loops, best of 3: 5.64 ms per loop
#2
5
you can use numpy.binary_repr method:
你可以使用numpy.binary_repr方法:
In [336]: df.some_int.apply(lambda x: pd.Series(list(np.binary_repr(x, width=4)))) \
.add_prefix('f')
Out[336]:
f0 f1 f2 f3
0 0 0 0 0
1 0 0 0 1
2 0 0 1 0
3 0 0 1 1
4 0 1 0 0
or
要么
In [346]: pd.DataFrame([list(np.binary_repr(x, width=4)) for x in df.some_int.values],
...: columns=np.arange(1,5)) \
...: .add_prefix('f')
...:
Out[346]:
f1 f2 f3 f4
0 0 0 0 0
1 0 0 0 1
2 0 0 1 0
3 0 0 1 1
4 0 1 0 0
#3
3
I think you need:
我认为你需要:
a = pd.DataFrame(df['some_int'].astype(int)
.apply(bin)
.str[2:]
.str.zfill(4)
.apply(list).values.tolist(), columns=["f1","f2","f3","f4"])
print (a)
f1 f2 f3 f4
0 0 0 0 0
1 0 0 0 1
2 0 0 1 0
3 0 0 1 1
4 0 1 0 0
Another solution, thanks Jon Clements and ayhan:
另一个解决方案,感谢Jon Clements和ayhan:
a = pd.DataFrame(df['some_int'].apply(lambda x: list('{:04b}'.format(x))).values.tolist(),
columns=['f1', 'f2', 'f3', 'f4'])
print (a)
f1 f2 f3 f4
0 0 0 0 0
1 0 0 0 1
2 0 0 1 0
3 0 0 1 1
4 0 1 0 0
A bit changed:
有点改变:
a = pd.DataFrame([list('{:04b}'.format(x)) for x in df['some_int'].values],
columns=['f1', 'f2', 'f3', 'f4'])
print (a)
f1 f2 f3 f4
0 0 0 0 0
1 0 0 0 1
2 0 0 1 0
3 0 0 1 1
4 0 1 0 0
Timings:
时序:
df = pd.DataFrame({'some_int':range(100000)})
In [80]: %timeit pd.DataFrame(df['some_int'].astype(int).apply(bin).str[2:].str.zfill(20).apply(list).values.tolist())
1 loop, best of 3: 231 ms per loop
In [81]: %timeit pd.DataFrame([list('{:020b}'.format(x)) for x in df['some_int'].values])
1 loop, best of 3: 232 ms per loop
In [82]: %timeit pd.DataFrame(df['some_int'].apply(lambda x: list('{:020b}'.format(x))).values.tolist())
1 loop, best of 3: 222 ms per loop
In [83]: %timeit pd.DataFrame([list(np.binary_repr(x, width=20)) for x in df.some_int.values])
1 loop, best of 3: 343 ms per loop
In [84]: %timeit df.some_int.apply(lambda x: pd.Series(list(np.binary_repr(x, width=20))))
1 loop, best of 3: 16.4 s per loop
In [87]: %timeit pd.DataFrame( num2bin(df.some_int.values, 20))
100 loops, best of 3: 11.4 ms per loop