To understand my question, I should first point out that R datatables aren't just R dataframes with syntaxic sugar, there are important behavioral differences : column assignation/modification by reference in datatables avoids the copying of the whole object in memory (see the example in this quora answer) as it is the case in dataframes.
理解我的问题,我应该首先指出datatable不仅仅是R dataframes与综合性的糖,有重要的行为差异:列分配/修改通过引用在datatable中避免了复制的对象在内存中(见这quora答案中的示例)dataframes的情况。
I've found on multiple occasions that the speed and memory differences that arise from data.table
's behavior is a crucial element that allows one to work with some big datasets while it wouldn't be possible with data.frame
's behavior.
我在很多情况下发现,数据产生的速度和内存差异。table的行为是一个关键元素,它允许我们处理一些大型数据集,而处理data.frame的行为是不可能的。
Therefore, what I'm wondering is : in Python, how do Pandas
' dataframes behave in this regard ?
因此,我想知道的是:在Python中,熊猫的dataframes在这方面是如何运行的?
Bonus question : if Pandas' dataframes are closer to R dataframes than to R datatables, and have the same down side (a full copy of the object when assigning/modifying column), is there a Python equivalent to R's data.table
package ?
附加问题:如果熊猫的数据aframes更接近于R数据aframes而不是R数据可查的数据,并且有相同的缺点(分配/修改列时对象的完整副本),那么是否存在与R的数据等价的Python。表包吗?
EDIT per comment request : Code examples :
编辑每个评论请求:代码示例:
R dataframes :
R dataframes:
# renaming a column
colnames(mydataframe)[1] <- "new_column_name"
R datatables :
R datatable中:
# renaming a column
library(data.table)
setnames(mydatatable, 'old_column_name', 'new_column_name')
In Pandas :
熊猫:
mydataframe.rename(columns = {'old_column_name': 'new_column_name'}, inplace=True)
1 个解决方案
#1
12
Pandas operates more like data.frame
in this regard. You can check this using the memory_profiler package; here's an example of its use in the Jupyter notebook:
在这方面,熊猫的运作方式更类似于data.frame。您可以使用memory_profiler包检查它;这里有一个在Jupyter笔记本上使用的例子:
First define a program that will test this:
首先定义一个测试这个的程序:
%%file df_memprofile.py
import numpy as np
import pandas as pd
def foo():
x = np.random.rand(1000000, 5)
y = pd.DataFrame(x, columns=list('abcde'))
y.rename(columns = {'e': 'f'}, inplace=True)
return y
Then load the memory profiler and run + profile the function
然后加载内存分析器并运行+配置文件函数
%load_ext memory_profiler
from df_memprofile import foo
%mprun -f foo foo()
I get the following output:
得到如下输出:
Filename: /Users/jakevdp/df_memprofile.py
Line # Mem usage Increment Line Contents
================================================
4 66.1 MiB 66.1 MiB def foo():
5 104.2 MiB 38.2 MiB x = np.random.rand(1000000, 5)
6 104.4 MiB 0.2 MiB y = pd.DataFrame(x, columns=list('abcde'))
7 142.6 MiB 38.2 MiB y.rename(columns = {'e': 'f'}, inplace=True)
8 142.6 MiB 0.0 MiB return y
You can see a couple things:
你可以看到以下几点:
-
when
y
is created, it is just a light wrapper around the original array: i.e. no data is copied.当y被创建时,它只是一个围绕原始数组的光包装:即没有复制数据。
-
When the column in
y
is renamed, it results in duplication of the entire data array in memory (it's the same 38MB increment as whenx
is created in the first place).当y中的列被重命名时,它会导致内存中整个数据数组的重复(这与第一个创建x时的38MB增量相同)。
So, unless I'm missing something, it appears that Pandas operates more like R's dataframes than R's data tables.
所以,除非我漏掉了什么,否则熊猫的操作看起来更像R的数据存储器,而不是R的数据表。
Edit: Note that rename()
has an argument copy
that controls this behavior, and defaults to True. For example, using this:
编辑:请注意,rename()有一个用于控制此行为的参数副本,并且默认为True。例如,使用:
y.rename(columns = {'e': 'f'}, inplace=True, copy=False)
... results in an inplace operation without copying data.
…结果在一个inplace操作中,不需要复制数据。
Alternatively, you can modify the columns
attribute directly:
也可以直接修改columns属性:
y.columns = ['a', 'b', 'c', 'd', 'f']
#1
12
Pandas operates more like data.frame
in this regard. You can check this using the memory_profiler package; here's an example of its use in the Jupyter notebook:
在这方面,熊猫的运作方式更类似于data.frame。您可以使用memory_profiler包检查它;这里有一个在Jupyter笔记本上使用的例子:
First define a program that will test this:
首先定义一个测试这个的程序:
%%file df_memprofile.py
import numpy as np
import pandas as pd
def foo():
x = np.random.rand(1000000, 5)
y = pd.DataFrame(x, columns=list('abcde'))
y.rename(columns = {'e': 'f'}, inplace=True)
return y
Then load the memory profiler and run + profile the function
然后加载内存分析器并运行+配置文件函数
%load_ext memory_profiler
from df_memprofile import foo
%mprun -f foo foo()
I get the following output:
得到如下输出:
Filename: /Users/jakevdp/df_memprofile.py
Line # Mem usage Increment Line Contents
================================================
4 66.1 MiB 66.1 MiB def foo():
5 104.2 MiB 38.2 MiB x = np.random.rand(1000000, 5)
6 104.4 MiB 0.2 MiB y = pd.DataFrame(x, columns=list('abcde'))
7 142.6 MiB 38.2 MiB y.rename(columns = {'e': 'f'}, inplace=True)
8 142.6 MiB 0.0 MiB return y
You can see a couple things:
你可以看到以下几点:
-
when
y
is created, it is just a light wrapper around the original array: i.e. no data is copied.当y被创建时,它只是一个围绕原始数组的光包装:即没有复制数据。
-
When the column in
y
is renamed, it results in duplication of the entire data array in memory (it's the same 38MB increment as whenx
is created in the first place).当y中的列被重命名时,它会导致内存中整个数据数组的重复(这与第一个创建x时的38MB增量相同)。
So, unless I'm missing something, it appears that Pandas operates more like R's dataframes than R's data tables.
所以,除非我漏掉了什么,否则熊猫的操作看起来更像R的数据存储器,而不是R的数据表。
Edit: Note that rename()
has an argument copy
that controls this behavior, and defaults to True. For example, using this:
编辑:请注意,rename()有一个用于控制此行为的参数副本,并且默认为True。例如,使用:
y.rename(columns = {'e': 'f'}, inplace=True, copy=False)
... results in an inplace operation without copying data.
…结果在一个inplace操作中,不需要复制数据。
Alternatively, you can modify the columns
attribute directly:
也可以直接修改columns属性:
y.columns = ['a', 'b', 'c', 'd', 'f']