熊猫的数据aframes (Python)是否更接近R的数据aframes或datatables?

时间:2021-11-19 22:55:34

To understand my question, I should first point out that R datatables aren't just R dataframes with syntaxic sugar, there are important behavioral differences : column assignation/modification by reference in datatables avoids the copying of the whole object in memory (see the example in this quora answer) as it is the case in dataframes.

理解我的问题,我应该首先指出datatable不仅仅是R dataframes与综合性的糖,有重要的行为差异:列分配/修改通过引用在datatable中避免了复制的对象在内存中(见这quora答案中的示例)dataframes的情况。

I've found on multiple occasions that the speed and memory differences that arise from data.table's behavior is a crucial element that allows one to work with some big datasets while it wouldn't be possible with data.frame's behavior.

我在很多情况下发现,数据产生的速度和内存差异。table的行为是一个关键元素,它允许我们处理一些大型数据集,而处理data.frame的行为是不可能的。

Therefore, what I'm wondering is : in Python, how do Pandas' dataframes behave in this regard ?

因此,我想知道的是:在Python中,熊猫的dataframes在这方面是如何运行的?

Bonus question : if Pandas' dataframes are closer to R dataframes than to R datatables, and have the same down side (a full copy of the object when assigning/modifying column), is there a Python equivalent to R's data.table package ?

附加问题:如果熊猫的数据aframes更接近于R数据aframes而不是R数据可查的数据,并且有相同的缺点(分配/修改列时对象的完整副本),那么是否存在与R的数据等价的Python。表包吗?


EDIT per comment request : Code examples :

编辑每个评论请求:代码示例:

R dataframes :

R dataframes:

# renaming a column
colnames(mydataframe)[1] <- "new_column_name"

R datatables :

R datatable中:

# renaming a column
library(data.table)
setnames(mydatatable, 'old_column_name', 'new_column_name')

In Pandas :

熊猫:

mydataframe.rename(columns = {'old_column_name': 'new_column_name'}, inplace=True)

1 个解决方案

#1


12  

Pandas operates more like data.frame in this regard. You can check this using the memory_profiler package; here's an example of its use in the Jupyter notebook:

在这方面,熊猫的运作方式更类似于data.frame。您可以使用memory_profiler包检查它;这里有一个在Jupyter笔记本上使用的例子:

First define a program that will test this:

首先定义一个测试这个的程序:

%%file df_memprofile.py
import numpy as np
import pandas as pd

def foo():
    x = np.random.rand(1000000, 5)
    y = pd.DataFrame(x, columns=list('abcde'))
    y.rename(columns = {'e': 'f'}, inplace=True)
    return y

Then load the memory profiler and run + profile the function

然后加载内存分析器并运行+配置文件函数

%load_ext memory_profiler
from df_memprofile import foo
%mprun -f foo foo()

I get the following output:

得到如下输出:

Filename: /Users/jakevdp/df_memprofile.py

Line #    Mem usage    Increment   Line Contents
================================================
     4     66.1 MiB     66.1 MiB   def foo():
     5    104.2 MiB     38.2 MiB       x = np.random.rand(1000000, 5)
     6    104.4 MiB      0.2 MiB       y = pd.DataFrame(x, columns=list('abcde'))
     7    142.6 MiB     38.2 MiB       y.rename(columns = {'e': 'f'}, inplace=True)
     8    142.6 MiB      0.0 MiB       return y

You can see a couple things:

你可以看到以下几点:

  1. when y is created, it is just a light wrapper around the original array: i.e. no data is copied.

    当y被创建时,它只是一个围绕原始数组的光包装:即没有复制数据。

  2. When the column in y is renamed, it results in duplication of the entire data array in memory (it's the same 38MB increment as when x is created in the first place).

    当y中的列被重命名时,它会导致内存中整个数据数组的重复(这与第一个创建x时的38MB增量相同)。

So, unless I'm missing something, it appears that Pandas operates more like R's dataframes than R's data tables.

所以,除非我漏掉了什么,否则熊猫的操作看起来更像R的数据存储器,而不是R的数据表。


Edit: Note that rename() has an argument copy that controls this behavior, and defaults to True. For example, using this:

编辑:请注意,rename()有一个用于控制此行为的参数副本,并且默认为True。例如,使用:

y.rename(columns = {'e': 'f'}, inplace=True, copy=False)

... results in an inplace operation without copying data.

…结果在一个inplace操作中,不需要复制数据。

Alternatively, you can modify the columns attribute directly:

也可以直接修改columns属性:

y.columns = ['a', 'b', 'c', 'd', 'f']

#1


12  

Pandas operates more like data.frame in this regard. You can check this using the memory_profiler package; here's an example of its use in the Jupyter notebook:

在这方面,熊猫的运作方式更类似于data.frame。您可以使用memory_profiler包检查它;这里有一个在Jupyter笔记本上使用的例子:

First define a program that will test this:

首先定义一个测试这个的程序:

%%file df_memprofile.py
import numpy as np
import pandas as pd

def foo():
    x = np.random.rand(1000000, 5)
    y = pd.DataFrame(x, columns=list('abcde'))
    y.rename(columns = {'e': 'f'}, inplace=True)
    return y

Then load the memory profiler and run + profile the function

然后加载内存分析器并运行+配置文件函数

%load_ext memory_profiler
from df_memprofile import foo
%mprun -f foo foo()

I get the following output:

得到如下输出:

Filename: /Users/jakevdp/df_memprofile.py

Line #    Mem usage    Increment   Line Contents
================================================
     4     66.1 MiB     66.1 MiB   def foo():
     5    104.2 MiB     38.2 MiB       x = np.random.rand(1000000, 5)
     6    104.4 MiB      0.2 MiB       y = pd.DataFrame(x, columns=list('abcde'))
     7    142.6 MiB     38.2 MiB       y.rename(columns = {'e': 'f'}, inplace=True)
     8    142.6 MiB      0.0 MiB       return y

You can see a couple things:

你可以看到以下几点:

  1. when y is created, it is just a light wrapper around the original array: i.e. no data is copied.

    当y被创建时,它只是一个围绕原始数组的光包装:即没有复制数据。

  2. When the column in y is renamed, it results in duplication of the entire data array in memory (it's the same 38MB increment as when x is created in the first place).

    当y中的列被重命名时,它会导致内存中整个数据数组的重复(这与第一个创建x时的38MB增量相同)。

So, unless I'm missing something, it appears that Pandas operates more like R's dataframes than R's data tables.

所以,除非我漏掉了什么,否则熊猫的操作看起来更像R的数据存储器,而不是R的数据表。


Edit: Note that rename() has an argument copy that controls this behavior, and defaults to True. For example, using this:

编辑:请注意,rename()有一个用于控制此行为的参数副本,并且默认为True。例如,使用:

y.rename(columns = {'e': 'f'}, inplace=True, copy=False)

... results in an inplace operation without copying data.

…结果在一个inplace操作中,不需要复制数据。

Alternatively, you can modify the columns attribute directly:

也可以直接修改columns属性:

y.columns = ['a', 'b', 'c', 'd', 'f']