为什么阅读行比阅读列更快?

时间:2022-01-21 21:29:01

I am analysing a dataset having 200 rows and 1200 columns, this dataset is stored in a .CSV file. In order to process, I read this file using R's read.csv() function.

我正在分析具有200行和1200列的数据集,该数据集存储在.CSV文件中。为了处理,我使用R的read.csv()函数读取该文件。

R takes ≈ 600 seconds to read this dataset. Later I got an idea and I transposed the data inside .CSV file and tried to read it again using read.csv() function. I was amazed to see that it only took ≈ 20 seconds. As you can see, it was ≈ 30 times faster.

R需要≈600秒才能读取此数据集。后来我有了一个想法,我将数据转换到.CSV文件中,并尝试使用read.csv()函数再次读取它。我惊讶地发现它只花了大约20秒。如你所见,它快了约30倍。

I verified it for following iterations:

我验证了它的迭代次数:

Reading 200 rows and 1200 columns (Not transposed)

> system.time(dat <- read.csv(file = "data.csv", sep = ",", header = F))

   user  system elapsed 
 610.98    6.54  618.42 # 1st iteration
 568.27    5.83  574.47 # 2nd iteration
 521.13    4.73  525.97 # 3rd iteration
 618.31    3.11  621.98 # 4th iteration
 603.85    3.29  607.50 # 5th iteration

Reading 1200 rows and 200 columns (Transposed)

> system.time(dat <- read.csv(file = "data_transposed.csv",
      sep = ",", header = F))

   user  system elapsed 
  17.23    0.73   17.97 # 1st iteration
  17.11    0.69   17.79 # 2nd iteration
  20.70    0.89   21.61 # 3rd iteration
  18.28    0.82   19.11 # 4th iteration
  18.37    1.61   20.01 # 5th iteration

In any data-set we take observations in rows and columns contain variables to-be observed. Transpose changes this structure of data. Is it a good practice to transpose the data for processing, even though it makes data look weird?

在任何数据集中,我们在行和列中观察包含要观察的变量。转置改变了这种数据结构。将数据转换为处理是否是一个好习惯,即使它使数据看起来很奇怪?

I am wondering what makes R read datasets fast when I transposed the data. I am sure it is because earlier dimensions were 200 * 1200 which became 1200 * 200 after transpose operation. Why R reads data fast when I transpose the data?

我想知道当我转换数据时R读取数据集的速度是多少。我确信这是因为早期尺寸为200 * 1200,转换操作后变为1200 * 200。为什么R在转置数据时会快速读取数据?


Update : Research & experiments


I initially asked this question because my RStudio was taking long time to read and compute a highly dimensional dataset (many columns as compare to rows [200 rows, 1200 columns]). I was using built-in R function read.csv(). I read the comments below, as per their suggestions later I experimented with read.csv2() and fread() function they all work well but they perform slowly for my original dataset [200 rows * 1200 columns] and they read transposed data-set faster.

我最初问这个问题是因为我的RStudio需要很长时间来阅读和计算高维数据集(与行[200行,1200列]相比,许多列)。我使用内置的R函数read.csv()。我阅读下面的评论,根据他们后来的建议,我尝试了read.csv2()和fread()函数,它们都运行良好,但它们对我的原始数据集[200行* 1200列]执行缓慢,并且它们读取转置数据集更快。

I observed that this is also valid for MS-Excel and Libre office Calc too. I even tried to open it into Sublime Text editor and even for this text editor it was easy(fast) to read transposed data. I am still not able to figure out the reason why all these applications behave so. All these apps get into trouble if your data has many columns as compare to rows.

我发现这也适用于MS-Excel和Libre office Calc。我甚至尝试将其打开到Sublime Text编辑器中,甚至对于这个文本编辑器来说,读取转置数据也很容易(快速)。我仍然无法弄清楚为什么所有这些应用程序都表现如此。如果您的数据与行相比有很多列,那么所有这些应用都会遇到麻烦。

So to wrap up whole story, I have only 3 question.

所以结束整个故事,我只有3个问题。

  1. What kind of issue is it? Is it related to operating systems or is it application level problem?
  2. 这是什么问题?它与操作系统有关还是应用程序级问题?
  3. Is it a good practice to transpose the data for processing?
  4. 转置数据进行处理是一种好习惯吗?
  5. Why R and/or other apps reads my data fast when I transpose the data?
  6. 为什么R和/或其他应用程序在转置数据时会快速读取数据?

My experiments perhaps helped me to rediscover some 'already known' wisdom, but I couldn't find anything relevant on internet. Kindly share such good programming/data analysis practices.

我的实验可能帮助我重新发现了一些“已知的”智慧,但我在互联网上找不到任何相关的东西。请分享这些良好的编程/数据分析实践。

2 个解决方案

#1


7  

Your question is basically about: is reading long dataset much faster than reading wide dataset?

您的问题基本上是:读取长数据集比读取宽数据集要快得多吗?

What I give here is not going to be the final answer, but a new starting point.

我在这里给出的不是最终答案,而是一个新的起点。


For any performance-related issue, it is always better to profile than guess. system.time is good, but it only tells you about the total run time than how time is split inside. If you have a quick glance of the source code of read.table (read.csv is merely a wrapper of read.table), it contains three stages:

对于任何与性能相关的问题,配置文件总是比猜测更好。 system.time很好,但它只告诉你总的运行时间,而不是时间分割的时间。如果您快速浏览一下read.table的源代码(read.csv只是read.table的包装器),它包含三个阶段:

  1. call scan to read in 5 rows of your data. I am not entirely sure about the purpose of this part;
  2. 调用扫描读入5行数据。我不完全确定这一部分的目的;
  3. call scan to read in your complete data. Basically this will read your data column by column into a list of character strings, where each column is a "record";
  4. 调用扫描读取您的完整数据。基本上这会将您的数据列逐列读入字符串列表,其中每列是“记录”;
  5. type conversion, either implicitly by type.convert, or explicitly (if you have specified column classes) by say as.numeric, as.Date, etc.
  6. 类型转换,可以通过type.convert隐式显示,也可以通过as.numeric,as.Date等显式地(如果已指定列类)进行类型转换。

The first two stages are done at C-level, while the final stage at R-level with a for loop through all records.

前两个阶段在C级完成,而最后一个阶段在R级完成,并且循环遍历所有记录。

A basic profiling tool is Rprof and summaryRprof. The following is a very very simple example.

基本的分析工具是Rprof和summaryRprof。以下是一个非常简单的例子。

## configure size
m <- 10000
n <- 100

## a very very simple example, where all data are numeric
x <- runif(m * n)

## long and wide .csv
write.csv(matrix(x, m, n), file = "long.csv", row.names = FALSE, quote = FALSE)
write.csv(matrix(x, n, m), file = "wide.csv", row.names = FALSE, quote = FALSE)

## profiling (sample stage)
Rprof("long.out")
long <- read.csv("long.csv")
Rprof(NULL)

Rprof("wide.out")
wide <- read.csv("wide.csv")
Rprof(NULL)

## profiling (report stage)
summaryRprof("long.out")[c(2, 4)]
summaryRprof("wide.out")[c(2, 4)]

The c(2, 4) extracts "by.total" time for all R-level functions with enough samples and "total CPU time" (may be lower than wall clock time). The following is what I get on my intel i5 2557m @1.1GHz (turbo boost disabled), Sandy Bridge 2011.

对于具有足够样本和“总CPU时间”(可能低于挂钟时间)的所有R级函数,c(2,4)提取“by.total”时间。以下是我的英特尔i5 2557m @ 1.1GHz(涡轮增压禁用),Sandy Bridge 2011。

## "long.csv"
#$by.total
#               total.time total.pct self.time self.pct
#"read.csv"            7.0       100       0.0        0
#"read.table"          7.0       100       0.0        0
#"scan"                6.3        90       6.3       90
#".External2"          0.7        10       0.7       10
#"type.convert"        0.7        10       0.0        0
#
#$sampling.time
#[1] 7

## "wide.csv"
#$by.total
#               total.time total.pct self.time self.pct
#"read.table"        25.86    100.00      0.06     0.23
#"read.csv"          25.86    100.00      0.00     0.00
#"scan"              23.22     89.79     23.22    89.79
#"type.convert"       2.22      8.58      0.38     1.47
#"match.arg"          1.20      4.64      0.46     1.78
#"eval"               0.66      2.55      0.12     0.46
#".External2"         0.64      2.47      0.64     2.47
#"parent.frame"       0.50      1.93      0.50     1.93
#".External"          0.30      1.16      0.30     1.16
#"formals"            0.08      0.31      0.04     0.15
#"make.names"         0.04      0.15      0.04     0.15
#"sys.function"       0.04      0.15      0.02     0.08
#"as.character"       0.02      0.08      0.02     0.08
#"c"                  0.02      0.08      0.02     0.08
#"lapply"             0.02      0.08      0.02     0.08
#"sys.parent"         0.02      0.08      0.02     0.08
#"sapply"             0.02      0.08      0.00     0.00
#
#$sampling.time
#[1] 25.86

So reading a long dataset takes 7s CPU time, while reading a wide dataset takes 25.86s CPU time.

因此,读取长数据集需要7s CPU时间,而读取宽数据集需要25.86s CPU时间。

It might be confusing at first glance, that more functions are reported for wide case. In fact, both long and wide cases execute the same set of functions, but long case is faster, so many functions take less time than the sampling interval (0.02s) hence can not be measured.

乍一看可能会让人感到困惑,因为报道了大量的功能。实际上,长和宽的情况都执行相同的功能集,但是长的情况更快,因此许多功能比采样间隔(0.02s)花费的时间更少,因此无法测量。

But anyway, the run time is dominated by scan and type.convert (implicit type conversion). For this example, we see that

但无论如何,运行时间由scan和type.convert(隐式类型转换)控制。对于这个例子,我们看到了

  • type conversion is not too costly even though it is done at R-level; for both long and wide it accounts for no more than 10% of the time;
  • 类型转换即使在R级完成也不会太昂贵;无论是长期还是广泛,它占不到10%的时间;
  • scan is basically all read.csv is working with, but unfortunately, we are unable to further divide such time to stage-1 and stage-2. Don't take it for granted that because stage-1 only reads in 5 rows so it would be very fast. In debugging mode I actually find that stage-1 can take quite a long time.
  • scan基本上都是read.csv正在使用的,但遗憾的是,我们无法将这些时间进一步划分为stage-1和stage-2。不要理所当然,因为stage-1只读取5行所以它会非常快。在调试模式中,我实际上发现阶段1可能需要相当长的时间。

So what should we do next?

那我们下一步该做什么呢?

  • It would be great if we could find a way to measure the time spent in stage-1 and stage-2 scan;
  • 如果我们能找到一种方法来衡量在第一阶段和第二阶段扫描中花费的时间,那将会很棒;
  • You might want to profile general cases, where your dataset have a mixed of data classes.
  • 您可能希望分析一般情况,其中数据集具有混合的数据类。

#2


1  

Wide data sets are typically slower to read into memory than long data sets (i.e. the transposed one). This effects many programs that read data, such as R, Python, Excel, etc. though this description is more pertinent to R:

宽数据集通常比长数据集(即转置数据集)读入存储器更慢。这会影响许多读取数据的程序,例如R,Python,Excel等,尽管这种描述与R更相关:

  • R needs to allocate memory to each cell, even if it is NA. This means that every column has at least as many cells as the number of rows in the csv file, whereas in a long dataset you can potentially drop the NA values and save some space
  • R需要为每个单元分配内存,即使它是NA。这意味着每列的单元格数量至少与csv文件中的行数一样多,而在长数据集中,您可以删除NA值并节省一些空间
  • R has to guess the data type for each value and make sure it's consistent with the data type of the column, which also introduces overhead
  • R必须猜测每个值的数据类型,并确保它与列的数据类型一致,这也会引入开销

Since your dataset doesn't appear to contain any NA values, my hunch is that you're seeing the speed improvement because of the second point. You can test this theory by passing colClasses = rep('numeric', 20) to read.csv or fread for the 20 column data set, or rep('numeric', 120) for the 120 column one, which should decrease the overhead of guessing data types.

由于您的数据集似乎不包含任何NA值,我的预感是您看到由于第二点而提高了速度。您可以通过将colClasses = rep('numeric',20)传递给read.csv或fread代表20列数据集,或者rep('numeric',120)代表120列,这可以减少开销猜测数据类型。

#1


7  

Your question is basically about: is reading long dataset much faster than reading wide dataset?

您的问题基本上是:读取长数据集比读取宽数据集要快得多吗?

What I give here is not going to be the final answer, but a new starting point.

我在这里给出的不是最终答案,而是一个新的起点。


For any performance-related issue, it is always better to profile than guess. system.time is good, but it only tells you about the total run time than how time is split inside. If you have a quick glance of the source code of read.table (read.csv is merely a wrapper of read.table), it contains three stages:

对于任何与性能相关的问题,配置文件总是比猜测更好。 system.time很好,但它只告诉你总的运行时间,而不是时间分割的时间。如果您快速浏览一下read.table的源代码(read.csv只是read.table的包装器),它包含三个阶段:

  1. call scan to read in 5 rows of your data. I am not entirely sure about the purpose of this part;
  2. 调用扫描读入5行数据。我不完全确定这一部分的目的;
  3. call scan to read in your complete data. Basically this will read your data column by column into a list of character strings, where each column is a "record";
  4. 调用扫描读取您的完整数据。基本上这会将您的数据列逐列读入字符串列表,其中每列是“记录”;
  5. type conversion, either implicitly by type.convert, or explicitly (if you have specified column classes) by say as.numeric, as.Date, etc.
  6. 类型转换,可以通过type.convert隐式显示,也可以通过as.numeric,as.Date等显式地(如果已指定列类)进行类型转换。

The first two stages are done at C-level, while the final stage at R-level with a for loop through all records.

前两个阶段在C级完成,而最后一个阶段在R级完成,并且循环遍历所有记录。

A basic profiling tool is Rprof and summaryRprof. The following is a very very simple example.

基本的分析工具是Rprof和summaryRprof。以下是一个非常简单的例子。

## configure size
m <- 10000
n <- 100

## a very very simple example, where all data are numeric
x <- runif(m * n)

## long and wide .csv
write.csv(matrix(x, m, n), file = "long.csv", row.names = FALSE, quote = FALSE)
write.csv(matrix(x, n, m), file = "wide.csv", row.names = FALSE, quote = FALSE)

## profiling (sample stage)
Rprof("long.out")
long <- read.csv("long.csv")
Rprof(NULL)

Rprof("wide.out")
wide <- read.csv("wide.csv")
Rprof(NULL)

## profiling (report stage)
summaryRprof("long.out")[c(2, 4)]
summaryRprof("wide.out")[c(2, 4)]

The c(2, 4) extracts "by.total" time for all R-level functions with enough samples and "total CPU time" (may be lower than wall clock time). The following is what I get on my intel i5 2557m @1.1GHz (turbo boost disabled), Sandy Bridge 2011.

对于具有足够样本和“总CPU时间”(可能低于挂钟时间)的所有R级函数,c(2,4)提取“by.total”时间。以下是我的英特尔i5 2557m @ 1.1GHz(涡轮增压禁用),Sandy Bridge 2011。

## "long.csv"
#$by.total
#               total.time total.pct self.time self.pct
#"read.csv"            7.0       100       0.0        0
#"read.table"          7.0       100       0.0        0
#"scan"                6.3        90       6.3       90
#".External2"          0.7        10       0.7       10
#"type.convert"        0.7        10       0.0        0
#
#$sampling.time
#[1] 7

## "wide.csv"
#$by.total
#               total.time total.pct self.time self.pct
#"read.table"        25.86    100.00      0.06     0.23
#"read.csv"          25.86    100.00      0.00     0.00
#"scan"              23.22     89.79     23.22    89.79
#"type.convert"       2.22      8.58      0.38     1.47
#"match.arg"          1.20      4.64      0.46     1.78
#"eval"               0.66      2.55      0.12     0.46
#".External2"         0.64      2.47      0.64     2.47
#"parent.frame"       0.50      1.93      0.50     1.93
#".External"          0.30      1.16      0.30     1.16
#"formals"            0.08      0.31      0.04     0.15
#"make.names"         0.04      0.15      0.04     0.15
#"sys.function"       0.04      0.15      0.02     0.08
#"as.character"       0.02      0.08      0.02     0.08
#"c"                  0.02      0.08      0.02     0.08
#"lapply"             0.02      0.08      0.02     0.08
#"sys.parent"         0.02      0.08      0.02     0.08
#"sapply"             0.02      0.08      0.00     0.00
#
#$sampling.time
#[1] 25.86

So reading a long dataset takes 7s CPU time, while reading a wide dataset takes 25.86s CPU time.

因此,读取长数据集需要7s CPU时间,而读取宽数据集需要25.86s CPU时间。

It might be confusing at first glance, that more functions are reported for wide case. In fact, both long and wide cases execute the same set of functions, but long case is faster, so many functions take less time than the sampling interval (0.02s) hence can not be measured.

乍一看可能会让人感到困惑,因为报道了大量的功能。实际上,长和宽的情况都执行相同的功能集,但是长的情况更快,因此许多功能比采样间隔(0.02s)花费的时间更少,因此无法测量。

But anyway, the run time is dominated by scan and type.convert (implicit type conversion). For this example, we see that

但无论如何,运行时间由scan和type.convert(隐式类型转换)控制。对于这个例子,我们看到了

  • type conversion is not too costly even though it is done at R-level; for both long and wide it accounts for no more than 10% of the time;
  • 类型转换即使在R级完成也不会太昂贵;无论是长期还是广泛,它占不到10%的时间;
  • scan is basically all read.csv is working with, but unfortunately, we are unable to further divide such time to stage-1 and stage-2. Don't take it for granted that because stage-1 only reads in 5 rows so it would be very fast. In debugging mode I actually find that stage-1 can take quite a long time.
  • scan基本上都是read.csv正在使用的,但遗憾的是,我们无法将这些时间进一步划分为stage-1和stage-2。不要理所当然,因为stage-1只读取5行所以它会非常快。在调试模式中,我实际上发现阶段1可能需要相当长的时间。

So what should we do next?

那我们下一步该做什么呢?

  • It would be great if we could find a way to measure the time spent in stage-1 and stage-2 scan;
  • 如果我们能找到一种方法来衡量在第一阶段和第二阶段扫描中花费的时间,那将会很棒;
  • You might want to profile general cases, where your dataset have a mixed of data classes.
  • 您可能希望分析一般情况,其中数据集具有混合的数据类。

#2


1  

Wide data sets are typically slower to read into memory than long data sets (i.e. the transposed one). This effects many programs that read data, such as R, Python, Excel, etc. though this description is more pertinent to R:

宽数据集通常比长数据集(即转置数据集)读入存储器更慢。这会影响许多读取数据的程序,例如R,Python,Excel等,尽管这种描述与R更相关:

  • R needs to allocate memory to each cell, even if it is NA. This means that every column has at least as many cells as the number of rows in the csv file, whereas in a long dataset you can potentially drop the NA values and save some space
  • R需要为每个单元分配内存,即使它是NA。这意味着每列的单元格数量至少与csv文件中的行数一样多,而在长数据集中,您可以删除NA值并节省一些空间
  • R has to guess the data type for each value and make sure it's consistent with the data type of the column, which also introduces overhead
  • R必须猜测每个值的数据类型,并确保它与列的数据类型一致,这也会引入开销

Since your dataset doesn't appear to contain any NA values, my hunch is that you're seeing the speed improvement because of the second point. You can test this theory by passing colClasses = rep('numeric', 20) to read.csv or fread for the 20 column data set, or rep('numeric', 120) for the 120 column one, which should decrease the overhead of guessing data types.

由于您的数据集似乎不包含任何NA值,我的预感是您看到由于第二点而提高了速度。您可以通过将colClasses = rep('numeric',20)传递给read.csv或fread代表20列数据集,或者rep('numeric',120)代表120列,这可以减少开销猜测数据类型。