I have a large matrix that I would like to transpose without having to bring it into memory. There are three ways I can think of to accomplish this:
我有一个大矩阵我想要转置而不需要把它带入记忆。我有三种方法来实现这个目标:
- Write the original matrix to a .txt file column by column. Later, read it into memory row by row with
readLines(...)
, and sequentially write these rows to a new file. The problem with this approach is that I am unaware of how to append to a .txt file by column rather than by row. - 将原始矩阵逐列写入.txt文件列。稍后,用readLines(…)逐行读取到内存中,并将这些行按顺序写入一个新文件。这种方法的问题是,我不知道如何按列而不是按行添加到.txt文件。
- Read the matrix from the .txt file column by column, then write the columns to a new file by row. I have tried this with
scan(pipe("cut -f1 filename.txt"))
, but this operation opens a separate connection at each iteration and therefore takes far too long to complete due to the overhead associated with opening and closing these connections. - 按列从.txt文件列读取矩阵,然后逐行将列写入新文件。我已经用scan(pipe(“cut -f1 filename.txt”)尝试过这个操作,但是这个操作在每次迭代时都会打开一个单独的连接,因此由于打开和关闭这些连接的开销太大,所以要花很长时间才能完成。
- Use some unknown R function to complete the task.
- 使用未知的R函数完成任务。
Is there something I am missing here? Do I need to do this with a separate program? Thanks in advance for the help!
有什么东西我遗漏了吗?我需要用一个单独的程序来做吗?谢谢你的帮助!
3 个解决方案
#1
3
There's a lot of languages way better at this kind of thing. If you really want to use R, you will have to read the file in one row at a time, take one element from the column you want, store it in a vector, and then write that vector as a row. And do that for every column.
有很多语言在这方面做得更好。如果你真的想用R,你必须一次读一行文件,从你想要的列中取一个元素,把它存储在一个向量中,然后把这个向量写成一行。对每一列都这样做。
Columns = 1e9
Rows = 1e6
FileName = "YourFile.csv"
NewFile = "NewFileName"
for(i in 1:Columns)
{
ColumnToBeRow = vector("numeric", Columns)
for(j in 1:Rows)
{
ColumnToBeRow[j] = read.csv(FileName, nrows=1, skip=(j - 1), header=F)
}
write.csv(ColumnToBeRow, NewFile, append=TRUE)
}
#2
1
This post to the R-help mailing list includes my naive (psuedo?) code to split the input file into n transposed output files, then tile across chunks of the n output files (in a checkerboard fashion) to stitch the transposed columns back together. It's efficient to do this in chunks of rows in both the transpose and stitch phases. It's worth asking what you're hoping to do after transposing the matrix to generate a file that still won't fit in memory. Also there is a scholarly a literature on efficient out-of-memory matrix transposition (e.g.).
这个post到R-help邮件列表包括我的天真(psuedo?)代码将输入文件分割成n个转置的输出文件,然后将n个输出文件的块(在一个checkerboard fashion中)分割成块,将被转置的列重新组合在一起。在转置和分线阶段都可以用大量的行来完成这个任务。值得问的是,在转置矩阵以生成仍然无法装入内存的文件之后,您希望做什么。还有一篇关于有效的内存不足矩阵换位(例如)的学术文献。
#3
0
scan
can read it in as a stream, and all you need to add to the mix is the number of rows. Since your original matrix has a dimension attribute you just need to save the column value and use it as the row value when reading back in.
扫描可以读取它作为一个流,您需要添加的是行数。由于您的原始矩阵有一个维度属性,所以您只需要保存列值,并在回读时将其用作行值。
MASS::write.matrix(matrix(1:30, 6), file="test.txt")
matrix( scan("test.txt"), 5)
#-------------
Read 30 items
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 1 2 3 4 5 6
[2,] 7 8 9 10 11 12
[3,] 13 14 15 16 17 18
[4,] 19 20 21 22 23 24
[5,] 25 26 27 28 29 30
I suspect that your code to write rows of matrices as lines will not be a fast as Ripley's MASS-pkg will achieve, but if I'm wrong, you should offer the improvement to Prof Ripley.
我怀疑您编写的以行形式编写矩阵行的代码不会像Ripley的MASS-pkg那样快,但如果我错了,您应该向Ripley教授提供改进。
#1
3
There's a lot of languages way better at this kind of thing. If you really want to use R, you will have to read the file in one row at a time, take one element from the column you want, store it in a vector, and then write that vector as a row. And do that for every column.
有很多语言在这方面做得更好。如果你真的想用R,你必须一次读一行文件,从你想要的列中取一个元素,把它存储在一个向量中,然后把这个向量写成一行。对每一列都这样做。
Columns = 1e9
Rows = 1e6
FileName = "YourFile.csv"
NewFile = "NewFileName"
for(i in 1:Columns)
{
ColumnToBeRow = vector("numeric", Columns)
for(j in 1:Rows)
{
ColumnToBeRow[j] = read.csv(FileName, nrows=1, skip=(j - 1), header=F)
}
write.csv(ColumnToBeRow, NewFile, append=TRUE)
}
#2
1
This post to the R-help mailing list includes my naive (psuedo?) code to split the input file into n transposed output files, then tile across chunks of the n output files (in a checkerboard fashion) to stitch the transposed columns back together. It's efficient to do this in chunks of rows in both the transpose and stitch phases. It's worth asking what you're hoping to do after transposing the matrix to generate a file that still won't fit in memory. Also there is a scholarly a literature on efficient out-of-memory matrix transposition (e.g.).
这个post到R-help邮件列表包括我的天真(psuedo?)代码将输入文件分割成n个转置的输出文件,然后将n个输出文件的块(在一个checkerboard fashion中)分割成块,将被转置的列重新组合在一起。在转置和分线阶段都可以用大量的行来完成这个任务。值得问的是,在转置矩阵以生成仍然无法装入内存的文件之后,您希望做什么。还有一篇关于有效的内存不足矩阵换位(例如)的学术文献。
#3
0
scan
can read it in as a stream, and all you need to add to the mix is the number of rows. Since your original matrix has a dimension attribute you just need to save the column value and use it as the row value when reading back in.
扫描可以读取它作为一个流,您需要添加的是行数。由于您的原始矩阵有一个维度属性,所以您只需要保存列值,并在回读时将其用作行值。
MASS::write.matrix(matrix(1:30, 6), file="test.txt")
matrix( scan("test.txt"), 5)
#-------------
Read 30 items
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 1 2 3 4 5 6
[2,] 7 8 9 10 11 12
[3,] 13 14 15 16 17 18
[4,] 19 20 21 22 23 24
[5,] 25 26 27 28 29 30
I suspect that your code to write rows of matrices as lines will not be a fast as Ripley's MASS-pkg will achieve, but if I'm wrong, you should offer the improvement to Prof Ripley.
我怀疑您编写的以行形式编写矩阵行的代码不会像Ripley的MASS-pkg那样快,但如果我错了,您应该向Ripley教授提供改进。