从数据框创建稀疏矩阵

时间:2021-05-17 08:57:56

I m doing an assignment where I am trying to build a collaborative filtering model for the Netflix prize data. The data that I am using is in a CSV file which I easily imported into a data frame. Now what I need to do is create a sparse matrix consisting of the Users as the rows and Movies as the columns and each cell is filled up by the corresponding rating value. When I try to map out the values in the data frame I need to run a loop for each row in the data frame, which is taking a lot of time in R, please can anyone suggest a better approach. Here is the sample code and data:

我正在做一项任务,我正在尝试为Netflix奖品数据构建一个协作过滤模型。我正在使用的数据位于CSV文件中,我可以轻松导入到数据框中。现在我需要做的是创建一个稀疏矩阵,由用户组成行,电影作为列,每个单元格由相应的评级值填充。当我尝试绘制数据框中的值时,我需要为数据帧中的每一行运行一个循环,这在R中花费了大量时间,请任何人都可以提出更好的方法。以下是示例代码和数据:

buildUserMovieMatrix <- function(trainingData)
{
  UIMatrix <- Matrix(0, nrow = max(trainingData$UserID), ncol = max(trainingData$MovieID), sparse = T);
  for(i in 1:nrow(trainingData))
  {
    UIMatrix[trainingData$UserID[i], trainingData$MovieID[i]] = trainingData$Rating[i];
  }
  return(UIMatrix);
}

Sample of data in the dataframe from which the sparse matrix is being created:

从中创建稀疏矩阵的数据框中的数据样本:

    MovieID UserID  Rating
1       1      2       3
2       2      3       3
3       2      4       4
4       2      6       3
5       2      7       3

So in the end I want something like this: The columns are the movie IDs and the rows are the user IDs

所以最后我想要这样的东西:列是电影ID,行是用户ID

    1   2   3   4   5   6   7
1   0   0   0   0   0   0   0
2   3   0   0   0   0   0   0
3   0   3   0   0   0   0   0
4   0   4   0   0   0   0   0
5   0   0   0   0   0   0   0
6   0   3   0   0   0   0   0
7   0   3   0   0   0   0   0

So the interpretation is something like this: user 2 rated movie 1 as 3 star, user 3 rated the movie 2 as 3 star and so on for the other users and movies. There are about 8500000 rows in my data frame for which my code takes just about 30-45 mins to create this user item matrix, i would like to get any suggestions

所以解释是这样的:用户2将电影1评为3星,用户3将电影2评为3星,以此类推其他用户和电影。我的数据框中有大约8500000行,我的代码需要大约30-45分钟来创建此用户项矩阵,我想得到任何建议

2 个解决方案

#1


13  

The Matrix package has a constructor made especially for your type of data:

Matrix包有一个专门为您的数据类型而构建的构造函数:

library(Matrix)
UIMatrix <- sparseMatrix(i = trainingData$UserID,
                         j = trainingData$MovieID,
                         x = trainingData$Rating)

Otherwise, you might like knowing about that cool feature of the [ function known as matrix indexing. Your could have tried:

否则,你可能想知道[函数称为矩阵索引的那个很酷的特性。你本可以尝试:

buildUserMovieMatrix <- function(trainingData) {
  UIMatrix <- Matrix(0, nrow = max(trainingData$UserID),
                        ncol = max(trainingData$MovieID), sparse = TRUE);
  UIMatrix[cbind(trainingData$UserID,
                 trainingData$MovieID)] <- trainingData$Rating;
  return(UIMatrix);
}

(but I would definitely recommend the sparseMatrix approach over this.)

(但我绝对会推荐使用sparseMatrix方法。)

#2


9  

This will probably be faster than a loop.

这可能比循环更快。

library(reshape2)
m <- dcast(df,UserID~MovieID,fill=0)[-1]
m
#   1 2
# 1 3 0
# 2 0 3
# 3 0 4
# 4 0 3
# 5 0 3

If you use data.tables, it will be a lot faster:

如果你使用data.tables,它会快得多:

library(data.table)
DT <- as.data.table(df)
m  <- dcast(DT,UserID~MovieID,fill=0)[-1]

And as I'm sure someone will point out, you can use this instead

而且我确信有人会指出,你可以使用它

setDT(df)
m  <- dcast(df,UserID~MovieID,fill=0)[-1]

This converts df to a data.table in place (without making a copy). if your data set is enormous, that can make a difference...

这会将df转换为data.table(不进行复制)。如果您的数据集很大,那可能会有所不同......

#1


13  

The Matrix package has a constructor made especially for your type of data:

Matrix包有一个专门为您的数据类型而构建的构造函数:

library(Matrix)
UIMatrix <- sparseMatrix(i = trainingData$UserID,
                         j = trainingData$MovieID,
                         x = trainingData$Rating)

Otherwise, you might like knowing about that cool feature of the [ function known as matrix indexing. Your could have tried:

否则,你可能想知道[函数称为矩阵索引的那个很酷的特性。你本可以尝试:

buildUserMovieMatrix <- function(trainingData) {
  UIMatrix <- Matrix(0, nrow = max(trainingData$UserID),
                        ncol = max(trainingData$MovieID), sparse = TRUE);
  UIMatrix[cbind(trainingData$UserID,
                 trainingData$MovieID)] <- trainingData$Rating;
  return(UIMatrix);
}

(but I would definitely recommend the sparseMatrix approach over this.)

(但我绝对会推荐使用sparseMatrix方法。)

#2


9  

This will probably be faster than a loop.

这可能比循环更快。

library(reshape2)
m <- dcast(df,UserID~MovieID,fill=0)[-1]
m
#   1 2
# 1 3 0
# 2 0 3
# 3 0 4
# 4 0 3
# 5 0 3

If you use data.tables, it will be a lot faster:

如果你使用data.tables,它会快得多:

library(data.table)
DT <- as.data.table(df)
m  <- dcast(DT,UserID~MovieID,fill=0)[-1]

And as I'm sure someone will point out, you can use this instead

而且我确信有人会指出,你可以使用它

setDT(df)
m  <- dcast(df,UserID~MovieID,fill=0)[-1]

This converts df to a data.table in place (without making a copy). if your data set is enormous, that can make a difference...

这会将df转换为data.table(不进行复制)。如果您的数据集很大,那可能会有所不同......