在libsvm格式中读取/写入数据。

时间:2022-10-16 22:54:29

How do I read/write libsvm data into/from R?

如何读取/写入libsvm数据到R?

The libsvm format is sparse data like

libsvm格式是稀疏数据。

<class/target>[ <attribute number>:<attribute value>]*

(cf. Compressed Row Storage (CRS)) e.g.,

(cf.压缩行存储(CRS))

1 10:3.4 123:0.5 34567:0.231
0.2 22:1 456:03

I am sure I can whip some something myself, but I would much rather use something off the shelf. However, R library foreign does not seem to provide the necessary functionality.

我相信我自己可以做一些事情,但是我更愿意用现成的东西。然而,R图书馆外国似乎并没有提供必要的功能。

6 个解决方案

#1


12  

e1071 is off the shelf:

install.packages("e1071")
library(e1071)
read.matrix.csr(...)
write.matrix.csr(...)

Note: it is implemented in R, not in C, so it is dog-slow.

注意:它是在R中实现的,而不是在C中实现的,所以它是dog-slow。

It even have a special vignette Support Vector Machines—the Interface to libsvm in package e1071.

它甚至有一个特殊的vignette支持向量机——在package e1071中对libsvm的接口。

r.vw is bundled with vowpal_wabbit

Note: it is implemented in R, not in C, so it is dog-slow.

注意:它是在R中实现的,而不是在C中实现的,所以它是dog-slow。

#2


9  

I have been running a job using the zygmuntz solution on a dataset with 25k observations (rows) for almost 5 hrs now. It has done 3k-ish rows. It was taking so long that I coded this up in the meantime (based on zygmuntz's code):

我已经在一个数据集上使用zygmuntz解决方案了,现在已经有25k的观察(行)了。它已经完成了3k行。花了很长时间,我在这段时间里编写了这个代码(基于zygmuntz的代码):

require(Matrix)
read.libsvm = function( filename ) {
  content = readLines( filename )
  num_lines = length( content )
  tomakemat = cbind(1:num_lines, -1, substr(content,1,1))

  # loop over lines
  makemat = rbind(tomakemat,
  do.call(rbind, 
    lapply(1:num_lines, function(i){
       # split by spaces, remove lines
           line = as.vector( strsplit( content[i], ' ' )[[1]])
           cbind(i, t(simplify2array(strsplit(line[-1],
                          ':'))))   
})))
class(makemat) = "numeric"

#browser()
yx = sparseMatrix(i = makemat[,1], 
              j = makemat[,2]+2, 
          x = makemat[,3])
return( yx )
}

This ran in minutes on the same machine (there may have been memory issues with zygmuntz solution too, not sure). Hope this helps anyone with the same problem.

这在同一台机器上运行了几分钟(也可能是zygmuntz解决方案的内存问题)。希望这能帮助有同样问题的人。

Remember, if you need to do big computations in R, VECTORIZE!

记住,如果你需要在R上做大计算,矢量化!

EDIT: fixed an indexing error I found this morning.

编辑:修正了我今早发现的索引错误。

#3


3  

I came up with my own ad hoc solution leveraging some data.table utilities,

我想出了我自己的解决方案,利用一些数据。表工具,

It ran in almost no time on the test data set I found (Boston Housing data).

它几乎没有时间运行在我找到的测试数据集(波士顿房屋数据)。

Converting that to a data.table (orthogonal to solution, but adding here for easy reproducibility):

将其转换为数据。表(正交于溶液,但在此增加易重现性):

library(data.table)
x = fread("/media/data_drive/housing.data.fw",
          sep = "\n", header = FALSE)
#usually fixed-width conversion is harder, but everything here is numeric
columns =  c("CRIM", "ZN", "INDUS", "CHAS",
             "NOX", "RM", "AGE", "DIS", "RAD", 
             "TAX", "PTRATIO", "B", "LSTAT", "MEDV")
DT = with(x, fread(paste(gsub("\\s+", "\t", V1), collapse = "\n"),
                   header = FALSE, sep = "\t",
                   col.names = columns))

Here it is:

这里是:

DT[ , fwrite(as.data.table(paste0(
  MEDV, " | ", sapply(transpose(lapply(
    names(.SD), function(jj)
      paste0(jj, ":", get(jj)))),
    paste, collapse = " "))), 
  "/path/to/output", col.names = FALSE, quote = FALSE),
  .SDcols = !"MEDV"]
#what gets sent to as.data.table:
#[1] "24 | CRIM:0.00632 ZN:18 INDUS:2.31 CHAS:0 NOX:0.538 RM:6.575 
#  AGE:65.2 DIS:4.09 RAD:1 TAX:296 PTRATIO:15.3 B:396.9 LSTAT:4.98 MEDV:24"      
#[2] "21.6 | CRIM:0.02731 ZN:0 INDUS:7.07 CHAS:0 NOX:0.469 RM:6.421 
#  AGE:78.9 DIS:4.9671 RAD:2 TAX:242 PTRATIO:17.8 B:396.9 LSTAT:9.14 MEDV:21.6"
# ...

There may be a better way to get this understood by fwrite than as.data.table, but I can't think of one (until setDT works on vectors).

也许有一种更好的方法可以通过fwrite来理解这一点。表,但是我不能想到一个(直到setDT在向量上工作)。

I replicated this to test its performance on a bigger data set (just blow up the current data set):

我将其复制到一个更大的数据集中测试它的性能(只是放大了当前的数据集):

DT2 = rbindlist(replicate(1000, DT, simplify = FALSE))

The operation was pretty fast compared to some of the times reported here (I haven't bothered comparing directly yet):

与这里报道的一些时间相比,这个操作相当快(我还没有直接比较):

system.time(.)
#    user  system elapsed 
#   8.392   0.000   8.385 

I also tested using writeLines instead of fwrite, but the latter was better.

我也测试了使用writeLines而不是fwrite,但后者更好。


I am looking again and seeing it might take a while to figure out what's going on. Maybe the magrittr-piped version will be easier to follow:

我再看一遍,看可能需要一段时间才能弄清楚到底发生了什么。也许magrittr-piped版本更容易理解:

DT[ , 
    #1) prepend each column's values with the column name
    lapply(names(.SD), function(jj)
      paste0(jj, ":", get(jj))) %>%
      #2) transpose this list (using data.table's fast tool)
      #   (was column-wise, now row-wise)
      #3) concatenate columns, separated by " "
      transpose %>% sapply(paste, collapse = " ") %>%
      #4) prepend each row with the target value
      #   (with Vowpal Wabbit in mind, separate with a pipe)
      paste0(MEDV, " | ", .) %>%
      #5) convert this to a data.table to use fwrite
      as.data.table %>%
      #6) fwrite it; exclude nonsense column name,
      #   and force quotes off
      fwrite("/path/to/data", 
             col.names = FALSE, quote = FALSE),
  .SDcols = !"MEDV"]

reading in such files is much easier**

在这样的文件中阅读容易得多。

#quickly read data; don't split within lines
x = fread("/path/to/data", sep = "\n", header = FALSE)

#tstrsplit is transpose(strsplit(.))
dt1 = x[ , tstrsplit(V1, split = "[| :]+")]

#even columns have variable names
nms = c("target_name", 
        unlist(dt1[1L, seq(2L, ncol(dt1), by = 2L), 
                   with = FALSE]))

#odd columns have values
DT = dt1[ , seq(1L, ncol(dt1), by = 2L), with = FALSE]
#add meaningful names
setnames(DT, nms)

**this will not work with "ragged"/sparse input data. I don't think there's a way to extend this to work in such cases.

这将不会使用“不规则”/稀疏的输入数据。我认为在这种情况下没有办法将其扩展到工作中。

#4


2  

Try these functions and examples:

试试这些功能和例子:

https://github.com/zygmuntz/r-libsvm-format-read-write

https://github.com/zygmuntz/r-libsvm-format-read-write

#5


2  

Based on some comments. I add it as an aswer so it's easier for others to use. This is to write data in libsvm format.

Function to write a data.frame to svm light format. I've added a train={TRUE, FALSE} argument in case the data doesn't have labels. In this case, the class index is ignored.

函数为svm的光格式写入数据。我添加了一个列={TRUE, FALSE}参数,以防数据没有标签。在这种情况下,类索引被忽略。

write.libsvm = function(data, filename= "out.dat", class = 1, train=TRUE) {
  out = file(filename)
  if(train){
    writeLines(apply(data, 1, function(X) {
      paste(X[class], 
            apply(cbind(which(X!=0)[-class], 
                        X[which(X!=0)[-class]]), 
                  1, paste, collapse=":"), 
            collapse=" ") 
      }), out)
  } else {
    # leaves 1 as default for the new data without predictions. 
    writeLines(apply(data, 1, function(X) {
      paste('1',
            apply(cbind(which(X!=0), X[which(X!=0)]), 1, paste, collapse=":"), 
            collapse=" ") 
      }), out)
  }
  close(out) 
}

** EDIT **

* *编辑* *

Another option - In case you already have the data in a data.table object

libfm and SVMlight have the same format, so this function should work.

library(data.table)

data.table.fm <- function (data = X, fileName = "../out.fm", target = "y_train", 
    train = TRUE) {
    if (train) {
        if (is.logical(data[[target]]) | sum(levels(factor(data[[target]])) == 
            levels(factor(c(0, 1)))) == 2) {
            data[[target]][data[[target]] == TRUE] = 1
            data[[target]][data[[target]] == FALSE] = -1
        }
    }
    specChar = "\\(|\\)|\\||\\:"
    specCharSpace = "\\(|\\)|\\||\\:| "
    parsingNames <- function(x) {
        ret = c()
        for (el in x) ret = append(ret, gsub(specCharSpace, "_", 
            el))
        ret
    }
    parsingVar <- function(x, keepSpace, hard_parse) {
        if (!keepSpace) 
            spch = specCharSpace
        else spch = specChar
        if (hard_parse) 
            gsub("(^_( *|_*)+)|(^_$)|(( *|_*)+_$)|( +_+ +)", 
                " ", gsub(specChar, "_", gsub("(^ +)|( +$)", 
                  "", x)))
        else gsub(spch, "_", x)
    }
    setnames(data, names(data), parsingNames(names(data)))
    target = parsingNames(target)
    format_vw <- function(column, formater) {
        ifelse(as.logical(column), sprintf(formater, j, column), 
            "")
    }
    all_vars = names(data)[!names(data) %in% target]
    cat("Reordering data.table if class isn't first\n")
    target_inx = which(names(data) %in% target)
    rest_inx = which(!names(data) %in% target)
    cat("Adding Variable names to data.table\n")
    for (j in rest_inx) {
        column = data[[j]]
        formater = "%s:%f"
        set(data, i = NULL, j = j, value = format_vw(column, 
            formater))
        cat(sprintf("Fixing %s\n", j))
    }
    data = data[, c(target_inx, rest_inx), with = FALSE]
    drop_extra_space <- function(x) {
        gsub(" {1,}", " ", x)
    }
    cat("Pasting data - Removing extra spaces\n")
    data = apply(data, 1, function(x) drop_extra_space(paste(x, 
        collapse = " ")))
    cat("Writing to disk\n")
    write.table(data, file = fileName, sep = " ", row.names = FALSE, 
        col.names = FALSE, quote = FALSE)
}

#6


0  

I went with a two-hop solution - convert R data to another format first, and then to LIBSVM:

我使用了一个两跳解决方案——先将R数据转换为另一种格式,然后再将其转换为LIBSVM:

  1. Used R package foreign to convert (and write out) data frame to ARFF format (modified write.arff changing write.table to na="0.0" instead of na="?" otherwise step 2 fails)
  2. 使用R包外转换(并写出)数据帧到ARFF格式(修改后的写)。飞机救援消防改变写。表到na="0.0"而不是na="?"否则步骤2失败)
  3. Used https://github.com/dat/svm-tools/blob/master/arff2svm.py to convert ARFF format to LIBSVM
  4. 使用https://github.com/dat/svm-tools/blob/master/arff2svm.py将ARFF格式转换为LIBSVM。

My data set is 200K x 500 and this only took 3-5 minutes.

我的数据集是200kx 500,这只花了3-5分钟。

#1


12  

e1071 is off the shelf:

install.packages("e1071")
library(e1071)
read.matrix.csr(...)
write.matrix.csr(...)

Note: it is implemented in R, not in C, so it is dog-slow.

注意:它是在R中实现的,而不是在C中实现的,所以它是dog-slow。

It even have a special vignette Support Vector Machines—the Interface to libsvm in package e1071.

它甚至有一个特殊的vignette支持向量机——在package e1071中对libsvm的接口。

r.vw is bundled with vowpal_wabbit

Note: it is implemented in R, not in C, so it is dog-slow.

注意:它是在R中实现的,而不是在C中实现的,所以它是dog-slow。

#2


9  

I have been running a job using the zygmuntz solution on a dataset with 25k observations (rows) for almost 5 hrs now. It has done 3k-ish rows. It was taking so long that I coded this up in the meantime (based on zygmuntz's code):

我已经在一个数据集上使用zygmuntz解决方案了,现在已经有25k的观察(行)了。它已经完成了3k行。花了很长时间,我在这段时间里编写了这个代码(基于zygmuntz的代码):

require(Matrix)
read.libsvm = function( filename ) {
  content = readLines( filename )
  num_lines = length( content )
  tomakemat = cbind(1:num_lines, -1, substr(content,1,1))

  # loop over lines
  makemat = rbind(tomakemat,
  do.call(rbind, 
    lapply(1:num_lines, function(i){
       # split by spaces, remove lines
           line = as.vector( strsplit( content[i], ' ' )[[1]])
           cbind(i, t(simplify2array(strsplit(line[-1],
                          ':'))))   
})))
class(makemat) = "numeric"

#browser()
yx = sparseMatrix(i = makemat[,1], 
              j = makemat[,2]+2, 
          x = makemat[,3])
return( yx )
}

This ran in minutes on the same machine (there may have been memory issues with zygmuntz solution too, not sure). Hope this helps anyone with the same problem.

这在同一台机器上运行了几分钟(也可能是zygmuntz解决方案的内存问题)。希望这能帮助有同样问题的人。

Remember, if you need to do big computations in R, VECTORIZE!

记住,如果你需要在R上做大计算,矢量化!

EDIT: fixed an indexing error I found this morning.

编辑:修正了我今早发现的索引错误。

#3


3  

I came up with my own ad hoc solution leveraging some data.table utilities,

我想出了我自己的解决方案,利用一些数据。表工具,

It ran in almost no time on the test data set I found (Boston Housing data).

它几乎没有时间运行在我找到的测试数据集(波士顿房屋数据)。

Converting that to a data.table (orthogonal to solution, but adding here for easy reproducibility):

将其转换为数据。表(正交于溶液,但在此增加易重现性):

library(data.table)
x = fread("/media/data_drive/housing.data.fw",
          sep = "\n", header = FALSE)
#usually fixed-width conversion is harder, but everything here is numeric
columns =  c("CRIM", "ZN", "INDUS", "CHAS",
             "NOX", "RM", "AGE", "DIS", "RAD", 
             "TAX", "PTRATIO", "B", "LSTAT", "MEDV")
DT = with(x, fread(paste(gsub("\\s+", "\t", V1), collapse = "\n"),
                   header = FALSE, sep = "\t",
                   col.names = columns))

Here it is:

这里是:

DT[ , fwrite(as.data.table(paste0(
  MEDV, " | ", sapply(transpose(lapply(
    names(.SD), function(jj)
      paste0(jj, ":", get(jj)))),
    paste, collapse = " "))), 
  "/path/to/output", col.names = FALSE, quote = FALSE),
  .SDcols = !"MEDV"]
#what gets sent to as.data.table:
#[1] "24 | CRIM:0.00632 ZN:18 INDUS:2.31 CHAS:0 NOX:0.538 RM:6.575 
#  AGE:65.2 DIS:4.09 RAD:1 TAX:296 PTRATIO:15.3 B:396.9 LSTAT:4.98 MEDV:24"      
#[2] "21.6 | CRIM:0.02731 ZN:0 INDUS:7.07 CHAS:0 NOX:0.469 RM:6.421 
#  AGE:78.9 DIS:4.9671 RAD:2 TAX:242 PTRATIO:17.8 B:396.9 LSTAT:9.14 MEDV:21.6"
# ...

There may be a better way to get this understood by fwrite than as.data.table, but I can't think of one (until setDT works on vectors).

也许有一种更好的方法可以通过fwrite来理解这一点。表,但是我不能想到一个(直到setDT在向量上工作)。

I replicated this to test its performance on a bigger data set (just blow up the current data set):

我将其复制到一个更大的数据集中测试它的性能(只是放大了当前的数据集):

DT2 = rbindlist(replicate(1000, DT, simplify = FALSE))

The operation was pretty fast compared to some of the times reported here (I haven't bothered comparing directly yet):

与这里报道的一些时间相比,这个操作相当快(我还没有直接比较):

system.time(.)
#    user  system elapsed 
#   8.392   0.000   8.385 

I also tested using writeLines instead of fwrite, but the latter was better.

我也测试了使用writeLines而不是fwrite,但后者更好。


I am looking again and seeing it might take a while to figure out what's going on. Maybe the magrittr-piped version will be easier to follow:

我再看一遍,看可能需要一段时间才能弄清楚到底发生了什么。也许magrittr-piped版本更容易理解:

DT[ , 
    #1) prepend each column's values with the column name
    lapply(names(.SD), function(jj)
      paste0(jj, ":", get(jj))) %>%
      #2) transpose this list (using data.table's fast tool)
      #   (was column-wise, now row-wise)
      #3) concatenate columns, separated by " "
      transpose %>% sapply(paste, collapse = " ") %>%
      #4) prepend each row with the target value
      #   (with Vowpal Wabbit in mind, separate with a pipe)
      paste0(MEDV, " | ", .) %>%
      #5) convert this to a data.table to use fwrite
      as.data.table %>%
      #6) fwrite it; exclude nonsense column name,
      #   and force quotes off
      fwrite("/path/to/data", 
             col.names = FALSE, quote = FALSE),
  .SDcols = !"MEDV"]

reading in such files is much easier**

在这样的文件中阅读容易得多。

#quickly read data; don't split within lines
x = fread("/path/to/data", sep = "\n", header = FALSE)

#tstrsplit is transpose(strsplit(.))
dt1 = x[ , tstrsplit(V1, split = "[| :]+")]

#even columns have variable names
nms = c("target_name", 
        unlist(dt1[1L, seq(2L, ncol(dt1), by = 2L), 
                   with = FALSE]))

#odd columns have values
DT = dt1[ , seq(1L, ncol(dt1), by = 2L), with = FALSE]
#add meaningful names
setnames(DT, nms)

**this will not work with "ragged"/sparse input data. I don't think there's a way to extend this to work in such cases.

这将不会使用“不规则”/稀疏的输入数据。我认为在这种情况下没有办法将其扩展到工作中。

#4


2  

Try these functions and examples:

试试这些功能和例子:

https://github.com/zygmuntz/r-libsvm-format-read-write

https://github.com/zygmuntz/r-libsvm-format-read-write

#5


2  

Based on some comments. I add it as an aswer so it's easier for others to use. This is to write data in libsvm format.

Function to write a data.frame to svm light format. I've added a train={TRUE, FALSE} argument in case the data doesn't have labels. In this case, the class index is ignored.

函数为svm的光格式写入数据。我添加了一个列={TRUE, FALSE}参数,以防数据没有标签。在这种情况下,类索引被忽略。

write.libsvm = function(data, filename= "out.dat", class = 1, train=TRUE) {
  out = file(filename)
  if(train){
    writeLines(apply(data, 1, function(X) {
      paste(X[class], 
            apply(cbind(which(X!=0)[-class], 
                        X[which(X!=0)[-class]]), 
                  1, paste, collapse=":"), 
            collapse=" ") 
      }), out)
  } else {
    # leaves 1 as default for the new data without predictions. 
    writeLines(apply(data, 1, function(X) {
      paste('1',
            apply(cbind(which(X!=0), X[which(X!=0)]), 1, paste, collapse=":"), 
            collapse=" ") 
      }), out)
  }
  close(out) 
}

** EDIT **

* *编辑* *

Another option - In case you already have the data in a data.table object

libfm and SVMlight have the same format, so this function should work.

library(data.table)

data.table.fm <- function (data = X, fileName = "../out.fm", target = "y_train", 
    train = TRUE) {
    if (train) {
        if (is.logical(data[[target]]) | sum(levels(factor(data[[target]])) == 
            levels(factor(c(0, 1)))) == 2) {
            data[[target]][data[[target]] == TRUE] = 1
            data[[target]][data[[target]] == FALSE] = -1
        }
    }
    specChar = "\\(|\\)|\\||\\:"
    specCharSpace = "\\(|\\)|\\||\\:| "
    parsingNames <- function(x) {
        ret = c()
        for (el in x) ret = append(ret, gsub(specCharSpace, "_", 
            el))
        ret
    }
    parsingVar <- function(x, keepSpace, hard_parse) {
        if (!keepSpace) 
            spch = specCharSpace
        else spch = specChar
        if (hard_parse) 
            gsub("(^_( *|_*)+)|(^_$)|(( *|_*)+_$)|( +_+ +)", 
                " ", gsub(specChar, "_", gsub("(^ +)|( +$)", 
                  "", x)))
        else gsub(spch, "_", x)
    }
    setnames(data, names(data), parsingNames(names(data)))
    target = parsingNames(target)
    format_vw <- function(column, formater) {
        ifelse(as.logical(column), sprintf(formater, j, column), 
            "")
    }
    all_vars = names(data)[!names(data) %in% target]
    cat("Reordering data.table if class isn't first\n")
    target_inx = which(names(data) %in% target)
    rest_inx = which(!names(data) %in% target)
    cat("Adding Variable names to data.table\n")
    for (j in rest_inx) {
        column = data[[j]]
        formater = "%s:%f"
        set(data, i = NULL, j = j, value = format_vw(column, 
            formater))
        cat(sprintf("Fixing %s\n", j))
    }
    data = data[, c(target_inx, rest_inx), with = FALSE]
    drop_extra_space <- function(x) {
        gsub(" {1,}", " ", x)
    }
    cat("Pasting data - Removing extra spaces\n")
    data = apply(data, 1, function(x) drop_extra_space(paste(x, 
        collapse = " ")))
    cat("Writing to disk\n")
    write.table(data, file = fileName, sep = " ", row.names = FALSE, 
        col.names = FALSE, quote = FALSE)
}

#6


0  

I went with a two-hop solution - convert R data to another format first, and then to LIBSVM:

我使用了一个两跳解决方案——先将R数据转换为另一种格式,然后再将其转换为LIBSVM:

  1. Used R package foreign to convert (and write out) data frame to ARFF format (modified write.arff changing write.table to na="0.0" instead of na="?" otherwise step 2 fails)
  2. 使用R包外转换(并写出)数据帧到ARFF格式(修改后的写)。飞机救援消防改变写。表到na="0.0"而不是na="?"否则步骤2失败)
  3. Used https://github.com/dat/svm-tools/blob/master/arff2svm.py to convert ARFF format to LIBSVM
  4. 使用https://github.com/dat/svm-tools/blob/master/arff2svm.py将ARFF格式转换为LIBSVM。

My data set is 200K x 500 and this only took 3-5 minutes.

我的数据集是200kx 500,这只花了3-5分钟。