dplyr包:如何使用'%xyz ' SQL语法查询大型数据帧?

时间:2022-08-21 22:34:02

dplyr is the only package that can handle my 843k data.frame and query it in a fast way. I can filter fine using some math and equal criteria, however I need to implement a search for a concept.

dplyr是唯一可以处理我的843k数据。frame并快速查询的软件包。我可以使用一些数学和等条件进行筛选,但是我需要实现对概念的搜索。

I need something like this sqldf query

我需要类似于sqldf查询的东西

library(sqldf)
head(iris)
sqldf("select * from iris where lower(Species) like '%nica%'")

In dplyr help I was not able to find how I could do it. something like:

在dplyr的帮助下,我无法找到我能做的。喜欢的东西:

filter(iris,Species like '%something%')

The starting and ending % is very important. Also, note that the data frame has 800+k rows so traditional R functions may run slow. It has to bee a dplyr based solution.

起始和结束%非常重要。另外,请注意,数据帧有800+k行,因此传统的R函数可能运行缓慢。它必须是基于dplyr的解决方案。

2 个解决方案

#1


5  

What about this -

这- - - - - -

library(dplyr)
data(iris)
filter(iris, grepl("nica",Species))

EDIT: Another option - the function %like% in data.table()

编辑:另一个选项-函数%,如data.table()中的%

library(dplyr)
data(iris)
##
Iris <- iris[
  rep(seq_len(nrow(iris)),each=5000),
  ]
dim(Iris)
[1] 750000      5
##
library(microbenchmark)
library(data.table)
##
Dt <- data.table(Iris)
setkeyv(Dt,cols="Species")
##
foo <- function(){
  subI <- filter(Iris, grepl("nica",Species))
}
##
foo2 <- function(){
  subI <- Dt[Species %like% "nica"]
}
##
foo3 <- function(){
  subI <- filter(Iris, Species %like% "nica")
}
Res <- microbenchmark(
  foo(),foo2(),foo3(),
  times=100L)
##
> Res
Unit: milliseconds
   expr       min        lq    median        uq      max neval
  foo() 114.31080 122.12303 131.15523 136.33254 214.0405   100
 foo2()  23.00508  30.33685  39.77843  41.49121 129.9125   100
 foo3()  18.84933  22.47958  29.39228  35.96649 114.4389   100

#2


2  

full code would be (including the lowercase)

完整的代码(包括小写)

require(data.table)
iris %>% filter(tolower(Species) %like% 'nica')

#1


5  

What about this -

这- - - - - -

library(dplyr)
data(iris)
filter(iris, grepl("nica",Species))

EDIT: Another option - the function %like% in data.table()

编辑:另一个选项-函数%,如data.table()中的%

library(dplyr)
data(iris)
##
Iris <- iris[
  rep(seq_len(nrow(iris)),each=5000),
  ]
dim(Iris)
[1] 750000      5
##
library(microbenchmark)
library(data.table)
##
Dt <- data.table(Iris)
setkeyv(Dt,cols="Species")
##
foo <- function(){
  subI <- filter(Iris, grepl("nica",Species))
}
##
foo2 <- function(){
  subI <- Dt[Species %like% "nica"]
}
##
foo3 <- function(){
  subI <- filter(Iris, Species %like% "nica")
}
Res <- microbenchmark(
  foo(),foo2(),foo3(),
  times=100L)
##
> Res
Unit: milliseconds
   expr       min        lq    median        uq      max neval
  foo() 114.31080 122.12303 131.15523 136.33254 214.0405   100
 foo2()  23.00508  30.33685  39.77843  41.49121 129.9125   100
 foo3()  18.84933  22.47958  29.39228  35.96649 114.4389   100

#2


2  

full code would be (including the lowercase)

完整的代码(包括小写)

require(data.table)
iris %>% filter(tolower(Species) %like% 'nica')