dplyr is the only package that can handle my 843k data.frame and query it in a fast way. I can filter fine using some math and equal criteria, however I need to implement a search for a concept.
dplyr是唯一可以处理我的843k数据。frame并快速查询的软件包。我可以使用一些数学和等条件进行筛选,但是我需要实现对概念的搜索。
I need something like this sqldf query
我需要类似于sqldf查询的东西
library(sqldf)
head(iris)
sqldf("select * from iris where lower(Species) like '%nica%'")
In dplyr help I was not able to find how I could do it. something like:
在dplyr的帮助下,我无法找到我能做的。喜欢的东西:
filter(iris,Species like '%something%')
The starting and ending % is very important. Also, note that the data frame has 800+k rows so traditional R functions may run slow. It has to bee a dplyr based solution.
起始和结束%非常重要。另外,请注意,数据帧有800+k行,因此传统的R函数可能运行缓慢。它必须是基于dplyr的解决方案。
2 个解决方案
#1
5
What about this -
这- - - - - -
library(dplyr)
data(iris)
filter(iris, grepl("nica",Species))
EDIT: Another option - the function %like%
in data.table()
编辑:另一个选项-函数%,如data.table()中的%
library(dplyr)
data(iris)
##
Iris <- iris[
rep(seq_len(nrow(iris)),each=5000),
]
dim(Iris)
[1] 750000 5
##
library(microbenchmark)
library(data.table)
##
Dt <- data.table(Iris)
setkeyv(Dt,cols="Species")
##
foo <- function(){
subI <- filter(Iris, grepl("nica",Species))
}
##
foo2 <- function(){
subI <- Dt[Species %like% "nica"]
}
##
foo3 <- function(){
subI <- filter(Iris, Species %like% "nica")
}
Res <- microbenchmark(
foo(),foo2(),foo3(),
times=100L)
##
> Res
Unit: milliseconds
expr min lq median uq max neval
foo() 114.31080 122.12303 131.15523 136.33254 214.0405 100
foo2() 23.00508 30.33685 39.77843 41.49121 129.9125 100
foo3() 18.84933 22.47958 29.39228 35.96649 114.4389 100
#2
2
full code would be (including the lowercase)
完整的代码(包括小写)
require(data.table)
iris %>% filter(tolower(Species) %like% 'nica')
#1
5
What about this -
这- - - - - -
library(dplyr)
data(iris)
filter(iris, grepl("nica",Species))
EDIT: Another option - the function %like%
in data.table()
编辑:另一个选项-函数%,如data.table()中的%
library(dplyr)
data(iris)
##
Iris <- iris[
rep(seq_len(nrow(iris)),each=5000),
]
dim(Iris)
[1] 750000 5
##
library(microbenchmark)
library(data.table)
##
Dt <- data.table(Iris)
setkeyv(Dt,cols="Species")
##
foo <- function(){
subI <- filter(Iris, grepl("nica",Species))
}
##
foo2 <- function(){
subI <- Dt[Species %like% "nica"]
}
##
foo3 <- function(){
subI <- filter(Iris, Species %like% "nica")
}
Res <- microbenchmark(
foo(),foo2(),foo3(),
times=100L)
##
> Res
Unit: milliseconds
expr min lq median uq max neval
foo() 114.31080 122.12303 131.15523 136.33254 214.0405 100
foo2() 23.00508 30.33685 39.77843 41.49121 129.9125 100
foo3() 18.84933 22.47958 29.39228 35.96649 114.4389 100
#2
2
full code would be (including the lowercase)
完整的代码(包括小写)
require(data.table)
iris %>% filter(tolower(Species) %like% 'nica')