I am working on HCUP data and this has range of values in one single column that needs to be split into multiple columns. Below is the HCUP data frame for reference :
我正在研究HCUP数据,它在一个列中有一系列值,需要拆分成多列。以下是HCUP数据框供参考:
code label
61000-61003 excision of CNS
0169T-0169T ventricular shunt
The desired output should be :
期望的输出应该是:
code label
61000 excision of CNS
61001 excision of CNS
61002 excision of CNS
61003 excision of CNS
0169T ventricular shunt
My approach to this problem is using the package splitstackshape and using this code
我解决这个问题的方法是使用包splitstackshape并使用此代码
library(data.table)
library(splitstackshape)
cSplit(hcup, "code", "-")[, list(code = code_1:code_2, by = label)]
This approach leads to memory issues. Is there a better approach to this problem?
这种方法会导致内存问题。有没有更好的方法来解决这个问题?
Some comments :
一些评论:
- The data has many letters apart from "T".
- The letter can be either in the front or at the very end but not in between two numbers.
- There is no change of letter from "T" to "U" in one single range
除“T”之外,数据有许多字母。
这封信可以在前面或最后但不在两个数字之间。
在一个范围内,“T”到“U”的字母没有变化
5 个解决方案
#1
7
Here's a solution using dplyr
and all.is.numeric
from Hmisc
:
这是使用来自Hmisc的dplyr和all.is.numeric的解决方案:
library(dplyr)
library(Hmisc)
library(tidyr)
dat %>% separate(code, into=c("code1", "code2")) %>%
rowwise %>%
mutate(lists = ifelse(all.is.numeric(c(code1, code2)),
list(as.character(seq(from = as.numeric(code1), to = as.numeric(code2)))),
list(code1))) %>%
unnest(lists) %>%
select(code = lists, label)
Source: local data frame [5 x 2]
code label
(chr) (fctr)
1 61000 excision of CNS
2 61001 excision of CNS
3 61002 excision of CNS
4 61003 excision of CNS
5 0169T ventricular shunt
An edit to fix ranges with character values. Brings down the simplicity a little:
用于修复具有字符值的范围的编辑。简单介绍一下:
dff %>% mutate(row = row_number()) %>%
separate(code, into=c("code1", "code2")) %>%
group_by(row) %>%
summarise(lists = if(all.is.numeric(c(code1, code2)))
{list(str_pad(as.character(
seq(from = as.numeric(code1), to = as.numeric(code2))),
nchar(code1), pad="0"))}
else if(grepl("^[0-9]", code1))
{list(str_pad(paste0(as.character(
seq(from = extract_numeric(code1), to = extract_numeric(code2))),
strsplit(code1, "[0-9]+")[[1]][2]),
nchar(code1), pad = "0"))}
else
{list(paste0(
strsplit(code1, "[0-9]+")[[1]],
str_pad(as.character(
seq(from = extract_numeric(code1), to = extract_numeric(code2))),
nchar(gsub("[^0-9]", "", code1)), pad="0")))},
label = first(label)) %>%
unnest(lists) %>%
select(-row)
Source: local data frame [15 x 2]
label lists
(chr) (chr)
1 excision of CNS 61000
2 excision of CNS 61001
3 excision of CNS 61002
4 ventricular shunt 0169T
5 ventricular shunt 0170T
6 ventricular shunt 0171T
7 excision of CNS 01000
8 excision of CNS 01001
9 excision of CNS 01002
10 some procedure A2543
11 some procedure A2544
12 some procedure A2545
13 some procedure A0543
14 some procedure A0544
15 some procedure A0545
data:
dff <- structure(list(code = c("61000-61002", "0169T-0171T", "01000-01002",
"A2543-A2545", "A0543-A0545"), label = c("excision of CNS", "ventricular shunt",
"excision of CNS", "some procedure", "some procedure")), .Names = c("code",
"label"), row.names = c(NA, 5L), class = "data.frame")
#2
6
Original Answer: See below for update.
原答案:请参阅下面的更新。
First, I made your example data a little more challenging by adding the first row to the bottom.
首先,通过将第一行添加到底部,我使您的示例数据更具挑战性。
dff <- structure(list(code = c("61000-61003", "0169T-0169T", "61000-61003"
), label = c("excision of CNS", "ventricular shunt", "excision of CNS"
)), .Names = c("code", "label"), row.names = c(NA, 3L), class = "data.frame")
dff
# code label
# 1 61000-61003 excision of CNS
# 2 0169T-0169T ventricular shunt
# 3 61000-61003 excision of CNS
We can use the sequence operator :
to get the sequences for the code
column, wrapping with tryCatch()
so we can avoid an error on, and save the values that cannot be sequenced. First we split the values by the dash mark -
then run it through lapply()
.
我们可以使用序列运算符:获取代码列的序列,使用tryCatch()包装,这样我们就可以避免错误,并保存无法排序的值。首先,我们用破折号标记值 - 然后通过lapply()运行它。
xx <- lapply(
strsplit(dff$code, "-", fixed = TRUE),
function(x) tryCatch(x[1]:x[2], warning = function(w) x)
)
data.frame(code = unlist(xx), label = rep(dff$label, lengths(xx)))
# code label
# 1 61000 excision of CNS
# 2 61001 excision of CNS
# 3 61002 excision of CNS
# 4 61003 excision of CNS
# 5 0169T ventricular shunt
# 6 0169T ventricular shunt
# 7 61000 excision of CNS
# 8 61001 excision of CNS
# 9 61002 excision of CNS
# 10 61003 excision of CNS
We're trying to apply the sequence operator :
to each element from strsplit()
, and if taking x[1]:x[2]
is not possible then this returns just the values for those elements and proceeds with the sequence x[1]:x[2]
otherwise. Then we just replicate the values of the label
column based on the resulting lengths in xx
to get the new label
column.
我们试图将序列运算符应用于:来自strsplit()的每个元素,如果不能使用x [1]:x [2],则只返回这些元素的值,并继续执行序列x [1 ]:x [2]否则。然后,我们只需根据xx中的结果长度复制标签列的值,即可获得新的标签列。
Update: Here is what I've come up with in response to your edit. Replace xx
above with
更新:以下是我为响应您的编辑而提出的问题。用上面的xx替换
xx <- lapply(strsplit(dff$code, "-", TRUE), function(x) {
s <- stringi::stri_locate_first_regex(x, "[A-Z]")
nc <- nchar(x)[1L]
fmt <- function(n) paste0("%0", n, "d")
if(!all(is.na(s))) {
ss <- s[1,1]
fmt <- fmt(nc-1)
if(ss == 1L) {
xx <- substr(x, 2, nc)
paste0(substr(x, 1, 1), sprintf(fmt, xx[1]:xx[2]))
} else {
xx <- substr(x, 1, ss-1)
paste0(sprintf(fmt, xx[1]:xx[2]), substr(x, nc, nc))
}
} else {
sprintf(fmt(nc), x[1]:x[2])
}
})
Yep, it's complicated. Now if we take the following data frame df2
as a test case
是的,这很复杂。现在,如果我们将以下数据帧df2作为测试用例
df2 <- structure(list(code = c("61000-61003", "0169T-0174T", "61000-61003",
"T0169-T0174"), label = c("excision of CNS", "ventricular shunt",
"excision of CNS", "ventricular shunt")), .Names = c("code",
"label"), row.names = c(NA, 4L), class = "data.frame")
and run the xx
code from above on it, we can get the following result.
并从上面运行xx代码,我们可以得到以下结果。
data.frame(code = unlist(xx), label = rep(df2$label, lengths(xx)))
# code label
# 1 61000 excision of CNS
# 2 61001 excision of CNS
# 3 61002 excision of CNS
# 4 61003 excision of CNS
# 5 0169T ventricular shunt
# 6 0170T ventricular shunt
# 7 0171T ventricular shunt
# 8 0172T ventricular shunt
# 9 0173T ventricular shunt
# 10 0174T ventricular shunt
# 11 61000 excision of CNS
# 12 61001 excision of CNS
# 13 61002 excision of CNS
# 14 61003 excision of CNS
# 15 T0169 ventricular shunt
# 16 T0170 ventricular shunt
# 17 T0171 ventricular shunt
# 18 T0172 ventricular shunt
# 19 T0173 ventricular shunt
# 20 T0174 ventricular shunt
#3
3
Create a sequencing rule for such codes:
为此类代码创建排序规则:
seq_code <- function(from,to){
ext = function(x, part) gsub("([^0-9]?)([0-9]*)([^0-9]?)", paste0("\\",part), x)
pre = unique(sapply(list(from,to), ext, part = 1 ))
suf = unique(sapply(list(from,to), ext, part = 3 ))
if (length(pre) > 1 | length(suf) > 1){
return("NO!")
}
num = do.call(seq, lapply(list(from,to), function(x) as.integer(ext(x, part = 2))))
len = nchar(from)-nchar(pre)-nchar(suf)
paste0(pre, sprintf(paste0("%0",len,"d"), num), suf)
}
With @jeremycg's example:
用@ jeremycg的例子:
setDT(dff)[,.(
label = label[1],
code = do.call(seq_code, tstrsplit(code,'-'))
), by=.(row=seq(nrow(dff)))]
which gives
row label code
1: 1 excision of CNS 61000
2: 1 excision of CNS 61001
3: 1 excision of CNS 61002
4: 2 ventricular shunt 0169T
5: 2 ventricular shunt 0170T
6: 2 ventricular shunt 0171T
7: 3 excision of CNS 01000
8: 3 excision of CNS 01001
9: 3 excision of CNS 01002
10: 4 some procedure A2543
11: 4 some procedure A2544
12: 4 some procedure A2545
13: 5 some procedure A0543
14: 5 some procedure A0544
15: 5 some procedure A0545
Data copied from @jeremycg's answer:
从@ jeremycg的答案复制的数据:
dff <- structure(list(code = c("61000-61002", "0169T-0171T", "01000-01002",
"A2543-A2545", "A0543-A0545"), label = c("excision of CNS", "ventricular shunt",
"excision of CNS", "some procedure", "some procedure")), .Names = c("code",
"label"), row.names = c(NA, 5L), class = "data.frame")
#4
3
If you're patient enough, you'd probably parse the strings into separate pieces instead of the eval/parse trick, alas I'm not, so:
如果你足够耐心,你可能会将字符串分解成单独的部分,而不是eval / parse技巧,唉,我不是,所以:
fancy.seq = function(x) eval(parse(text=sub(', \\)', ')', sub('\\(, ', '(',
sub('.*?([0-9]+)(.*)-(.*?)([1-9][0-9]*).*',
'paste0("\\3",
formatC(\\1:\\4, width=log10(\\4)+1, format="d", flag="0"),
"\\2")',
x)))))
# using example from jeremycg's answer
dt[, .(fancy.seq(code), label), by = 1:nrow(dt)]
# nrow V1 label
# 1: 1 61000 excision of CNS
# 2: 1 61001 excision of CNS
# 3: 1 61002 excision of CNS
# 4: 2 0169T ventricular shunt
# 5: 2 0170T ventricular shunt
# 6: 2 0171T ventricular shunt
# 7: 3 01000 excision of CNS
# 8: 3 01001 excision of CNS
# 9: 3 01002 excision of CNS
#10: 4 A2543 some procedure
#11: 4 A2544 some procedure
#12: 4 A2545 some procedure
#13: 5 A0543 some procedure
#14: 5 A0544 some procedure
#15: 5 A0545 some procedure
If unclear what the above is doing - just run the sub
commands one by one on one of the "code" strings.
如果不清楚上面做了什么 - 只需在其中一个“代码”字符串上逐个运行子命令。
#5
1
A less elegant way to do it:
一种不那么优雅的方式:
# the data
hcup <- data.frame(code=c("61000-61003", "0169T-0169T"),
label=c("excision of CNS", "ventricular shunt"), stringsAsFactors = F)
hcup
> code label
>1 61000-61003 excision of CNS
>2 0169T-0169T ventricular shunt
# reshaping
# split the code ranges into separate columns
seq.ends <- cbind(do.call(rbind.data.frame, strsplit(hcup$code, "-")), hcup$label)
# create a list with a data.frame for each original line
new.list <- apply(seq.ends, 1, FUN=function(x){data.frame(code=if(grepl("\\d{5}", x[1])){
z<-x[1]:x[2]}else{z<-x[1]}, label=rep(x[3], length(z)),
stringsAsFactors = F)})
# collapse the list into a df
new.df <- do.call(rbind, lapply(new.list, data.frame, stringsAsFactors=F))
new.df
> code label
>1.1 61000 excision of CNS
>1.2 61001 excision of CNS
>1.3 61002 excision of CNS
>1.4 61003 excision of CNS
>2 0169T ventricular shunt
#1
7
Here's a solution using dplyr
and all.is.numeric
from Hmisc
:
这是使用来自Hmisc的dplyr和all.is.numeric的解决方案:
library(dplyr)
library(Hmisc)
library(tidyr)
dat %>% separate(code, into=c("code1", "code2")) %>%
rowwise %>%
mutate(lists = ifelse(all.is.numeric(c(code1, code2)),
list(as.character(seq(from = as.numeric(code1), to = as.numeric(code2)))),
list(code1))) %>%
unnest(lists) %>%
select(code = lists, label)
Source: local data frame [5 x 2]
code label
(chr) (fctr)
1 61000 excision of CNS
2 61001 excision of CNS
3 61002 excision of CNS
4 61003 excision of CNS
5 0169T ventricular shunt
An edit to fix ranges with character values. Brings down the simplicity a little:
用于修复具有字符值的范围的编辑。简单介绍一下:
dff %>% mutate(row = row_number()) %>%
separate(code, into=c("code1", "code2")) %>%
group_by(row) %>%
summarise(lists = if(all.is.numeric(c(code1, code2)))
{list(str_pad(as.character(
seq(from = as.numeric(code1), to = as.numeric(code2))),
nchar(code1), pad="0"))}
else if(grepl("^[0-9]", code1))
{list(str_pad(paste0(as.character(
seq(from = extract_numeric(code1), to = extract_numeric(code2))),
strsplit(code1, "[0-9]+")[[1]][2]),
nchar(code1), pad = "0"))}
else
{list(paste0(
strsplit(code1, "[0-9]+")[[1]],
str_pad(as.character(
seq(from = extract_numeric(code1), to = extract_numeric(code2))),
nchar(gsub("[^0-9]", "", code1)), pad="0")))},
label = first(label)) %>%
unnest(lists) %>%
select(-row)
Source: local data frame [15 x 2]
label lists
(chr) (chr)
1 excision of CNS 61000
2 excision of CNS 61001
3 excision of CNS 61002
4 ventricular shunt 0169T
5 ventricular shunt 0170T
6 ventricular shunt 0171T
7 excision of CNS 01000
8 excision of CNS 01001
9 excision of CNS 01002
10 some procedure A2543
11 some procedure A2544
12 some procedure A2545
13 some procedure A0543
14 some procedure A0544
15 some procedure A0545
data:
dff <- structure(list(code = c("61000-61002", "0169T-0171T", "01000-01002",
"A2543-A2545", "A0543-A0545"), label = c("excision of CNS", "ventricular shunt",
"excision of CNS", "some procedure", "some procedure")), .Names = c("code",
"label"), row.names = c(NA, 5L), class = "data.frame")
#2
6
Original Answer: See below for update.
原答案:请参阅下面的更新。
First, I made your example data a little more challenging by adding the first row to the bottom.
首先,通过将第一行添加到底部,我使您的示例数据更具挑战性。
dff <- structure(list(code = c("61000-61003", "0169T-0169T", "61000-61003"
), label = c("excision of CNS", "ventricular shunt", "excision of CNS"
)), .Names = c("code", "label"), row.names = c(NA, 3L), class = "data.frame")
dff
# code label
# 1 61000-61003 excision of CNS
# 2 0169T-0169T ventricular shunt
# 3 61000-61003 excision of CNS
We can use the sequence operator :
to get the sequences for the code
column, wrapping with tryCatch()
so we can avoid an error on, and save the values that cannot be sequenced. First we split the values by the dash mark -
then run it through lapply()
.
我们可以使用序列运算符:获取代码列的序列,使用tryCatch()包装,这样我们就可以避免错误,并保存无法排序的值。首先,我们用破折号标记值 - 然后通过lapply()运行它。
xx <- lapply(
strsplit(dff$code, "-", fixed = TRUE),
function(x) tryCatch(x[1]:x[2], warning = function(w) x)
)
data.frame(code = unlist(xx), label = rep(dff$label, lengths(xx)))
# code label
# 1 61000 excision of CNS
# 2 61001 excision of CNS
# 3 61002 excision of CNS
# 4 61003 excision of CNS
# 5 0169T ventricular shunt
# 6 0169T ventricular shunt
# 7 61000 excision of CNS
# 8 61001 excision of CNS
# 9 61002 excision of CNS
# 10 61003 excision of CNS
We're trying to apply the sequence operator :
to each element from strsplit()
, and if taking x[1]:x[2]
is not possible then this returns just the values for those elements and proceeds with the sequence x[1]:x[2]
otherwise. Then we just replicate the values of the label
column based on the resulting lengths in xx
to get the new label
column.
我们试图将序列运算符应用于:来自strsplit()的每个元素,如果不能使用x [1]:x [2],则只返回这些元素的值,并继续执行序列x [1 ]:x [2]否则。然后,我们只需根据xx中的结果长度复制标签列的值,即可获得新的标签列。
Update: Here is what I've come up with in response to your edit. Replace xx
above with
更新:以下是我为响应您的编辑而提出的问题。用上面的xx替换
xx <- lapply(strsplit(dff$code, "-", TRUE), function(x) {
s <- stringi::stri_locate_first_regex(x, "[A-Z]")
nc <- nchar(x)[1L]
fmt <- function(n) paste0("%0", n, "d")
if(!all(is.na(s))) {
ss <- s[1,1]
fmt <- fmt(nc-1)
if(ss == 1L) {
xx <- substr(x, 2, nc)
paste0(substr(x, 1, 1), sprintf(fmt, xx[1]:xx[2]))
} else {
xx <- substr(x, 1, ss-1)
paste0(sprintf(fmt, xx[1]:xx[2]), substr(x, nc, nc))
}
} else {
sprintf(fmt(nc), x[1]:x[2])
}
})
Yep, it's complicated. Now if we take the following data frame df2
as a test case
是的,这很复杂。现在,如果我们将以下数据帧df2作为测试用例
df2 <- structure(list(code = c("61000-61003", "0169T-0174T", "61000-61003",
"T0169-T0174"), label = c("excision of CNS", "ventricular shunt",
"excision of CNS", "ventricular shunt")), .Names = c("code",
"label"), row.names = c(NA, 4L), class = "data.frame")
and run the xx
code from above on it, we can get the following result.
并从上面运行xx代码,我们可以得到以下结果。
data.frame(code = unlist(xx), label = rep(df2$label, lengths(xx)))
# code label
# 1 61000 excision of CNS
# 2 61001 excision of CNS
# 3 61002 excision of CNS
# 4 61003 excision of CNS
# 5 0169T ventricular shunt
# 6 0170T ventricular shunt
# 7 0171T ventricular shunt
# 8 0172T ventricular shunt
# 9 0173T ventricular shunt
# 10 0174T ventricular shunt
# 11 61000 excision of CNS
# 12 61001 excision of CNS
# 13 61002 excision of CNS
# 14 61003 excision of CNS
# 15 T0169 ventricular shunt
# 16 T0170 ventricular shunt
# 17 T0171 ventricular shunt
# 18 T0172 ventricular shunt
# 19 T0173 ventricular shunt
# 20 T0174 ventricular shunt
#3
3
Create a sequencing rule for such codes:
为此类代码创建排序规则:
seq_code <- function(from,to){
ext = function(x, part) gsub("([^0-9]?)([0-9]*)([^0-9]?)", paste0("\\",part), x)
pre = unique(sapply(list(from,to), ext, part = 1 ))
suf = unique(sapply(list(from,to), ext, part = 3 ))
if (length(pre) > 1 | length(suf) > 1){
return("NO!")
}
num = do.call(seq, lapply(list(from,to), function(x) as.integer(ext(x, part = 2))))
len = nchar(from)-nchar(pre)-nchar(suf)
paste0(pre, sprintf(paste0("%0",len,"d"), num), suf)
}
With @jeremycg's example:
用@ jeremycg的例子:
setDT(dff)[,.(
label = label[1],
code = do.call(seq_code, tstrsplit(code,'-'))
), by=.(row=seq(nrow(dff)))]
which gives
row label code
1: 1 excision of CNS 61000
2: 1 excision of CNS 61001
3: 1 excision of CNS 61002
4: 2 ventricular shunt 0169T
5: 2 ventricular shunt 0170T
6: 2 ventricular shunt 0171T
7: 3 excision of CNS 01000
8: 3 excision of CNS 01001
9: 3 excision of CNS 01002
10: 4 some procedure A2543
11: 4 some procedure A2544
12: 4 some procedure A2545
13: 5 some procedure A0543
14: 5 some procedure A0544
15: 5 some procedure A0545
Data copied from @jeremycg's answer:
从@ jeremycg的答案复制的数据:
dff <- structure(list(code = c("61000-61002", "0169T-0171T", "01000-01002",
"A2543-A2545", "A0543-A0545"), label = c("excision of CNS", "ventricular shunt",
"excision of CNS", "some procedure", "some procedure")), .Names = c("code",
"label"), row.names = c(NA, 5L), class = "data.frame")
#4
3
If you're patient enough, you'd probably parse the strings into separate pieces instead of the eval/parse trick, alas I'm not, so:
如果你足够耐心,你可能会将字符串分解成单独的部分,而不是eval / parse技巧,唉,我不是,所以:
fancy.seq = function(x) eval(parse(text=sub(', \\)', ')', sub('\\(, ', '(',
sub('.*?([0-9]+)(.*)-(.*?)([1-9][0-9]*).*',
'paste0("\\3",
formatC(\\1:\\4, width=log10(\\4)+1, format="d", flag="0"),
"\\2")',
x)))))
# using example from jeremycg's answer
dt[, .(fancy.seq(code), label), by = 1:nrow(dt)]
# nrow V1 label
# 1: 1 61000 excision of CNS
# 2: 1 61001 excision of CNS
# 3: 1 61002 excision of CNS
# 4: 2 0169T ventricular shunt
# 5: 2 0170T ventricular shunt
# 6: 2 0171T ventricular shunt
# 7: 3 01000 excision of CNS
# 8: 3 01001 excision of CNS
# 9: 3 01002 excision of CNS
#10: 4 A2543 some procedure
#11: 4 A2544 some procedure
#12: 4 A2545 some procedure
#13: 5 A0543 some procedure
#14: 5 A0544 some procedure
#15: 5 A0545 some procedure
If unclear what the above is doing - just run the sub
commands one by one on one of the "code" strings.
如果不清楚上面做了什么 - 只需在其中一个“代码”字符串上逐个运行子命令。
#5
1
A less elegant way to do it:
一种不那么优雅的方式:
# the data
hcup <- data.frame(code=c("61000-61003", "0169T-0169T"),
label=c("excision of CNS", "ventricular shunt"), stringsAsFactors = F)
hcup
> code label
>1 61000-61003 excision of CNS
>2 0169T-0169T ventricular shunt
# reshaping
# split the code ranges into separate columns
seq.ends <- cbind(do.call(rbind.data.frame, strsplit(hcup$code, "-")), hcup$label)
# create a list with a data.frame for each original line
new.list <- apply(seq.ends, 1, FUN=function(x){data.frame(code=if(grepl("\\d{5}", x[1])){
z<-x[1]:x[2]}else{z<-x[1]}, label=rep(x[3], length(z)),
stringsAsFactors = F)})
# collapse the list into a df
new.df <- do.call(rbind, lapply(new.list, data.frame, stringsAsFactors=F))
new.df
> code label
>1.1 61000 excision of CNS
>1.2 61001 excision of CNS
>1.3 61002 excision of CNS
>1.4 61003 excision of CNS
>2 0169T ventricular shunt