I have a dataframe that includes a column of numbers like this:
我有一个数据框,其中包含一列数字,如下所示:
360010001001002
360010001001004
360010001001005
360010001001006
I'd like to break into chunks of 2 digits, 3 digits, 5 digits, 1 digit, 4 digits:
我想打破2位数,3位数,5位数,1位数,4位数的块:
36 001 00010 0 1002
36 001 00010 0 1004
36 001 00010 0 1005
36 001 00010 0 1006
That seems like it should be straightforward but I'm reading the strsplit documentation and I can't sort out how I'd do this by lengths.
这似乎应该是直截了当的,但我正在阅读strsplit文档,我无法理解我是如何做到这一点的长度。
5 个解决方案
#1
4
Assuming this data:
假设这个数据:
x <- c("360010001001002", "360010001001004", "360010001001005", "360010001001006")
try this:
read.fwf(textConnection(x), widths = c(2, 3, 5, 1, 4))
If x
is numeric then replace x
with as.character(x)
in this statement.
如果x是数字,则在此语句中将as替换为as.character(x)。
#2
8
You can use substring
(assuming the length of string/number is fixed):
你可以使用substring(假设字符串/数字的长度是固定的):
xx <- c(360010001001002, 360010001001004, 360010001001005, 360010001001006)
out <- do.call(rbind, lapply(xx, function(x) as.numeric(substring(x,
c(1,3,6,11,12), c(2,5,10,11,15)))))
out <- as.data.frame(out)
#3
4
A functional version:
功能版:
split.fixed.len <- function(x, lengths) {
cum.len <- c(0, cumsum(lengths))
start <- head(cum.len, -1) + 1
stop <- tail(cum.len, -1)
mapply(substring, list(x), start, stop)
}
a <- c(360010001001002,
360010001001004,
360010001001005,
360010001001006)
split.fixed.len(a, c(2, 3, 5, 1, 4))
# [,1] [,2] [,3] [,4] [,5]
# [1,] "36" "001" "00010" "0" "1002"
# [2,] "36" "001" "00010" "0" "1004"
# [3,] "36" "001" "00010" "0" "1005"
# [4,] "36" "001" "00010" "0" "1006"
#4
0
(Wow, this task is incredibly clunky and painful compared to Python. Anyhoo...)
(哇,与Python相比,这项任务非常笨拙和痛苦.Anyhoo ......)
PS I see now your main intent was to convert a vector of substring lengths into pairs of indices. You could use cumsum()
, then sort the indices all together:
PS我现在看到你的主要目的是将子串长度的矢量转换为索引对。您可以使用cumsum(),然后将所有索引排序在一起:
ll <- c(2,3,5,1,4)
sort( c(1, cumsum(ll), (cumsum(ll)+1)[1:(length(ll)-1)]) )
# now extract these as pairs.
But it's quite painful. flodel's answer for that is better.
但这很痛苦。弗洛尔德对此的回答更好。
As to the actual task of splitting into d.f. columns, and doing that efficiently:
至于分裂成d.f.的实际任务。列,并有效地做到这一点:
stringr::str_sub()
combines elegantly with plyr::ddply()
/ ldply
stringr :: str_sub()优雅地与plyr :: ddply()/ ldply结合使用
require(plyr)
require(stringr)
df <- data.frame(value=c(360010001001002,360010001001004,360010001001005,360010001001006))
df$valc = as.character(df$value)
df <- ddply(df, .(value), mutate, chk1=str_sub(valc,1,2), chk3=str_sub(valc,3,5), chk6=str_sub(valc,6,10), chk11=str_sub(valc,11,11), chk14=str_sub(valc,12,15) )
# value valc chk1 chk3 chk6 chk11 chk14
# 1 360010001001002 360010001001002 36 001 00010 0 1002
# 2 360010001001004 360010001001004 36 001 00010 0 1004
# 3 360010001001005 360010001001005 36 001 00010 0 1005
# 4 360010001001006 360010001001006 36 001 00010 0 1006
#5
0
You can use this function from stringi
package
您可以使用stringi包中的此功能
splitpoints <- cumsum(c(2, 3, 5, 1,4))
stri_sub("360010001001002",c(1,splitpoints[-length(splitpoints)]+1),splitpoints)
#1
4
Assuming this data:
假设这个数据:
x <- c("360010001001002", "360010001001004", "360010001001005", "360010001001006")
try this:
read.fwf(textConnection(x), widths = c(2, 3, 5, 1, 4))
If x
is numeric then replace x
with as.character(x)
in this statement.
如果x是数字,则在此语句中将as替换为as.character(x)。
#2
8
You can use substring
(assuming the length of string/number is fixed):
你可以使用substring(假设字符串/数字的长度是固定的):
xx <- c(360010001001002, 360010001001004, 360010001001005, 360010001001006)
out <- do.call(rbind, lapply(xx, function(x) as.numeric(substring(x,
c(1,3,6,11,12), c(2,5,10,11,15)))))
out <- as.data.frame(out)
#3
4
A functional version:
功能版:
split.fixed.len <- function(x, lengths) {
cum.len <- c(0, cumsum(lengths))
start <- head(cum.len, -1) + 1
stop <- tail(cum.len, -1)
mapply(substring, list(x), start, stop)
}
a <- c(360010001001002,
360010001001004,
360010001001005,
360010001001006)
split.fixed.len(a, c(2, 3, 5, 1, 4))
# [,1] [,2] [,3] [,4] [,5]
# [1,] "36" "001" "00010" "0" "1002"
# [2,] "36" "001" "00010" "0" "1004"
# [3,] "36" "001" "00010" "0" "1005"
# [4,] "36" "001" "00010" "0" "1006"
#4
0
(Wow, this task is incredibly clunky and painful compared to Python. Anyhoo...)
(哇,与Python相比,这项任务非常笨拙和痛苦.Anyhoo ......)
PS I see now your main intent was to convert a vector of substring lengths into pairs of indices. You could use cumsum()
, then sort the indices all together:
PS我现在看到你的主要目的是将子串长度的矢量转换为索引对。您可以使用cumsum(),然后将所有索引排序在一起:
ll <- c(2,3,5,1,4)
sort( c(1, cumsum(ll), (cumsum(ll)+1)[1:(length(ll)-1)]) )
# now extract these as pairs.
But it's quite painful. flodel's answer for that is better.
但这很痛苦。弗洛尔德对此的回答更好。
As to the actual task of splitting into d.f. columns, and doing that efficiently:
至于分裂成d.f.的实际任务。列,并有效地做到这一点:
stringr::str_sub()
combines elegantly with plyr::ddply()
/ ldply
stringr :: str_sub()优雅地与plyr :: ddply()/ ldply结合使用
require(plyr)
require(stringr)
df <- data.frame(value=c(360010001001002,360010001001004,360010001001005,360010001001006))
df$valc = as.character(df$value)
df <- ddply(df, .(value), mutate, chk1=str_sub(valc,1,2), chk3=str_sub(valc,3,5), chk6=str_sub(valc,6,10), chk11=str_sub(valc,11,11), chk14=str_sub(valc,12,15) )
# value valc chk1 chk3 chk6 chk11 chk14
# 1 360010001001002 360010001001002 36 001 00010 0 1002
# 2 360010001001004 360010001001004 36 001 00010 0 1004
# 3 360010001001005 360010001001005 36 001 00010 0 1005
# 4 360010001001006 360010001001006 36 001 00010 0 1006
#5
0
You can use this function from stringi
package
您可以使用stringi包中的此功能
splitpoints <- cumsum(c(2, 3, 5, 1,4))
stri_sub("360010001001002",c(1,splitpoints[-length(splitpoints)]+1),splitpoints)