Given is a string vector vecA:

给出一个弦向量纬卡:

vecA <- c("Population 1222",
          "Population 90over",
          "population under78",
          "population 99101",
          "Population 1254", 
          "Population 78 92")

Problem

I would like to arrive at the vecB that would correspond to:

我想要到达vecB，它对应的是:

vecB <- c("Population 12 - 22",
          "Population 90 over",
          "population under 78",
          "population 99 - 101",
          "Population 12 - 54", 
          "Population 78 - 92")

Key characteristics

The vecB has the following characteristics:

欧洲央行具有以下特点:

After the first two digits space and dash and space are inserted (-)
在插入前两位数空格和破折号和空格后(-)
If the space exists only the dash (-) is inserted
如果空间存在，则只插入破折号(-)
For combinations like underDigitDigit only space is inserted: under DigitDigit
对于像数字不足这样的组合，只在数字以下插入空格

Attempts

I was thinking of making use of groups in gsub, on the lines:

我想利用gsub中的群组，在线条上:

gsub("^([[:alpha:]]*[[:blank:]])(\\d{2})(.*)$", "\\2", vecA)

but that does not work for all the cases:

但这并不适用于所有情况:

> t(t(gsub("^([[:alpha:]]*[[:blank:]])(\\d{2})(.*)$", "\\2", vecA)))
     [,1]                
[1,] "12"                
[2,] "90"                
[3,] "population under78"
[4,] "99"                
[5,] "12"                
[6,] "78"

t() applied for the presentational purposes only; regex101 link.

t()仅适用于表示目的;regex101链接。

1 个解决方案

#1

Here is my suggestion - do it in two steps: 1) add the hyphen between the numbers first, and then 2) add the space between words "over"/"under" and the number:

这里是我的建议——分两个步骤来做:1)先在数字之间加上连字符，然后2)在单词“over”/“under”和数字之间加上空格:

vecA <- c("Population 1222",
           "Population 90over",
           "population under78",
           "population 99101",
           "Population 1254", 
           "Population 78 92")
v <- gsub("^([[:alpha:]]+[[:blank:]]+)([[:digit:]]{2})\\s*([[:digit:]])", "\\1\\2 - \\3", vecA)
gsub("^([[:alpha:]]+[[:blank:]]+)(?|(over|under)(\\d+)|(\\d+)(over|under))", "\\1\\2 \\3", v, perl=T)

Output of a code demo:

代码演示的输出:

[1] "Population 12 - 22"  "Population 90 over"  "population under 78"
[4] "population 99 - 101" "Population 12 - 54"  "Population 78 - 92"

The second regex contains a branch reset pattern (?|...|...) to keep the same group IDs in the alternative subpatterns, thus requires a perl=T.

第二个regex包含一个分支重置模式(?|…|…)，以便在可选子模式中保持相同的组id，因此需要perl=T。

#1