清除R中的字符串:添加标点符号w/o覆盖上一个字符

时间:2022-01-27 01:13:03

I'm new to R and unable to find other threads with a similar issue.

我是R新手,无法找到其他有类似问题的线程。

I'm cleaning data that requires punctuation at the end of each line. I am unable to add, say, a period without overwriting the final character of the line preceding the carriage return + line feed.

我正在清理需要在每行结束时使用标点符号的数据。例如,我无法添加一个句点,而不覆盖在回车+换行之前的行的最终字符。

Sample code:

示例代码:

Data1 <- "%trn: dads sheep\r\n*MOT: hunn.\r\n%trn: yes.\r\n*MOT: ana mu\r\n%trn: where is it?"
Data2 <- gsub("[^[:punct:]]\r\n\\*", ".\r\n\\*", Data1)

The contents of Data2:

的内容Data2:

[1] "%trn: dads shee.\r\n*MOT: hunn.\r\n%trn: yes.\r\n*MOT: ana mu\r\n%trn: where is it?"

Notice the "p" of sheep was overwritten with the period. Any thoughts on how I could avoid this?

注意,绵羊的“p”被周期覆盖。有什么办法可以避免这种情况吗?

2 个解决方案

#1


2  

Capturing group:

Use a capturing group around your character class and reference the group inside of your replacement.

在您的角色类周围使用一个捕获组,并引用替换后的组。

gsub('([^[:punct:]])\\r\\n\\*', '\\1.\r\n*', Data1)
      ^            ^             ^^^
# [1] "%trn: dads sheep.\r\n*MOT: hunn.\r\n%trn: yes.\r\n*MOT: ana mu\r\n%trn: where is it?"

Lookarounds:

You can switch on PCRE by using perl=T and use lookarounds to achieve this.

您可以通过使用perl=T打开PCRE,并使用lookarounds实现这一点。

gsub('[^\\pP]\\K(?=\\r\\n\\*)', '.', Data1, perl=T)
# [1] "%trn: dads sheep.\r\n*MOT: hunn.\r\n%trn: yes.\r\n*MOT: ana mu\r\n%trn: where is it?"

The negated Unicode property \pP class matches any character except any kind of punctuation character.

否定的Unicode属性\pP类匹配除了任何类型的标点字符之外的任何字符。

Instead of using a capturing group, I used \K here. This escape sequence resets the starting point of the reported match. Any previously matched characters are not included in the final matched sequence. As well, I used a Positive Lookahead to assert that a carriage return, newline sequence and a literal asterisk character follows.

我这里用的不是捕获组,而是\K。这个转义序列重置了所报告的匹配的起点。任何先前匹配的字符都不包含在最终匹配序列中。此外,我还使用了一个积极的前视符来断言后面跟着一个回车符、换行序列和一个星号字符。

#2


1  

There are several ways to do it:

有几种方法可以做到:

Capture group: gsub("([^[:punct:]])\\r\\n\\*", "\\1.\r\n*", Data1)

捕获组:gsub(“([^[punct:]])\ \ r \ \ n \ \ *”、“\ \ 1。Data1 \ r \ n *”)

Positive lookbehind (non-capturing group): gsub("(?<=[^[:punct:]])\\r\\n\\*", ".\r\n*", Data1, perl=T)

积极的向后插入(无组):gsub(“(? < =(^[punct:]])\ \ r \ \ n \ \ *”、“。\ r \ n *”,Data1、perl = T)

EDIT: fixed the backslashes and removed the uncertainty about R support for these.

编辑:修正反斜线,消除R支持的不确定性。

#1


2  

Capturing group:

Use a capturing group around your character class and reference the group inside of your replacement.

在您的角色类周围使用一个捕获组,并引用替换后的组。

gsub('([^[:punct:]])\\r\\n\\*', '\\1.\r\n*', Data1)
      ^            ^             ^^^
# [1] "%trn: dads sheep.\r\n*MOT: hunn.\r\n%trn: yes.\r\n*MOT: ana mu\r\n%trn: where is it?"

Lookarounds:

You can switch on PCRE by using perl=T and use lookarounds to achieve this.

您可以通过使用perl=T打开PCRE,并使用lookarounds实现这一点。

gsub('[^\\pP]\\K(?=\\r\\n\\*)', '.', Data1, perl=T)
# [1] "%trn: dads sheep.\r\n*MOT: hunn.\r\n%trn: yes.\r\n*MOT: ana mu\r\n%trn: where is it?"

The negated Unicode property \pP class matches any character except any kind of punctuation character.

否定的Unicode属性\pP类匹配除了任何类型的标点字符之外的任何字符。

Instead of using a capturing group, I used \K here. This escape sequence resets the starting point of the reported match. Any previously matched characters are not included in the final matched sequence. As well, I used a Positive Lookahead to assert that a carriage return, newline sequence and a literal asterisk character follows.

我这里用的不是捕获组,而是\K。这个转义序列重置了所报告的匹配的起点。任何先前匹配的字符都不包含在最终匹配序列中。此外,我还使用了一个积极的前视符来断言后面跟着一个回车符、换行序列和一个星号字符。

#2


1  

There are several ways to do it:

有几种方法可以做到:

Capture group: gsub("([^[:punct:]])\\r\\n\\*", "\\1.\r\n*", Data1)

捕获组:gsub(“([^[punct:]])\ \ r \ \ n \ \ *”、“\ \ 1。Data1 \ r \ n *”)

Positive lookbehind (non-capturing group): gsub("(?<=[^[:punct:]])\\r\\n\\*", ".\r\n*", Data1, perl=T)

积极的向后插入(无组):gsub(“(? < =(^[punct:]])\ \ r \ \ n \ \ *”、“。\ r \ n *”,Data1、perl = T)

EDIT: fixed the backslashes and removed the uncertainty about R support for these.

编辑:修正反斜线,消除R支持的不确定性。