在R中提取字符串的一部分

时间:2022-06-01 12:46:07

I have a string of the form

我有一个表格的字符串

stamp = "section_d1_2010-07-01_08_00.txt"

and would like to be able to extract parts of this. I have been able to do this by using repeated str_extract to get to the section I want, e.g. to grab the month

并希望能够提取部分内容。我已经能够通过使用重复的str_extract来到达我想要的部分,例如抓住这个月

month = str_extract(stamp,"2010.+")
month = str_extract(month,"-..")
month = str_extract(month,"..$")

however this is terribly inefficient and there has to be a better way. For this particular example I can use

然而,这是非常低效的,必须有一个更好的方法。对于这个特殊的例子,我可以使用

month = substr(stamp,17,18)

however am looking for something more versatile (in case the number of digits changes).

但是我正在寻找更多功能的东西(如果数字变化的话)。

I think I need the regular expression to grab what comes AFTER certain flags (the _ or -, or the 3rd _ etc.). I have tried using sub as well, but had the same problem in that I was needing several to hone into what I actually wanted.

我想我需要使用正则表达式来获取某些标志(_或 - ,或者第3个_等)后出现的内容。我也尝试过使用sub,但遇到了同样的问题,因为我需要几个来磨练我真正想要的东西。

An example of how to get say the month (07 here) and the hour (08 here) would be appreciated.

如何说出月份(这里是07)和小时(这里是08)的一个例子将不胜感激。

2 个解决方案

#1


4  

You can simply use strsplit with regex [-_] and perl=TRUE option to get all the parts.

你可以简单地使用strsplit和regex [-_]以及perl = TRUE选项来获取所有部分。

stamp <- "section_d1_2010-07-01_08_00.txt"
strsplit(stamp, '[-_]')[[1]]
# [1] "section" "d1"      "2010"    "07"      "01"      "08"      "00.txt" 

See demo.

https://regex101.com/r/cK4iV0/8

#2


2  

You can try

你可以试试

gsub('^.*_\\d+-|-\\d+_.*$', '', stamp)
#[1] "07"

For the hour

一小时

library(stringr)
str_extract(stamp, '(?<=\\d_)\\d+(?=_\\d)')
#[1] "08"

Extracting both

 str_extract_all(stamp, '(?<=\\d{4}[^0-9])\\d{2}|\\d{2}(?=[^0-9]\\d{2}\\.)')[[1]]
 #[1] "07" "08"

#1


4  

You can simply use strsplit with regex [-_] and perl=TRUE option to get all the parts.

你可以简单地使用strsplit和regex [-_]以及perl = TRUE选项来获取所有部分。

stamp <- "section_d1_2010-07-01_08_00.txt"
strsplit(stamp, '[-_]')[[1]]
# [1] "section" "d1"      "2010"    "07"      "01"      "08"      "00.txt" 

See demo.

https://regex101.com/r/cK4iV0/8

#2


2  

You can try

你可以试试

gsub('^.*_\\d+-|-\\d+_.*$', '', stamp)
#[1] "07"

For the hour

一小时

library(stringr)
str_extract(stamp, '(?<=\\d_)\\d+(?=_\\d)')
#[1] "08"

Extracting both

 str_extract_all(stamp, '(?<=\\d{4}[^0-9])\\d{2}|\\d{2}(?=[^0-9]\\d{2}\\.)')[[1]]
 #[1] "07" "08"