I have a huge text file from Edgar. I want to extract only a portion of text from business risk section.
我有一个来自埃德加的巨大文本文件。我想从业务风险部分只提取一部分文本。
For example if the text is like :
例如,如果文本是这样的:
Bshehebvegegeveghdhebejejrjbfbfk
And I want to extract the start position as he
(2nd instance) end position ge
(second instance).
我想提取起始位置作为他(第二个实例)结束位置ge(第二个实例)。
So my output will be - hebvegege
所以我的输出将是 - hebvegege
I want a code in R. And I am specially interested in the business risk section.
我想在R中使用代码。我对业务风险部分特别感兴趣。
1 个解决方案
#1
0
One option is gregexpr
to find the index of the starting character for the patterns 'he' and 'ge' and then use substr
to specify the start
and stop
positions of the string to extract the substring
一个选项是gregexpr找到模式'he'和'ge'的起始字符的索引,然后使用substr指定字符串的开始和停止位置以提取子字符串
i1 <- gregexpr("he", str1)[[1]][2]
i2 <- gregexpr("ge", str1)[[1]][2] +1
substr(str1, i1, i2)
#[1] "hebvegege"
Or in a single step
或者只需一步
do.call(substr, c(str1, lapply(c("he", "(?<=g)e"),
function(pat) gregexpr(pat, str1, perl=TRUE)[[1]][2]) ))
#[1] "hebvegege"
data
str1 <- "Bshehebvegegeveghdhebejejrjbfbfk"
#1
0
One option is gregexpr
to find the index of the starting character for the patterns 'he' and 'ge' and then use substr
to specify the start
and stop
positions of the string to extract the substring
一个选项是gregexpr找到模式'he'和'ge'的起始字符的索引,然后使用substr指定字符串的开始和停止位置以提取子字符串
i1 <- gregexpr("he", str1)[[1]][2]
i2 <- gregexpr("ge", str1)[[1]][2] +1
substr(str1, i1, i2)
#[1] "hebvegege"
Or in a single step
或者只需一步
do.call(substr, c(str1, lapply(c("he", "(?<=g)e"),
function(pat) gregexpr(pat, str1, perl=TRUE)[[1]][2]) ))
#[1] "hebvegege"
data
str1 <- "Bshehebvegegeveghdhebejejrjbfbfk"