在第二个实例上部分提取非结构化数据

时间:2021-06-07 18:31:32

I have a huge text file from Edgar. I want to extract only a portion of text from business risk section.

我有一个来自埃德加的巨大文本文件。我想从业务风险部分只提取一部分文本。

For example if the text is like :

例如,如果文本是这样的:

Bshehebvegegeveghdhebejejrjbfbfk

And I want to extract the start position as he(2nd instance) end position ge(second instance).

我想提取起始位置作为他(第二个实例)结束位置ge(第二个实例)。

So my output will be - hebvegege

所以我的输出将是 - hebvegege

I want a code in R. And I am specially interested in the business risk section.

我想在R中使用代码。我对业务风险部分特别感兴趣。

1 个解决方案

#1


0  

One option is gregexpr to find the index of the starting character for the patterns 'he' and 'ge' and then use substr to specify the start and stop positions of the string to extract the substring

一个选项是gregexpr找到模式'he'和'ge'的起始字符的索引,然后使用substr指定字符串的开始和停止位置以提取子字符串

i1 <- gregexpr("he", str1)[[1]][2]
i2 <- gregexpr("ge", str1)[[1]][2] +1
substr(str1, i1, i2)
#[1] "hebvegege"

Or in a single step

或者只需一步

do.call(substr, c(str1, lapply(c("he", "(?<=g)e"), 
     function(pat) gregexpr(pat, str1, perl=TRUE)[[1]][2]) ))
#[1] "hebvegege"

data

str1 <- "Bshehebvegegeveghdhebejejrjbfbfk"

#1


0  

One option is gregexpr to find the index of the starting character for the patterns 'he' and 'ge' and then use substr to specify the start and stop positions of the string to extract the substring

一个选项是gregexpr找到模式'he'和'ge'的起始字符的索引,然后使用substr指定字符串的开始和停止位置以提取子字符串

i1 <- gregexpr("he", str1)[[1]][2]
i2 <- gregexpr("ge", str1)[[1]][2] +1
substr(str1, i1, i2)
#[1] "hebvegege"

Or in a single step

或者只需一步

do.call(substr, c(str1, lapply(c("he", "(?<=g)e"), 
     function(pat) gregexpr(pat, str1, perl=TRUE)[[1]][2]) ))
#[1] "hebvegege"

data

str1 <- "Bshehebvegegeveghdhebejejrjbfbfk"