用字的向量分割长字符串

I'm looking to split some television scripts into a data frame with two variables: (1) spoken dialogue and (2) speaker.

我想把一些电视脚本分割成一个有两个变量的数据框架:(1)对话和(2)说话人。

Here is the sample data: http://www.buffyworld.com/buffy/transcripts/127_tran.html

下面是示例数据:http://www.buffyworld.com/buffy/transcripts/127_tran.html

Loaded to R via:

加载到R通过:

require(rvest)

url <- 'http://www.buffyworld.com/buffy/transcripts/127_tran.html')
url <- read_html(url)

all <- url %>% html_text()

[1] "Selfless - Buffy Episode 7x5 'Selfless' (#127) Transcript\n\nBuffy Episode #127: \"Selfless\" \n  Transcript\nWritten by Drew Goddard\n  Original Air Date: October 22, 2002 Skip Teaser.. Take Me To Beginning Of Episode. \n\n \n   \n        NB: The content of this transcript, including the characters \n          and the story, belongs to Mutant Enemy. This transcript was created \n          based on the broadcast episode.\n      \n       \n      \n             \n            BUFFYWORLD.COM \n              prefers that you direct link to this transcript rather than post \n              it on your site, but you can post it on your site if you really \n              want, as long as you keep everything intact, this includes the link \n              to buffyworld.com and this writing. Please also keep the disclaimers \n              intact.\n            \n            Originally transcribed for: http://www.buffyworld.com/.\n\t  \n    TEASER (RECAP SEGMENT):\n  GILES (V.O.)\n\n  Previousl... <truncated>

What I'm trying now is to split at each character's name (I have a full list). For example, 'GILES' above. This works fine except I can't retain character name if I split there. Here's a simplified example.

我现在正在尝试的是对每个角色的名字进行拆分(我有一个完整的列表)。例如,“吉尔”上面。这很好，除非我不能保留字符名，如果我在那里分割。这是一个简化的例子。

to_parse <- paste(c('BUFFY', 'WILLOW'), collapse = '|')
all <- strsplit(all, to_parse)

This gives me the splits I want, but doesn't retain the character name.

这给了我想要的劈叉，但不保留角色名。

Finite question: Any approach to retain that character name w/ what I'm doing? Infinite question: Any other approaches I should be trying?

有限问题:有什么方法可以保留字符名w/我在做什么?无限问题:我还需要尝试其他方法吗?

Thanks in advance!

提前谢谢!

1 个解决方案

#1

I think you can use perl compatible regular expressions with strsplit. For explanatory purposes, I used a shorter sample string, but it should work the same:

我认为您可以使用与strsplit兼容的perl正则表达式。出于解释的目的，我使用了一个更短的示例字符串，但它的工作方式应该是相同的:

string <- "text BUFFY more text WILLOW other text"

to_parse <- paste(c('BUFFY', 'WILLOW'), collapse = '|')
strsplit(string, paste0("(?<=", to_parse, ")"), perl = TRUE)

#[[1]]
#[1] "text BUFFY"        " more text WILLOW" " other text"

As suggested by @Lamia, if you instead had the name before the text you could do a positive look-ahead. I edited the suggestion slightly so that the split string includes the delimiter.

正如@Lamia所建议的那样，如果你在文本之前有了这个名字，你可以做一个积极的展望。我稍微修改了建议，使分割字符串包含分隔符。

strsplit(string, paste0("(?<=.(?=", to_parse, "))"), perl = TRUE)

#[[1]]
#[1] "text "             "BUFFY more text "  "WILLOW other text"

#1