从R中的字符串中删除html标记

时间:2022-08-27 17:08:56

I'm trying to read web page source into R and process it as strings. I'm trying to take the paragraphs out and remove the html tags from the paragraph text. I'm running into the following problem:

我正在尝试将网页源读入R并将其作为字符串处理。我试图删除段落并从段落文本中删除html标记。我遇到了以下问题:

I tried implementing a function to remove the html tags:

我尝试实现一个删除html标签的函数:

cleanFun=function(fullStr)
{
 #find location of tags and citations
 tagLoc=cbind(str_locate_all(fullStr,"<")[[1]][,2],str_locate_all(fullStr,">")[[1]][,1]);

 #create storage for tag strings
 tagStrings=list()

 #extract and store tag strings
 for(i in 1:dim(tagLoc)[1])
 {
   tagStrings[i]=substr(fullStr,tagLoc[i,1],tagLoc[i,2]);
 }

 #remove tag strings from paragraph
 newStr=fullStr
 for(i in 1:length(tagStrings))
 {
   newStr=str_replace_all(newStr,tagStrings[[i]][1],"")
 }
 return(newStr)
};

This works for some tags but not all tags, an example where this fails is following string:

这适用于某些标签但不适用于所有标签,此失败的示例是以下字符串:

test="junk junk<a href=\"/wiki/abstraction_(mathematics)\" title=\"abstraction (mathematics)\"> junk junk"

The goal would be to obtain:

目标是获得:

cleanFun(test)="junk junk junk junk"

However, this doesn't seem to work. I thought it might be something to do with string length or escape characters, but I couldn't find a solution involving those.

但是,这似乎不起作用。我认为它可能与字符串长度或转义字符有关,但我找不到涉及这些的解决方案。

7 个解决方案

#1


36  

This can be achieved simply through regular expressions and the grep family:

这可以通过正则表达式和grep系列来实现:

cleanFun <- function(htmlString) {
  return(gsub("<.*?>", "", htmlString))
}

This will also work with multiple html tags in the same string!

这也适用于同一个字符串中的多个html标签!

#2


8  

Another approach, using tm.plugin.webmining, which uses XML internally.

另一种方法,使用tm.plugin.webmining,它在内部使用XML。

> library(tm.plugin.webmining)
> extractHTMLStrip("junk junk<a href=\"/wiki/abstraction_(mathematics)\" title=\"abstraction (mathematics)\"> junk junk")
[1] "junk junk junk junk"

#3


6  

You can also do this with two functions in the rvest package:

您还可以使用rvest包中的两个函数执行此操作:

library(rvest)

strip_html <- function(s) {
    html_text(read_html(s))
}

Note that you should not use regexes to parse HTML.

请注意,您不应使用正则表达式来解析HTML。

#4


5  

An approach using the qdap package:

使用qdap包的方法:

library(qdap)
bracketX(test, "angle")

## > bracketX(test, "angle")
## [1] "junk junk junk junk"

#5


3  

It is best not to parse html using regular expressions. RegEx match open tags except XHTML self-contained tags

最好不要使用正则表达式解析html。除了XHTML自包含标记之外,RegEx匹配开放标记

Use a package like XML. Source the html code in parse it using for example htmlParse and use xpaths to find the quantities relevant to you.

使用像XML这样的包。使用例如htmlParse解析html代码并使用xpaths查找与您相关的数量。

UPDATE:

更新:

To answer the OP's question

回答OP的问题

require(XML)
xData <- htmlParse('yourfile.html')
xpathSApply(xData, 'appropriate xpath', xmlValue)

#6


2  

First, your subject line is misleading; there are no backslashes in the string you posted. You've fallen victim to one of the classic blunders: not as bad as getting involved in a land war in Asia, but notable all the same. You're mistaking R's use of \ to denote escaped characters for literal backslashes. In this case, \" means the double quote mark, not the two literal characters \ and ". You can use cat to see what the string would actually look like if escaped characters were treated literally.

首先,你的主题是误导;您发布的字符串中没有反斜杠。你已成为一个经典失误的受害者:并不像参与亚洲的陆战一样糟糕,但值得注意的是同样值得注意的。你误以为R使用\来表示字面反斜杠的转义字符。在这种情况下,\“表示双引号,而不是两个文字字符\和”。如果对字面上的转义字符进行处理,您可以使用cat来查看字符串实际上是什么样子。

Second, you're using regular expressions to parse HTML. (They don't appear in your code, but they are used under the hood in str_locate_all and str_replace_all.) This is another of the classic blunders; see here for more exposition.

其次,您使用正则表达式来解析HTML。 (它们没有出现在你的代码中,但它们在str_locate_all和str_replace_all中被使用。)这是另一个经典的错误;在这里看到更多的阐述。

Third, you should have mentioned in your post that you're using the stringr package, but this is only a minor blunder by comparison.

第三,你应该在帖子中提到你正在使用stringr包,但相比之下这只是一个小错误。

#7


2  

It may be easier with sub or gsub ?

sub或gsub可能更容易?

> test  <- "junk junk<a href=\"/wiki/abstraction_(mathematics)\" title=\"abstraction (mathematics)\"> junk junk"
> gsub(pattern = "<.*>", replacement = "", x = test)
[1] "junk junk junk junk"

#1


36  

This can be achieved simply through regular expressions and the grep family:

这可以通过正则表达式和grep系列来实现:

cleanFun <- function(htmlString) {
  return(gsub("<.*?>", "", htmlString))
}

This will also work with multiple html tags in the same string!

这也适用于同一个字符串中的多个html标签!

#2


8  

Another approach, using tm.plugin.webmining, which uses XML internally.

另一种方法,使用tm.plugin.webmining,它在内部使用XML。

> library(tm.plugin.webmining)
> extractHTMLStrip("junk junk<a href=\"/wiki/abstraction_(mathematics)\" title=\"abstraction (mathematics)\"> junk junk")
[1] "junk junk junk junk"

#3


6  

You can also do this with two functions in the rvest package:

您还可以使用rvest包中的两个函数执行此操作:

library(rvest)

strip_html <- function(s) {
    html_text(read_html(s))
}

Note that you should not use regexes to parse HTML.

请注意,您不应使用正则表达式来解析HTML。

#4


5  

An approach using the qdap package:

使用qdap包的方法:

library(qdap)
bracketX(test, "angle")

## > bracketX(test, "angle")
## [1] "junk junk junk junk"

#5


3  

It is best not to parse html using regular expressions. RegEx match open tags except XHTML self-contained tags

最好不要使用正则表达式解析html。除了XHTML自包含标记之外,RegEx匹配开放标记

Use a package like XML. Source the html code in parse it using for example htmlParse and use xpaths to find the quantities relevant to you.

使用像XML这样的包。使用例如htmlParse解析html代码并使用xpaths查找与您相关的数量。

UPDATE:

更新:

To answer the OP's question

回答OP的问题

require(XML)
xData <- htmlParse('yourfile.html')
xpathSApply(xData, 'appropriate xpath', xmlValue)

#6


2  

First, your subject line is misleading; there are no backslashes in the string you posted. You've fallen victim to one of the classic blunders: not as bad as getting involved in a land war in Asia, but notable all the same. You're mistaking R's use of \ to denote escaped characters for literal backslashes. In this case, \" means the double quote mark, not the two literal characters \ and ". You can use cat to see what the string would actually look like if escaped characters were treated literally.

首先,你的主题是误导;您发布的字符串中没有反斜杠。你已成为一个经典失误的受害者:并不像参与亚洲的陆战一样糟糕,但值得注意的是同样值得注意的。你误以为R使用\来表示字面反斜杠的转义字符。在这种情况下,\“表示双引号,而不是两个文字字符\和”。如果对字面上的转义字符进行处理,您可以使用cat来查看字符串实际上是什么样子。

Second, you're using regular expressions to parse HTML. (They don't appear in your code, but they are used under the hood in str_locate_all and str_replace_all.) This is another of the classic blunders; see here for more exposition.

其次,您使用正则表达式来解析HTML。 (它们没有出现在你的代码中,但它们在str_locate_all和str_replace_all中被使用。)这是另一个经典的错误;在这里看到更多的阐述。

Third, you should have mentioned in your post that you're using the stringr package, but this is only a minor blunder by comparison.

第三,你应该在帖子中提到你正在使用stringr包,但相比之下这只是一个小错误。

#7


2  

It may be easier with sub or gsub ?

sub或gsub可能更容易?

> test  <- "junk junk<a href=\"/wiki/abstraction_(mathematics)\" title=\"abstraction (mathematics)\"> junk junk"
> gsub(pattern = "<.*>", replacement = "", x = test)
[1] "junk junk junk junk"