My question is what is the gsub command to substitute for a word starting with a specific letter. My main goal is to remove all URL's from a given text.
我的问题是什么是gsub命令来替换以特定字母开头的单词。我的主要目标是从给定文本中删除所有URL。
For example, I have a text: "refer http://www.google.com for further details"
. What I need to do is, transform the text to "refer for further details"
. For this, essentially I need to write a gsub
command something like below:
例如,我有一个文字:“请参阅http://www.google.com了解更多详情”。我需要做的是,将文本转换为“参考更多细节”。为此,基本上我需要编写一个类似下面的gsub命令:
text <- "refer http://www.google.com for further details"
gsub("http", "", text)
however this removes only the part 'http' from the text. I need to remove the complete word starting with 'http'.
但是这只从文本中删除了部分'http'。我需要删除以'http'开头的完整单词。
some other commands that I tried:
我试过的其他一些命令:
gsub('http..', "", text) # -->removes two letters more after 'http' (the number of dots specifies the number of letters'
gsub('^http', "", text)
gsub('/http', "", text)
gsub('\\\http', "", text)
All this didn't give any fruitful results.
所有这些都没有带来任何丰硕成果。
Any help in this regard will be greatly appreciated.
在这方面的任何帮助将不胜感激。
1 个解决方案
#1
1
This is only a halfway answer:
这只是答案的一半:
gsub("https?://.*?\\s", "", text)
# [1] "refer for further details"
Why is it a "halfway answer"? It really only addresses a limited set of scenarios--those where a URL is always followed by a space. However, if it encountered a URL followed immediately by a punctuation mark, it would not work.
为什么它是“中途回答”?它实际上只针对一组有限的场景 - 其中URL始终后跟空格。但是,如果它遇到一个后跟紧跟标点符号的URL,则无效。
Detecting URLs is a fairly common task. You should be able to find more detailed patterns by searching for something like "regex identify URL". Most likely, though, you'd need to modify it somewhat to work with R.
检测URL是一项相当常见的任务。您应该能够通过搜索“正则表达式识别URL”之类的内容找到更详细的模式。但是,最有可能的是,您需要对其进行一些修改才能与R一起使用。
#1
1
This is only a halfway answer:
这只是答案的一半:
gsub("https?://.*?\\s", "", text)
# [1] "refer for further details"
Why is it a "halfway answer"? It really only addresses a limited set of scenarios--those where a URL is always followed by a space. However, if it encountered a URL followed immediately by a punctuation mark, it would not work.
为什么它是“中途回答”?它实际上只针对一组有限的场景 - 其中URL始终后跟空格。但是,如果它遇到一个后跟紧跟标点符号的URL,则无效。
Detecting URLs is a fairly common task. You should be able to find more detailed patterns by searching for something like "regex identify URL". Most likely, though, you'd need to modify it somewhat to work with R.
检测URL是一项相当常见的任务。您应该能够通过搜索“正则表达式识别URL”之类的内容找到更详细的模式。但是,最有可能的是,您需要对其进行一些修改才能与R一起使用。