如何使用正则表达式从文本文档中删除超链接,电子邮件ID等?

时间:2021-09-03 02:46:57

I have some text documents which contains:

我有一些文本文件,其中包含:

  • Different types of emails addresses: I mean public domain such as gmail, yahoo, etc and private emails as well such as abc@mycompany.org...
  • 不同类型的电子邮件地址:我的意思是公共域,如gmail,yahoo等,以及私人电子邮件,如abc@mycompany.org ...
  • Different hyperlinks such as abc.com, http://abc.com, www.abc.org, ...
  • 不同的超链接,如abc.com,http://abc.com,www.abc.org,..

So, I wish to know if I can write a single regex command to remove all such entries from my documents for further processing, and if yes then please share some links, documents, or anything useful. I wish to remove any sort of email id or hyperlink from the documents using regex function. I'll be implementing the regex code in R. Since, I'm a newbie in this area so any detailed explanation will be highly appreciated.

所以,我想知道我是否可以编写一个正则表达式命令从我的文档中删除所有这些条目以进行进一步处理,如果是,那么请分享一些链接,文档或任何有用的东西。我希望使用正则表达式功能从文档中删除任何类型的电子邮件ID或超链接。我将在R中实现正则表达式代码。因为,我是这个领域的新手,所以任何详细的解释都将受到高度赞赏。

So, if I give input as:

所以,如果我提供输入:

"abc@mycompany.org aasd234bc.com to be retained http://abc.com www.abc.org org com .com comm in sahgo234@flkja23.in"

“abc@mycompany.org aasd234bc.com将保留http://abc.com www.abc.org org com .com comm in sahgo234@flkja23.in”

Then I should get output as:

然后我应该得到输出:

"to be retained org com comm in"

“保留org com comm in”

2 个解决方案

#1


2  

You can try something like that:

你可以试试这样的东西:

x <- c("abc@mycompany.org", "abc.com", "http://abc.com", "www.abc.org")
gsub("(@.+$|\\..{1,3}$|(^http://)?(w{3}\\.)?)", "", x, perl=T)

If I better understand your question and if it is the first email adress that you need to remove:

如果我更好地理解您的问题,并且它是您需要删除的第一个电子邮件地址:

 gsub("(^\\b\\S+\\@\\S+\\..{1,3}(\\s)?\\b)", "", x, perl=T)

otherwise:

除此以外:

gsub("(\\b\\S+\\@\\S+\\..{1,3}(\\s)?\\b)", "", x, perl=T)

HTH

HTH

#2


1  

I wouldn't call this truly regex and it's likely slower but...

我不会称之为真正的正则表达式,它可能会更慢但是......

x <- c("abc@mycompany.org aasd234bc.com to be retained abc.com www.abc.org org com .com comm in sahgo234@flkja23.in")

y <- unlist(strsplit(x, "\\s+"))
paste(y[!grepl("@|\\.com|\\.org|www\\.|\\.org|\\.in", y)], collapse=" ")

## [1] "to be retained org com comm in"

EDIT: For a multi-row vector wrap it up as a function and lapply it...

编辑:对于一个多行向量包装它作为一个函数,并提供它...

x <- c("abc@mycompany.org aasd234bc.com to be retained abc.com www.abc.org org com .com comm in sahgo234@flkja23.in", 
    "abc@mycompany.org aasd234bc.com to be retained abc.com www.abc.org org com .com comm in sahgo234@flkja23.in")

FUN <- function(x) {
    y <- unlist(strsplit(x, "\\s+"))
    paste(y[!grepl("@|\\.com|\\.org|www\\.|\\.org|\\.in", y)], collapse=" ")
}
unlist(lapply(x, FUN))

## > unlist(lapply(x, FUN))
## [1] "to be retained org com comm in" "to be retained org com comm in"

#1


2  

You can try something like that:

你可以试试这样的东西:

x <- c("abc@mycompany.org", "abc.com", "http://abc.com", "www.abc.org")
gsub("(@.+$|\\..{1,3}$|(^http://)?(w{3}\\.)?)", "", x, perl=T)

If I better understand your question and if it is the first email adress that you need to remove:

如果我更好地理解您的问题,并且它是您需要删除的第一个电子邮件地址:

 gsub("(^\\b\\S+\\@\\S+\\..{1,3}(\\s)?\\b)", "", x, perl=T)

otherwise:

除此以外:

gsub("(\\b\\S+\\@\\S+\\..{1,3}(\\s)?\\b)", "", x, perl=T)

HTH

HTH

#2


1  

I wouldn't call this truly regex and it's likely slower but...

我不会称之为真正的正则表达式,它可能会更慢但是......

x <- c("abc@mycompany.org aasd234bc.com to be retained abc.com www.abc.org org com .com comm in sahgo234@flkja23.in")

y <- unlist(strsplit(x, "\\s+"))
paste(y[!grepl("@|\\.com|\\.org|www\\.|\\.org|\\.in", y)], collapse=" ")

## [1] "to be retained org com comm in"

EDIT: For a multi-row vector wrap it up as a function and lapply it...

编辑:对于一个多行向量包装它作为一个函数,并提供它...

x <- c("abc@mycompany.org aasd234bc.com to be retained abc.com www.abc.org org com .com comm in sahgo234@flkja23.in", 
    "abc@mycompany.org aasd234bc.com to be retained abc.com www.abc.org org com .com comm in sahgo234@flkja23.in")

FUN <- function(x) {
    y <- unlist(strsplit(x, "\\s+"))
    paste(y[!grepl("@|\\.com|\\.org|www\\.|\\.org|\\.in", y)], collapse=" ")
}
unlist(lapply(x, FUN))

## > unlist(lapply(x, FUN))
## [1] "to be retained org com comm in" "to be retained org com comm in"