I have a list of urls that I would like to parse and normalize.
我有一个url列表,我想对其进行解析和规范化。
I'd like to be able to split each address into parts so that I can identify "www.google.com/test/index.asp" and "google.com/somethingelse" as being from the same website.
我希望能够将每个地址分成几个部分,这样我就可以将www.google.com/test/index.asp“和查查google /somethingelse”标识为来自同一个网站。
6 个解决方案
#1
10
Since parse_url()
uses regular expressions anyway, we may as well reinvent the wheel and create a single regular expression replacement in order to build a sweet and fancy gsub
call.
由于parse_url()无论如何都使用正则表达式,我们也可以重新创建*,并创建一个正则表达式替换,以便构建一个漂亮的gsub调用。
Let's see. A URL consists of a protocol, a "netloc" which may include username, password, hostname and port components, and a remainder which we happily strip away. Let's assume first there's no username nor password nor port.
让我们来看看。URL由一个协议、一个“netloc”组成,该协议可能包含用户名、密码、主机名和端口组件,以及我们高兴地去掉的其余部分。我们先假设没有用户名、密码和端口。
-
^(?:(?:[[:alpha:]+.-]+)://)?
will match the protocol header (copied fromparse_url()
), we are stripping this away if we find it - ^(?:?:[[:α]+。-]+):/ /)?是否会匹配协议头(从parse_url()复制),如果找到它,我们就会将其剥离
- Also, a potential
www.
prefix is stripped away, but not captured:(?:www\\.)?
- 同时,一个潜在的www。前缀被删除,但没有捕获:(?:www\)?
- Anything up to the subsequent slash will be our fully qualified host name, which we capture:
([^/]+)
- 任何后续的削减将是我们完全限定的主机名,我们捕捉:([^ /]+)
- The rest we ignore:
.*$
- 我们忽略的其余部分:.*$
Now we plug together the regexes above, and the extraction of the hostname becomes:
现在我们把上面的regexes代入,然后提取主机名就变成:
PROTOCOL_REGEX <- "^(?:(?:[[:alpha:]+.-]+)://)?"
PREFIX_REGEX <- "(?:www\\.)?"
HOSTNAME_REGEX <- "([^/]+)"
REST_REGEX <- ".*$"
URL_REGEX <- paste0(PROTOCOL_REGEX, PREFIX_REGEX, HOSTNAME_REGEX, REST_REGEX)
domain.name <- function(urls) gsub(URL_REGEX, "\\1", urls)
Change host name regex to include (but not capture) the port:
将主机名regex更改为包含(但不捕获)端口:
HOSTNAME_REGEX <- "([^:/]+)(?::[0-9]+)?"
And so forth and so on, until we finally arrive at an RFC-compliant regular expression for parsing URLs. However, for home use, the above should suffice:
等等,直到我们最终得到一个符合rpc的正则表达式来解析url。然而,对于家庭使用来说,上述情况就足够了:
> domain.name(c("test.server.com/test", "www.google.com/test/index.asp",
"http://test.com/?ex"))
[1] "test.server.com" "google.com" "test.com"
#2
10
You can use the function of the R package httr
您可以使用R包httr的功能。
parse_url(url)
>parse_url("http://google.com/")
You can get more details here: http://cran.r-project.org/web/packages/httr/httr.pdf
您可以在这里获得更多细节:http://cran.r-project.org/web/packages/httr/httr.pdf
#3
5
There's also the urltools
package, now, which is infinitely faster:
还有urltools软件包,它的速度是无限快的:
urltools::url_parse(c("www.google.com/test/index.asp",
"google.com/somethingelse"))
## scheme domain port path parameter fragment
## 1 www.google.com test/index.asp
## 2 google.com somethingelse
#4
4
I'd forgo a package and use regex for this.
我放弃一个包,用regex来做这个。
EDIT reformulated after the robot attack from Dason...
编辑重新制定后,机器人攻击达森…
x <- c("talkstats.com", "www.google.com/test/index.asp",
"google.com/somethingelse", "www.*.com",
"http://www.bing.com/search?q=google.com&go=&qs=n&form=QBLH&pq=google.com&sc=8-1??0&sp=-1&sk=")
parser <- function(x) gsub("www\\.", "", sapply(strsplit(gsub("http://", "", x), "/"), "[[", 1))
parser(x)
lst <- lapply(unique(parser(x)), function(var) x[parser(x) %in% var])
names(lst) <- unique(parser(x))
lst
## $talkstats.com
## [1] "talkstats.com"
##
## $google.com
## [1] "www.google.com/test/index.asp" "google.com/somethingelse"
##
## $*.com
## [1] "www.*.com"
##
## $bing.com
## [1] "http://www.bing.com/search?q=google.com&go=&qs=n&form=QBLH&pq=google.com&sc=8-1??0&sp=-1&sk="
This may need to be extended depending on the structure of the data.
这可能需要根据数据的结构进行扩展。
#5
3
Building upon R_Newbie's answer, here's a function that will extract the server name from a (vector of) URLs, stripping away a www.
prefix if it exists, and gracefully ignoring a missing protocol prefix.
基于R_Newbie的答案,这里有一个函数,它将从url(向量)中提取服务器名,去掉www。如果存在前缀,则适当地忽略缺失的协议前缀。
domain.name <- function(urls) {
require(httr)
require(plyr)
paths <- laply(urls, function(u) with(parse_url(u),
paste0(hostname, "/", path)))
gsub("^/?(?:www\\.)?([^/]+).*$", "\\1", paths)
}
The parse_url
function is used to extract the path
argument, which is further processed by gsub
. The /?
and (?:www\\.)?
parts of the regular expression will match an optional leading slash followed by an optional www.
, and the [^/]+
matches everything after that but before the first slash -- this is captured and effectively used in the replace text of the gsub
call.
parse_url函数用于提取路径参数,由gsub进一步处理。/ ?和(?:www \ \)?部分正则表达式匹配一个可选的主要削减紧随其后的是一个可选的www。[^ /]+匹配一切之后,但在第一个削减——这是捕获并有效使用的替换文本gsub电话。
> domain.name(c("test.server.com/test", "www.google.com/test/index.asp",
"http://test.com/?ex"))
[1] "test.server.com" "google.com" "test.com"
#6
2
If you like tldextract one option would be to use the version on appengine
如果您喜欢tldextract,可以在appengine上使用这个版本
require(RJSONIO)
test <- c("test.server.com/test", "www.google.com/test/index.asp", "http://test.com/?ex")
lapply(paste0("http://tldextract.appspot.com/api/extract?url=", test), fromJSON)
[[1]]
domain subdomain tld
"server" "test" "com"
[[2]]
domain subdomain tld
"google" "www" "com"
[[3]]
domain subdomain tld
"test" "" "com"
#1
10
Since parse_url()
uses regular expressions anyway, we may as well reinvent the wheel and create a single regular expression replacement in order to build a sweet and fancy gsub
call.
由于parse_url()无论如何都使用正则表达式,我们也可以重新创建*,并创建一个正则表达式替换,以便构建一个漂亮的gsub调用。
Let's see. A URL consists of a protocol, a "netloc" which may include username, password, hostname and port components, and a remainder which we happily strip away. Let's assume first there's no username nor password nor port.
让我们来看看。URL由一个协议、一个“netloc”组成,该协议可能包含用户名、密码、主机名和端口组件,以及我们高兴地去掉的其余部分。我们先假设没有用户名、密码和端口。
-
^(?:(?:[[:alpha:]+.-]+)://)?
will match the protocol header (copied fromparse_url()
), we are stripping this away if we find it - ^(?:?:[[:α]+。-]+):/ /)?是否会匹配协议头(从parse_url()复制),如果找到它,我们就会将其剥离
- Also, a potential
www.
prefix is stripped away, but not captured:(?:www\\.)?
- 同时,一个潜在的www。前缀被删除,但没有捕获:(?:www\)?
- Anything up to the subsequent slash will be our fully qualified host name, which we capture:
([^/]+)
- 任何后续的削减将是我们完全限定的主机名,我们捕捉:([^ /]+)
- The rest we ignore:
.*$
- 我们忽略的其余部分:.*$
Now we plug together the regexes above, and the extraction of the hostname becomes:
现在我们把上面的regexes代入,然后提取主机名就变成:
PROTOCOL_REGEX <- "^(?:(?:[[:alpha:]+.-]+)://)?"
PREFIX_REGEX <- "(?:www\\.)?"
HOSTNAME_REGEX <- "([^/]+)"
REST_REGEX <- ".*$"
URL_REGEX <- paste0(PROTOCOL_REGEX, PREFIX_REGEX, HOSTNAME_REGEX, REST_REGEX)
domain.name <- function(urls) gsub(URL_REGEX, "\\1", urls)
Change host name regex to include (but not capture) the port:
将主机名regex更改为包含(但不捕获)端口:
HOSTNAME_REGEX <- "([^:/]+)(?::[0-9]+)?"
And so forth and so on, until we finally arrive at an RFC-compliant regular expression for parsing URLs. However, for home use, the above should suffice:
等等,直到我们最终得到一个符合rpc的正则表达式来解析url。然而,对于家庭使用来说,上述情况就足够了:
> domain.name(c("test.server.com/test", "www.google.com/test/index.asp",
"http://test.com/?ex"))
[1] "test.server.com" "google.com" "test.com"
#2
10
You can use the function of the R package httr
您可以使用R包httr的功能。
parse_url(url)
>parse_url("http://google.com/")
You can get more details here: http://cran.r-project.org/web/packages/httr/httr.pdf
您可以在这里获得更多细节:http://cran.r-project.org/web/packages/httr/httr.pdf
#3
5
There's also the urltools
package, now, which is infinitely faster:
还有urltools软件包,它的速度是无限快的:
urltools::url_parse(c("www.google.com/test/index.asp",
"google.com/somethingelse"))
## scheme domain port path parameter fragment
## 1 www.google.com test/index.asp
## 2 google.com somethingelse
#4
4
I'd forgo a package and use regex for this.
我放弃一个包,用regex来做这个。
EDIT reformulated after the robot attack from Dason...
编辑重新制定后,机器人攻击达森…
x <- c("talkstats.com", "www.google.com/test/index.asp",
"google.com/somethingelse", "www.*.com",
"http://www.bing.com/search?q=google.com&go=&qs=n&form=QBLH&pq=google.com&sc=8-1??0&sp=-1&sk=")
parser <- function(x) gsub("www\\.", "", sapply(strsplit(gsub("http://", "", x), "/"), "[[", 1))
parser(x)
lst <- lapply(unique(parser(x)), function(var) x[parser(x) %in% var])
names(lst) <- unique(parser(x))
lst
## $talkstats.com
## [1] "talkstats.com"
##
## $google.com
## [1] "www.google.com/test/index.asp" "google.com/somethingelse"
##
## $*.com
## [1] "www.*.com"
##
## $bing.com
## [1] "http://www.bing.com/search?q=google.com&go=&qs=n&form=QBLH&pq=google.com&sc=8-1??0&sp=-1&sk="
This may need to be extended depending on the structure of the data.
这可能需要根据数据的结构进行扩展。
#5
3
Building upon R_Newbie's answer, here's a function that will extract the server name from a (vector of) URLs, stripping away a www.
prefix if it exists, and gracefully ignoring a missing protocol prefix.
基于R_Newbie的答案,这里有一个函数,它将从url(向量)中提取服务器名,去掉www。如果存在前缀,则适当地忽略缺失的协议前缀。
domain.name <- function(urls) {
require(httr)
require(plyr)
paths <- laply(urls, function(u) with(parse_url(u),
paste0(hostname, "/", path)))
gsub("^/?(?:www\\.)?([^/]+).*$", "\\1", paths)
}
The parse_url
function is used to extract the path
argument, which is further processed by gsub
. The /?
and (?:www\\.)?
parts of the regular expression will match an optional leading slash followed by an optional www.
, and the [^/]+
matches everything after that but before the first slash -- this is captured and effectively used in the replace text of the gsub
call.
parse_url函数用于提取路径参数,由gsub进一步处理。/ ?和(?:www \ \)?部分正则表达式匹配一个可选的主要削减紧随其后的是一个可选的www。[^ /]+匹配一切之后,但在第一个削减——这是捕获并有效使用的替换文本gsub电话。
> domain.name(c("test.server.com/test", "www.google.com/test/index.asp",
"http://test.com/?ex"))
[1] "test.server.com" "google.com" "test.com"
#6
2
If you like tldextract one option would be to use the version on appengine
如果您喜欢tldextract,可以在appengine上使用这个版本
require(RJSONIO)
test <- c("test.server.com/test", "www.google.com/test/index.asp", "http://test.com/?ex")
lapply(paste0("http://tldextract.appspot.com/api/extract?url=", test), fromJSON)
[[1]]
domain subdomain tld
"server" "test" "com"
[[2]]
domain subdomain tld
"google" "www" "com"
[[3]]
domain subdomain tld
"test" "" "com"