我如何使用R(Rcurl / XML包?!)来抓取这个网页?

时间:2021-02-06 20:28:23

I have a (somewhat complex) web scraping challenge that I wish to accomplish and would love for some direction (to whatever level you feel like sharing) here goes:

我有一个(有点复杂的)网络抓取挑战,我希望完成,并希望在某个方向(你想分享的任何级别)这里:

I would like to go through all the "species pages" present in this link:

我想通过这个链接中的所有“物种页面”:

http://gtrnadb.ucsc.edu/

http://gtrnadb.ucsc.edu/

So for each of them I will go to:

所以对于他们每个人我会去:

  1. The species page link (for example: http://gtrnadb.ucsc.edu/Aero_pern/)
  2. 物种页面链接(例如:http://gtrnadb.ucsc.edu/Aero_pern/)
  3. And then to the "Secondary Structures" page link (for example: http://gtrnadb.ucsc.edu/Aero_pern/Aero_pern-structs.html)
  4. 然后到“二级结构”页面链接(例如:http://gtrnadb.ucsc.edu/Aero_pern/Aero_pern-structs.html)

Inside that link I wish to scrap the data in the page so that I will have a long list containing this data (for example):

在该链接中,我希望废弃页面中的数据,以便我将有一个包含此数据的长列表(例如):

chr.trna3 (1-77)    Length: 77 bp
Type: Ala   Anticodon: CGC at 35-37 (35-37) Score: 93.45
Seq: GGGCCGGTAGCTCAGCCtGGAAGAGCGCCGCCCTCGCACGGCGGAGGcCCCGGGTTCAAATCCCGGCCGGTCCACCA
Str: >>>>>>>..>>>>.........<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<<<<....

Where each line will have it's own list (inside the list for each "trna" inside the list for each animal)

每一行都有自己的列表(在每个动物列表中每个“trna”的列表中)

I remember coming across the packages Rcurl and XML (in R) that can allow for such a task. But I don't know how to use them. So what I would love to have is: 1. Some suggestion on how to build such a code. 2. And recommendation for how to learn the knowledge needed for performing such a task.

我记得看过Rcurl和XML(在R中)可以允许这样的任务。但我不知道如何使用它们。所以我希望拥有的是:1。关于如何构建这样的代码的一些建议。 2.并建议如何学习执行此类任务所需的知识。

Thanks for any help,

谢谢你的帮助,

Tal

塔尔

3 个解决方案

#1


17  

Tal,

塔尔,

You could use R and the XML package to do this, but (damn) that is some poorly formed HTML you are trying to parse. In fact, in most cases your would want to be using the readHTMLTable() function, which is covered in this previous thread.

您可以使用R和XML包来执行此操作,但是(该死的)这是您尝试解析的一些格式不正确的HTML。实际上,在大多数情况下,您可能希望使用readHTMLTable()函数,该函数在此前一个线程中有所介绍。

Given this ugly HTML, however, we will have to use the RCurl package to pull the raw HTML and create some custom functions to parse it. This problem has two components:

但是,鉴于这个丑陋的HTML,我们将不得不使用RCurl包来提取原始HTML并创建一些自定义函数来解析它。这个问题有两个组成部分:

  1. Get all of the genome URLS from the base webpage (http://gtrnadb.ucsc.edu/) using the getURLContent() function in the RCurlpackage and some regex magic :-)
  2. 使用RCurlpackage中的getURLContent()函数和一些正则表达式魔法从基本网页(http://gtrnadb.ucsc.edu/)获取所有基因组URLS :-)
  3. Then take that list of URLS and scrape the data you are looking for, and then stick it into a data.frame.
  4. 然后获取URL列表并抓取您要查找的数据,然后将其粘贴到data.frame中。

So, here goes...

所以,这里......

library(RCurl)

### 1) First task is to get all of the web links we will need ##
base_url<-"http://gtrnadb.ucsc.edu/"
base_html<-getURLContent(base_url)[[1]]
links<-strsplit(base_html,"a href=")[[1]]

get_data_url<-function(s) {
    u_split1<-strsplit(s,"/")[[1]][1]
    u_split2<-strsplit(u_split1,'\\"')[[1]][2]
    ifelse(grep("[[:upper:]]",u_split2)==1 & length(strsplit(u_split2,"#")[[1]])<2,return(u_split2),return(NA))
}

# Extract only those element that are relevant
genomes<-unlist(lapply(links,get_data_url))
genomes<-genomes[which(is.na(genomes)==FALSE)]

### 2) Now, scrape the genome data from all of those URLS ###

# This requires two complementary functions that are designed specifically
# for the UCSC website. The first parses the data from a -structs.html page
# and the second collects that data in to a multi-dimensional list
parse_genomes<-function(g) {
    g_split1<-strsplit(g,"\n")[[1]]
    g_split1<-g_split1[2:5]
    # Pull all of the data and stick it in a list
    g_split2<-strsplit(g_split1[1],"\t")[[1]]
    ID<-g_split2[1]                             # Sequence ID
    LEN<-strsplit(g_split2[2],": ")[[1]][2]     # Length
    g_split3<-strsplit(g_split1[2],"\t")[[1]]
    TYPE<-strsplit(g_split3[1],": ")[[1]][2]    # Type
    AC<-strsplit(g_split3[2],": ")[[1]][2]      # Anticodon
    SEQ<-strsplit(g_split1[3],": ")[[1]][2]     # ID
    STR<-strsplit(g_split1[4],": ")[[1]][2]     # String
    return(c(ID,LEN,TYPE,AC,SEQ,STR))
}

# This will be a high dimensional list with all of the data, you can then manipulate as you like
get_structs<-function(u) {
    struct_url<-paste(base_url,u,"/",u,"-structs.html",sep="")
    raw_data<-getURLContent(struct_url)
    s_split1<-strsplit(raw_data,"<PRE>")[[1]]
    all_data<-s_split1[seq(3,length(s_split1))]
    data_list<-lapply(all_data,parse_genomes)
    for (d in 1:length(data_list)) {data_list[[d]]<-append(data_list[[d]],u)}
    return(data_list)
}

# Collect data, manipulate, and create data frame (with slight cleaning)
genomes_list<-lapply(genomes[1:2],get_structs) # Limit to the first two genomes (Bdist & Spurp), a full scrape will take a LONG time
genomes_rows<-unlist(genomes_list,recursive=FALSE) # The recursive=FALSE saves a lot of work, now we can just do a straigh forward manipulation
genome_data<-t(sapply(genomes_rows,rbind))
colnames(genome_data)<-c("ID","LEN","TYPE","AC","SEQ","STR","NAME")
genome_data<-as.data.frame(genome_data)
genome_data<-subset(genome_data,ID!="</PRE>")   # Some malformed web pages produce bad rows, but we can remove them

head(genome_data)

The resulting data frame contains seven columns related to each genome entry: ID, length, type, sequence, string, and name. The name column contains the base genome, which was my best guess for data organization. Here it what it looks like:

结果数据框包含与每个基因组条目相关的七列:ID,长度,类型,序列,字符串和名称。名称列包含基础基因组,这是我对数据组织的最佳猜测。这就是它的样子:

head(genome_data)
                                   ID   LEN TYPE                           AC                                                                       SEQ
1     Scaffold17302.trna1 (1426-1498) 73 bp  Ala     AGC at 34-36 (1459-1461) AGGGAGCTAGCTCAGATGGTAGAGCGCTCGCTTAGCATGCGAGAGGtACCGGGATCGATGCCCGGGTTTTCCA
2   Scaffold20851.trna5 (43038-43110) 73 bp  Ala   AGC at 34-36 (43071-43073) AGGGAGCTAGCTCAGATGGTAGAGCGCTCGCTTAGCATGCGAGAGGtACCGGGATCGATGCCCGGGTTCTCCA
3   Scaffold20851.trna8 (45975-46047) 73 bp  Ala   AGC at 34-36 (46008-46010) TGGGAGCTAGCTCAGATGGTAGAGCGCTCGCTTAGCATGCGAGAGGtACCGGGATCGATGCCCGGGTTCTCCA
4     Scaffold17302.trna2 (2514-2586) 73 bp  Ala     AGC at 34-36 (2547-2549) GGGGAGCTAGCTCAGATGGTAGAGCGCTCGCTTAGCATGCGAGAGGtACAGGGATCGATGCCCGGGTTCTCCA
5 Scaffold51754.trna5 (253637-253565) 73 bp  Ala AGC at 34-36 (253604-253602) CGGGGGCTAGCTCAGATGGTAGAGCGCTCGCTTAGCATGCGAGAGGtACCGGGATCGATGCCCGGGTCCTCCA
6     Scaffold17302.trna4 (6027-6099) 73 bp  Ala     AGC at 34-36 (6060-6062) GGGGAGCTAGCTCAGATGGTAGAGCGCTCGCTTAGCATGCGAGAGGtACCGGGATCGATGCCCGAGTTCTCCA
                                                                        STR  NAME
1 .>>>>>>..>>>>........<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<<<.. Spurp
2 .>>>>>>..>>>>........<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<<<.. Spurp
3 .>>>>>>..>>>>........<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<<<.. Spurp
4 >>>>>>>..>>>>........<<<<.>>>>>.......<<<<<.....>.>>>.......<<<.<<<<<<<<. Spurp
5 .>>>>>>..>>>>........<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<<<.. Spurp
6 >>>>>>>..>>>>........<<<<.>>>>>.......<<<<<......>>>>.......<<<<.<<<<<<<. Spurp

I hope this helps, and thanks for the fun little Sunday afternoon R challenge!

我希望这有所帮助,并感谢有趣的小周日下午R挑战!

#2


1  

Just tried it using Mozenda (http://www.mozenda.com). After roughly 10 minutes and I had an agent that could scrape the data as you describe. You may be able to get all of this data just using their free trial. Coding is fun, if you have time, but it looks like you may already have a solution coded for you. Nice job Drew.

刚尝试使用Mozenda(http://www.mozenda.com)。大约10分钟后,我有一个代理人可以像你描述的那样抓取数据。您可以使用他们的免费试用版获得所有这些数据。编码很有趣,如果你有时间,但看起来你可能已经为你编写了一个解决方案。干得好Drew。

#3


0  

Interesting problem and agree that R is cool, but somehow i find R to be a bit cumbersome in this respect. I seem to prefer to get the data in intermediate plain text form first in order to be able to verify that the data is correct in every step... If the data is ready in its final form or for uploading your data somewhere RCurl is very useful.

有趣的问题,并同意R很酷,但不知何故,我发现R在这方面有点麻烦。我似乎更喜欢首先以中间纯文本形式获取数据,以便能够在每个步骤中验证数据是否正确...如果数据已准备好以其最终形式或用于在某处上传您的数据RCurl非常有用。

Simplest in my opinion would be to (on linux/unix/mac/or in cygwin) just mirror the entire http://gtrnadb.ucsc.edu/ site (using wget) and take the files named /-structs.html, sed or awk the data you would like and format it for reading into R.

我认为最简单的是(在linux / unix / mac /或cygwin上)只需镜像整个http://gtrnadb.ucsc.edu/网站(使用wget)并获取名为/-structs.html,sed的文件或者awk您想要的数据并格式化以便读入R.

I'm sure there would be lots of other ways also.

我相信还有很多其他方法。

#1


17  

Tal,

塔尔,

You could use R and the XML package to do this, but (damn) that is some poorly formed HTML you are trying to parse. In fact, in most cases your would want to be using the readHTMLTable() function, which is covered in this previous thread.

您可以使用R和XML包来执行此操作,但是(该死的)这是您尝试解析的一些格式不正确的HTML。实际上,在大多数情况下,您可能希望使用readHTMLTable()函数,该函数在此前一个线程中有所介绍。

Given this ugly HTML, however, we will have to use the RCurl package to pull the raw HTML and create some custom functions to parse it. This problem has two components:

但是,鉴于这个丑陋的HTML,我们将不得不使用RCurl包来提取原始HTML并创建一些自定义函数来解析它。这个问题有两个组成部分:

  1. Get all of the genome URLS from the base webpage (http://gtrnadb.ucsc.edu/) using the getURLContent() function in the RCurlpackage and some regex magic :-)
  2. 使用RCurlpackage中的getURLContent()函数和一些正则表达式魔法从基本网页(http://gtrnadb.ucsc.edu/)获取所有基因组URLS :-)
  3. Then take that list of URLS and scrape the data you are looking for, and then stick it into a data.frame.
  4. 然后获取URL列表并抓取您要查找的数据,然后将其粘贴到data.frame中。

So, here goes...

所以,这里......

library(RCurl)

### 1) First task is to get all of the web links we will need ##
base_url<-"http://gtrnadb.ucsc.edu/"
base_html<-getURLContent(base_url)[[1]]
links<-strsplit(base_html,"a href=")[[1]]

get_data_url<-function(s) {
    u_split1<-strsplit(s,"/")[[1]][1]
    u_split2<-strsplit(u_split1,'\\"')[[1]][2]
    ifelse(grep("[[:upper:]]",u_split2)==1 & length(strsplit(u_split2,"#")[[1]])<2,return(u_split2),return(NA))
}

# Extract only those element that are relevant
genomes<-unlist(lapply(links,get_data_url))
genomes<-genomes[which(is.na(genomes)==FALSE)]

### 2) Now, scrape the genome data from all of those URLS ###

# This requires two complementary functions that are designed specifically
# for the UCSC website. The first parses the data from a -structs.html page
# and the second collects that data in to a multi-dimensional list
parse_genomes<-function(g) {
    g_split1<-strsplit(g,"\n")[[1]]
    g_split1<-g_split1[2:5]
    # Pull all of the data and stick it in a list
    g_split2<-strsplit(g_split1[1],"\t")[[1]]
    ID<-g_split2[1]                             # Sequence ID
    LEN<-strsplit(g_split2[2],": ")[[1]][2]     # Length
    g_split3<-strsplit(g_split1[2],"\t")[[1]]
    TYPE<-strsplit(g_split3[1],": ")[[1]][2]    # Type
    AC<-strsplit(g_split3[2],": ")[[1]][2]      # Anticodon
    SEQ<-strsplit(g_split1[3],": ")[[1]][2]     # ID
    STR<-strsplit(g_split1[4],": ")[[1]][2]     # String
    return(c(ID,LEN,TYPE,AC,SEQ,STR))
}

# This will be a high dimensional list with all of the data, you can then manipulate as you like
get_structs<-function(u) {
    struct_url<-paste(base_url,u,"/",u,"-structs.html",sep="")
    raw_data<-getURLContent(struct_url)
    s_split1<-strsplit(raw_data,"<PRE>")[[1]]
    all_data<-s_split1[seq(3,length(s_split1))]
    data_list<-lapply(all_data,parse_genomes)
    for (d in 1:length(data_list)) {data_list[[d]]<-append(data_list[[d]],u)}
    return(data_list)
}

# Collect data, manipulate, and create data frame (with slight cleaning)
genomes_list<-lapply(genomes[1:2],get_structs) # Limit to the first two genomes (Bdist & Spurp), a full scrape will take a LONG time
genomes_rows<-unlist(genomes_list,recursive=FALSE) # The recursive=FALSE saves a lot of work, now we can just do a straigh forward manipulation
genome_data<-t(sapply(genomes_rows,rbind))
colnames(genome_data)<-c("ID","LEN","TYPE","AC","SEQ","STR","NAME")
genome_data<-as.data.frame(genome_data)
genome_data<-subset(genome_data,ID!="</PRE>")   # Some malformed web pages produce bad rows, but we can remove them

head(genome_data)

The resulting data frame contains seven columns related to each genome entry: ID, length, type, sequence, string, and name. The name column contains the base genome, which was my best guess for data organization. Here it what it looks like:

结果数据框包含与每个基因组条目相关的七列:ID,长度,类型,序列,字符串和名称。名称列包含基础基因组,这是我对数据组织的最佳猜测。这就是它的样子:

head(genome_data)
                                   ID   LEN TYPE                           AC                                                                       SEQ
1     Scaffold17302.trna1 (1426-1498) 73 bp  Ala     AGC at 34-36 (1459-1461) AGGGAGCTAGCTCAGATGGTAGAGCGCTCGCTTAGCATGCGAGAGGtACCGGGATCGATGCCCGGGTTTTCCA
2   Scaffold20851.trna5 (43038-43110) 73 bp  Ala   AGC at 34-36 (43071-43073) AGGGAGCTAGCTCAGATGGTAGAGCGCTCGCTTAGCATGCGAGAGGtACCGGGATCGATGCCCGGGTTCTCCA
3   Scaffold20851.trna8 (45975-46047) 73 bp  Ala   AGC at 34-36 (46008-46010) TGGGAGCTAGCTCAGATGGTAGAGCGCTCGCTTAGCATGCGAGAGGtACCGGGATCGATGCCCGGGTTCTCCA
4     Scaffold17302.trna2 (2514-2586) 73 bp  Ala     AGC at 34-36 (2547-2549) GGGGAGCTAGCTCAGATGGTAGAGCGCTCGCTTAGCATGCGAGAGGtACAGGGATCGATGCCCGGGTTCTCCA
5 Scaffold51754.trna5 (253637-253565) 73 bp  Ala AGC at 34-36 (253604-253602) CGGGGGCTAGCTCAGATGGTAGAGCGCTCGCTTAGCATGCGAGAGGtACCGGGATCGATGCCCGGGTCCTCCA
6     Scaffold17302.trna4 (6027-6099) 73 bp  Ala     AGC at 34-36 (6060-6062) GGGGAGCTAGCTCAGATGGTAGAGCGCTCGCTTAGCATGCGAGAGGtACCGGGATCGATGCCCGAGTTCTCCA
                                                                        STR  NAME
1 .>>>>>>..>>>>........<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<<<.. Spurp
2 .>>>>>>..>>>>........<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<<<.. Spurp
3 .>>>>>>..>>>>........<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<<<.. Spurp
4 >>>>>>>..>>>>........<<<<.>>>>>.......<<<<<.....>.>>>.......<<<.<<<<<<<<. Spurp
5 .>>>>>>..>>>>........<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<<<.. Spurp
6 >>>>>>>..>>>>........<<<<.>>>>>.......<<<<<......>>>>.......<<<<.<<<<<<<. Spurp

I hope this helps, and thanks for the fun little Sunday afternoon R challenge!

我希望这有所帮助,并感谢有趣的小周日下午R挑战!

#2


1  

Just tried it using Mozenda (http://www.mozenda.com). After roughly 10 minutes and I had an agent that could scrape the data as you describe. You may be able to get all of this data just using their free trial. Coding is fun, if you have time, but it looks like you may already have a solution coded for you. Nice job Drew.

刚尝试使用Mozenda(http://www.mozenda.com)。大约10分钟后,我有一个代理人可以像你描述的那样抓取数据。您可以使用他们的免费试用版获得所有这些数据。编码很有趣,如果你有时间,但看起来你可能已经为你编写了一个解决方案。干得好Drew。

#3


0  

Interesting problem and agree that R is cool, but somehow i find R to be a bit cumbersome in this respect. I seem to prefer to get the data in intermediate plain text form first in order to be able to verify that the data is correct in every step... If the data is ready in its final form or for uploading your data somewhere RCurl is very useful.

有趣的问题,并同意R很酷,但不知何故,我发现R在这方面有点麻烦。我似乎更喜欢首先以中间纯文本形式获取数据,以便能够在每个步骤中验证数据是否正确...如果数据已准备好以其最终形式或用于在某处上传您的数据RCurl非常有用。

Simplest in my opinion would be to (on linux/unix/mac/or in cygwin) just mirror the entire http://gtrnadb.ucsc.edu/ site (using wget) and take the files named /-structs.html, sed or awk the data you would like and format it for reading into R.

我认为最简单的是(在linux / unix / mac /或cygwin上)只需镜像整个http://gtrnadb.ucsc.edu/网站(使用wget)并获取名为/-structs.html,sed的文件或者awk您想要的数据并格式化以便读入R.

I'm sure there would be lots of other ways also.

我相信还有很多其他方法。