分享一个从IEEE Xplore上批量下载会议论文的方法

标签（空格分隔）： IEEE Xplore， bash

测试环境：Ubuntu 15.04, 中山大学

首先，从下载一篇论文开始，在IEEE Xplore上任意下载一篇论文，获取下载链接, 如：

http://ieeexplore.ieee.org/ielx7/6875427/6877223/06877226.pdf?tp=&arnumber=6877226&isnumber=6877223

截取？前面部分：

http://ieeexplore.ieee.org/ielx7/6875427/6877223/06877226.pdf

然后，Linux上使用wget命令可以快速地从指定URL下载文件（后面也是使用这个命令来实现批量下载），

分享一个从IEEE Xplore上批量下载会议论文的方法

一篇论文就这么下载了，所以，要实现批量下载，必须要获取所有论文的下载URL，其实，多下载几篇论文比较下它们的下载链接就可以发现：

http://ieeexplore.ieee.org/ielx7/6875427/6877223/06877326.pdf?tp=&arnumber=6877326&isnumber=6877223
http://ieeexplore.ieee.org/ielx7/6875427/6877223/06877325.pdf?tp=&arnumber=6877325&isnumber=6877223
http://ieeexplore.ieee.org/ielx7/6875427/6877223/06877324.pdf?tp=&arnumber=6877324&isnumber=6877223

下载链接的格式如下，（前两串数字即“6875427”和“6877223”对于同一个会议都是相同的，所以只需要获取一次就可以了）：

http://ieeexplore.ieee.org/ielx7/6875427/6877223/0{arnumber}.pdf

所以，可以将下载链接分为两个部分， 注意arnumber前面有多了一个0：

http://ieeexplore.ieee.org/ielx7/6875427/6877223/ 和 0{arnumber}.pdf

问题就变成，如何获取所有论文的arnumber了，这个方法就有两种，一种可以使用爬虫，解析网页获取，但是写代码来比较麻烦，这里使用另外一种，IEEE Xplore提供了一个Download Citations的功能，如图：

分享一个从IEEE Xplore上批量下载会议论文的方法

下载后保存至文件，

Thangavel, M.; Chandrasekaran, M.; Madheswaran, M., "Analysis of B-mode transverse ultrasound common carotid artery images using contour tracking by particle filtering technique," in Devices, Circuits and Systems (ICDCS), 2012 International Conference on , vol., no., pp.470-473, 15-16 March 2012
doi: 10.1109/ICDCSyst.2012.6188759
keywords: {biodiffusion;biomedical ultrasonics;blood vessels;cardiovascular system;diseases;filters;image denoising;image segmentation;medical image processing;particle filtering (numerical methods);speckle;ultrasonic imaging;B-mode transverse ultrasound common carotid artery images;atherosclerosis;cardiovascular diseases;contour tracking;edge preserving anisotropic diffusion filter;image segmentation;medical image analysis;particle filtering technique;speckle noises;speckle reduction;Fitting;Image segmentation;Image Segmentation;Medical imaging;Particle filtering;Ultrasound image},
URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6188759&isnumber=6188639

其中包含了每一篇论文的信息，包括标题（可以用来文件命名），关键字，URL等信息，其中的URL并不能直接用来wget下载论文，但是包含了我们要获取的arnumber信息，好了～接下来要做的就是从这些信息里面抽取arnumber和论文标题了

观察下载后的Citations信息，发现论文标题都包含在双引号之间，即“”标题””这样，arnumber即“arnumber=6188759”，那就用正则表达式来匹配吧，看命令：

cat {citations file} | grep -o -e "arnumber=[0-9]*" -e '"[^\"]*"' >> "{save file}"

实现了从刚才下载的索引文件里抽取出论文标题和arnumber信息，并保存至另外一个文件的功能，其中有两个正则表达式, 分别用来匹配arnumber和论文标题，得到的信息如下：

"Algorithm Engineering for Scalable Parallel External Sorting,"
arnumber=6012805
"Power-Aware Replica Placement and Update Strategies in Tree Networks,"
arnumber=6012820
"Minimum Cost Resource Allocation for Meeting Job Requirements,"
arnumber=6012821

每两行代表一篇论文的标题和arnumber，然后就好办了，进行Shell编程, 循环读取以上的信息，使用arnumber去下载，然后用论文标题作为文件名保存，那么，如何读取呢～

#!/bin/bash
base="http://ieeexplore.ieee.org/ielx7/6875427/6877223/"
file="文件名.txt"
while read -r title; read -r arnumber
do 
  title=`echo $title | cut -d "\"" -f 2 | cut -d "," -f 1 | sed 's/\///'` #获取title
  arnumber=`echo $arnumber | cut -d "=" -f 2` #获取arnumber
  wget "$base/0$arnumber.pdf" #下载
  mv "0$arnumber.pdf" "$title.pdf" #用标题来作为文件名保存
done < "$file"

保存为download.sh, 给予它执行的权限：

sudo chmod +x download.sh

然后./download.sh就可以运行了，等待程序运行完就ok了～

上面还用到了两个命令， cut 主要用来截取部分字符串， sed用来去除标题中的斜杠，因为斜杠不能出现在文件名中～具体用法不说了

亲测，ICDCS 2012， IPDPS 2012-2015 可用～

分享一个从IEEE Xplore上批量下载会议论文的方法

附上我的完整程序：

#!/bin/bash

base=
file=
tempfile1="downlist.txt" #临时文件，用完删除
tempfile2="urls.txt" #临时文件，用完删除

if [ -f $tempfile1 ]; then
    rm $tempfile1
fi

if [ -f $tempfile2 ]; then
    rm $tempfile2
fi

usage()
{
echo "Usage: `basename $0` -b url_base_string -f input_file [-h help]"
exit 1
}

while getopts "b:f:h" arg #选项后面的冒号表示该选项需要参数
do
case $arg in
         b)
            base=$OPTARG
            ;;
         f)
            file=$OPTARG
            ;;
         h)
            usage
            ;;
         ?)  #当有不认识的选项的时候arg为?
echo "unkonw argument"
exit 1
    ;;
esac
done

if [ -z "$base" ]; then   #该脚本必须提供-b选项
echo "You must specify base with -b option"
exit
fi

if [ -z "$file" ]; then   #该脚本必须提供-f选项
echo "You must specify file with -f option"
exit
fi

cat $file | grep -o -e "arnumber=[0-9]*" -e '"[^\"]*"' >> "$tempfile1"

while read -r title; read -r arnumber #循环读取标题和arnumber
do 
  title=`echo $title | cut -d "\"" -f 2 | cut -d "," -f 1 | sed 's/\///'`
  arnumber=`echo $arnumber | cut -d "=" -f 2`
echo "$base/0$arnumber.pdf" >> "$tempfile2" #这里先生成所有下载链接，然后保存到临时文件
done < "$tempfile1"

wget -i $tempfile2 #批量下载论文

echo $?

while read -r title; read -r arnumber #重命名
do 
  title=`echo $title | cut -d "\"" -f 2 | cut -d "," -f 1 | sed 's/\///'`
  arnumber=`echo $arnumber | cut -d "=" -f 2`
  mv "0$arnumber.pdf" "$title.pdf"
done < "$tempfile1"

if [ -f $tempfile1 ]; then
    rm $tempfile1
fi

if [ -f $tempfile2 ]; then
    rm $tempfile2
fi

用法：./download.sh -b {base url, 需自行获取} -f {从IEEE Xplore上下载的Citations文件}

./download.sh -b http://ieeexplore.ieee.org/ielx5/6180033/6188639 -f downloadCitations.txt

有需要可以问我，嗯～@maxuan

秒客网

分享一个从IEEE Xplore上批量下载会议论文的方法

相关文章