Python从Web Scraped URL下载数据文件

I'm trying to develop an automated script to download the following data file to a utility server and then ETL related processing. Looking for pythonic suggestions. Not familiar with the current best options for this type of process between urllib, urllib2, beautiful soup, requests, mechanize, selenium, etc.

我正在尝试开发一个自动脚本,将以下数据文件下载到实用程序服务器,然后进行ETL相关处理。寻找pythonic的建议。不熟悉urllib,urllib2,美味汤,请求,机械化,硒等之间此类过程的当前最佳选项。

The Website

"Full Replacement Monthly NPI File"

“完全替换每月NPI文件”

The Monthly Data File

每月数据文件

The file name (and subsequent url) changes monthly.

文件名(和后续网址)每月更改一次。

Here is my current approach thus far:

这是我目前的方法:

from bs4 import BeautifulSoup
import urllib 
import urllib2

soup = BeautifulSoup(urllib2.urlopen('http://nppes.viva-it.com/NPI_Files.html').read())

download_links = []

for link in soup.findAll(href=True):
    urls = link.get('href', '/')
    download_links.append(urls)

target_url = download_links[2]

urllib.urlretrieve(target_url , "NPI.zip")

I am not anticipating the content on this clunky govt. site to change, so I though just selecting the 3rd element of the scraped url list would be good enough. Of course, if my entire approach is wrongheaded, I welcome correction (data analytics is the personal forte). Also, if I am using outdated libraries, unpythonic practices, or low performance options, I definitely welcome the newer and better!

我不期待这个笨重的*的内容。要更改的网站,所以我只是选择已删除网址列表的第3个元素就足够了。当然,如果我的整个方法都是错误的,我欢迎更正(数据分析是个人的强项)。此外,如果我使用过时的库,unpythonic实践或低性能选项,我绝对欢迎更新更好!

1 个解决方案

#1

In general requests is the easiest way to get webpages.

通常,请求是获取网页的最简单方法。

If the name of the data files follows the pattern NPPES_Data_Dissemination_<Month>_<year>.zip, which seems logical, you can request that directly;

如果数据文件的名称遵循模式NPPES_Data_Dissemination_ _ .zip,这似乎是合乎逻辑的,您可以直接请求;

import requests

url = "http://nppes.viva-it.com/NPPES_Data_Dissemination_{}_{}.zip"
r = requests.get(url.format("March", 2015))

The data is then in r.text.

然后数据在r.text中。

If the data-file name is less certain, you can get the webpage and use a regular expression to search for links to zip files;

如果数据文件名不太确定,您可以获取网页并使用正则表达式搜索zip文件的链接;

In [1]: import requests

In [2]: r = requests.get('http://nppes.viva-it.com/NPI_Files.html')

In [3]: import re

In [4]: re.findall('http.*NPPES.*\.zip', r.text)
Out[4]: 
['http://nppes.viva-it.com/NPPES_Data_Dissemination_March_2015.zip',
 'http://nppes.viva-it.com/NPPES_Deactivated_NPI_Report_031015.zip',
 'http://nppes.viva-it.com/NPPES_Data_Dissemination_030915_031515_Weekly.zip',
 'http://nppes.viva-it.com/NPPES_Data_Dissemination_031615_032215_Weekly.zip',
 'http://nppes.viva-it.com/NPPES_Data_Dissemination_032315_032915_Weekly.zip',
 'http://nppes.viva-it.com/NPPES_Data_Dissemination_033015_040515_Weekly.zip',
 'http://nppes.viva-it.com/NPPES_Data_Dissemination_100614_101214_Weekly.zip']

The regular expression in In[4] basically says to find strings that start with "http", contain "NPPES" and end with ".zip". This isn't speficic enough. Let's change the regular expression as shown below;

In [4]中的正则表达式基本上用于查找以“http”开头的字符串,包含“NPPES”并以“.zip”结尾。这不够明确。让我们改变正则表达式,如下所示;

In [5]: re.findall('http.*NPPES_Data_Dissemination.*\.zip', r.text)
Out[5]: 
['http://nppes.viva-it.com/NPPES_Data_Dissemination_March_2015.zip',
 'http://nppes.viva-it.com/NPPES_Data_Dissemination_030915_031515_Weekly.zip',
 'http://nppes.viva-it.com/NPPES_Data_Dissemination_031615_032215_Weekly.zip',
 'http://nppes.viva-it.com/NPPES_Data_Dissemination_032315_032915_Weekly.zip',
 'http://nppes.viva-it.com/NPPES_Data_Dissemination_033015_040515_Weekly.zip',
 'http://nppes.viva-it.com/NPPES_Data_Dissemination_100614_101214_Weekly.zip']

This gives us the URLs of the file we want but also the weekly files.

这为我们提供了我们想要的文件的URL以及每周文件。

In [6]: fileURLS = re.findall('http.*NPPES_Data_Dissemination.*\.zip', r.text)

Let's filter out the weekly files:

我们过滤掉每周文件:

In [7]: [f for f in fileURLS if 'Weekly' not in f]
Out[7]: ['http://nppes.viva-it.com/NPPES_Data_Dissemination_March_2015.zip']

This is the URL you seek. But this whole scheme does depend on how regular the names are. You can add flags to the regular expression searches to discard the case of the letters, that would make it accept more.

这是您寻找的网址。但这整个方案确实取决于名称的规律性。您可以向正则表达式搜索添加标志以丢弃字母的大小写,这将使其接受更多。

#1