使用python从HTML页面源下载图像文件?

时间:2021-10-28 08:59:27

I am writing a scraper that downloads all the image files from a HTML page and saves them to a specific folder. all the images are the part of the HTML page.

我正在编写一个刮刀,从HTML页面下载所有图像文件并将其保存到特定文件夹。所有图像都是HTML页面的一部分。

6 个解决方案

#1


74  

Here is some code to download all the images from the supplied URL, and save them in the specified output folder. You can modify it to your own needs.

下面是一些从提供的URL下载所有图像的代码,并将它们保存在指定的输出文件夹中。您可以根据自己的需要进行修改。

"""
dumpimages.py
    Downloads all the images on the supplied URL, and saves them to the
    specified output file ("/test/" by default)

Usage:
    python dumpimages.py http://example.com/ [output]
"""
from bs4 import BeautifulSoup as bs
from urllib.request import (
    urlopen, urlparse, urlunparse, urlretrieve)
import os
import sys

def main(url, out_folder="/test/"):
    """Downloads all the images at 'url' to /test/"""
    soup = bs(urlopen(url))
    parsed = list(urlparse(url))

    for image in soup.findAll("img"):
        print("Image: %(src)s" % image)
        filename = image["src"].split("/")[-1]
        parsed[2] = image["src"]
        outpath = os.path.join(out_folder, filename)
        if image["src"].lower().startswith("http"):
            urlretrieve(image["src"], outpath)
        else:
            urlretrieve(urlunparse(parsed), outpath)

def _usage():
    print("usage: python dumpimages.py http://example.com [outpath]")

if __name__ == "__main__":
    url = sys.argv[-1]
    out_folder = "/test/"
    if not url.lower().startswith("http"):
        out_folder = sys.argv[-1]
        url = sys.argv[-2]
        if not url.lower().startswith("http"):
            _usage()
            sys.exit(-1)
    main(url, out_folder)

Edit: You can specify the output folder now.

编辑:您可以立即指定输出文件夹。

#2


12  

Ryan's solution is good, but fails if the image source URLs are absolute URLs or anything that doesn't give a good result when simply concatenated to the main page URL. urljoin recognizes absolute vs. relative URLs, so replace the loop in the middle with:

Ryan的解决方案很好,但是如果图像源URL是绝对URL或者只是简单地连接到主页面URL时没有给出好结果的任何东西,则会失败。 urljoin识别绝对URL和相对URL,因此将中间的循环替换为:

for image in soup.findAll("img"):
    print "Image: %(src)s" % image
    image_url = urlparse.urljoin(url, image['src'])
    filename = image["src"].split("/")[-1]
    outpath = os.path.join(out_folder, filename)
    urlretrieve(image_url, outpath)

#3


8  

You have to download the page and parse html document, find your image with regex and download it.. You can use urllib2 for downloading and Beautiful Soup for parsing html file.

你必须下载页面并解析html文档,用正则表达式找到你的图像并下载它。你可以使用urllib2下载和美丽的汤来解析html文件。

#4


8  

And this is function for download one image:

这是下载一个图像的功能:

def download_photo(self, img_url, filename):
    file_path = "%s%s" % (DOWNLOADED_IMAGE_PATH, filename)
    downloaded_image = file(file_path, "wb")

    image_on_web = urllib.urlopen(img_url)
    while True:
        buf = image_on_web.read(65536)
        if len(buf) == 0:
            break
        downloaded_image.write(buf)
    downloaded_image.close()
    image_on_web.close()

    return file_path

#5


2  

Use htmllib to extract all img tags (override do_img), then use urllib2 to download all the images.

使用htmllib提取所有img标签(覆盖do_img),然后使用urllib2下载所有图像。

#6


1  

If the request need an authorization refer to this one:

如果请求需要授权,请参阅以下内容:

r_img = requests.get(img_url, auth=(username, password)) 
f = open('000000.jpg','wb') 
f.write(r_img.content) 
f.close()

#1


74  

Here is some code to download all the images from the supplied URL, and save them in the specified output folder. You can modify it to your own needs.

下面是一些从提供的URL下载所有图像的代码,并将它们保存在指定的输出文件夹中。您可以根据自己的需要进行修改。

"""
dumpimages.py
    Downloads all the images on the supplied URL, and saves them to the
    specified output file ("/test/" by default)

Usage:
    python dumpimages.py http://example.com/ [output]
"""
from bs4 import BeautifulSoup as bs
from urllib.request import (
    urlopen, urlparse, urlunparse, urlretrieve)
import os
import sys

def main(url, out_folder="/test/"):
    """Downloads all the images at 'url' to /test/"""
    soup = bs(urlopen(url))
    parsed = list(urlparse(url))

    for image in soup.findAll("img"):
        print("Image: %(src)s" % image)
        filename = image["src"].split("/")[-1]
        parsed[2] = image["src"]
        outpath = os.path.join(out_folder, filename)
        if image["src"].lower().startswith("http"):
            urlretrieve(image["src"], outpath)
        else:
            urlretrieve(urlunparse(parsed), outpath)

def _usage():
    print("usage: python dumpimages.py http://example.com [outpath]")

if __name__ == "__main__":
    url = sys.argv[-1]
    out_folder = "/test/"
    if not url.lower().startswith("http"):
        out_folder = sys.argv[-1]
        url = sys.argv[-2]
        if not url.lower().startswith("http"):
            _usage()
            sys.exit(-1)
    main(url, out_folder)

Edit: You can specify the output folder now.

编辑:您可以立即指定输出文件夹。

#2


12  

Ryan's solution is good, but fails if the image source URLs are absolute URLs or anything that doesn't give a good result when simply concatenated to the main page URL. urljoin recognizes absolute vs. relative URLs, so replace the loop in the middle with:

Ryan的解决方案很好,但是如果图像源URL是绝对URL或者只是简单地连接到主页面URL时没有给出好结果的任何东西,则会失败。 urljoin识别绝对URL和相对URL,因此将中间的循环替换为:

for image in soup.findAll("img"):
    print "Image: %(src)s" % image
    image_url = urlparse.urljoin(url, image['src'])
    filename = image["src"].split("/")[-1]
    outpath = os.path.join(out_folder, filename)
    urlretrieve(image_url, outpath)

#3


8  

You have to download the page and parse html document, find your image with regex and download it.. You can use urllib2 for downloading and Beautiful Soup for parsing html file.

你必须下载页面并解析html文档,用正则表达式找到你的图像并下载它。你可以使用urllib2下载和美丽的汤来解析html文件。

#4


8  

And this is function for download one image:

这是下载一个图像的功能:

def download_photo(self, img_url, filename):
    file_path = "%s%s" % (DOWNLOADED_IMAGE_PATH, filename)
    downloaded_image = file(file_path, "wb")

    image_on_web = urllib.urlopen(img_url)
    while True:
        buf = image_on_web.read(65536)
        if len(buf) == 0:
            break
        downloaded_image.write(buf)
    downloaded_image.close()
    image_on_web.close()

    return file_path

#5


2  

Use htmllib to extract all img tags (override do_img), then use urllib2 to download all the images.

使用htmllib提取所有img标签(覆盖do_img),然后使用urllib2下载所有图像。

#6


1  

If the request need an authorization refer to this one:

如果请求需要授权,请参阅以下内容:

r_img = requests.get(img_url, auth=(username, password)) 
f = open('000000.jpg','wb') 
f.write(r_img.content) 
f.close()