如何从aspx页面取消图像?

时间:2022-12-21 16:19:18

i am trying to scrap images from a aspx page i have this code that scrapes images from normal webpage but can't scrape aspx page cause i need to send http post requests to the aspx page i can't figure out how to do that even after reading few threads this is the original code

我想取消aspx页面的图片我有这段代码,从正常的网页但是不能刮擦伤图片aspx页面导致我需要发送http post请求的aspx页面我不知道该怎么做,甚至在阅读一些线程这是原始代码

from bs4 import BeautifulSoup as bs
import urlparse
import urllib2
from urllib import urlretrieve
import os
import sys
import subprocess
import re


def thefunc(url, out_folder):

    c = False

i have already defined headers for aspx page and a if statement that distinguished between normal page and aspx page

我已经为aspx页面定义了标题,以及区分普通页面和aspx页面的if语句

    select =  raw_input('Is this a .net  aspx page ? y/n : ')
    if select.lower().startswith('y'):
        usin = raw_input('Specify origin of .net page : ')
        usaspx = raw_input('Specify aspx page url : ')

the header for aspx page

aspx页面的头

        headdic = {
            'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
            'Origin': usin,
            'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.17 (KHTML, like Gecko)  Chrome/24.0.1312.57 Safari/537.17',
            'Content-Type': 'application/x-www-form-urlencoded',
            'Referer': usaspx,
            'Accept-Encoding': 'gzip,deflate,sdch',
            'Accept-Language': 'en-US,en;q=0.8',
            'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3'
        }
        c = True

    if c:
        req = urllib2.Request(url, headers=headic)
    else:
        req = urllib2.Request(url, headers={'User-Agent' : "Magic Browser"})
    resp = urllib2.urlopen(req)

    soup = bs(resp, 'lxml')

    parsed = list(urlparse.urlparse(url))

    print '\n',len(soup.findAll('img')), 'images are about to be downloaded'

    for image in soup.findAll("img"):

        print "Image: %(src)s" % image

        filename = image["src"].split("/")[-1]

        parsed[2] = image["src"]

        outpath = os.path.join(out_folder, filename)

        try:

            if image["src"].lower().startswith("http"):
                urlretrieve(image["src"], outpath)
            else:
                urlretrieve(urlparse.urlunparse(parsed), outpath)
        except:
            print 'OOPS missed one for some reason !!'
            pass


try:
    put =  raw_input('Please enter the page url : ')
    reg1 = re.compile('^http*',re.IGNORECASE)
    reg1.match(put)
except:
    print('Type the url carefully !!')
    sys.exit()
fol = raw_input('Enter the foldername to save the images : ')
if os.path.isdir(fol):
    thefunc(put, fol)
else:
    subprocess.call('mkdir', fol)
    thefunc(put, fol)

i have made few modifications for aspx detection and creating the header for the aspx page but how to modify next i am stuck here

我对aspx检测和创建aspx页面的头做了一些修改,但是接下来如何修改我就被困在这里了

***here is the aspx page link*** http://www.foxrun.com.au/Products/Cylinders_with_Gadgets.aspx

这里是aspx页面链接*** http://www.foxrun.com.au/products/缸s_with_gadgets.aspx

sorry if i am not clear as you can see i am new to programming, the question i am asking is how can i get the images i get from the aspx page when i am clicking the next page button in the browser cause if i can only scrape one page cause the url is not changing unless i send a http post somehow to tell the page to show the next page with new pictures, because the url stays the same i hope i am clear

抱歉如果我不清楚可以看到我新的编程,我问的问题是我怎样才能从aspx页面图片给我当我在浏览器中点击下一页按钮导致如果我只能刮一页的url是不会改变,除非我发送一个http post告诉页面显示下一页的新照片,因为url不变我希望我清楚

2 个解决方案

#1


2  

You can do it using requests by posting to the url with the correct data which you can parse from the initial page:

您可以使用请求将请求发送到url,并使用正确的数据,您可以从初始页面进行解析:

import requests
from bs4 import BeautifulSoup
from urlparse import urljoin
from itertools import chain

url = "http://www.foxrun.com.au/Products/Cylinders_with_Gadgets.aspx"


def validate(soup):
    return {"__VIEWSTATE": soup.select_one("#__VIEWSTATE")["value"],
            "__VIEWSTATEGENERATOR": soup.select_one("#__VIEWSTATEGENERATOR")["value"],
            "__EVENTVALIDATION": soup.select_one("#__EVENTVALIDATION")["value"]}


def parse(base, url):
    data = {"__ASYNCPOST": "true"
            }
    h = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.17 (KHTML, like Gecko)  Chrome/24.0.1312.57 Safari/537.17'}
    soup = BeautifulSoup(requests.get(url).text)
    data.update(validate(soup))
    # gets links for < 1,2,3,4,5,6>
    pages = [a["id"] for a in soup.select("a[id^=ctl01_ctl00_pbsc1_pbPagerBottom_btnP]")][2:]
    # get images from initial page
    yield [img["src"] for img in soup.select("img")]
    # add token for post 
    data.update(validate(soup))
    for p in pages:
        # we need $ in place of _ for the form data
        data["__EVENTTARGET"] = p.replace("_", "$")
        data["RadScriptManager1"] = "ctl01$ctl00$pbsc1$ctl01$ctl00$pbsc1$ajaxPanel1Panel|{}".format(p.replace("_", "$"))
        r = requests.post(url, data=data, headers=h).text
        soup = BeautifulSoup(r)
        yield [urljoin(base, img["src"]) for img in soup.select("img")]


for url in chain.from_iterable(parse("http://www.foxrun.com.au/", url)):
    print(url)

That will give you the links, you just have to download the content and write it to file. Normally we could create a Session and go from one page to the next but in this case what is posted is ctl01$ctl00$pbsc1$pbPagerBottom$btnNext which would work fine going from the initial page to the second but there is no concept of going from the second to the third etc.. as we have no page number in the form data.

它会给你链接,你只需要下载内容并将其写入文件。通常我们可以创建一个Session然后从一个页面跳转到下一个页面但是在这个例子中,我们发布的是ctl01$ctl00$pbsc1$ pbsc1$ btnNext,它可以从初始页面到第二个页面,但是没有从第二个到第三个的概念。因为表单数据中没有页码。

#2


1  

My internet is really bad where I am so I cannot guarantee 100% that this works just as it is but what you want to execute is in between these lines.

我的网络真的很糟糕,所以我不能100%保证这是可行的,但你想要执行的是在这两行之间。

This works for any type of page. If I interpreted anything wrong, don't hold back on commenting.

这适用于任何类型的页面。如果我理解错了什么,不要在评论上退缩。

import urllib2
from urlparse import urljoin
from urllib import urlretrieve
from bs4 import BeautifulSoup

url = "http://www.foxrun.com.au/Products/Cylinders_with_Gadgets.aspx"
html = urllib2.urlopen(url)
soup = BeautifulSoup(html)
imgs = soup.findAll("img")
image=0
for img in imgs:
    link=urljoin(url,img['src']) #Join relative paths
    urlretrieve(link, "image"+str(image)) #saves image in the folder you execute this
    image+=1 #increments name

This will create

这将创建

image1 image2 ... imageN

image1 image2……画像

Change the target path as you wish

按照您的意愿更改目标路径。

EDIT:

编辑:

This has nothing to do with aspx.

这与aspx无关。

The page links are javascript generated therefore you can't extract a url from it. urrlib doesn't handle dynamically generated content so in this case you will have to use a browser emulator, something like Selenium+Firefox()/PhantomJS or you can use Splash. There is also CasperJS+ PhantomJS. The possibilities are endless but I'd go with Selenium :)

页面链接是javascript生成的,因此不能从中提取url。urrlib不能处理动态生成的内容,所以在这种情况下,您必须使用浏览器模拟器,比如Selenium+Firefox()/PhantomJS,或者您可以使用Splash。还有CasperJS+ PhantomJS。可能性是无穷无尽的,但我选择了硒:)

With these tools you can interact with the page as if you were in a browser (click, scroll, input text to boxes, etc)

使用这些工具,您可以像在浏览器中一样与页面进行交互(单击、滚动、向方框输入文本等)

#1


2  

You can do it using requests by posting to the url with the correct data which you can parse from the initial page:

您可以使用请求将请求发送到url,并使用正确的数据,您可以从初始页面进行解析:

import requests
from bs4 import BeautifulSoup
from urlparse import urljoin
from itertools import chain

url = "http://www.foxrun.com.au/Products/Cylinders_with_Gadgets.aspx"


def validate(soup):
    return {"__VIEWSTATE": soup.select_one("#__VIEWSTATE")["value"],
            "__VIEWSTATEGENERATOR": soup.select_one("#__VIEWSTATEGENERATOR")["value"],
            "__EVENTVALIDATION": soup.select_one("#__EVENTVALIDATION")["value"]}


def parse(base, url):
    data = {"__ASYNCPOST": "true"
            }
    h = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.17 (KHTML, like Gecko)  Chrome/24.0.1312.57 Safari/537.17'}
    soup = BeautifulSoup(requests.get(url).text)
    data.update(validate(soup))
    # gets links for < 1,2,3,4,5,6>
    pages = [a["id"] for a in soup.select("a[id^=ctl01_ctl00_pbsc1_pbPagerBottom_btnP]")][2:]
    # get images from initial page
    yield [img["src"] for img in soup.select("img")]
    # add token for post 
    data.update(validate(soup))
    for p in pages:
        # we need $ in place of _ for the form data
        data["__EVENTTARGET"] = p.replace("_", "$")
        data["RadScriptManager1"] = "ctl01$ctl00$pbsc1$ctl01$ctl00$pbsc1$ajaxPanel1Panel|{}".format(p.replace("_", "$"))
        r = requests.post(url, data=data, headers=h).text
        soup = BeautifulSoup(r)
        yield [urljoin(base, img["src"]) for img in soup.select("img")]


for url in chain.from_iterable(parse("http://www.foxrun.com.au/", url)):
    print(url)

That will give you the links, you just have to download the content and write it to file. Normally we could create a Session and go from one page to the next but in this case what is posted is ctl01$ctl00$pbsc1$pbPagerBottom$btnNext which would work fine going from the initial page to the second but there is no concept of going from the second to the third etc.. as we have no page number in the form data.

它会给你链接,你只需要下载内容并将其写入文件。通常我们可以创建一个Session然后从一个页面跳转到下一个页面但是在这个例子中,我们发布的是ctl01$ctl00$pbsc1$ pbsc1$ btnNext,它可以从初始页面到第二个页面,但是没有从第二个到第三个的概念。因为表单数据中没有页码。

#2


1  

My internet is really bad where I am so I cannot guarantee 100% that this works just as it is but what you want to execute is in between these lines.

我的网络真的很糟糕,所以我不能100%保证这是可行的,但你想要执行的是在这两行之间。

This works for any type of page. If I interpreted anything wrong, don't hold back on commenting.

这适用于任何类型的页面。如果我理解错了什么,不要在评论上退缩。

import urllib2
from urlparse import urljoin
from urllib import urlretrieve
from bs4 import BeautifulSoup

url = "http://www.foxrun.com.au/Products/Cylinders_with_Gadgets.aspx"
html = urllib2.urlopen(url)
soup = BeautifulSoup(html)
imgs = soup.findAll("img")
image=0
for img in imgs:
    link=urljoin(url,img['src']) #Join relative paths
    urlretrieve(link, "image"+str(image)) #saves image in the folder you execute this
    image+=1 #increments name

This will create

这将创建

image1 image2 ... imageN

image1 image2……画像

Change the target path as you wish

按照您的意愿更改目标路径。

EDIT:

编辑:

This has nothing to do with aspx.

这与aspx无关。

The page links are javascript generated therefore you can't extract a url from it. urrlib doesn't handle dynamically generated content so in this case you will have to use a browser emulator, something like Selenium+Firefox()/PhantomJS or you can use Splash. There is also CasperJS+ PhantomJS. The possibilities are endless but I'd go with Selenium :)

页面链接是javascript生成的,因此不能从中提取url。urrlib不能处理动态生成的内容,所以在这种情况下,您必须使用浏览器模拟器,比如Selenium+Firefox()/PhantomJS,或者您可以使用Splash。还有CasperJS+ PhantomJS。可能性是无穷无尽的,但我选择了硒:)

With these tools you can interact with the page as if you were in a browser (click, scroll, input text to boxes, etc)

使用这些工具,您可以像在浏览器中一样与页面进行交互(单击、滚动、向方框输入文本等)