利用BeautifulSoup爬取网页内容

利用BeautifulSoup可以很简单的爬取网页上的内容。这个套件可以把一个网页变成DOM Tree

要使用BeautifulSoup需要使用命令行进行安装，不过也可以直接用python的ide。

基础操作 :

①

使用之前需要先从bs4中导入包：from bs4 import BeautifulSoup

②

使用的代码：soup = BeautifulSoup(res.text, 'html.parser')

括号中的第一个参数中的res是源网页，res.text是源网页的html，第二个参数'html.parser'是使用html的剖析器。、

③

可以使用select函数找出所有含有特定标签的HTML元素，例如：soup.select('h1')可以找出所有含有h1标签得到元素

它会返回一个list，这个list包含所有含'h1'的元素。

代码：

soup = BeautifulSoup(res.text, 'html.parser')
h1 = soup.select('h1')
for h in h1:
    print(h)
#len = len(h1)
#for i in range(0,len):
#    print(h1[i])
#

④

可以使用select函数找出所有含有特定CSS属性的元素，例如：

soup.select('#title')可以找出所有id为title的元素（格式为"#加上id名称"）

soup.select('#link')可以找出所有class为title的元素（格式为"#加上class名称"）

select返回的元素都是按照tag进行分类的，所以可以获取到tag的值：

代码：

a = '<a href = "#" abc = 456 def = 123> i am a link </a>'
soup = BeautifulSoup(a, 'html.parser')
print(soup.select('a')[0]['href'])#输出"#"
print(soup.select('a')[0]['abc'])#输出"456"
print(soup.select('a')[0]['def'])#输出"123"

实战（爬取新浪新闻资讯）：

#导入包
import requests
from bs4 import BeautifulSoup
#爬取特定网页
res = requests.get("https://news.sina.com.cn/china/")
#转化文字编码
res.encoding = 'utf-8'
#存进BeautifulSoup元素中
soup = BeautifulSoup(res.text, 'html.parser')
#print(soup)

for news in soup.select('.news-1'):#爬取并遍历所有class为"news_1”的元素
    li = news.select('li')#选取所有含有'li'特定标签的元素,并存进li这个list中去
    num = len(li)#获取到元素的个数
    if num > 0:
        for i in range(0, num):
            print(li[i].text)

秒客网

利用BeautifulSoup爬取网页内容

相关文章