本文实例为大家分享了python3爬取数据至mysql的具体代码,供大家参考,具体内容如下
直接贴代码
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
|
#!/usr/local/bin/python3.5
# -*- coding:UTF-8 -*-
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
import datetime
import random
import pymysql
connect = pymysql.connect(host = '192.168.10.142' , unix_socket = '/tmp/mysql.sock' , user = 'root' , passwd = '1234' , db = 'scraping' , charset = 'utf8' )
cursor = connect.cursor()
cursor.execute( 'USE scraping' )
random.seed(datetime.datetime.now())
def store(title, content):
execute = cursor.execute( "select * from pages WHERE `title` = %s" , title)
if execute < = 0 :
cursor.execute( "insert into pages(`title`, `content`) VALUES(%s, %s)" , (title, content))
cursor.connection.commit()
else :
print ( 'This content is already exist.' )
def get_links(acticle_url):
html = urlopen( 'http://en.wikipedia.org' + acticle_url)
soup = BeautifulSoup(html, 'html.parser' )
title = soup.h1.get_text()
content = soup.find( 'div' , { 'id' : 'mw-content-text' }).find( 'p' ).get_text()
store(title, content)
return soup.find( 'div' , { 'id' : 'bodyContent' }).findAll( 'a' , href = re. compile ( "^(/wiki/)(.)*$" ))
links = get_links('')
try :
while len (links) > 0 :
newActicle = links[random.randint( 0 , len (links) - 1 )].attrs[ 'href' ]
links = get_links(newActicle)
print (links)
finally :
cursor.close()
connect.close()
|
以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持服务器之家。
原文链接:https://blog.csdn.net/ASAS1314/article/details/52594232