python爬虫-爬取蛋白晶体和分子结构

时间:2024-10-22 17:01:33

文章目录

  • 前言
  • 一、环境准备
  • 二、爬取PDB蛋白结构
    • 1.下载指定数量的随机PDB
    • 2.下载指定靶标的PDB
  • 二、从ZINC爬取小分子mol2结构
    • 1.下载指定数量的随机分子
    • 2.下载指定分子
  • 三、从ChEMBL爬取小分子信息
    • 1.下载指定ID的SMILES(测试不成功,网站变成readonly了)
  • 四、总结爬虫
    • 1.查看对应的xpath
    • 2.同一页同类的多个对象的Xpath
  • 总结


前言

最近觉得自己下载数据很麻烦,所以决定实践爬虫进行相关数据的下载,将所学内容总结如下:


一、环境准备

安装谷歌浏览器:https://www.google.cn/intl/zh-CN/chrome/next-steps.html?statcb=1&installdataindex=empty&defaultbrowser=0
检查安装的谷歌浏览器的版本:在这里插入图片描述在这里插入图片描述

安装对应版本的驱动器:https://googlechromelabs.github.io/chrome-for-testing/
在这里插入图片描述
安装爬虫需要的库:

pip install lxml
pip install selenium

二、爬取PDB蛋白结构

1.下载指定数量的随机PDB

import urllib.request
import urllib
import os
import time
import random
import datetime

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service


t1 = datetime.datetime.now()

chrome_options = Options()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--disable-gpu')
headers = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/130.0.6723.59 Safari/537.36'

output = 'protein/'

try:
  os.mkdir(output)
except FileExistsError:
  pass

s = Service('chromedriver.exe')
driver = webdriver.Chrome(service=s, options=chrome_options)

class Spider:
  def __init__(self,numbers,fmt):
    self.numbers = numbers
    self.fmt = fmt

    # 方法
  def download_protein(self):
    num_str = '1 2 3 4 5 6 7 8 9'
    num_list = num_str.split()
    letters = '1 2 3 4 5 6 7 8 9 0 a b c d e f g h i j k l m n o p q r s t u v w x y z A B C D E F G H I J K L M N O P Q R S T U V W X Y Z'
    letter_list = letters.split()
    letter_list = [s.upper() for s in letter_list]

    num = 1
    for i in range(1,self.numbers):
      A = random.choice(num_list)
      B = random.choice(letter_list)
      C = random.choice(letter_list)
      D = random.choice(letter_list)
      ABCD = A + B + C + D
      PDBID = "".join(ABCD)
      url = 'https://files.rcsb.org/download/' + PDBID + '.pdb'

      try:
        request = urllib.request.Request(url=url)
        response = urllib.request.urlopen(request)
        data = response.read()
      except urllib.error.HTTPError as e:
        pass
      except urllib.error.URLError as e:
        pass
      else:
        f = open(output + url[url.find('download')+9:],'wb')
        f.write(data)
        f.close()
        print('正在下第{0}个{1}格式的蛋白质{2}'.format(str(num),self.fmt,PDBID))
        time.sleep(random.randint(1,3))
      num = num + 1

spider = Spider(numbers=10,fmt='pdb')
spider.download_protein()
driver