精易论坛

标题: Python-网页爬虫与Sqlite3 [打印本页]

作者: 大司命    时间: 2020-12-14 14:51
标题: Python-网页爬虫与Sqlite3
  1. #!/usr/bin/python
  2. # -*- coding: utf-8 -*-

  3. import sqlite3
  4. import requests
  5. from bs4 import BeautifulSoup
  6. from re import escape

  7. if __name__ == '__main__':
  8.     conn = sqlite3.connect('Python.db')
  9.     c = conn.cursor()
  10.     c.execute('''CREATE TABLE IF NOT EXISTS Python (
  11.         Url VARCHAR,
  12.         Title VARCHAR,
  13.         Author VARCHAR
  14.     )''')
  15.     conn.commit()

  16.     # --------------------Split Line--------------------
  17.     headers = {
  18.         "User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.87 Safari/537.36"
  19.     }

  20.     for i in range(1, 1046):
  21.         url = "http://xxx/index_%s.html" % str(i)
  22.         req = requests.get(url=url, headers=headers)
  23.         req.encoding = "utf-8"
  24.         html = BeautifulSoup(req.text, "lxml")

  25.         # --------------------Split Line--------------------
  26.         for div in html.find_all('div', class_='loop'):
  27.             content_body = div.select('h2 > a')[0]
  28.             content_infor = div.select('.content_infor > span:nth-child(3)')[0]

  29.             # --------------------Split Line--------------------
  30.             cursor = c.execute(
  31.                 "SELECT COUNT(*) FROM Python WHERE Url = '%s'" % ("http://xxx" + content_body.get('href')))
  32.             len = 0
  33.             for row in cursor:
  34.                 len = row[0]
  35.             if len > 0:
  36.                 continue

  37.             # --------------------Split Line--------------------
  38.             c.execute('INSERT INTO Python( Url, Title, Author) VALUES ( "%s", "%s", "%s")' % (
  39.                 "http://xxx" + content_body.get('href'),
  40.                 escape(content_body.get('title').replace(""", """")),
  41.                 content_infor.text.replace('xxx: ', '')))

  42.         conn.commit()
  43.         print("第%s页" % str(i))

  44.     # --------------------Split Line--------------------
  45.     conn.close()
复制代码

Python-网页爬虫与Sqlite3
https://bbs.266.la/forum.php?mod=viewthread&tid=540
(出处: 派生社区)






欢迎光临 精易论坛 (https://125.confly.eu.org/) Powered by Discuz! X3.4