精易论坛

标题: Python-xpath [打印本页]

作者: 大司命    时间: 2020-12-23 16:55
标题: Python-xpath
  1. #!/usr/bin/python
  2. # -*- coding: utf-8 -*-

  3. import requests
  4. from lxml import etree
  5. import sqlite3


  6. def write_sql(c, text):
  7.     html = etree.HTML(text)
  8.     # 标题
  9.     titles = html.xpath('//ul[@class="news"]//a[@target="_blank"]/p/text()')
  10.     # 链接
  11.     hrefs = html.xpath('//ul[@class="news"]//a[@target="_blank"]/@href')
  12.     # 日期
  13.     ems = html.xpath('//ul[@class="news"]//a[@target="_blank"]/em/text()')

  14.     number = 0
  15.     for title, href, em in zip(titles, hrefs, ems):
  16.         href = host + href
  17.         cursor = c.execute(
  18.             "SELECT COUNT(*) FROM Python WHERE Url = '%s'" % href)
  19.         res = c.fetchall()
  20.         # 判断该字段是否已存在
  21.         if res[0][0] > 0:
  22.             continue

  23.         c.execute('INSERT INTO Python( Url, Title, Author) VALUES ( "%s", "%s", "%s")' % (
  24.             href,
  25.             title.replace(""", """"),
  26.             em))
  27.         number += 1
  28.         print(title, href, em)

  29.     conn.commit()
  30.     return number > 0


  31. if __name__ == '__main__':

  32.     conn = sqlite3.connect("Python-xxx.db")
  33.     c = conn.cursor()
  34.     c.execute('''CREATE TABLE IF NOT EXISTS Python (
  35.         Url VARCHAR,
  36.         Title VARCHAR,
  37.         Author VARCHAR
  38.     )''')
  39.     conn.commit()

  40.     host = "https://xxx"
  41.     url = host + "/xxx"
  42.     req = requests.get(url)
  43.     req.encoding = 'utf-8'
  44.     # print(req.text)

  45.     html = etree.HTML(req.text)
  46.     clearfixs = html.xpath('//*[@class="nav clearfix"]//a[starts-with(@href, "/cate/")]/text()')
  47.     hrefs = html.xpath('//*[@class="nav clearfix"]//a[starts-with(@href, "/cate/")]/@href')
  48.     # print(clearfix, href)

  49.     for clearfix, href in zip(clearfixs, hrefs):
  50.         print(clearfix, host + href)

  51.         page = 1
  52.         while True:
  53.             url = host + href + "/list_%s.html" % page
  54.             req = requests.get(url)
  55.             req.encoding = 'utf-8'

  56.             if (not write_sql(c, req.text)):
  57.                 break

  58.             print("第%s页" % page)
  59.             page += 1

  60.     conn.close()
复制代码
xpath用着就是舒服~


Python-xpath
https://bbs.266.la/forum.php?mod=viewthread&tid=959
(出处: 派生社区)


作者: hanlang    时间: 2022-4-10 19:44
学习了,谢旋
作者: yjtnihaoma1    时间: 2024-7-18 20:00
挖2024坟啊啊啊啊啊啊啊啊啊啊
作者: zhyl8888    时间: 2025-5-29 13:04





欢迎光临 精易论坛 (https://125.confly.eu.org/) Powered by Discuz! X3.4