[python] 使用 Selenium 和 chromedirver 抓取网页

[python] 使用 Selenium 和 chromedirver 抓取网页 2022-07-09 236

1 没有使用 JavaScript 的网页抓取方法

例如如下的网页：

1.1 安装 BeautifulSoup 库

1.2 代码例子

app.py：

import requests
from pages.quotes_page import QuotesPage

page_content = requests.get("https://quotes.toscrape.com/").content

page = QuotesPage(page_content)

for quote in page.quotes:
    print(quote)

BeautifulSoup 的使用：

from bs4 import BeautifulSoup

from locators.quotes_page_locators import QuotesPageLocators
from parsers.quote import QuoteParser

class QuotesPage:
    def __init__(self, page):
        self.soup = BeautifulSoup(page, html.parser)

    @property
    def quotes(self):
        locator = QuotesPageLocators.QUOTE
        quote_tags = self.soup.select(locator)
        return [QuoteParser(e) for e in quote_tags]

quote.py, 即 parser，解析含有单个quote的HTML：

from locators.quote_locators import QuoteLocators

class QuoteParser:
    """
    Given one of the specific quote divs, find out the data
    about the quote (quote content, author, tags).
    """
    def __init__(self, parent):
        self.parent = parent

    def __repr__(self):
        return f<Quote {
            
     self.content} by {
            
     self.author}>

    @property
    def content(self):
        locator = QuoteLocators.CONTENT
        return self.parent.select_one(locator).string

    @property
    def author(self):
        locator = QuoteLocators.AUTHOR
        return self.parent.select_one(locator).string

    @property
    def tags(self):
        locator = QuoteLocators.TAGS

        # select all available individual tags
        return [e.string for e in self.parent.select(locator)]

2 抓取使用了 JavaScript 的网页

这些网页需要执行 JavaScript 才能生成需要的内容，如下的网页：需要执行 3 步操作才能获得 quote，首先需要选择 author，然后选择 tag，最后点击 search 按钮，才会显示相应的 quote：

使用 Selenium 和 chromedriver 可以使用代码执行这些原本需要手动才能完成的操作，然后再抓取相应的网页数据，实现浏览器自动化。

2.1 下载 chromedriver

https://chromedriver.chromium.org/downloads 下载前，需要chekc自己使用的chrome的版本，例如 chrome 103, chrome 104 都对应不同的 chromedriver，要选择正确的版本。

2.2 chromedriver 压缩包解压

然后将 chromedriver.exe 放置在某个位置，将来要使用其所在路径。

2.3 安装 Selenium

当前的最新版本是 4.3.0，安装的是这个版本，不同的版本，API 也会变化。

2.4 代码例子 app.py:

from selenium import webdriver
from selenium.webdriver.chrome.service import Service

from pages.quotes_page import QuotesPage

chrome = webdriver.Chrome(service=Service("chromedriver.exe"))
chrome.get("https://quotes.toscrape.com/search.aspx")
page = QuotesPage(chrome)

author = input("Enter the author youd like quotes from: ")
page.select_author(author)

tags = page.get_available_tags()
print("Select one of these tags: [{}]".format( | .join(tags)))
selected_tag = input("Enter your tag: ")

page.select_tag(selected_tag)

page.search_button.click()
print(page.quotes)

测试界面：

无论使用 BeautifulSoup 还是使用 Selenium 抓取网页，都是要分析 HTML 文件，再使用 CSS selector 定位 HTML 代码中需要的数据，再调用 library 中的函数读取数据。

免费搭建微信查券返利机器人来轻松赚佣金

文章来自:IT技术分享网
分享地址:http://www.5ityx.cn/cate107/69582.html

上一篇： Java进阶学习之Java架构师的学习路线

下一篇：好用到爆！IDEA 版 Postman 面世了，功能真心强大