python3编程05--爬虫实战：爬取新闻网站信息1

python3编程05--爬虫实战：爬取新闻网站信息1 2022-09-27 593

爬取新闻网站信息

本篇博客爬取内容如下：

准备工作：

安装python3

安装相关套件：jupyter、requests、BeautifulSoup4 、datetime （安装方法： pip install xxx）

确定要爬取的新闻网站：

首先打开新浪新闻 https://news.sina.com.cn/china/

找到一篇带有评论的新闻

复制网址：

页面空白处，右键-->检查

下载页面数据

#下载页面数据
import requests

res = requests.get(https://news.sina.com.cn/c/2018-11-15/doc-ihnvukff4194550.shtml)
res.encoding = utf-8
print(res.text)

抓取新闻标题

通过下图所示，得到标题所在的class=main-title，抓取其他信息查看其所在class或id的方法都一样。

#抓取新闻标题 class=main-title
from bs4 import BeautifulSoup
soup = BeautifulSoup(res.text,html.parser)
title = soup.select(.main-title)[0].text
print(title)

抓取新闻时间

#抓取新闻时间
from datetime import datetime
timesource = soup.select(.date)[0].text
#字符串转为时间类型
dt=datetime.strptime(timesource, %Y年%m月%d日 %H:%M)
print(dt)

抓取新闻内容

#获取内容  id=article
source = soup.select(#article p)[:-1]
print(source)

#将内容合并到一个list
article = []
for p in source:
    article.append(p.text.strip())
#以"***"连接每一个段落
***.join(article)

简化获取内容语句

#简化获取内容语句 
***.join([p.text.strip() for p in soup.select(.article p)[:-1]])

抓取责任编辑

#抓取责任编辑 class=show_author
soup.select(.show_author)[0].text.lstrip(责任编辑：)

抓取评论数（难点）

错误方法：

#抓取评论数 num
soup.select(.num)[0]

显然我们打开的新闻评论不为0 ，抓取到的评论数为0，所以此方法错误。

正确方法：因为评论数不在我们下载的html文件中，其实评论数在js文件中，通过如下图方法找到了与评论数对应的数字

通过以下图示方法，能看到页面的评论与js文件的内容一致，可以确认评论来自 "info?version=1..."文件

复制评论所在URL：

#下载评论数据
comments = requests.get(https://comment.sina.com.cn/page/info?version=1&format=json&channel=gn&newsid=comos-hnvukff4194550&group=undefined&compress=0&ie=utf-8&oe=utf-8&page=1&page_size=3&t_size=3&h_size=3&thread=1&callback=jsonp_1542297510153&_=1542297510153)
comments.encoding = utf-8
print(comments.text)

发现输出类似于json格式，但多出了上图前缀 “jsonp_1542297510153(”

观察评论URL: 末尾正好有与前缀一样的内容

我们不妨试着去掉上图末尾的内容，再抓取数据看看，发现输出为json格式的数据。

#下载评论数据，去掉元素URL末尾的“&callback=jsonp_1542297510153&_=1542297510153”
comments = requests.get(https://comment.sina.com.cn/page/info?version=1&format=json&channel=gn&newsid=comos-hnvukff4194550&group=undefined&compress=0&ie=utf-8&oe=utf-8&page=1&page_size=3&t_size=3&h_size=3&thread=1)
comments.encoding = utf-8
print(comments.text)

解析json数据，得到新闻评论数

#加载json数据
import json
json.loads(comments.text)

#解析json取得评论数
import json
jd = json.loads(comments.text)[result][count][total]
print(jd)

抓取新闻标识符

#抓取新闻标识符,方法一：
newurl = https://news.sina.com.cn/c/2018-11-15/doc-ihnvukff4194550.shtml
urlid = newurl.split(/)
urlid[-1].lstrip(doc-i).rstrip(.shtml)

#抓取新闻标识符,方法二：使用正则表达式
import re  
newurl = https://news.sina.com.cn/c/2018-11-15/doc-ihnvukff4194550.shtml
match = re.search(doc-i(.+).shtml, newurl)
match.group(1)

先到这，enjoy it！

文章来自:IT技术分享网
分享地址:http://www.5ityx.cn/cate100/128501.html

上一篇： .gitignore 文件不生效问题 & 解决方法

下一篇： .gitignore与.git/info/exclude区别