Python爬取新闻联播(文字版)
环境安装
先安装 Python3 和 pip3 环境,然后需要安装以下的库:
-
pip install beautifulsoup4 pip install requests
脚本编写
直接把运行代码会把文字版新闻联播爬取下来并保存在 news.txt 中:
import datetime
import requests
from bs4 import BeautifulSoup
headers = {
User-Agent: "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1"
}
url = "http://www.sdpp.com.cn/list/list_98.html"
s = requests.get(url,headers=headers)
s.encoding="utf-8"
print(s.status_code)
bs = BeautifulSoup(s.text,"html.parser")
news = bs.find(aside,class_="news_list").find("a")[href]
today = str((datetime.date.today() + datetime.timedelta(days=0)).strftime(%Y%m%d))
preday = str((datetime.date.today() + datetime.timedelta(days=-1)).strftime(%Y%m%d))
pre2day = str((datetime.date.today() + datetime.timedelta(days=-2)).strftime(%Y%m%d))
if str(pre2day) in news or str(preday) in news or today in news:
print("have new news")
r = requests.get(news,headers=headers)
r.encoding = urf8
bs = BeautifulSoup(r.text, "html.parser")
allpagecount = int(bs.find("span",{
"id":"allpagecount"}).get_text())
title = bs.find("div", {
"class": "keys3"}).get_text()
temp = title + "
"
for i in range(1,allpagecount):
link = bs.find("a",{
"id":"nextpageurl"})["href"]
r = requests.get(link,headers=headers)
r.encoding = urf8
bs = BeautifulSoup(r.text, "html.parser")
maintext = bs.find("div", {
"class": "textCon"}).get_text()
temp = temp + maintext + "
"
print(maintext)
with open("news.txt", "w", encoding="utf8") as f:
f.write(temp)
# email_data = {
# "title": title,
# "body": temp,
# "sender": "your_email@xx.com",
# "password": "password",
# "receiver": "your_email@xx.com",
# "smtpserver": "smtp.163.com",
# "is_send_email": True
# }
# send_email(**email_data)
发送邮件
参考我的关于使用Python。
-
导入 send_email 函数 修改以上注释代码中的 sender,password,receiver 为自己的邮箱及密码 解开注释
