反爬虫robots协议,处理方法
Robots协议
Robots:通过robots协议告诉搜索引擎那些页面可以抓取,那些页面不能抓取; 位置:根目录下,网址/robots.txt; 例如: https://www.baidu.com/robots.txt https://www.douban.com/robots.txt 得到如下结果: User-agent: * Disallow: /subject_search Disallow: /amazon_search Disallow: /search Disallow: /group/search Disallow: /event/search Disallow: /celebrities/search Disallow: /location/drama/search Disallow: /forum/ Disallow: /new_subject Disallow: /service/iframe Disallow: /j/ Disallow: /link2/ Disallow: /recommend/ Disallow: /doubanapp/card Disallow: /update/topic/ Disallow: /share/ Allow: /ads.txt Sitemap: https://www.douban.com/sitemap_index.xml Sitemap: https://www.douban.com/sitemap_updated_index.xml
User-agent: Wandoujia Spider Disallow: /
User-agent: Mediapartners-Google Disallow: /subject_search Disallow: /amazon_search Disallow: /search Disallow: /group/search Disallow: /event/search Disallow: /celebrities/search Disallow: /location/drama/search Disallow: /j/
robots.txt 语法,说明
处理方法
在实际工作中,对一个网站进行爬取,要遵循Robots协议,这里有两种方法,
- 就是在上述网站这样的形式,把文件下载下来,然后分析 我们输入上面允许爬取的资源这里行 Sitemap: https://www.douban.com/sitemap_index.xml,得到如下结果: 然后我们输入第一行的url 得到一个gz文件,解压缩,拖到pyCharm中,得到这样的:
<?xml version="1.0" encoding="utf-8"?> <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"> <url> <loc>https://www.douban.com/</loc> <priority>1.0</priority> <changefreq>daily</changefreq> </url> <url> <loc>https://www.douban.com/explore/</loc> <priority>0.9</priority> <changefreq>daily</changefreq> </url> <url> <loc>https://www.douban.com/online/</loc> <priority>0.9</priority> <changefreq>daily</changefreq> </url> <url> <loc>https://www.douban.com/group/</loc> <priority>0.9</priority> <changefreq>daily</changefreq> </url> <url> <loc>https://www.douban.com/group/all</loc> <priority>0.9</priority> <changefreq>weekly</changefreq> </url> <url> <loc>https://www.douban.com/group/category/1/</loc> <priority>0.9</priority> <changefreq>weekly</changefreq> </url>
如上,这只是一部分,这些网站里面就都是可以爬取的,如我们输入 https://www.douban.com/explore,可以得到这样一个页面,这个页面里面就都是可以爬取的。
- 采用自带的库函数来解决
from urllib import robotparser//就是这个库 from urllib import request from fake_useragent import UserAgent gen = UserAgent() hd ={ User-Agent:gen.random} url = https://www.douban.com/robots.txt url1 = request.Request(url,headers=hd) req = request.urlopen(url1) con =req.read().decode(utf-8) print(con) robot = robotparser.RobotFileParser() robot.set_url(url) robot.read() print(robot.can_fetch(*,/))
这里输出的是true;
这里的*表示user-agent,表示任意一种,/表示根目录,,这里返回的是bool值
from urllib import robotparser url = https://www.baidu.com/robots.txt robot = robotparser.RobotFileParser() robot.set_url(url) robot.read() print(robot.can_fetch(Baiduspider,/)) print(robot.can_fetch(Baiduspider,/baidu))
一个目录是可以访问的,一个是不行的;