scrapy 的 Spider 通过 url 请求居然和浏览器通过 url 请求的 requests 的 html 不一样！？我傻了

import scrapy

class ZhipinSpider(scrapy.Spider):
    name = 'zhipin'
   
    start_urls = ['https://www.zhipin.com/job_detail/?query=python&city=101010100']

    def parse(self, response):
        list = response.xpath('//body/div[1]')

        print('list:',list)

运行之后，解析到的结果是body 下面的子标签我通过 chrome 开发工具看到的 html，和爬虫的结果 html 不同，我傻了

比如我浏览器看到的是body/div[@id="wrap"] 就能继续向下解析

但是 scrapy 的 Spider 解析 body 下面是

 ['<div class="data-tips">\n            <div class="tip-inner">\n                <div class="boss-loading">\n                    <span class="component-b">B</span><span class="component-o">O</span><span class="component-s1">S</span><span class="component-s2">S</span>\n                    <p class="gray">正在加载中...</p>\n                </div>\n            </div>\n        </div>']

what ？这些**div class="data-tips" ** 我从来没看到。。。

网上搜了下有朋友给出答案：

同样的 URL 浏览器会加载 js 事件，所以浏览器和 python 爬虫类 requests 打开同样的 URL 获取的源码肯定是不一样的，你拷贝谷歌的 xpath 去提取 requests 的 html 大部分肯定出问题。失败原因一般是多了不存在的 tobdy 节点、标签样式(id、class)不同、标签顺序不同等

请问如何实现爬虫的 Spider 去抓取能像浏览器一样实现浏览器一样的结果.....我猜这是反爬措施吧，但是有没有老哥能帮忙看看如何搞 :(

9 replies • 2020-01-29 01:25:13 +08:00