V2EX = way to explore
V2EX 是一个关于分享和探索的地方
现在注册
已注册用户请  登录
推荐学习书目
Learn Python the Hard Way
Python Sites
PyPI - Python Package Index
http://diveintopython.org/toc/index.html
Pocoo
值得关注的项目
PyPy
Celery
Jinja2
Read the Docs
gevent
pyenv
virtualenv
Stackless Python
Beautiful Soup
结巴中文分词
Green Unicorn
Sentry
Shovel
Pyflakes
pytest
Python 编程
pep8 Checker
Styles
PEP 8
Google Python Style Guide
Code Style from The Hitchhiker's Guide
yifengs
V2EX  ›  Python

求解 scrapy 爬取报错问题

  •  
  •   yifengs · 2019-11-25 13:38:08 +08:00 · 4729 次点击
    这是一个创建于 1854 天前的主题,其中的信息可能已经有所发展或是发生改变。

    scrapy 爬取阳光政务出现 Error,但数据出来了,求怎么解决这俩报错,错误如下: [scrapy.robotstxt] WARNING: Failure while parsing robots.txt. File either contains garbage or is in an encoding other than UTF-8, treating it as an empty file. Traceback (most recent call last): File "/home/python/.virtualenvs/webspider/lib/python3.5/site-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks result = g.send(result) File "/home/python/.virtualenvs/webspider/lib/python3.5/site-packages/scrapy/core/downloader/middleware.py", line 44, in process_request defer.returnValue((yield download_func(request=request, spider=spider))) File "/home/python/.virtualenvs/webspider/lib/python3.5/site-packages/twisted/internet/defer.py", line 1362, in returnValue raise _DefGen_Return(val) twisted.internet.defer._DefGen_Return: <200 http://www.sun0769.com/error/404.htm>

    During handling of the above exception, another exception occurred:

    Traceback (most recent call last): File "/home/python/.virtualenvs/webspider/lib/python3.5/site-packages/scrapy/robotstxt.py", line 15, in decode_robotstxt robotstxt_body = robotstxt_body.decode('utf-8') UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb6 in position 327: invalid start byte {'content': '东莞南城周溪东径北街 6 号天台严重违建,现在还出租了,没有跟进后续情况', 'content_img': [], 'href': 'http://wz.sun0769.com/html/question/201911/436799.shtml', 'publish_date': '2019-11-25 11:58:44', 'title': '东莞南城周溪东径北街 6 号天台严重违建现在还出租了,相关部门没有跟进后续情况'} 最下面是数据

    3 条回复    2019-11-25 14:13:02 +08:00
    zdnyp
        1
    zdnyp  
       2019-11-25 14:04:09 +08:00
    Failure while parsing robots.txt. File either contains garbage or is in an encoding other than UTF-8, treating it as an empty file.
    可以在 settings 里把 robots 改为 Flase
    yifengs
        2
    yifengs  
    OP
       2019-11-25 14:08:44 +08:00
    谢谢,错误不见了,是我 scrapy 没安装好吗,为啥 robots.txt 会解析失败呢
    yifengs
        3
    yifengs  
    OP
       2019-11-25 14:13:02 +08:00
    哦哦看到了 robots 协议上不允许,谢谢哈
    关于   ·   帮助文档   ·   博客   ·   API   ·   FAQ   ·   实用小工具   ·   2636 人在线   最高记录 6679   ·     Select Language
    创意工作者们的社区
    World is powered by solitude
    VERSION: 3.9.8.5 · 21ms · UTC 07:01 · PVG 15:01 · LAX 23:01 · JFK 02:01
    Developed with CodeLauncher
    ♥ Do have faith in what you're doing.