抓网页，数据无法解码，看着不像编码给错 - V2EX

首页注册登录

V2EX = way to explore

V2EX 是一个关于分享和探索的地方

现在注册

已注册用户请登录

推荐学习书目

› Learn Python the Hard Way

Python Sites

› PyPI - Python Package Index

› http://diveintopython.org/toc/index.html

› Pocoo

值得关注的项目

› PyPy

› Celery

› Jinja2

› Read the Docs

› gevent

› pyenv

› Stackless Python

› Beautiful Soup

› 结巴中文分词

› Green Unicorn

› Sentry

› Shovel

› pytest

Python 编程

› pep8 Checker

Styles

› PEP 8

› Google Python Style Guide

› Code Style from The Hitchhiker's Guide

这是一个创建于 2863 天前的主题，其中的信息可能已经有所发展或是发生改变。

初学... 代码是这样：

from html.parser import HTMLParser
import urllib.request
import chardet

pars = HTMLParser()
home_url = "https://wallstreetcn.com/"
response = urllib.request.urlopen(home_url)
content = response.read()
encoding = chardet.detect(content)
pars.feed(content.decode(encoding["encoding"],errors="ignore"))

chrome 看网页 metadata 里面 charset 用的 utf-8，我这里无论直接用'utf-8' 还是检测编码，均无法正确解码，有点 response 根本就没给出正确数据的感觉。请教一下

7 条回复 • 2017-12-27 15:59:04 +08:00

1

n329291362

2017-12-17 23:46:54 +08:00

1

1f8b 开头。。。。gzip 压缩啊最简单的
import gzip
gzip.decompress(content)

2

swordspoet

2017-12-18 00:48:49 +08:00 via iPhone

换一个 HTML 解析器，html.parser 的容错率不高，试试看 lxml

3

swordspoet

2017-12-18 00:51:39 +08:00 via iPhone

from bs4 import BeautifulSoup

standard_html = BeautifulSoup(content, 'lxml')

试试看这个～

4

free9fw

2017-12-18 09:29:26 +08:00

Accept-Encoding:gzip, deflate, br

5

hukangha

OP

2017-12-18 22:06:47 +08:00

@n329291362
果然... 可是如果是其他的压缩什么的怎么办... 只能这样靠丰富的经验么...

6

n329291362

2017-12-19 01:50:08 +08:00

@hukangha
也可以控制头 Accept-Encoding 让服务端返回没压缩过的数据
网页最多也就是 gzip 压缩在复杂也复杂不到哪

7

F1024

2017-12-27 15:59:04 +08:00

import os
import requests

html = requests.get('https://wallstreetcn.com').content.decode('utf-8')
print(html)

os.system("pause")

是这个吗

关于 · 帮助文档 · 自助推广系统 · 博客 · API · FAQ · Solana · 2916 人在线 最高记录 6679 ·

Select Language

创意工作者们的社区

World is powered by solitude

VERSION: 3.9.8.5 · 25ms · UTC 13:11 · PVG 21:11 · LAX 06:11 · JFK 09:11
♥ Do have faith in what you're doing.