1
yangbin9317 2016-04-10 19:05:47 +08:00 1
from bs4 import BeautifulSoup
soup = BeautifulSoup(open('tmp1.txt'), 'html.parser') list_a = soup.find_all('a') list_a_text = [] for link in list_a: list_a_text.append(link.text) |
2
yangbin9317 2016-04-10 19:06:27 +08:00
还招聘写爬虫的吗?
|
3
omg21 OP @yangbin9317 谢谢,原来是这样做的。
额~~目前不招,不过希望以后有机会能合作。 |
4
gitb 2016-04-10 23:31:45 +08:00 via Android
推荐用 lxml 解析,自带的效率低
|
6
niuzhewen 2016-04-10 23:41:57 +08:00 via Android
from bs4 import BeautifulSoup
from lxml import etree soup = BeautifulSoup(open('tmp1.txt'), 'lxml') list_a = soup.find_all('a') list_a_text = [] for link in list_a: list_a_text.append(link.text) |
7
chevalier 2016-04-11 01:15:02 +08:00
from bs4 import BeautifulSoup
soup = BeautifulSoup(open('tmp1.txt'), 'lxml') list_a = [tag.get('href') for tag in soup.select('a[href]')] list_a 中即全部的页面超链接 # 求各种爬虫兼职 |
8
xlzd 2016-04-11 14:13:06 +08:00
list_a, list_a_text = (lambda l: ([_['href'] for _ in l], [_.getText() for _ in l]))(getattr(__import__('bs4'), 'BeautifulSoup')(open('tmp1.txt'), 'lxml').find_all('a'))
上面的代码即可取出 tmp1.txt 中的所有链接放到 list_a 中,并将链接中的文本放到 list_a_text 中。 |