https://github.com/intohole/xspider 是再重复造轮子!但让我们一起熟悉
main.py:
from xspider.spider.spider import BaseSpider
from xspider.filters import urlfilter
from kuailiyu import KuaiLiYu
if __name__ == "__main__":
spider = BaseSpider(name = "kuailiyu" , page_processor = KuaiLiYu() , allow_site = ["kuailiyu.cyzone.cn"] , start_urls = ["http://kuailiyu.cyzone.cn/"])
spider.url_filters.append(urlfilter.UrlRegxFilter(["kuailiyu.cyzone.cn/article/[0-9]*\.html$","kuailiyu.cyzone.cn/index_[0-9]+.html$"]))
spider.start()
kuailiyu.py
from xspider import processor
from xspider.selector import xpath_selector
from xspider import model
class KuaiLiYu(processor.PageProcessor.PageProcessor):
def __init__(self):
super(KuaiLiYu , self).__init__()
self.title_extractor = xpath_selector.XpathSelector(path = "//title/text()")
def process(self , page , spider):
items = model.fileds.Fileds()
items["title"] = self.title_extractor.find(page)
items["url"] = page.url
return items
1
xiaozizayang 2017-11-23 16:25:48 +08:00
|
2
tamlok 2017-11-23 16:49:58 +08:00 via Android
|
3
intohole OP @xiaozizayang 学习一下
|
5
j1wu 2017-11-23 20:00:21 +08:00
JavaScript 版本助攻,向大家学习 Orz https://github.com/j1wu/cli-scraper
|
6
zhangysh1995 2017-11-23 21:39:56 +08:00
最近正好在学爬虫,收藏一个,楼主加油!
|
8
intohole OP @zhangysh1995 里面的 api 没有整理 , 这个爬虫专门为了机器不足 时间来换的开发
|
9
coolloves 2017-12-01 11:12:16 +08:00
马克,学习
|