舆情爬虫项目框架及存储选择 - V2EX

Home Sign Up Sign In

推荐学习书目

› Learn Python the Hard Way

Python Sites

› PyPI - Python Package Index

› http://diveintopython.org/toc/index.html

› Pocoo

值得关注的项目

› PyPy

› Celery

› Jinja2

› Read the Docs

› gevent

› pyenv

› Stackless Python

› Beautiful Soup

› 结巴中文分词

› Green Unicorn

› Sentry

› Shovel

› pytest

Python 编程

› pep8 Checker

Styles

› PEP 8

› Google Python Style Guide

› Code Style from The Hitchhiker's Guide

This topic created in 1974 days ago, the information mentioned may be changed or developed.

手上有个舆情爬虫项目，需要监控几百个企业和一些关键词的网络舆情。本人之前主要有爬取单个网站的经历，查找了些资料，打算使用 scrapy_redis 分布式爬虫，数据库用 mongodb，还有在考虑用 hadoop 存储框架这样数据传输和计算方便些。有没有做过同时采集多个网站的高手指点下爬虫框架和存储应该选择哪种更好

16 replies • 2021-01-18 10:53:03 +08:00

1

AntoniotheFuture

Jan 7, 2021

舆情爬虫有商业服务了，要不要考虑一下？

2

liwenbest

OP

Jan 7, 2021

@AntoniotheFuture 加我 QQ986636628 私聊

3

AntoniotheFuture

Jan 7, 2021

@liwenbest 我没有做，你百度一下有很多啊

4

Keyes

Jan 7, 2021

预算多少，买个现成的，卖两个人过去看着吧，我司舆情项目都 saas 化了，单一系统根本收不回成本

5

wzwwzw

Jan 7, 2021

@liwenbest QQ 有答案无法回复。

6

jr55475f112iz2tu

Jan 7, 2021

1

这种从 0 开始不太现实吧..数说故事 /明略 /秒针之类的都有解决方案

7

murmur

Jan 7, 2021

是真的要做还是要坑人钱，爬微博知乎就省省吧，贴吧那么多你爬哪里

8

liwenbest

OP

Jan 7, 2021

@wzwwzw 沙滩车

9

liwenbest

OP

Jan 7, 2021

@Keyes 公司接了个舆情项目要开发的开发周期一年要自己搞了

10

liwenbest

OP

Jan 7, 2021

@czfy 要自己开发的

11

jr55475f112iz2tu

Jan 7, 2021

@liwenbest 自己开发..只能祝你好运

12

smgui

Jan 7, 2021

可以试试这个，试过用 kafka 和 rabbitmq 作为队列爬了些网络小说：
https://github.com/Insutanto/scrapy-distributed
这些框架的源代码都很简单，完全可以自己造轮子。

13

liwenbest

OP

Jan 7, 2021

@smgui 非常感谢我看看

14

tisswb

Jan 12, 2021

我几年前做过类似项目，用的是 scrapy_redis + elasticsearch 的组合，基本够用

15

liwenbest

OP

Jan 14, 2021

@tisswb 我目前也是 scrapy_redis 但是存储用 mongodb

16

tisswb

Jan 18, 2021

@liwenbest 用 es 的好处就是索引分词统计功能全面，可以省不少功能开发量

About · Help · Advertise · Blog · API · FAQ · Solana · 961 Online Highest 6679 ·

Select Language

创意工作者们的社区

World is powered by solitude

VERSION: 3.9.8.5 · 43ms · UTC 21:55 · PVG 05:55 · LAX 14:55 · JFK 17:55
♥ Do have faith in what you're doing.