基于redis实现scrapy分布式爬虫

安装redis

下载

1
wget http://download.redis.io/releases/redis-stable.tar.gz

安装

1
2
tar -zxvf redis-stable.tar.gz
cd redis-stable

运行

1
2
./src/redis-server
./src/redis-cli

安装scrapy-redis

1
pip install scrapy-redis

新建分布式爬虫

20190907082159.png

新建项目

1
scrapy startproject distributedspider

新建redis crawler(mycrawler_redis.py)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
import redis
from scrapy.spiders import Rule
from scrapy.linkextractors import LinkExtractor
from scrapy_redis.spiders import RedisCrawlSpider

class MyCrawler(RedisCrawlSpider):
"""Spider that reads urls from redis queue (myspider:start_urls)."""
name = 'mycrawler_redis'
redis_key = 'mycrawler:start_urls'
start_urls = []

def __init__(self, *args, **kwargs):
super(MyCrawler, self).__init__(*args, **kwargs)
# Dynamically define the allowed domains list.
domain = kwargs.pop('domain', '')
self.allowed_domains = filter(None, domain.split(','))
self.start_urls.append('http://joke.4399pk.com/funnyimg/find-cate-2.html')

r = redis.Redis()
for pageNum in range(1, 20, 1):
pageUrl = 'http://joke.4399pk.com/funnyimg/find-cate-2-p-' + str(pageNum) + '.html'
start_urls_len = r.lpush("myspider:start_urls", pageUrl)
print 'start_urls_len:' + str(start_urls_len)

新建redis spider(myspider_redis.py)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
from scrapy_redis.spiders import RedisSpider

class MySpider(RedisSpider):
"""Spider that reads urls from redis queue (myspider:start_urls)."""
name = 'myspider_redis'
redis_key = 'myspider:start_urls'

def __init__(self, *args, **kwargs):
# Dynamically define the allowed domains list.
domain = kwargs.pop('domain', '')
self.allowed_domains = filter(None, domain.split(','))

super(MySpider, self).__init__(*args, **kwargs)

def parse(self, response):
print 'spider_____________'
print response.url

return {
'name': response.css('title::text').extract_first(),
'url': response.url,
}

修改配置(settings.pyc)

配置redis地址,多机部署时队列读取地址。

1
2
3
4
5
6
7
8
9
10
11

DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
SCHEDULER = "scrapy_redis.scheduler.Scheduler"

ITEM_PIPELINES = {
'distributedspider.pipelines.DistributedspiderPipeline': 300,
'scrapy_redis.pipelines.RedisPipeline': 400,
}

REDIS_HOST = '127.0.0.1'
REDIS_PORT = 6379

运行

启动redis

20190907082348.png

运行spider1

20190907082407.png

运行spider2

20190907082428.png

添加start_urls

方式一(手动添加):
20190907082458.png

方式二(执行脚本添加):

1
scrapy crawl mycrawler_redis

结论

可以看到spider1、spider2在并行处理请求

20190907082524.png

20190907082541.png


来源:http://leunggeorge.github.io/

0%