Python自定义scrapy中间模块避免重复采集的方法

yipeiwu_com6年前Python基础

本文实例讲述了Python自定义scrapy中间模块避免重复采集的方法。分享给大家供大家参考。具体如下:

from scrapy import log
from scrapy.http import Request
from scrapy.item import BaseItem
from scrapy.utils.request import request_fingerprint
from myproject.items import MyItem
class IgnoreVisitedItems(object):
  """Middleware to ignore re-visiting item pages if they
  were already visited before. 
  The requests to be filtered by have a meta['filter_visited']
  flag enabled and optionally define an id to use 
  for identifying them, which defaults the request fingerprint,
  although you'd want to use the item id,
  if you already have it beforehand to make it more robust.
  """
  FILTER_VISITED = 'filter_visited'
  VISITED_ID = 'visited_id'
  CONTEXT_KEY = 'visited_ids'
  def process_spider_output(self, response, result, spider):
    context = getattr(spider, 'context', {})
    visited_ids = context.setdefault(self.CONTEXT_KEY, {})
    ret = []
    for x in result:
      visited = False
      if isinstance(x, Request):
        if self.FILTER_VISITED in x.meta:
          visit_id = self._visited_id(x)
          if visit_id in visited_ids:
            log.msg("Ignoring already visited: %s" % x.url,
                level=log.INFO, spider=spider)
            visited = True
      elif isinstance(x, BaseItem):
        visit_id = self._visited_id(response.request)
        if visit_id:
          visited_ids[visit_id] = True
          x['visit_id'] = visit_id
          x['visit_status'] = 'new'
      if visited:
        ret.append(MyItem(visit_id=visit_id, visit_status='old'))
      else:
        ret.append(x)
    return ret
  def _visited_id(self, request):
    return request.meta.get(self.VISITED_ID) or request_fingerprint(request)

希望本文所述对大家的Python程序设计有所帮助。

相关文章

python标准算法实现数组全排列的方法

本文实例讲述了python标准算法实现数组全排列的方法,代码来自国外网站。分享给大家供大家参考。具体分析如下: 从n个不同元素中任取m(m≤n)个元素,按照一定的顺序排列起来,叫做从n个...

scikit-learn线性回归,多元回归,多项式回归的实现

scikit-learn线性回归,多元回归,多项式回归的实现

匹萨的直径与价格的数据 %matplotlib inline import matplotlib.pyplot as plt def runplt(): plt.figure()...

win8.1安装Python 2.7版环境图文详解

win8.1安装Python 2.7版环境图文详解

Python相信大家都有所耳闻,特别是Python进入山东省小学教材,还列入全国计算机等级考试。 打算爬网易云音乐评论的我,首先要安装一个Python环境。 目前Python有2.x版和...

Python Queue模块详解

Python中,队列是线程间最常用的交换数据的形式。Queue模块是提供队列操作的模块,虽然简单易用,但是不小心的话,还是会出现一些意外。 创建一个“队列”对象 import Queue...

Python 循环终止语句的三种方法小结

在Python循环终止语句有三种: 1、break break用于退出本层循环 示例如下: while True: print "123" break print "45...