Python自定义scrapy中间模块避免重复采集的方法

yipeiwu_com5年前Python基础

本文实例讲述了Python自定义scrapy中间模块避免重复采集的方法。分享给大家供大家参考。具体如下:

from scrapy import log
from scrapy.http import Request
from scrapy.item import BaseItem
from scrapy.utils.request import request_fingerprint
from myproject.items import MyItem
class IgnoreVisitedItems(object):
  """Middleware to ignore re-visiting item pages if they
  were already visited before. 
  The requests to be filtered by have a meta['filter_visited']
  flag enabled and optionally define an id to use 
  for identifying them, which defaults the request fingerprint,
  although you'd want to use the item id,
  if you already have it beforehand to make it more robust.
  """
  FILTER_VISITED = 'filter_visited'
  VISITED_ID = 'visited_id'
  CONTEXT_KEY = 'visited_ids'
  def process_spider_output(self, response, result, spider):
    context = getattr(spider, 'context', {})
    visited_ids = context.setdefault(self.CONTEXT_KEY, {})
    ret = []
    for x in result:
      visited = False
      if isinstance(x, Request):
        if self.FILTER_VISITED in x.meta:
          visit_id = self._visited_id(x)
          if visit_id in visited_ids:
            log.msg("Ignoring already visited: %s" % x.url,
                level=log.INFO, spider=spider)
            visited = True
      elif isinstance(x, BaseItem):
        visit_id = self._visited_id(response.request)
        if visit_id:
          visited_ids[visit_id] = True
          x['visit_id'] = visit_id
          x['visit_status'] = 'new'
      if visited:
        ret.append(MyItem(visit_id=visit_id, visit_status='old'))
      else:
        ret.append(x)
    return ret
  def _visited_id(self, request):
    return request.meta.get(self.VISITED_ID) or request_fingerprint(request)

希望本文所述对大家的Python程序设计有所帮助。

相关文章

Python collections模块使用方法详解

Python collections模块使用方法详解

一、collections模块 1.函数namedtuple (1)作用:tuple类型,是一个可命名的tuple (2)格式:collections(列表名称,列表) (3)̴...

Python下的Mysql模块MySQLdb安装详解

默认情况下,MySQLdb包是没有安装的,不信? 看到类似下面的代码你就信了。复制代码 代码如下: -bash-3.2# /usr/local/python2.7.3/bin/pytho...

Python中使用gzip模块压缩文件的简单教程

压缩数据创建gzip文件 先看一个略麻烦的做法   import StringIO,gzip content = 'Life is short.I use python'...

python 一个figure上显示多个图像的实例

python 一个figure上显示多个图像的实例

方法一:主要是inshow()函数的使用 首先基本的画图流程为: import matplotlib.pyplot as plt #创建新的figure fig = plt.f...

如何用itertools解决无序排列组合的问题

最近我作为Python菜鸟一枚开始征战Codewars,所以打算在这里记下遇到的有意思的题目。今天这第一题叫做“Best Travel”: John和Mary计划去一些小镇旅行。Mary...