Python自定义scrapy中间模块避免重复采集的方法

yipeiwu_com6年前Python基础

本文实例讲述了Python自定义scrapy中间模块避免重复采集的方法。分享给大家供大家参考。具体如下:

from scrapy import log
from scrapy.http import Request
from scrapy.item import BaseItem
from scrapy.utils.request import request_fingerprint
from myproject.items import MyItem
class IgnoreVisitedItems(object):
  """Middleware to ignore re-visiting item pages if they
  were already visited before. 
  The requests to be filtered by have a meta['filter_visited']
  flag enabled and optionally define an id to use 
  for identifying them, which defaults the request fingerprint,
  although you'd want to use the item id,
  if you already have it beforehand to make it more robust.
  """
  FILTER_VISITED = 'filter_visited'
  VISITED_ID = 'visited_id'
  CONTEXT_KEY = 'visited_ids'
  def process_spider_output(self, response, result, spider):
    context = getattr(spider, 'context', {})
    visited_ids = context.setdefault(self.CONTEXT_KEY, {})
    ret = []
    for x in result:
      visited = False
      if isinstance(x, Request):
        if self.FILTER_VISITED in x.meta:
          visit_id = self._visited_id(x)
          if visit_id in visited_ids:
            log.msg("Ignoring already visited: %s" % x.url,
                level=log.INFO, spider=spider)
            visited = True
      elif isinstance(x, BaseItem):
        visit_id = self._visited_id(response.request)
        if visit_id:
          visited_ids[visit_id] = True
          x['visit_id'] = visit_id
          x['visit_status'] = 'new'
      if visited:
        ret.append(MyItem(visit_id=visit_id, visit_status='old'))
      else:
        ret.append(x)
    return ret
  def _visited_id(self, request):
    return request.meta.get(self.VISITED_ID) or request_fingerprint(request)

希望本文所述对大家的Python程序设计有所帮助。

相关文章

利用python如何处理百万条数据(适用java新手)

利用python如何处理百万条数据(适用java新手)

1、前言 因为负责基础服务,经常需要处理一些数据,但是大多时候采用awk以及java程序即可,但是这次突然有百万级数据需要处理,通过awk无法进行匹配,然后我又采用java来处理,文件...

Python3.5以上版本lxml导入etree报错的解决方案

Python3.5以上版本lxml导入etree报错的解决方案

在python中安装了lxml-4.2.1,在使用时发现导入etree时IDE中报错Unresolved reference 其实发现,不影响使用,可以正常运行,对于我这种要刨根问底的...

使用python list 查找所有匹配元素的位置实例

如下所示: import re word = "test" s = "test abcdas test 1234 testcase testsuite" w = [m.start...

pyqt弹出新对话框,以及关闭对话框获取数据的实例

如下所示: from PyQt4 import QtGui,QtCore import sys class Web_Browser(QtGui.QDialog): def __i...

python使用win32com库播放mp3文件的方法

本文实例讲述了python使用win32com库播放mp3文件的方法。分享给大家供大家参考。具体实现方法如下: # Python supports COM, if you have...