generic.pipelines#

Module Contents#

Classes#

GenericPipeline

DropMissingTextPipeline

Drops items without text.

FeedStoragePipeline

Save FeedItem on local disk.

FileItemPipeline

Process FileItem. This pipeline should be placed before FileItemStoragePipeline.

FileItemStoragePipeline

Save FileItem on local disk. This pipeline should be at the end of ITEM_PIPELINES.

SpacyTokenizePipeline

CleanSentencesPipeline

API#

class generic.pipelines.GenericPipeline#
process_item(item, spider)#
class generic.pipelines.DropMissingTextPipeline#

Drops items without text.

process_item(item)#
class generic.pipelines.FeedStoragePipeline#

Save FeedItem on local disk.

process_item(item)#
class generic.pipelines.FileItemPipeline#

Process FileItem. This pipeline should be placed before FileItemStoragePipeline.

This pipeline expects FileItem to have filename with a proper file extention.

The purpose of the pipeline is:

  • Generate a unique, hashed file name.

  • Process FileItems if necessary, e.g., adding contexts or metadata to the FileItem.

process_item(item: generic.items.FileItem, spider: scrapy.Spider) generic.items.FileItem#

Process FileItem.

  1. Call a specific method to process the FileItem.

  2. Generate a unique, hashed file name

  3. Create a new FileItem with the generated file name.

process_pdf_item(item: generic.items.FileItem, spider: scrapy.Spider) generic.items.FileItem#

Process PDF FileItem.

  1. Adding metadata to the PDF

class generic.pipelines.FileItemStoragePipeline#

Save FileItem on local disk. This pipeline should be at the end of ITEM_PIPELINES.

process_item(item, spider)#
class generic.pipelines.SpacyTokenizePipeline(spacy_url)#

Initialization

classmethod from_crawler(crawler)#
async process_item(item, spider)#
async close_spider(spider)#
class generic.pipelines.CleanSentencesPipeline#
async process_item(item, spider)#