generic.spiders.feed#

Module Contents#

Classes#

FeedEntry

A class for internal use.

Feed

A class for internal use.

FeedConfig

The YAML configuration file format.

FeedSpiderConfig

Configuration for FeedSpider.

FeedSpider

A spider that generates Atom/RSS feeds. The spider crawls URLs in a configuration file, scrape links, and geenrates a feed for the page.

API#

class generic.spiders.feed.FeedEntry(/, **data: Any)#

Bases: pydantic.BaseModel

A class for internal use.

Initialization

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

id: str = None#
title: str = None#
class generic.spiders.feed.Feed(/, **data: Any)#

Bases: pydantic.BaseModel

A class for internal use.

Initialization

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

id: str = None#

The URL of the page.

lang: str = None#

The language of the page.

type: str = None#

The type of the feed. Either “atom” or “rss”

title: str = None#

Title of the page

class generic.spiders.feed.FeedConfig(/, **data: Any)#

Bases: pydantic.BaseModel

The YAML configuration file format.

---
feed_config:
  "http://foo.example.org/latest.html":
    file_name: "latest.xml"
    feed_type: "atom"
    xpath_href: "//li[@class='articles-list__item']/a/@href"
    xpath_title: "//li[@class='articles-list__item']/a/text()"

The top-level key must be feed_config.

feed_config is a dictionary. The key is the page URL for the feed. The value is a dictionary with file_name, feed_type, xpath_href, and xpath_title.

Initialization

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

file_name: str = None#

The name of the generated feed file. This must be unique per URL. The file is overwritten when a feed is generated.

xpath_href: str = None#

The path to the href attribute of the link.

xpath_title: str = None#

The path to the feed title.

feed_type: str = 'atom'#

Type of the feed. Either atom or rss.

class generic.spiders.feed.FeedSpiderConfig(/, **data: Any)#

Bases: generic.spiders.base.GenericSpiderConfig

Configuration for FeedSpider.

Unlike other spiders, this spider does not accept urls. Instead, it requires config, e.g., -a config=/path/to/config.yml and the configuration file defines feeds to generate.

Initialization

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

feed_config: dict[str, generic.spiders.feed.FeedConfig] = None#

Foo

config: pathlib.Path = None#

The path to the configuration file.

class generic.spiders.feed.FeedSpider(*args, **kwargs)#

Bases: generic.spiders.base.GenericSpider[generic.spiders.feed.FeedSpiderConfig]

A spider that generates Atom/RSS feeds. The spider crawls URLs in a configuration file, scrape links, and geenrates a feed for the page.

The spider has custom_settings. In ITEM_PIPELINES, FeedStoragePipeline is set at 900. The pipeline stores the generated feeds.

Args:
config (str): The path to configuration file.

The default is feed.yml.

Initialization

name = 'feed'#
custom_settings = None#
classmethod get_config_class() Type[generic.spiders.feed.FeedSpiderConfig]#

Returns the config class for this spider.

_load_config(path: pathlib.Path)#
async start()#
parse(response: scrapy.http.Response)#
_generate_feed(url: str, feed: generic.spiders.feed.Feed, feed_entries: list[generic.spiders.feed.FeedEntry], file_name: str) generic.items.FeedItem#