`generic.spiders.feed`#

Module Contents#

Classes#

`FeedEntry`	A class for internal use.
`Feed`	A class for internal use.
`FeedConfig`	The YAML configuration file format.
`FeedSpiderConfig`	Configuration for FeedSpider.
`FeedSpider`	A spider that generates Atom/RSS feeds. The spider crawls URLs in a configuration file, scrape links, and geenrates a feed for the page.

API#

class generic.spiders.feed.FeedEntry(/, **data: Any)#

Bases: pydantic.BaseModel

A class for internal use.

Initialization

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

id: str = None#

title: str = None#

link: str = None#

class generic.spiders.feed.Feed(/, **data: Any)#

Bases: pydantic.BaseModel

A class for internal use.

Initialization

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

id: str = None#: The URL of the page.

lang: str = None#: The language of the page.

type: str = None#: The type of the feed. Either “atom” or “rss”

title: str = None#: Title of the page

class generic.spiders.feed.FeedConfig(/, **data: Any)#

Bases: pydantic.BaseModel

The YAML configuration file format.

---
feed_config:
  "http://foo.example.org/latest.html":
    file_name: "latest.xml"
    feed_type: "atom"
    xpath_href: "//li[@class='articles-list__item']/a/@href"
    xpath_title: "//li[@class='articles-list__item']/a/text()"

The top-level key must be feed_config.

feed_config is a dictionary. The key is the page URL for the feed. The value is a dictionary with file_name, feed_type, xpath_href, and xpath_title.

Initialization

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

file_name: str = None#: The name of the generated feed file. This must be unique per URL. The file is overwritten when a feed is generated.

xpath_href: str = None#: The path to the href attribute of the link.

xpath_title: str = None#: The path to the feed title.

feed_type: str = 'atom'#: Type of the feed. Either atom or rss.

class generic.spiders.feed.FeedSpiderConfig(/, **data: Any)#

Bases: generic.spiders.base.GenericSpiderConfig

Configuration for FeedSpider.

Unlike other spiders, this spider does not accept urls. Instead, it requires config, e.g., -a config=/path/to/config.yml and the configuration file defines feeds to generate.

Initialization

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

feed_config: dict[str, generic.spiders.feed.FeedConfig] = None#: Foo

config: pathlib.Path = None#: The path to the configuration file.

class generic.spiders.feed.FeedSpider(*args, **kwargs)#

Bases: generic.spiders.base.GenericSpider[generic.spiders.feed.FeedSpiderConfig]

A spider that generates Atom/RSS feeds. The spider crawls URLs in a configuration file, scrape links, and geenrates a feed for the page.

The spider has custom_settings. In ITEM_PIPELINES, FeedStoragePipeline is set at 900. The pipeline stores the generated feeds.

Args:

config (str): The path to configuration file.: The default is feed.yml.

Initialization

name = 'feed'#

custom_settings = None#

classmethod get_config_class() → Type[generic.spiders.feed.FeedSpiderConfig]#: Returns the config class for this spider.

_load_config(path: pathlib.Path)#

async start()#

parse(response: scrapy.http.Response)#

_generate_feed(url: str, feed: generic.spiders.feed.Feed, feed_entries: list[generic.spiders.feed.FeedEntry], file_name: str) → generic.items.FeedItem#

generic.spiders.feed#

Module Contents#

Classes#

API#

This Page

`generic.spiders.feed`#