generic.spiders.feed#
Module Contents#
Classes#
A class for internal use. |
|
A class for internal use. |
|
The YAML configuration file format. |
|
Configuration for FeedSpider. |
|
A spider that generates Atom/RSS feeds. The spider crawls URLs in a configuration file, scrape links, and geenrates a feed for the page. |
API#
- class generic.spiders.feed.FeedEntry(/, **data: Any)#
Bases:
pydantic.BaseModelA class for internal use.
Initialization
Create a new model by parsing and validating input data from keyword arguments.
Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.
self is explicitly positional-only to allow self as a field name.
- id: str = None#
- title: str = None#
- link: str = None#
- class generic.spiders.feed.Feed(/, **data: Any)#
Bases:
pydantic.BaseModelA class for internal use.
Initialization
Create a new model by parsing and validating input data from keyword arguments.
Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.
self is explicitly positional-only to allow self as a field name.
- id: str = None#
The URL of the page.
- lang: str = None#
The language of the page.
- type: str = None#
The type of the feed. Either “atom” or “rss”
- title: str = None#
Title of the page
- class generic.spiders.feed.FeedConfig(/, **data: Any)#
Bases:
pydantic.BaseModelThe YAML configuration file format.
--- feed_config: "http://foo.example.org/latest.html": file_name: "latest.xml" feed_type: "atom" xpath_href: "//li[@class='articles-list__item']/a/@href" xpath_title: "//li[@class='articles-list__item']/a/text()"
The top-level key must be
feed_config.feed_configis a dictionary. The key is the page URL for the feed. The value is a dictionary withfile_name,feed_type,xpath_href, andxpath_title.Initialization
Create a new model by parsing and validating input data from keyword arguments.
Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.
self is explicitly positional-only to allow self as a field name.
- file_name: str = None#
The name of the generated feed file. This must be unique per URL. The file is overwritten when a feed is generated.
- xpath_href: str = None#
The path to the
hrefattribute of the link.
- xpath_title: str = None#
The path to the feed title.
- feed_type: str = 'atom'#
Type of the feed. Either
atomorrss.
- class generic.spiders.feed.FeedSpiderConfig(/, **data: Any)#
Bases:
generic.spiders.base.GenericSpiderConfigConfiguration for FeedSpider.
Unlike other spiders, this spider does not accept
urls. Instead, it requiresconfig, e.g.,-a config=/path/to/config.ymland the configuration file defines feeds to generate.Initialization
Create a new model by parsing and validating input data from keyword arguments.
Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.
self is explicitly positional-only to allow self as a field name.
- feed_config: dict[str, generic.spiders.feed.FeedConfig] = None#
Foo
- config: pathlib.Path = None#
The path to the configuration file.
- class generic.spiders.feed.FeedSpider(*args, **kwargs)#
Bases:
generic.spiders.base.GenericSpider[generic.spiders.feed.FeedSpiderConfig]A spider that generates Atom/RSS feeds. The spider crawls URLs in a configuration file, scrape links, and geenrates a feed for the page.
The spider has custom_settings. In ITEM_PIPELINES, FeedStoragePipeline is set at 900. The pipeline stores the generated feeds.
- Args:
- config (str): The path to configuration file.
The default is feed.yml.
Initialization
- name = 'feed'#
- custom_settings = None#
- classmethod get_config_class() Type[generic.spiders.feed.FeedSpiderConfig]#
Returns the config class for this spider.
- _load_config(path: pathlib.Path)#
- async start()#
- parse(response: scrapy.http.Response)#
- _generate_feed(url: str, feed: generic.spiders.feed.Feed, feed_entries: list[generic.spiders.feed.FeedEntry], file_name: str) generic.items.FeedItem#