`generic.spiders.archive`#

Module Contents#

Classes#

`ArchiveSpiderConfig`	A spider configuration class for ArchiveSpider.
`ArchiveSpider`	Parse archive pages, follow links to articles, and proceed to the next archive page if any.

API#

class generic.spiders.archive.ArchiveSpiderConfig(/, **data: Any)#

Bases: generic.spiders.read_more.ReadMoreSpiderConfig

A spider configuration class for ArchiveSpider.

Initialization

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

archive_article_xpath: Optional[str] = "//main//li[@class!=' pr']//h2[@class='title']//a/@href"#: XPath expression to extract article links from the archive page.

archive_next_xpath: Optional[str] = "//div[contains(@class, 'pagination')]//a[contains(text(), '次へ')]/@href"#: XPath expression to extract a next archive link from the archive page.

class generic.spiders.archive.ArchiveSpider(*args, **kwargs)#

Bases: generic.spiders.base.GenericSpider[generic.spiders.archive.ArchiveSpiderConfig], generic.mixins.read_more.ReadMoreMixin

Parse archive pages, follow links to articles, and proceed to the next archive page if any.

A typical archive page consists of:

A list of articles with links
A paginated navigation bar to “Next”

The spider collects links to articles with archive_article_xpath, follows the “Next” link with archive_next_xpath, and processes the next archive page.

The spider’s configuration is ArchiveSpiderConfig, which inherits ReadMoreSpiderConfig.

archive_article_xpath
A XPath expression to the href attribute of an <a> tag for articles.
archive_next_xpath
A XPath expression to the href attribute of an <a> tag for the “Next” archive page.

Initialization

name = 'archive_spider'#

allowed_domains = ['bunshun.jp']#

start_urls = ['https://bunshun.jp/category/latest?page=300']#

classmethod get_config_class() → Type[generic.spiders.archive.ArchiveSpiderConfig]#: Returns the config class for this spider.

async start()#

parse_archive_index(response)#

Parse the index page of an archive, yielding requests for articles and the next archive page.

Args:

response (scrapy.http.Response):: The response object containing the archive page content.

Yields:

scrapy.http.Request:: Requests for individual articles and the next archive page.

generic.spiders.archive#

Module Contents#

Classes#

API#

This Page

`generic.spiders.archive`#