generic.spiders.archive#

Module Contents#

Classes#

ArchiveSpiderConfig

A spider configuration class for ArchiveSpider.

ArchiveSpider

Parse archive pages, follow links to articles, and proceed to the next archive page if any.

API#

class generic.spiders.archive.ArchiveSpiderConfig(/, **data: Any)#

Bases: generic.spiders.read_more.ReadMoreSpiderConfig

A spider configuration class for ArchiveSpider.

Initialization

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

archive_article_xpath: Optional[str] = "//main//li[@class!=' pr']//h2[@class='title']//a/@href"#

XPath expression to extract article links from the archive page.

archive_next_xpath: Optional[str] = "//div[contains(@class, 'pagination')]//a[contains(text(), '次へ')]/@href"#

XPath expression to extract a next archive link from the archive page.

class generic.spiders.archive.ArchiveSpider(*args, **kwargs)#

Bases: generic.spiders.base.GenericSpider[generic.spiders.archive.ArchiveSpiderConfig], generic.mixins.read_more.ReadMoreMixin

Parse archive pages, follow links to articles, and proceed to the next archive page if any.

A typical archive page consists of:

  • A list of articles with links

  • A paginated navigation bar to “Next”

The spider collects links to articles with archive_article_xpath, follows the “Next” link with archive_next_xpath, and processes the next archive page.

The spider’s configuration is ArchiveSpiderConfig, which inherits ReadMoreSpiderConfig.

  • archive_article_xpath

    A XPath expression to the href attribute of an <a> tag for articles.

  • archive_next_xpath

    A XPath expression to the href attribute of an <a> tag for the “Next” archive page.

Initialization

name = 'archive_spider'#
allowed_domains = ['bunshun.jp']#
start_urls = ['https://bunshun.jp/category/latest?page=300']#
classmethod get_config_class() Type[generic.spiders.archive.ArchiveSpiderConfig]#

Returns the config class for this spider.

async start()#
parse_archive_index(response)#

Parse the index page of an archive, yielding requests for articles and the next archive page.

Args:
response (scrapy.http.Response):

The response object containing the archive page content.

Yields:
scrapy.http.Request:

Requests for individual articles and the next archive page.