generic.spiders.archive#
Module Contents#
Classes#
A spider configuration class for ArchiveSpider. |
|
Parse archive pages, follow links to articles, and proceed to the next archive page if any. |
API#
- class generic.spiders.archive.ArchiveSpiderConfig(/, **data: Any)#
Bases:
generic.spiders.read_more.ReadMoreSpiderConfigA spider configuration class for ArchiveSpider.
Initialization
Create a new model by parsing and validating input data from keyword arguments.
Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.
self is explicitly positional-only to allow self as a field name.
- archive_article_xpath: Optional[str] = "//main//li[@class!=' pr']//h2[@class='title']//a/@href"#
XPath expression to extract article links from the archive page.
- archive_next_xpath: Optional[str] = "//div[contains(@class, 'pagination')]//a[contains(text(), '次へ')]/@href"#
XPath expression to extract a next archive link from the archive page.
- class generic.spiders.archive.ArchiveSpider(*args, **kwargs)#
Bases:
generic.spiders.base.GenericSpider[generic.spiders.archive.ArchiveSpiderConfig],generic.mixins.read_more.ReadMoreMixinParse archive pages, follow links to articles, and proceed to the next archive page if any.
A typical archive page consists of:
A list of articles with links
A paginated navigation bar to “Next”
The spider collects links to articles with archive_article_xpath, follows the “Next” link with archive_next_xpath, and processes the next archive page.
The spider’s configuration is ArchiveSpiderConfig, which inherits ReadMoreSpiderConfig.
- archive_article_xpath
A XPath expression to the href attribute of an <a> tag for articles.
- archive_next_xpath
A XPath expression to the href attribute of an <a> tag for the “Next” archive page.
Initialization
- name = 'archive_spider'#
- allowed_domains = ['bunshun.jp']#
- start_urls = ['https://bunshun.jp/category/latest?page=300']#
- classmethod get_config_class() Type[generic.spiders.archive.ArchiveSpiderConfig]#
Returns the config class for this spider.
- async start()#
- parse_archive_index(response)#
Parse the index page of an archive, yielding requests for articles and the next archive page.
- Args:
- response (scrapy.http.Response):
The response object containing the archive page content.
- Yields:
- scrapy.http.Request:
Requests for individual articles and the next archive page.