generic.mixins.read_more#

Module Contents#

Classes#

ReadMoreCompatible

ReadMoreMixinConfig

ReadMoreMixin

Provides recursive article parsing capability. See also: ReadMoreSpider.

API#

class generic.mixins.read_more.ReadMoreCompatible#

Bases: typing.Protocol

args: generic.spiders.base.GenericSpiderConfig = None#
logger: any = None#
class generic.mixins.read_more.ReadMoreMixinConfig(/, **data: Any)#

Bases: generic.spiders.base.GenericSpiderConfig

read_more: str = '記事全文を読む'#

The text of <a> tag, the link to the main article.

read_more_xpath: Optional[str] = None#

XPath expression that matches the link to the main article.

read_next: str = '次へ'#

The text of <a> tag, the link to the next page.

read_next_contains: Optional[str] = None#
source_contains: Optional[str] = None#
source_parent_contains: Optional[str] = None#
class generic.mixins.read_more.ReadMoreMixin#

Provides recursive article parsing capability. See also: ReadMoreSpider.

parse_summary_page(res: scrapy.http.Response)#

Parse the summary article. If “Read more” link is not found, it assumes that the page is the main article and parse it as an article.

Yields:

Request to the main article if “Read more” link is found. ArticleItem if “Read more” link is not found.

parse_article(res: scrapy.http.Response, item: generic.items.ArticleItem = None)#

Parse an article.

Yields:

ArticleItem

parse_source_only(res: scrapy.http.Response, parent_item: generic.items.ArticleItem, remaining_urls: list[str])#

Parse source pages and add the source article to the parent item.

This method handles the extraction of source articles from response objects. It attempts to create an ArticleItem from the response and appends it to the parent item’s sources list. If the extraction fails, it logs the error and continues processing the remaining URLs.

Unlike Japanese news outlets, English ones avoid pagination in general for better UX. The method assumes that the source page is a single-page article. It does not crawl “Next page”.

Find the “Read more…” link in the HTML response.

This method searches for a “Read more…” link in the provided HTTP response. It first checks if a custom XPath expression is provided in the arguments. If so, it uses this XPath to find the link. Otherwise, it searches for a link with the default text “Read more…” using a predefined XPath expression.

Args:

res: The HTTP response to search for the link.

Returns:

str: The href attribute of the found link, or None if no link is found.

Find the “Next Page” link in the provided HTTP response.

This method searches for a link that indicates the next page of content. It can use either a specific text pattern or a direct text match to locate the link.

Args:
res: The HTTP response to search for the

next page link.

Returns:

str: The href attribute of the found link, or None if no link is found.

_merge_article_body(base_item: generic.items.ArticleItem, next_res: scrapy.http.Response) generic.items.ArticleItem#

Merge the content of an article from next_res into base_item.

This method takes the HTTP response of an article from next_res and appends it to the existing content in base_item. The method ensures that the resulting XML structure remains valid by properly handling the <main> tags and merging the content without introducing duplicate <main> tags.

Args:

base_item: The base article item to which the content will be merged. next_res: The response containing the article content to be merged.

Returns:

ArticleItem: The merged article item with the content from next_res appended.

Raises:

ValueError: If the base_item.body does not contain a <main> tag. etree.XMLSyntaxError: If there is an error parsing the XML content.

Find source links.

_find_and_request_sources(res: scrapy.http.Response, item: generic.items.ArticleItem)#

Find source articles.

_request_next_source(item: generic.items.ArticleItem, urls: list[str])#