generic.spiders.read_more#
Module Contents#
Classes#
A spider to extract an article from a landing page, the main article page, and “Next” pages. It also supports a single page of an article. This spider is useful when RSS feed does not return the link to the main article but a landing page. |
API#
- class generic.spiders.read_more.ReadMoreSpiderConfig(/, **data: Any)#
- class generic.spiders.read_more.ReadMoreSpider(*args, **kwargs)#
Bases:
generic.spiders.base.GenericSpider[generic.spiders.read_more.ReadMoreSpiderConfig],generic.mixins.read_more.ReadMoreMixinA spider to extract an article from a landing page, the main article page, and “Next” pages. It also supports a single page of an article. This spider is useful when RSS feed does not return the link to the main article but a landing page.
This spider processes summary pages that contain links to main articles. For example, a summary page might have a link like
<a href="main.html">Read more...</a>.It supports the following cases:
Summary page -> Main article page
Summary page -> Main article page -> Next page(s)
Main article page -> Next page(s)
Main article page
The content of the summary page will not be included in
generic.items.ArticleItemwhen the page contains a read_more link. Otherwise, the content is included as part of the article.When the main article is split into multiple pages, specify
read_next. The spider crawls all the pages and returns a single ArticleItem.The spider accepts a comma-separated list of summary page URLs and returns ArticleItem of the main articles.
When no link with read_more text is found, the spider parses the summary page and proceeds next page if it finds one.
The allowed_domains is automatically set to the domain name of the urls. It is recommended to pass URLs under the same domain.
- Args:
- urls:
Comma-separated string of summary page URLs. Mandatory.
- read_more:
Text string of the
<a>tag that links to the main article. Default is記事全文を読む.- read_more_xpath:
XPath query that matches
<a>tag./@hrefis automatically appended to the query.When
read_more_xpathis not None,read_moreis ignored.When the query matches multiple elements, the first one will be used.
Default is None.
An example:
//h3[contains(text(), "関連記事")]/following-sibling::ul[1]/li/a//h3Look everywhere: Search the entire document for any Level 3 Heading (
<h3>).
[contains(text(), "関連記事")]Filter by text: Out of all those headings, only keep the ones that contain the text
関連記事(Related Articles).
/following-sibling::ul[1]Find the next list: Look at the elements on the same level (siblings) immediately after that heading, and pick the first Unordered List (
<ul>) you see.
/li/aGo inside the list items: Navigate into each list item (
<li>) and then into the link tag (<a>) found inside it.
- read_next:
Text string of the
<a>tag that links to the next page. The spider finds the link to the next page that matches the exact value of the argument.Default is
次へ.- read_next_contains:
Text string of the
<a>tag that links to the next page.The spider finds the link to the next page that contains the value of the argument.
Default is
None.- source_contains:
Matches
<a>tag, whose text contains contains_text.When
contains_textisUS版, the spider picks all the following <a> tags:<main> <a href="#">US版</a> <p><a href="#">US版</a></p> </main>
- source_parent_contains:
Match
<a>tags whose parent contains the value.When
parent_contains_textis英語記事, the spider picks all the following<a>tags:<main> <p>英語記事: <a href="#">foo</a> / <a href="#">bar</a></p> </main>
Initialization
- name = 'read-more'#
- allowed_domains = ['news.yahoo.co.jp']#
- classmethod get_config_class() Type[generic.spiders.read_more.ReadMoreSpiderConfig]#
Returns the config class for this spider.
- async start()#
- parse(res: scrapy.http.Response)#