ReadMoreSpider#

This spider is a versatile spider that collects texts from websites. It supports multiple pages in an article, a landing page to the main article page, and articles with sources, e.g., English version of the article.

This spider is suitable for:

  • A single page article

  • A multi-pages article with a navigation bar with a link to Next page

  • An article with links to source articles, e.g., links to original English articles

Usage#

uv run scrapy crawl -a'urls=https://example.org/pickup/6567310' -O foo.jsonl read-more

The above command will crawl the URL, skip the landing page, crawl all the pages of an article, and generate a JSONL file.

To see the result in the file, open it with a text editor, or use jq.

jq < foo.jsonl

How It Works#

Some websites have a landing page for each article, notably, Yahoo! News. The structure is:

  • A landing page (/pickup/${PICKUP_ID}) with a link to the first article page (記事全文を読む, or read_more link)

  • The first page of the article (/articles/${ARTICLE_ID}) with a link to optional, subsequent article pages (次へ, or read_next link)

  • Optional article pages (/article/${ARTICLE_ID}?page=${N}) with a link to the next article page (次へ).

        ---
title: ReadMoreSpider State Diagram
---
stateDiagram-v2

    [*] --> LandingPage
    LandingPage --> MainArticle
    MainArticle --> FindSourceArticle: read_next is not found
    MainArticle --> NextPage: read_next is found
    NextPage --> FindSourceArticle: read_next is not found
    NextPage --> NextPage: read_next is found
    FindSourceArticle --> GenerateItem: source is not found
    FindSourceArticle --> LandingPage: source is found
    GenerateItem --> [*]
    

Given a URL, or URLs, the spider crawls the URL, finds a link to the first page of the article until there is no article pages while collecting texts from the article pages.

If the landing page has a read_more link, the spider follows the link. Because the landing page contains just a summary of the article, the spider does not collect texts from the landing page. However, if the landing page does not have read_more link, the spider considers the page is the first page of the article and collects text from the page.

After following the read_more link, the spider collects texts from the article page, looks for a read_next link. If it finds the link, repeat this step until no read_next link is found in the pages.

When no more read_next link is found, the spider looks for links to source articles. If it finds one, repeat the entire process.

When no source link is found, it finishes crawling.

When it finishes crawling and scraping article texts, it generates ArticleItem, an object that represents an article. The ArticleItem has an attribute, body, which includes all the texts from the article, among other metadata attributes. The object can be exported to a file, such as JSONL or CSV files.