ReadMoreSpider#
This spider is a versatile spider that collects texts from websites. It supports multiple pages in an article, a landing page to the main article page, and articles with sources, e.g., English version of the article.
This spider is suitable for:
A single page article
A multi-pages article with a navigation bar with a link to
NextpageAn article with links to source articles, e.g., links to original English articles
Usage#
uv run scrapy crawl -a'urls=https://example.org/pickup/6567310' -O foo.jsonl read-more
The above command will crawl the URL, skip the landing page, crawl all the pages of an article, and generate a JSONL file.
To see the result in the file, open it with a text editor, or use jq.
jq < foo.jsonl
How It Works#
Some websites have a landing page for each article, notably, Yahoo! News. The structure is:
A landing page (
/pickup/${PICKUP_ID}) with a link to the first article page (記事全文を読む, orread_morelink)The first page of the article (
/articles/${ARTICLE_ID}) with a link to optional, subsequent article pages (次へ, orread_nextlink)Optional article pages (
/article/${ARTICLE_ID}?page=${N}) with a link to the next article page (次へ).
---
title: ReadMoreSpider State Diagram
---
stateDiagram-v2
[*] --> LandingPage
LandingPage --> MainArticle
MainArticle --> FindSourceArticle: read_next is not found
MainArticle --> NextPage: read_next is found
NextPage --> FindSourceArticle: read_next is not found
NextPage --> NextPage: read_next is found
FindSourceArticle --> GenerateItem: source is not found
FindSourceArticle --> LandingPage: source is found
GenerateItem --> [*]
Given a URL, or URLs, the spider crawls the URL, finds a link to the first page of the article until there is no article pages while collecting texts from the article pages.
If the landing page has a read_more link, the spider follows the link. Because the landing page contains just a summary of the article, the spider does not collect texts from the landing page. However, if the landing page does not have read_more link, the spider considers the page is the first page of the article and collects text from the page.
After following the read_more link, the spider collects texts from the article page, looks for a read_next link. If it finds the link, repeat this step until no read_next link is found in the pages.
When no more read_next link is found, the spider looks for links to source articles. If it finds one, repeat the entire process.
When no source link is found, it finishes crawling.
When it finishes crawling and scraping article texts, it generates ArticleItem, an object that represents an article. The ArticleItem has an attribute, body, which includes all the texts from the article, among other metadata attributes. The object can be exported to a file, such as JSONL or CSV files.