ArchiveSpider#

This spider starts crawling the given URL, extract all the links to articles in the page, follows “Next” link to the next archive index page, and repeat the process until no link to “Next” page is found. The spider is designed for a typical “Archive” pages commonly found on popular CMS.

Usage#

The spider requires two arguments: archive_next_xpath and archive_article_xpath.

archive_next_xpath is an XPath (Wikipedia article) to href attribute of the link to the next page.

archive_article_xpath is an XPath to href attribute of the links to archive pages.

Suppose, the archive index page has many links to archive pages under a <ul> tag.

<ul>
  <li class="blog__post">
    <div class="post-date">
      <p>January 10, 2025</p>
    </div>
    <div class="post-container">
      <h3>
        <a href="/2025/1/10/this-week-in-rails">Sorted Columns in Schema Dumper, Deprecations, and lots of fixes!</a>
      </h3>
    </div>
  </li>
  <li class="blog__post">
    <!-- other link -->
  </li>
</ul>

One of possible XPath expressions for the links is:

//li[@class='blog__post']//h3/a/@href

This XPath returns an array of relative URL to archive post in the page.

  • //li[@class='blog__post'] finds all list items (<li>) anywhere in the document that have the exact class name blog__post.

  • //h3 searches inside those list items for any Level 3 Heading (<h3>), regardless of how deep they are nested.

  • /a selects the anchor tag (<a>) that is a direct child of the <h3>.

  • /@href extracts the value of the href attribute (the URL) from the link, rather than the text of the link itself.

Suppose, the page has a navigation to the next archive page. The link text is “See more posts”.

<div class="blog__pagination">
  <a href="/blog/page/2"><span>See more posts…</span></a>
</div>

One of possible XPath expressions for the link is:

//div[@class='blog__pagination']//a[contains(., 'See more posts')]/@href

With these XPath, all the archive pages can be scraped with:

uv run scrapy crawl -a'urls=https://rubyonrails.org/blog/' \
    -a "archive_article_xpath=//li[@class='blog__post']//h3/a/@href" \
    -a "archive_next_xpath="//div[@class='blog__pagination']//a[contains(., 'See more posts')]/@href" \
    -O foo.jsonl archive

How It Works#

        ---
title: ArchiveSpider State Diagram
---
flowchart TD

    StartUrls[urls] --> ArchiveIndexPage
    ArchiveIndexPage[Parse an archive index page] --> IfArchivePageFound{Archive page found?}
    IfArchivePageFound --> |Yes| ReadMore
    IfArchivePageFound --> |No| IfNextArchiveIndexPageFound{Next archive page found?}
    ReadMore --> ArchiveIndexPage
    IfNextArchiveIndexPageFound --> |Yes| ArchiveIndexPage
    IfNextArchiveIndexPageFound --> |No | End
    

The spider parse an index page of archive pages. When links to archive pages found, it scrapes the archives.

The spider internally use ReadMore to scrape an archive page. That is, the spider collects all subsequent pages in the archive page when the archive page has a “Next” link.

The spider then follow the next archive index page, repeating the above process.

When no more “Next” link to archive index is found, the spider terminates.