Overview of Spiders#
Spiders are programs that crawl websites and collect data. They are a fundamental tool in web scraping and data extraction.
Spiders collects some kind of data. Either ArticleItem or files.
ArticleItem is a structured data such as text and metadata.
ArticleItem has many attributes, including body, which is the scraped text from pages, and metadata, such as title of the article page.
{
"acquired_time": "2026-01-29T03:18:20.300844+00:00",
"body": "<main> ... </main>",
"url": "https://example.org/articles/d74c1662a8cd4ba0146d7f334c3058685320f611",
"lang": "ja",
"author": "Someone",
"description": "A description ... ",
"kind": "article",
"modified_time": "2026-01-29T11:05:09+09:00",
"published_time": "2026-01-29T11:05:09+09:00",
"site_name": "Foo website",
"title": "A title ...",
"item_type": "ArticleItem",
"character_count": 42,
"sources": []
}
There are several spiders for different purposes. They are different in two points:
What they collect
How they collect the data
Most spiders collect ArticleItem, a text article from web pages.
Others collect files, such as PDF files.
For example:
read-morespider collects an article of pages, ignoring a landing page commonly found in news websites.directoryspiders collectsArticleItembut only scrapes text of pages under a specific URL path.sitemapspiders collectsArticleItemby following links insitemap.xml.file-downloadspider collects files linked in web pages, such as PDF files.
The collected data, or items, are passed to item pipelines for further processing and the items are eventually saved, usually in JSONL format.
The result is a structured JSONL file that can be fed into a database.
The spiders reside under generic/spiders.
Name#
A spider has three names:
Human-friendly name, e.g.,
read-moreFile name, e.g.,
read_more.py(undergeneric/spidersdirectory)Python class name, e.g.,
ReadMoreSpider
When running a spider from command line, use human-friendly name. Human-friendly name is defined as a class variable.
# generic/spiders/read_more.py
class ReadMoreSpider(GenericSpider[ReadMoreSpiderConfig], ReadMoreMixin):
# ...
name = "read-more"
scrapy list command displays a list of all available spiders.
> uv run scrapy list
directory
feed
file-download
read-more
sitemap
...
Arguments#
Spiders accepts arguments. Arguments are passed with -a option.
uv run scrapy crawl -a "arg1=value1" -a "arg2=value2" $SPIDER_NAME
Caution
Always quote the arguments.
Common arguments#
All spiders require a mandatory argument, urls. urls is where spiders start crawling.
The value of urls is a comma-separated list of URLs.
uv run scrapy crawl -a "urls=http://example.org/,http://example.net/" read-more
Spiders apply the same arguments to multiple URLs. If different URLs need different arguments, run the spider multiple times with different arguments.
uv run scrapy crawl -a "urls=http://example.org/" -a "arg=value1" read-more
uv run scrapy crawl -a "urls=http://example.net/" -a "arg=value2" read-more
Output options#
Spiders that collect ArticleItem can export the items in various format.
JSONL is the most recommended one.
-O and -o option specify the output file name and the format.
-O overwrites the specified file with the collected items while -o appends new items to the file.
# -O overwrites and delete old items in the file
uv run scrapy crawl -a "urls=http://example.org/" -O items.jsonl read-more
# -o appends new items, preserving the existing items in the file
uv run scrapy crawl -a "urls=http://example.org/" -o items.jsonl read-more
# scrapy supports CSV format, too
uv run scrapy crawl -a "urls=http://example.org/" -o items.csv read-more
Spiders that collects files requires output_dir argument.