RSS Reader Overview#
rss_reader.py is a simple RSS reader for spiders.
When feeds have new entries, it crawls the new entries and save the collected ArticleItem in a file.
Usage#
uv run bin/rss_reader.py --help
usage: rss_reader.py [-h] [-c CONFIG] [-i INTERVAL] [-d DATABASE] [-o OUTPUT] [-l {DEBUG,INFO,WARNING,ERROR,CRITICAL}]
Simple RSS reader for rokujo-collector-scrapy
options:
-h, --help show this help message and exit
-c CONFIG, --config CONFIG
Path to RSS feed configuration file
-i INTERVAL, --interval INTERVAL
Update interval in minutes
-d DATABASE, --database DATABASE
Path to RSS feed database
-o OUTPUT, --output OUTPUT
Path to output JSONL file
-l {DEBUG,INFO,WARNING,ERROR,CRITICAL}, --loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL}
Log level, one of choices in upper case or lower case
Configuration file, rss.yml#
The main configuration file is a YAML file, rss.yml. Use --config option
to specify the path. The default is rss.yml in the current working
directory.
---
feed_urls:
- "https://news.example.org/rss/categories/it.xml"
- "https://news.example.org/rss/categories/domestic.xml"
rules:
- name: Example News
url_pattern: "://news\\.example\\.org/(pickup|articles)/"
spider_name: read-more
args:
- "read_more=記事全文を読む"
- "read_next=次へ"
feed_urls#
The is a list of RSS feed URLs.
The RSS reader polls the RSS feeds and crawls new entires in the feeds.
It can be path to a local RSS feed file.
To create a local RSS feed file, use FeedSpider.
rules#
This is a list of rules for URLs.
When the URL of a unread URL matches url_pattern, the RSS reader runs spider_name with args.
With rules, one can run different spiders with different arguments.
Attribute |
Optional or Required |
Description |
|---|---|---|
|
Optional |
Name of the feed |
|
Required |
A regular expression to match the URL |
|
Required |
The name of spider to run for the matched URLs |
|
Required |
A list of arguments to pass to the spider |
Interval, --interval#
Interval to poll RSS feeds in minutes. The default is 15 minutes.
Database, --database#
Path to RSS database file. When the database file does not exist, the RSS reader creates one. The database is for maintaining feeds, entries and their status.
Log Level, --loglevel#
The log level of the RSS reader and spiders. The default is info.
Output file, --output#
The output file to save collected ArticleItem.
The default is rss.jsonl.
The RSS reader appends Unix time to the specified filename.
The actual file name is rss-1769820942.jsonl for instance.
The output file is atomically created and it is safe to assume that the file is always ready for further processing.