RSS Reader Overview#

rss_reader.py is a simple RSS reader for spiders. When feeds have new entries, it crawls the new entries and save the collected ArticleItem in a file.

Usage#

uv run bin/rss_reader.py --help

usage: rss_reader.py [-h] [-c CONFIG] [-i INTERVAL] [-d DATABASE] [-o OUTPUT] [-l {DEBUG,INFO,WARNING,ERROR,CRITICAL}]

Simple RSS reader for rokujo-collector-scrapy

options:
  -h, --help            show this help message and exit
  -c CONFIG, --config CONFIG
                        Path to RSS feed configuration file
  -i INTERVAL, --interval INTERVAL
                        Update interval in minutes
  -d DATABASE, --database DATABASE
                        Path to RSS feed database
  -o OUTPUT, --output OUTPUT
                        Path to output JSONL file
  -l {DEBUG,INFO,WARNING,ERROR,CRITICAL}, --loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL}
                        Log level, one of choices in upper case or lower case

Configuration file, `rss.yml`#

The main configuration file is a YAML file, rss.yml. Use --config option to specify the path. The default is rss.yml in the current working directory.

---
feed_urls:
  - "https://news.example.org/rss/categories/it.xml"
  - "https://news.example.org/rss/categories/domestic.xml"

rules:
  - name: Example News
    url_pattern: "://news\\.example\\.org/(pickup|articles)/"
    spider_name: read-more
    args:
      - "read_more=記事全文を読む"
      - "read_next=次へ"

`feed_urls`#

The is a list of RSS feed URLs. The RSS reader polls the RSS feeds and crawls new entires in the feeds. It can be path to a local RSS feed file. To create a local RSS feed file, use FeedSpider.

`rules`#

This is a list of rules for URLs. When the URL of a unread URL matches url_pattern, the RSS reader runs spider_name with args. With rules, one can run different spiders with different arguments.

Attribute	Optional or Required	Description
`name`	Optional	Name of the feed
`url_pattern`	Required	A regular expression to match the URL
`spider_name`	Required	The name of spider to run for the matched URLs
`args`	Required	A list of arguments to pass to the spider

Interval, `--interval`#

Interval to poll RSS feeds in minutes. The default is 15 minutes.

Database, `--database`#

Path to RSS database file. When the database file does not exist, the RSS reader creates one. The database is for maintaining feeds, entries and their status.

Log Level, `--loglevel`#

The log level of the RSS reader and spiders. The default is info.

Output file, `--output`#

The output file to save collected ArticleItem. The default is rss.jsonl. The RSS reader appends Unix time to the specified filename. The actual file name is rss-1769820942.jsonl for instance. The output file is atomically created and it is safe to assume that the file is always ready for further processing.

RSS Reader Overview#

Usage#

Configuration file, rss.yml#

feed_urls#

rules#

Interval, --interval#

Database, --database#

Log Level, --loglevel#

Output file, --output#

This Page