generic.items#

Module Contents#

Classes#

FileItem

ArticleItem

Represents an article. A JSON representation of an ArticleItem looks like the following:

FeedItem

API#

class generic.items.FileItem(*args: Any, **kwargs: Any)#

Bases: scrapy.Item

acquired_time = 'Field(...)'#

The time when the webpage was acquired.

content = 'Field(...)'#

The file content as bytes.

filename = 'Field(...)'#
url = 'Field(...)'#
metadata = 'Field(...)'#
output_dir = 'Field(...)'#
__repr__()#

Surpress binary data in log. content often is binary data and unnecessary in log output.

class generic.items.ArticleItem#

Represents an article. A JSON representation of an ArticleItem looks like the following:

{
  "acquired_time": "2026-01-29T03:18:20.300844+00:00",
  "body": "<main> ... </main>",
  "url": "https://example.org/articles/d74c1662a8cd4ba0146d7f334c3058685320f611",
  "lang": "ja",
  "author": "Someone",
  "description": "A description ... ",
  "kind": "article",
  "modified_time": "2026-01-29T11:05:09+09:00",
  "published_time": "2026-01-29T11:05:09+09:00",
  "site_name": "Foo website",
  "title": "A title ...",
  "item_type": "ArticleItem",
  "character_count": 42,
  "sources": []
}
acquired_time: datetime.datetime = None#

The time when the webpage was acquired.

body: str = None#

The main content of the webpage.

url: str = None#

The URL of the webpage.

lang: str = None#

The two letter language code of the article. When the language is undetermined, “und” is returned.

See also: ISO 639-1:2002; Part 1: Alpha-2 code (JIS X 0412-1:2004)

author: Optional[str] = None#

The author of the webpage.

description: Optional[str] = None#

A brief description of the webpage.

kind: Optional[str] = None#

The type or category of the webpage.

modified_time: Optional[str] = None#

The time when the webpage was last modified.

published_time: Optional[str] = None#

The time when the webpage was published.

site_name: Optional[str] = None#

The name of the website.

title: Optional[str] = None#

The title of the webpage.

item_type: str = 'field(...)'#

The class name of the item. Automatically set in __post_init__.

character_count: int = 0#

The number of characters in the article.

sources: List[Self] = 'field(...)'#

A list of sources.

sentences: List[str] = 'field(...)'#

Sentences

tokens: List[str] = 'field(...)'#

Tokens

uuid: str = 'field(...)'#

UUID of the article

__post_init__()#
static get_json_ld(res: scrapy.http.Response) Dict[str, Any]#

Extracts and parses JSON-LD from the response.

classmethod from_response(res: scrapy.http.Response) Self#

Create an ArticleItem from a scrapy.http.Response.

Args:

res: scrapy.http.Response. lang: The language of the Response. When None, the language is guessed from the content.

Returns:

Self: An instance of ArticleItem or its subclass.

class generic.items.FeedItem#
file_name: str = None#

The file name of the feed.

url: str = None#

The URL of the page from which the feed was generated.

content: str = None#

The feed XML.

generated_at: str = 'field(...)'#

The time in ISO 8601 format in which the feed was generated.