`generic.items`#

Module Contents#

Classes#

`FileItem`
`ArticleItem`	Represents an article. A JSON representation of an ArticleItem looks like the following:
`FeedItem`

API#

class generic.items.FileItem(*args: Any, **kwargs: Any)#

Bases: scrapy.Item

acquired_time = 'Field(...)'#: The time when the webpage was acquired.

content = 'Field(...)'#: The file content as bytes.

filename = 'Field(...)'#

url = 'Field(...)'#

metadata = 'Field(...)'#

output_dir = 'Field(...)'#

__repr__()#: Surpress binary data in log. content often is binary data and unnecessary in log output.

class generic.items.ArticleItem#

Represents an article. A JSON representation of an ArticleItem looks like the following:

{
  "acquired_time": "2026-01-29T03:18:20.300844+00:00",
  "body": "<main> ... </main>",
  "url": "https://example.org/articles/d74c1662a8cd4ba0146d7f334c3058685320f611",
  "lang": "ja",
  "author": "Someone",
  "description": "A description ... ",
  "kind": "article",
  "modified_time": "2026-01-29T11:05:09+09:00",
  "published_time": "2026-01-29T11:05:09+09:00",
  "site_name": "Foo website",
  "title": "A title ...",
  "item_type": "ArticleItem",
  "character_count": 42,
  "sources": []
}

acquired_time: datetime.datetime = None#: The time when the webpage was acquired.

body: str = None#: The main content of the webpage.

url: str = None#: The URL of the webpage.

lang: str = None#

The two letter language code of the article. When the language is undetermined, “und” is returned.

See also: ISO 639-1:2002; Part 1: Alpha-2 code (JIS X 0412-1:2004)

author: Optional[str] = None#: The author of the webpage.

description: Optional[str] = None#: A brief description of the webpage.

kind: Optional[str] = None#: The type or category of the webpage.

modified_time: Optional[str] = None#: The time when the webpage was last modified.

published_time: Optional[str] = None#: The time when the webpage was published.

site_name: Optional[str] = None#: The name of the website.

title: Optional[str] = None#: The title of the webpage.

item_type: str = 'field(...)'#: The class name of the item. Automatically set in __post_init__.

character_count: int = 0#: The number of characters in the article.

sources: List[Self] = 'field(...)'#: A list of sources.

sentences: List[str] = 'field(...)'#: Sentences

tokens: List[str] = 'field(...)'#: Tokens

uuid: str = 'field(...)'#: UUID of the article

__post_init__()#

static get_json_ld(res: scrapy.http.Response) → Dict[str, Any]#: Extracts and parses JSON-LD from the response.

classmethod from_response(res: scrapy.http.Response) → Self#

Create an ArticleItem from a scrapy.http.Response.

Args:: res: scrapy.http.Response. lang: The language of the Response. When None, the language is guessed from the content.
Returns:: Self: An instance of ArticleItem or its subclass.

class generic.items.FeedItem#

file_name: str = None#: The file name of the feed.

url: str = None#: The URL of the page from which the feed was generated.

content: str = None#: The feed XML.

generated_at: str = 'field(...)'#: The time in ISO 8601 format in which the feed was generated.

generic.items#

Module Contents#

Classes#

API#

This Page

`generic.items`#