generic.items#
Module Contents#
Classes#
Represents an article. A JSON representation of an ArticleItem looks like the following: |
|
API#
- class generic.items.FileItem(*args: Any, **kwargs: Any)#
Bases:
scrapy.Item- acquired_time = 'Field(...)'#
The time when the webpage was acquired.
- content = 'Field(...)'#
The file content as bytes.
- filename = 'Field(...)'#
- url = 'Field(...)'#
- metadata = 'Field(...)'#
- output_dir = 'Field(...)'#
- __repr__()#
Surpress binary data in log. content often is binary data and unnecessary in log output.
- class generic.items.ArticleItem#
Represents an article. A JSON representation of an ArticleItem looks like the following:
{ "acquired_time": "2026-01-29T03:18:20.300844+00:00", "body": "<main> ... </main>", "url": "https://example.org/articles/d74c1662a8cd4ba0146d7f334c3058685320f611", "lang": "ja", "author": "Someone", "description": "A description ... ", "kind": "article", "modified_time": "2026-01-29T11:05:09+09:00", "published_time": "2026-01-29T11:05:09+09:00", "site_name": "Foo website", "title": "A title ...", "item_type": "ArticleItem", "character_count": 42, "sources": [] }
- acquired_time: datetime.datetime = None#
The time when the webpage was acquired.
- body: str = None#
The main content of the webpage.
- url: str = None#
The URL of the webpage.
- lang: str = None#
The two letter language code of the article. When the language is undetermined, “und” is returned.
See also: ISO 639-1:2002; Part 1: Alpha-2 code (JIS X 0412-1:2004)
- author: Optional[str] = None#
The author of the webpage.
- description: Optional[str] = None#
A brief description of the webpage.
- kind: Optional[str] = None#
The type or category of the webpage.
- modified_time: Optional[str] = None#
The time when the webpage was last modified.
- published_time: Optional[str] = None#
The time when the webpage was published.
- site_name: Optional[str] = None#
The name of the website.
- title: Optional[str] = None#
The title of the webpage.
- item_type: str = 'field(...)'#
The class name of the item. Automatically set in __post_init__.
- character_count: int = 0#
The number of characters in the article.
- sources: List[Self] = 'field(...)'#
A list of sources.
- sentences: List[str] = 'field(...)'#
Sentences
- tokens: List[str] = 'field(...)'#
Tokens
- uuid: str = 'field(...)'#
UUID of the article
- __post_init__()#
- static get_json_ld(res: scrapy.http.Response) Dict[str, Any]#
Extracts and parses JSON-LD from the response.
- classmethod from_response(res: scrapy.http.Response) Self#
Create an ArticleItem from a scrapy.http.Response.
- Args:
res: scrapy.http.Response. lang: The language of the Response. When None, the language is guessed from the content.
- Returns:
Self: An instance of ArticleItem or its subclass.