generic.utils#
Submodules#
Package Contents#
Functions#
Extracts a meta property content from a response. |
|
Extracts an article, or the relevant texts in the Response, with trafilatura. |
|
Returns ascii URL from a URL containing IDN. |
|
Generate metadata from Response. |
|
Count characters in XML string, excluding spaces (not words). |
|
Generate a safe filename with a hashed prefix, keeping the file name human-friendly but sortable by URL-relevance. |
|
Returns bool whether if the givne url is a URL to a file, not HTML page. |
|
Removes fragment from a URL string. |
|
API#
- generic.utils.get_meta_property(response: scrapy.http.Response, name: str) str#
Extracts a meta property content from a response.
- Args:
response The response object.
name Name of the property.
- generic.utils.extract_article(res: scrapy.http.Response) dict#
Extracts an article, or the relevant texts in the Response, with trafilatura.
Returns a dict. The dict has various metadata extracted from the Response.
- Args:
res The response object
- generic.utils.idn2ascii(url_str: str) str#
Returns ascii URL from a URL containing IDN.
- generic.utils.get_uniform_metadata(html: str, base_url: str)#
- generic.utils.str_to_isoformat(string: str)#
- generic.utils.get_metadata(res: scrapy.http.Response) dict#
Generate metadata from Response.
Returns: dict
- generic.utils.count_xml_character(xml_string: str) int#
Count characters in XML string, excluding spaces (not words).
- generic.utils.generate_hashed_filename(url, domain_size: int = 8, url_size: int = 32, max_len: int = 255) str#
Generate a safe filename with a hashed prefix, keeping the file name human-friendly but sortable by URL-relevance.
Supports URL-encoded file name.
The generated file name is hashed by domain and path of the URL.
The length of the generated file name is ensured to be less or equals to max_len.
- Args:
domain_size: The size of domain hash characters. url_size: The size of URL hash characters. max_len: Max allowed bytes in file names. Defaults to 255.
- generic.utils.is_path_matched(url: str, regexp: str) bool#
- generic.utils.is_file_url(url: str, regexp: str = '(?:/|\\.html?|\\.php|\\.aspx?|/[^./]+)$') bool#
Returns bool whether if the givne url is a URL to a file, not HTML page.
- generic.utils.get_url_without_fragment(url_string: str) str#
Removes fragment from a URL string.
- async generic.utils.analyze_text_with_spacy(client: httpx.AsyncClient, text: str, url: str)#
- generic.utils.tokens_include_predicate(tokens)#