generic.utils#

Submodules#

Package Contents#

Functions#

get_meta_property

Extracts a meta property content from a response.

extract_article

Extracts an article, or the relevant texts in the Response, with trafilatura.

idn2ascii

Returns ascii URL from a URL containing IDN.

get_uniform_metadata

str_to_isoformat

get_metadata

Generate metadata from Response.

count_xml_character

Count characters in XML string, excluding spaces (not words).

generate_hashed_filename

Generate a safe filename with a hashed prefix, keeping the file name human-friendly but sortable by URL-relevance.

is_path_matched

is_file_url

Returns bool whether if the givne url is a URL to a file, not HTML page.

get_url_without_fragment

Removes fragment from a URL string.

analyze_text_with_spacy

tokens_include_predicate

API#

generic.utils.get_meta_property(response: scrapy.http.Response, name: str) str#

Extracts a meta property content from a response.

Args:
  • response The response object.

  • name Name of the property.

generic.utils.extract_article(res: scrapy.http.Response) dict#

Extracts an article, or the relevant texts in the Response, with trafilatura.

Returns a dict. The dict has various metadata extracted from the Response.

Args:
  • res The response object

generic.utils.idn2ascii(url_str: str) str#

Returns ascii URL from a URL containing IDN.

generic.utils.get_uniform_metadata(html: str, base_url: str)#
generic.utils.str_to_isoformat(string: str)#
generic.utils.get_metadata(res: scrapy.http.Response) dict#

Generate metadata from Response.

Returns: dict

generic.utils.count_xml_character(xml_string: str) int#

Count characters in XML string, excluding spaces (not words).

generic.utils.generate_hashed_filename(url, domain_size: int = 8, url_size: int = 32, max_len: int = 255) str#

Generate a safe filename with a hashed prefix, keeping the file name human-friendly but sortable by URL-relevance.

Supports URL-encoded file name.

The generated file name is hashed by domain and path of the URL.

The length of the generated file name is ensured to be less or equals to max_len.

Args:

domain_size: The size of domain hash characters. url_size: The size of URL hash characters. max_len: Max allowed bytes in file names. Defaults to 255.

generic.utils.is_path_matched(url: str, regexp: str) bool#
generic.utils.is_file_url(url: str, regexp: str = '(?:/|\\.html?|\\.php|\\.aspx?|/[^./]+)$') bool#

Returns bool whether if the givne url is a URL to a file, not HTML page.

generic.utils.get_url_without_fragment(url_string: str) str#

Removes fragment from a URL string.

async generic.utils.analyze_text_with_spacy(client: httpx.AsyncClient, text: str, url: str)#
generic.utils.tokens_include_predicate(tokens)#