generic.spiders.xml#

Module Contents#

Classes#

XmlSpiderConfig

A configuration class for XmlSpider.

XmlSpider

A spider that scrapes ArticleItem from links in an XML file. The spider is useful when a list of links is in a dynamic XML response and the browser renders the list.

API#

class generic.spiders.xml.XmlSpiderConfig(/, **data: Any)#

Bases: generic.spiders.base.GenericSpiderConfig

A configuration class for XmlSpider.

Initialization

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

XPath expression to extract URLs, e.g., “//link/text()”.

class generic.spiders.xml.XmlSpider(*args, **kwargs)#

Bases: generic.spiders.base.GenericSpider[generic.spiders.xml.XmlSpiderConfig]

A spider that scrapes ArticleItem from links in an XML file. The spider is useful when a list of links is in a dynamic XML response and the browser renders the list.

Initialization

name = 'xml'#

The human-friendly name of the spider.

classmethod get_config_class() Type[generic.spiders.xml.XmlSpiderConfig]#

Returns the config class for this spider.

async start()#

The entry point. Start crawling from the given URLs.

parse_xml(response: scrapy.http.Response)#

A handler to parse the XML.

parse_content(response: scrapy.http.Response)#

A handler to parse the article.

Yields:

ArticleItem