generic.spiders.xml#
Module Contents#
Classes#
A configuration class for XmlSpider. |
|
A spider that scrapes ArticleItem from links in an XML file. The spider is useful when a list of links is in a dynamic XML response and the browser renders the list. |
API#
- class generic.spiders.xml.XmlSpiderConfig(/, **data: Any)#
Bases:
generic.spiders.base.GenericSpiderConfigA configuration class for XmlSpider.
Initialization
Create a new model by parsing and validating input data from keyword arguments.
Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.
self is explicitly positional-only to allow self as a field name.
- xml_link_xpath: str = None#
XPath expression to extract URLs, e.g., “//link/text()”.
- class generic.spiders.xml.XmlSpider(*args, **kwargs)#
Bases:
generic.spiders.base.GenericSpider[generic.spiders.xml.XmlSpiderConfig]A spider that scrapes ArticleItem from links in an XML file. The spider is useful when a list of links is in a dynamic XML response and the browser renders the list.
Initialization
- name = 'xml'#
The human-friendly name of the spider.
- classmethod get_config_class() Type[generic.spiders.xml.XmlSpiderConfig]#
Returns the config class for this spider.
- async start()#
The entry point. Start crawling from the given URLs.
- parse_xml(response: scrapy.http.Response)#
A handler to parse the XML.
- parse_content(response: scrapy.http.Response)#
A handler to parse the article.
- Yields:
ArticleItem