generic.spiders.generic_sitemap#
Module Contents#
Classes#
A spider that scrapes all the articles within a sitemap.xml. The sitemap.xml may contain another sitemap.xml (nested sitemap.xml). |
API#
- class generic.spiders.generic_sitemap.GenericSitemapSpiderConfig(/, **data: Any)#
Bases:
generic.spiders.base.GenericSpiderConfig- sitemap_type: str = 'all'#
The option rejects certain URLs to sitemap XML files, such as archive, author, etc.
“all” rejects nothing. This is the default.
“wordpress” rejects certain known URLs which point to index pages of tags, authors, and taxonomy.
- class generic.spiders.generic_sitemap.GenericSitemapSpider(*args, **kwargs)#
Bases:
scrapy.spiders.SitemapSpider,generic.spiders.base.GenericSpider[generic.spiders.generic_sitemap.GenericSitemapSpiderConfig]A spider that scrapes all the articles within a sitemap.xml. The sitemap.xml may contain another sitemap.xml (nested sitemap.xml).
The spider crawls almost entire sites and suitable for relatively small sites. In addition, scraped ArticleItem would contains noises. If you prefer quality over quantity, use other focused spiders.
When urls argument includes a URL that does not end with “sitemap.xml”, the spider appends “sitemap.xml” to the URL.
Initialization
- name = 'sitemap'#
- custom_settings = None#
- sitemap_urls = []#
- classmethod get_config_class() Type[generic.spiders.generic_sitemap.GenericSitemapSpiderConfig]#
Returns the config class for this spider.
- sitemap_filter(entries)#
- sitemap_filter_all(entries)#
- sitemap_filter_wordpress(entries)#
- parse(response: scrapy.http.Response)#