generic.spiders.directory#
Module Contents#
Classes#
A spider that crawls pages under a directory. The directory is the base directory of the last component of the given URL. |
API#
- class generic.spiders.directory.DirectorySpider(url=None, *args, **kwargs)#
Bases:
scrapy.spiders.CrawlSpiderA spider that crawls pages under a directory. The directory is the base directory of the last component of the given URL.
When the URL is “http://example.org/index.html”, it crawls all the URLs.
When the URL is “http://example.org/foo/index.html”, it crawls pages under /foo/.
When the URL is “http://example.org/foo/bar/index.html”, it crawls pages under /foo/bar/ but not “/foo/bar.html`.
Examples:
When start_urls is [”http://example.org/a/b/c.html”]:
it crawls /a/b/index.html.
it crawls /a/b/foo.html.
it crawls /a/b/c/bar.html.
it does not crawl “/index.html”
it does not crawl “/a/index.html”
Initialization
- name = 'directory'#
- custom_settings = None#
- parse_body(response)#