`generic.spiders.directory`#

Module Contents#

Classes#

DirectorySpider

A spider that crawls pages under a directory. The directory is the base directory of the last component of the given URL.

API#

class generic.spiders.directory.DirectorySpider(url=None, *args, **kwargs)#

Bases: scrapy.spiders.CrawlSpider

A spider that crawls pages under a directory. The directory is the base directory of the last component of the given URL.

When the URL is “http://example.org/index.html”, it crawls all the URLs.

When the URL is “http://example.org/foo/index.html”, it crawls pages under /foo/.

When the URL is “http://example.org/foo/bar/index.html”, it crawls pages under /foo/bar/ but not “/foo/bar.html`.

Examples:

When start_urls is [”http://example.org/a/b/c.html”]:

it crawls /a/b/index.html.
it crawls /a/b/foo.html.
it crawls /a/b/c/bar.html.
it does not crawl “/index.html”
it does not crawl “/a/index.html”

Initialization

name = 'directory'#

custom_settings = None#

parse_body(response)#