DirectorySpider#

A spider that crawls pages under a directory. This spider is useful when scraping ArticleItem from a part of website.

How It Works#

The directory to crawl is the base directory of the last component of the given URL. When the URL is http://example.org/index.html, it crawls all the URLs. When the URL is http://example.org/foo/index.html, it crawls pages under /foo/. When the URL is http://example.org/foo/bar/index.html, it crawls pages under /foo/bar/ but not /foo/bar.html.

When start_urls is http://example.org/a/b/c.html:

it crawls /a/b/index.html.
it crawls /a/b/foo.html.
it crawls /a/b/c/bar.html.
it does not crawl /index.html
it does not crawl /a/index.html

DirectorySpider#

How It Works#

This Page