DirectorySpider#

A spider that crawls pages under a directory. This spider is useful when scraping ArticleItem from a part of website.

How It Works#

The directory to crawl is the base directory of the last component of the given URL. When the URL is http://example.org/index.html, it crawls all the URLs. When the URL is http://example.org/foo/index.html, it crawls pages under /foo/. When the URL is http://example.org/foo/bar/index.html, it crawls pages under /foo/bar/ but not /foo/bar.html.

When start_urls is http://example.org/a/b/c.html:

  • it crawls /a/b/index.html.

  • it crawls /a/b/foo.html.

  • it crawls /a/b/c/bar.html.

  • it does not crawl /index.html

  • it does not crawl /a/index.html