From charlesreid1

No edit summary
Line 109: Line 109:


==More Complex Parsing==
==More Complex Parsing==
=Dealing with Forms=
Scrapy has the capability to deal with HTML forms built into it, via the FormRequest object type.
The spider can simulate someone submitting an HTML POST form, specifying key-value fields, by making the spider return a FormRequest object.
A FormRequest object


=Errors=
=Errors=

Revision as of 03:34, 13 May 2012

Installation

Prerequisites

You'll need to install libxml2 and libxslt to use Scrapy. Installation instructions can be found here: Libxml and libxslt

Build/Install

Scrapy can be installed by downloading the tarball, unzipping it, and doing the usual Python build/install steps:

$ python setup.py build

$ python setup.py install

Tutorial Project

Following the instructions here, create a new project:

$ scrapy startproject tutorial

Define Data Items

The first thing to do in a project is to define the type of data that the scraper will be extracting (in Scrapy's terms, this is an Item). If you were scraping IMDB for movie information, an Item might be a movie, or an actor; each Item would have various Fields, such as release year, studio, length, and so on.

Later, you will tell Scrapy how to populate these fields using the data scraped from the web page.

As an example, this would go in tutorial/items.py:

from scrapy.item import Item, Field

class DmozItem(Item):
    title = Field()
    link = Field()
    desc = Field()

Items are covered in great detail here: http://doc.scrapy.org/en/latest/topics/items.html#topics-items-declaring

They are basically a fancy version of a dictionary.

Create Spider

A spider grabs the HTML content of a web page (using an HTTP request, same as a web browser), but instead of displaying that HTML as a web page (as a browser might), the spider processes the contents to extract information.

The spider must define three things:

  • the spider's name (name)
  • list of URLs to scrape (start_urls)
  • what to do with the scraped material (parse())

The spider's behavior is defined in tutorial/spiders/dmoz_spider.py

from scrapy.spider import BaseSpider

class DmozSpider(BaseSpider):
    name = "dmoz"
    allowed_domains = ["dmoz.org"]
    start_urls = [
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
    ]

    def parse(self, response):
        filename = response.url.split("/")[-2]
        open(filename, 'wb').write(response.body)

This spider will go through each of the URLs given in start_urls and parse the content. The parser will create a file whose filename comes from the URL, and dump the entire contents of the response to Scrapy's HTTP request into that file.

If you wanted to download non-HTML content, such as a PDF file, the spider works the same way:

from scrapy.spider import BaseSpider

class FileSpider(BaseSpider):
    name = "file"
    allowed_domains = ["charlesmartinreid.com"]
    start_urls = [
        "http://www.charlesmartinreid.com/files/TabbloidExample.pdf"
    ]

    def parse(self, response):
        filename = response.url.split("/")[-1]
        open(filename, 'wb').write(response.body)

You probably won't need to add an extension using the above method, since the extension is part of the split url, but if you use any other string you'll probably have to add the file extension:

filename = response.url.split("/")[-2] + '.pdf'

Run Spider

To run the spider named dmoz, execute the following from the project directory:

$ scrapy crawl dmoz

More Complex Parsing

Dealing with Forms

Scrapy has the capability to deal with HTML forms built into it, via the FormRequest object type.

The spider can simulate someone submitting an HTML POST form, specifying key-value fields, by making the spider return a FormRequest object.

A FormRequest object


Errors

Errors Creating a Project

Module object has no attribute HTML_PARSE_RECOVER

You may see an error related to a missing attribute HTML_PARSE_RECOVER when you try and create a new project:


$ scrapy startproject tutorial

Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/Current/bin/scrapy", line 5, in <module>
    pkg_resources.run_script('Scrapy==0.14.3', 'scrapy')
  File "/Users/charles/pkg/mechanize/std/lib/python2.7/site-packages/setuptools-0.6c11-py2.7.egg/pkg_resources.py", line 489, in run_script
  File "/Users/charles/pkg/mechanize/std/lib/python2.7/site-packages/setuptools-0.6c11-py2.7.egg/pkg_resources.py", line 1207, in run_script
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Scrapy-0.14.3-py2.7.egg/EGG-INFO/scripts/scrapy", line 4, in <module>
    execute()
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Scrapy-0.14.3-py2.7.egg/scrapy/cmdline.py", line 112, in execute
    cmds = _get_commands_dict(inproject)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Scrapy-0.14.3-py2.7.egg/scrapy/cmdline.py", line 37, in _get_commands_dict
    cmds = _get_commands_from_module('scrapy.commands', inproject)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Scrapy-0.14.3-py2.7.egg/scrapy/cmdline.py", line 30, in _get_commands_from_module
    for cmd in _iter_command_classes(module):
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Scrapy-0.14.3-py2.7.egg/scrapy/cmdline.py", line 21, in _iter_command_classes
    for module in walk_modules(module_name):
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Scrapy-0.14.3-py2.7.egg/scrapy/utils/misc.py", line 65, in walk_modules
    submod = __import__(fullpath, {}, {}, [''])
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Scrapy-0.14.3-py2.7.egg/scrapy/commands/shell.py", line 8, in <module>
    from scrapy.shell import Shell
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Scrapy-0.14.3-py2.7.egg/scrapy/shell.py", line 14, in <module>
    from scrapy.selector import XPathSelector, XmlXPathSelector, HtmlXPathSelector
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Scrapy-0.14.3-py2.7.egg/scrapy/selector/__init__.py", line 30, in <module>
    from scrapy.selector.libxml2sel import *
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Scrapy-0.14.3-py2.7.egg/scrapy/selector/libxml2sel.py", line 12, in <module>
    from .factories import xmlDoc_from_html, xmlDoc_from_xml
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Scrapy-0.14.3-py2.7.egg/scrapy/selector/factories.py", line 14, in <module>
    libxml2.HTML_PARSE_NOERROR + \
AttributeError: 'module' object has no attribute 'HTML_PARSE_RECOVER'

This error indicates a problem with libxml or libxslt. First, follow the instructions at the Libxml and libxslt page to make sure your installation steps are correct. Also check to ensure that libxml and libxslt are on your Python path variable:

$ echo $PYTHONPATH

and ensure there are no minor typos (e.g. libxml vs libxml2). If you keep seeing this error, even after you install libxml and libxslt, it means your libxml and libxslt are not installed correctly, and are not working the way they are supposed to.


Errors When Trying To Crawl

KeyError: Spider Not Found

If you see a KeyError related to the spider not being found, like this one,


$ scrapy crawl dmoz

2012-05-12 15:11:20-0700 [scrapy] INFO: Scrapy 0.14.3 started (bot: tutorial)
2012-05-12 15:11:21-0700 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2012-05-12 15:11:21-0700 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, ChunkedTransferMiddleware, DownloaderStats
2012-05-12 15:11:21-0700 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2012-05-12 15:11:21-0700 [scrapy] DEBUG: Enabled item pipelines: 
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/Current/bin/scrapy", line 5, in <module>
    pkg_resources.run_script('Scrapy==0.14.3', 'scrapy')
  File "/Users/charles/pkg/mechanize/std/lib/python2.7/site-packages/setuptools-0.6c11-py2.7.egg/pkg_resources.py", line 489, in run_script
  File "/Users/charles/pkg/mechanize/std/lib/python2.7/site-packages/setuptools-0.6c11-py2.7.egg/pkg_resources.py", line 1207, in run_script
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Scrapy-0.14.3-py2.7.egg/EGG-INFO/scripts/scrapy", line 4, in <module>
    execute()
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Scrapy-0.14.3-py2.7.egg/scrapy/cmdline.py", line 132, in execute
    _run_print_help(parser, _run_command, cmd, args, opts)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Scrapy-0.14.3-py2.7.egg/scrapy/cmdline.py", line 97, in _run_print_help
    func(*a, **kw)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Scrapy-0.14.3-py2.7.egg/scrapy/cmdline.py", line 139, in _run_command
    cmd.run(args, opts)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Scrapy-0.14.3-py2.7.egg/scrapy/commands/crawl.py", line 43, in run
    spider = self.crawler.spiders.create(spname, **opts.spargs)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Scrapy-0.14.3-py2.7.egg/scrapy/spidermanager.py", line 43, in create
    raise KeyError("Spider not found: %s" % spider_name)
KeyError: 'Spider not found: dmoz'

it's probably because your spiders are not in the right place. Spiders should go in the directory NAMEOFPROJECT/spiders/NAMEOFSPIDER_spider.py.


For example, if I'm working on the "tutorial" project, and I have a spider named dmoz, it should be defined in the file

tutorial/spiders/dmoz_spider.py