From charlesreid1

Installation

Prerequisites

You'll need to install libxml2 and libxslt to use Scrapy. Installation instructions can be found here: Libxml and libxslt

Build/Install

Scrapy can be installed by downloading the tarball, unzipping it, and doing the usual Python build/install steps:

$ python setup.py build

$ python setup.py install

Tutorial Project

Following the instructions here, create a new project:

$ scrapy startproject tutorial

Define Data Items

The first thing to do in a project is to define the type of data that the scraper will be extracting (in Scrapy's terms, this is an Item). If you were scraping IMDB for movie information, an Item might be a movie, or an actor; each Item would have various Fields, such as release year, studio, length, and so on.

Later, you will tell Scrapy how to populate these fields using the data scraped from the web page.

As an example, this would go in tutorial/items.py:

from scrapy.item import Item, Field

class DmozItem(Item):
    title = Field()
    link = Field()
    desc = Field()

Items are covered in great detail here: http://doc.scrapy.org/en/latest/topics/items.html#topics-items-declaring

They are basically a fancy version of a dictionary.

Create Spider

A spider grabs the HTML content of a web page (using an HTTP request, same as a web browser), but instead of displaying that HTML as a web page (as a browser might), the spider processes the contents to extract information.

The spider must define three things:

  • the spider's name (name)
  • list of URLs to scrape (start_urls)
  • what to do with the scraped material (parse())

The spider's behavior is defined in tutorial/spiders/dmoz_spider.py

from scrapy.spider import BaseSpider

class DmozSpider(BaseSpider):
    name = "dmoz"
    allowed_domains = ["dmoz.org"]
    start_urls = [
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
    ]

    def parse(self, response):
        filename = response.url.split("/")[-2]
        open(filename, 'wb').write(response.body)

This spider will go through each of the URLs given in start_urls and parse the content. The parser will create a file whose filename comes from the URL, and dump the entire contents of the response to Scrapy's HTTP request into that file.

If you wanted to download non-HTML content, such as a PDF file, the spider works the same way:

from scrapy.spider import BaseSpider

class FileSpider(BaseSpider):
    name = "file"
    allowed_domains = ["charlesmartinreid.com"]
    start_urls = [
        "http://www.charlesmartinreid.com/files/TabbloidExample.pdf"
    ]

    def parse(self, response):
        filename = response.url.split("/")[-1]
        open(filename, 'wb').write(response.body)

You probably won't need to add an extension using the above method, since the extension is part of the split url, but if you use any other string you'll probably have to add the file extension:

filename = response.url.split("/")[-2] + '.pdf'

Run Spider

To run the spider named dmoz, execute the following from the project directory:

$ scrapy crawl dmoz

More Complex Parsing

Dealing with Forms

This section will illustrate the ability to use Scrapy spiders to deal with HTML forms.

The scenario will suppose that you want to set the variables in some form, submit the form, and receive a results page, which is then processed using Scrapy.

Use FormRequest

The FormRequest object type will allow a Scrapy spider to deal with forms. It is available by adding this to the beginning of a spider's definition:

from scrapy.http import FormRequest

Simple Example: Radio Buttons

Sample HTML Form

Suppose you have a simple form and associated result pages. These will look something like this:

The Form:


<html>
<head><title>form test</title></head>
<body>
<form action="form.php" method="post">
<p>Pick one: </p>
<input type="radio" name="thisbutton" value="100" /> 100
<p><input type="submit" name="submit" value="Send it!"></p>
</form>
</body>
</html>

And when this form is submitted, it results in the following page:

The Result Page:


<p>Submit is set.</p>
<p>Its value is:</p>
<div id="scrape">100</div>

Start the request

To submit the above form with a spider, first define start_request(), which is the function that is called when no start_urls are specified:

    def start_requests(self):

The start_requests definition begins by creating a FormRequest object (see http://www.pixelmender.com/2010/10/12/scraping-data-using-scrapy-framework/).

This request will specify the values that each piece of the form takes on:

    def start_requests(self):
        cmrResultRequest = FormRequest("http://charlesmartinreid.com/form.php",
                            formdata={'thisform':'27556','submit':'Send it!'},
                            callback=self.after_submit)
        return [cmrResultRequest]

When the start_requests function is called, it will create a form request object, which sets the given variables to the given values, and submits the form. It returns the FormRequest result, and passes it to a function called after_submit().

Processing the result

Once the FormRequest object is created, it submits The Form (see above) and returns The Result Page. Information from the result page can then be processed using Scrapy.

    def after_submit(self,response):
        hxs = HtmlXPathSelector(response)
        text_of_result = hxs.select('//div[@id=\'scrape\']/text()')
        print text_of_result
        return

In this case, the action is really boring: a print statement. But this could be something more elaborate like loading information into an item or several items.

Traversing Content

Errors

Errors Creating a Project

Module object has no attribute HTML_PARSE_RECOVER

You may see an error related to a missing attribute HTML_PARSE_RECOVER when you try and create a new project:


$ scrapy startproject tutorial

Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/Current/bin/scrapy", line 5, in <module>
    pkg_resources.run_script('Scrapy==0.14.3', 'scrapy')
  File "/Users/charles/pkg/mechanize/std/lib/python2.7/site-packages/setuptools-0.6c11-py2.7.egg/pkg_resources.py", line 489, in run_script
  File "/Users/charles/pkg/mechanize/std/lib/python2.7/site-packages/setuptools-0.6c11-py2.7.egg/pkg_resources.py", line 1207, in run_script
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Scrapy-0.14.3-py2.7.egg/EGG-INFO/scripts/scrapy", line 4, in <module>
    execute()
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Scrapy-0.14.3-py2.7.egg/scrapy/cmdline.py", line 112, in execute
    cmds = _get_commands_dict(inproject)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Scrapy-0.14.3-py2.7.egg/scrapy/cmdline.py", line 37, in _get_commands_dict
    cmds = _get_commands_from_module('scrapy.commands', inproject)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Scrapy-0.14.3-py2.7.egg/scrapy/cmdline.py", line 30, in _get_commands_from_module
    for cmd in _iter_command_classes(module):
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Scrapy-0.14.3-py2.7.egg/scrapy/cmdline.py", line 21, in _iter_command_classes
    for module in walk_modules(module_name):
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Scrapy-0.14.3-py2.7.egg/scrapy/utils/misc.py", line 65, in walk_modules
    submod = __import__(fullpath, {}, {}, [''])
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Scrapy-0.14.3-py2.7.egg/scrapy/commands/shell.py", line 8, in <module>
    from scrapy.shell import Shell
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Scrapy-0.14.3-py2.7.egg/scrapy/shell.py", line 14, in <module>
    from scrapy.selector import XPathSelector, XmlXPathSelector, HtmlXPathSelector
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Scrapy-0.14.3-py2.7.egg/scrapy/selector/__init__.py", line 30, in <module>
    from scrapy.selector.libxml2sel import *
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Scrapy-0.14.3-py2.7.egg/scrapy/selector/libxml2sel.py", line 12, in <module>
    from .factories import xmlDoc_from_html, xmlDoc_from_xml
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Scrapy-0.14.3-py2.7.egg/scrapy/selector/factories.py", line 14, in <module>
    libxml2.HTML_PARSE_NOERROR + \
AttributeError: 'module' object has no attribute 'HTML_PARSE_RECOVER'

This error indicates a problem with libxml or libxslt. First, follow the instructions at the Libxml and libxslt page to make sure your installation steps are correct. Also check to ensure that libxml and libxslt are on your Python path variable:

$ echo $PYTHONPATH

and ensure there are no minor typos (e.g. libxml vs libxml2). If you keep seeing this error, even after you install libxml and libxslt, it means your libxml and libxslt are not installed correctly, and are not working the way they are supposed to.


Errors When Trying To Crawl

KeyError: Spider Not Found

If you see a KeyError related to the spider not being found, like this one,


$ scrapy crawl dmoz

2012-05-12 15:11:20-0700 [scrapy] INFO: Scrapy 0.14.3 started (bot: tutorial)
2012-05-12 15:11:21-0700 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2012-05-12 15:11:21-0700 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, ChunkedTransferMiddleware, DownloaderStats
2012-05-12 15:11:21-0700 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2012-05-12 15:11:21-0700 [scrapy] DEBUG: Enabled item pipelines: 
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/Current/bin/scrapy", line 5, in <module>
    pkg_resources.run_script('Scrapy==0.14.3', 'scrapy')
  File "/Users/charles/pkg/mechanize/std/lib/python2.7/site-packages/setuptools-0.6c11-py2.7.egg/pkg_resources.py", line 489, in run_script
  File "/Users/charles/pkg/mechanize/std/lib/python2.7/site-packages/setuptools-0.6c11-py2.7.egg/pkg_resources.py", line 1207, in run_script
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Scrapy-0.14.3-py2.7.egg/EGG-INFO/scripts/scrapy", line 4, in <module>
    execute()
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Scrapy-0.14.3-py2.7.egg/scrapy/cmdline.py", line 132, in execute
    _run_print_help(parser, _run_command, cmd, args, opts)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Scrapy-0.14.3-py2.7.egg/scrapy/cmdline.py", line 97, in _run_print_help
    func(*a, **kw)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Scrapy-0.14.3-py2.7.egg/scrapy/cmdline.py", line 139, in _run_command
    cmd.run(args, opts)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Scrapy-0.14.3-py2.7.egg/scrapy/commands/crawl.py", line 43, in run
    spider = self.crawler.spiders.create(spname, **opts.spargs)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Scrapy-0.14.3-py2.7.egg/scrapy/spidermanager.py", line 43, in create
    raise KeyError("Spider not found: %s" % spider_name)
KeyError: 'Spider not found: dmoz'

it's probably because your spiders are not in the right place. Spiders should go in the directory NAMEOFPROJECT/spiders/NAMEOFSPIDER_spider.py.


For example, if I'm working on the "tutorial" project, and I have a spider named dmoz, it should be defined in the file

tutorial/spiders/dmoz_spider.py