Revision as of 17:39, 12 May 2012

Installation

Prerequisites

You'll need to install libxml2 and libxslt to use Scrapy. Installation instructions can be found here: Libxml and libxslt

Build/Install

Scrapy can be installed by downloading the tarball, unzipping it, and doing the usual Python build/install steps:

$ python setup.py build

$ python setup.py install

Create Project

Following the instructions here, create a new project:

$ scrapy startproject tutorial

Errors

Module object has no attribute HTML_PARSE_RECOVER

You may see an error related to a missing attribute HTML_PARSE_RECOVER when you try and create a new project:


$ scrapy startproject tutorial
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/Current/bin/scrapy", line 5, in <module>
    pkg_resources.run_script('Scrapy==0.14.3', 'scrapy')
  File "/Users/charles/pkg/mechanize/std/lib/python2.7/site-packages/setuptools-0.6c11-py2.7.egg/pkg_resources.py", line 489, in run_script
  File "/Users/charles/pkg/mechanize/std/lib/python2.7/site-packages/setuptools-0.6c11-py2.7.egg/pkg_resources.py", line 1207, in run_script
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Scrapy-0.14.3-py2.7.egg/EGG-INFO/scripts/scrapy", line 4, in <module>
    execute()
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Scrapy-0.14.3-py2.7.egg/scrapy/cmdline.py", line 112, in execute
    cmds = _get_commands_dict(inproject)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Scrapy-0.14.3-py2.7.egg/scrapy/cmdline.py", line 37, in _get_commands_dict
    cmds = _get_commands_from_module('scrapy.commands', inproject)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Scrapy-0.14.3-py2.7.egg/scrapy/cmdline.py", line 30, in _get_commands_from_module
    for cmd in _iter_command_classes(module):
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Scrapy-0.14.3-py2.7.egg/scrapy/cmdline.py", line 21, in _iter_command_classes
    for module in walk_modules(module_name):
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Scrapy-0.14.3-py2.7.egg/scrapy/utils/misc.py", line 65, in walk_modules
    submod = __import__(fullpath, {}, {}, [''])
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Scrapy-0.14.3-py2.7.egg/scrapy/commands/shell.py", line 8, in <module>
    from scrapy.shell import Shell
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Scrapy-0.14.3-py2.7.egg/scrapy/shell.py", line 14, in <module>
    from scrapy.selector import XPathSelector, XmlXPathSelector, HtmlXPathSelector
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Scrapy-0.14.3-py2.7.egg/scrapy/selector/__init__.py", line 30, in <module>
    from scrapy.selector.libxml2sel import *
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Scrapy-0.14.3-py2.7.egg/scrapy/selector/libxml2sel.py", line 12, in <module>
    from .factories import xmlDoc_from_html, xmlDoc_from_xml
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Scrapy-0.14.3-py2.7.egg/scrapy/selector/factories.py", line 14, in <module>
    libxml2.HTML_PARSE_NOERROR + \
AttributeError: 'module' object has no attribute 'HTML_PARSE_RECOVER'

This error indicates a problem with libxml or libxslt. First, follow the instructions at the Libxml and libxslt page to make sure your installation steps are correct. Also check to ensure that libxml and libxslt are on your Python path variable:

$ echo $PYTHONPATH

and ensure there are no minor typos (e.g. libxml vs libxml2). If you keep seeing this error, even after you install libxml and libxslt, it means your libxml and libxslt are not installed correctly, and are not working the way they are supposed to.

Scrapy: Difference between revisions

From charlesreid1