Scrapy
From charlesreid1
Installation
Prerequisites
You'll need to install libxml2 and libxslt to use Scrapy. Installation instructions can be found here: Libxml and libxslt
Build/Install
Scrapy can be installed by downloading the tarball, unzipping it, and doing the usual Python build/install steps:
$ python setup.py build
$ python setup.py install
Create Project
Following the instructions here, create a new project:
$ scrapy startproject tutorial
Errors
Errors Creating a Project
Module object has no attribute HTML_PARSE_RECOVER
You may see an error related to a missing attribute HTML_PARSE_RECOVER when you try and create a new project:
$ scrapy startproject tutorial
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/Current/bin/scrapy", line 5, in <module>
pkg_resources.run_script('Scrapy==0.14.3', 'scrapy')
File "/Users/charles/pkg/mechanize/std/lib/python2.7/site-packages/setuptools-0.6c11-py2.7.egg/pkg_resources.py", line 489, in run_script
File "/Users/charles/pkg/mechanize/std/lib/python2.7/site-packages/setuptools-0.6c11-py2.7.egg/pkg_resources.py", line 1207, in run_script
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Scrapy-0.14.3-py2.7.egg/EGG-INFO/scripts/scrapy", line 4, in <module>
execute()
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Scrapy-0.14.3-py2.7.egg/scrapy/cmdline.py", line 112, in execute
cmds = _get_commands_dict(inproject)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Scrapy-0.14.3-py2.7.egg/scrapy/cmdline.py", line 37, in _get_commands_dict
cmds = _get_commands_from_module('scrapy.commands', inproject)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Scrapy-0.14.3-py2.7.egg/scrapy/cmdline.py", line 30, in _get_commands_from_module
for cmd in _iter_command_classes(module):
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Scrapy-0.14.3-py2.7.egg/scrapy/cmdline.py", line 21, in _iter_command_classes
for module in walk_modules(module_name):
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Scrapy-0.14.3-py2.7.egg/scrapy/utils/misc.py", line 65, in walk_modules
submod = __import__(fullpath, {}, {}, [''])
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Scrapy-0.14.3-py2.7.egg/scrapy/commands/shell.py", line 8, in <module>
from scrapy.shell import Shell
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Scrapy-0.14.3-py2.7.egg/scrapy/shell.py", line 14, in <module>
from scrapy.selector import XPathSelector, XmlXPathSelector, HtmlXPathSelector
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Scrapy-0.14.3-py2.7.egg/scrapy/selector/__init__.py", line 30, in <module>
from scrapy.selector.libxml2sel import *
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Scrapy-0.14.3-py2.7.egg/scrapy/selector/libxml2sel.py", line 12, in <module>
from .factories import xmlDoc_from_html, xmlDoc_from_xml
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Scrapy-0.14.3-py2.7.egg/scrapy/selector/factories.py", line 14, in <module>
libxml2.HTML_PARSE_NOERROR + \
AttributeError: 'module' object has no attribute 'HTML_PARSE_RECOVER'
This error indicates a problem with libxml or libxslt. First, follow the instructions at the Libxml and libxslt page to make sure your installation steps are correct. Also check to ensure that libxml and libxslt are on your Python path variable:
$ echo $PYTHONPATH
and ensure there are no minor typos (e.g. libxml vs libxml2). If you keep seeing this error, even after you install libxml and libxslt, it means your libxml and libxslt are not installed correctly, and are not working the way they are supposed to.