Scrapy
From charlesreid1
Installation
Prerequisites
You'll need to install libxml2 and libxslt to use Scrapy. Installation instructions can be found here: Libxml and libxslt
Build/Install
Scrapy can be installed by downloading the tarball, unzipping it, and doing the usual Python build/install steps:
$ python setup.py build
$ python setup.py install
Tutorial Project
Following the instructions here, create a new project:
$ scrapy startproject tutorial
Define Data Items
The first thing to do in a project is to define the type of data that the scraper will be extracting (in Scrapy's terms, this is an Item). If you were scraping IMDB for movie information, an Item might be a movie, or an actor; each Item would have various Fields, such as release year, studio, length, and so on.
Later, you will tell Scrapy how to populate these fields using the data scraped from the web page.
As an example, this would go in tutorial/items.py
:
from scrapy.item import Item, Field
class DmozItem(Item):
title = Field()
link = Field()
desc = Field()
Items are covered in great detail here: http://doc.scrapy.org/en/latest/topics/items.html#topics-items-declaring
They are basically a fancy version of a dictionary.
Create Spider
A spider grabs the HTML content of a web page (using an HTTP request, same as a web browser), but instead of displaying that HTML as a web page (as a browser might), the spider processes the contents to extract information.
The spider must define three things:
- the spider's name (
name
) - list of URLs to scrape (
start_urls
) - what to do with the scraped material (
parse()
)
The spider's behavior is defined in tutorial/spiders/dmoz_spider.py
from scrapy.spider import BaseSpider
class DmozSpider(BaseSpider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
]
def parse(self, response):
filename = response.url.split("/")[-2]
open(filename, 'wb').write(response.body)
This spider will go through each of the URLs given in start_urls
and parse the content. The parser will create a file whose filename comes from the URL, and dump the entire contents of the response to Scrapy's HTTP request into that file.
If you wanted to download non-HTML content, such as a PDF file, the spider works the same way:
from scrapy.spider import BaseSpider
class FileSpider(BaseSpider):
name = "file"
allowed_domains = ["charlesmartinreid.com"]
start_urls = [
"http://www.charlesmartinreid.com/files/TabbloidExample.pdf"
]
def parse(self, response):
filename = response.url.split("/")[-1]
open(filename, 'wb').write(response.body)
You probably won't need to add an extension using the above method, since the extension is part of the split url, but if you use any other string you'll probably have to add the file extension:
filename = response.url.split("/")[-2] + '.pdf'
Run Spider
To run the spider named dmoz
, execute the following from the project directory:
$ scrapy crawl dmoz
More Complex Parsing
Dealing with Forms
This section will illustrate the ability to use Scrapy spiders to deal with HTML forms.
The scenario will suppose that you want to set the variables in some form, submit the form, and receive a results page, which is then processed using Scrapy.
Use FormRequest
The FormRequest object type will allow a Scrapy spider to deal with forms. It is available by adding this to the beginning of a spider's definition:
from scrapy.http import FormRequest
Simple Example: Radio Buttons
Sample HTML Form
Suppose you have a simple form and associated result pages. These will look something like this:
The Form:
<html> <head><title>form test</title></head> <body> <form action="form.php" method="post"> <p>Pick one: </p> <input type="radio" name="thisbutton" value="100" /> 100 <p><input type="submit" name="submit" value="Send it!"></p> </form> </body> </html>
And when this form is submitted, it results in the following page:
The Result Page:
<p>Submit is set.</p> <p>Its value is:</p> <div id="scrape">100</div>
Start the request
To submit the above form with a spider, first define start_request(), which is the function that is called when no start_urls are specified:
def start_requests(self):
The start_requests definition begins by creating a FormRequest object (see http://www.pixelmender.com/2010/10/12/scraping-data-using-scrapy-framework/).
This request will specify the values that each piece of the form takes on:
def start_requests(self):
cmrResultRequest = FormRequest("http://charlesmartinreid.com/form.php",
formdata={'thisform':'27556','submit':'Send it!'},
callback=self.after_submit)
return [cmrResultRequest]
When the start_requests function is called, it will create a form request object, which sets the given variables to the given values, and submits the form. It returns the FormRequest result, and passes it to a function called after_submit().
Processing the result
Once the FormRequest object is created, it submits The Form (see above) and returns The Result Page. Information from the result page can then be processed using Scrapy.
def after_submit(self,response):
hxs = HtmlXPathSelector(response)
text_of_result = hxs.select('//div[@id=\'scrape\']/text()')
print text_of_result
return
In this case, the action is really boring: a print statement. But this could be something more elaborate like loading information into an item or several items.
Traversing Content
Errors
Errors Creating a Project
Module object has no attribute HTML_PARSE_RECOVER
You may see an error related to a missing attribute HTML_PARSE_RECOVER when you try and create a new project:
$ scrapy startproject tutorial Traceback (most recent call last): File "/Library/Frameworks/Python.framework/Versions/Current/bin/scrapy", line 5, in <module> pkg_resources.run_script('Scrapy==0.14.3', 'scrapy') File "/Users/charles/pkg/mechanize/std/lib/python2.7/site-packages/setuptools-0.6c11-py2.7.egg/pkg_resources.py", line 489, in run_script File "/Users/charles/pkg/mechanize/std/lib/python2.7/site-packages/setuptools-0.6c11-py2.7.egg/pkg_resources.py", line 1207, in run_script File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Scrapy-0.14.3-py2.7.egg/EGG-INFO/scripts/scrapy", line 4, in <module> execute() File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Scrapy-0.14.3-py2.7.egg/scrapy/cmdline.py", line 112, in execute cmds = _get_commands_dict(inproject) File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Scrapy-0.14.3-py2.7.egg/scrapy/cmdline.py", line 37, in _get_commands_dict cmds = _get_commands_from_module('scrapy.commands', inproject) File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Scrapy-0.14.3-py2.7.egg/scrapy/cmdline.py", line 30, in _get_commands_from_module for cmd in _iter_command_classes(module): File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Scrapy-0.14.3-py2.7.egg/scrapy/cmdline.py", line 21, in _iter_command_classes for module in walk_modules(module_name): File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Scrapy-0.14.3-py2.7.egg/scrapy/utils/misc.py", line 65, in walk_modules submod = __import__(fullpath, {}, {}, ['']) File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Scrapy-0.14.3-py2.7.egg/scrapy/commands/shell.py", line 8, in <module> from scrapy.shell import Shell File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Scrapy-0.14.3-py2.7.egg/scrapy/shell.py", line 14, in <module> from scrapy.selector import XPathSelector, XmlXPathSelector, HtmlXPathSelector File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Scrapy-0.14.3-py2.7.egg/scrapy/selector/__init__.py", line 30, in <module> from scrapy.selector.libxml2sel import * File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Scrapy-0.14.3-py2.7.egg/scrapy/selector/libxml2sel.py", line 12, in <module> from .factories import xmlDoc_from_html, xmlDoc_from_xml File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Scrapy-0.14.3-py2.7.egg/scrapy/selector/factories.py", line 14, in <module> libxml2.HTML_PARSE_NOERROR + \ AttributeError: 'module' object has no attribute 'HTML_PARSE_RECOVER'
This error indicates a problem with libxml or libxslt. First, follow the instructions at the Libxml and libxslt page to make sure your installation steps are correct. Also check to ensure that libxml and libxslt are on your Python path variable:
$ echo $PYTHONPATH
and ensure there are no minor typos (e.g. libxml vs libxml2). If you keep seeing this error, even after you install libxml and libxslt, it means your libxml and libxslt are not installed correctly, and are not working the way they are supposed to.
Errors When Trying To Crawl
KeyError: Spider Not Found
If you see a KeyError related to the spider not being found, like this one,
$ scrapy crawl dmoz 2012-05-12 15:11:20-0700 [scrapy] INFO: Scrapy 0.14.3 started (bot: tutorial) 2012-05-12 15:11:21-0700 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState 2012-05-12 15:11:21-0700 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, ChunkedTransferMiddleware, DownloaderStats 2012-05-12 15:11:21-0700 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware 2012-05-12 15:11:21-0700 [scrapy] DEBUG: Enabled item pipelines: Traceback (most recent call last): File "/Library/Frameworks/Python.framework/Versions/Current/bin/scrapy", line 5, in <module> pkg_resources.run_script('Scrapy==0.14.3', 'scrapy') File "/Users/charles/pkg/mechanize/std/lib/python2.7/site-packages/setuptools-0.6c11-py2.7.egg/pkg_resources.py", line 489, in run_script File "/Users/charles/pkg/mechanize/std/lib/python2.7/site-packages/setuptools-0.6c11-py2.7.egg/pkg_resources.py", line 1207, in run_script File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Scrapy-0.14.3-py2.7.egg/EGG-INFO/scripts/scrapy", line 4, in <module> execute() File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Scrapy-0.14.3-py2.7.egg/scrapy/cmdline.py", line 132, in execute _run_print_help(parser, _run_command, cmd, args, opts) File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Scrapy-0.14.3-py2.7.egg/scrapy/cmdline.py", line 97, in _run_print_help func(*a, **kw) File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Scrapy-0.14.3-py2.7.egg/scrapy/cmdline.py", line 139, in _run_command cmd.run(args, opts) File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Scrapy-0.14.3-py2.7.egg/scrapy/commands/crawl.py", line 43, in run spider = self.crawler.spiders.create(spname, **opts.spargs) File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Scrapy-0.14.3-py2.7.egg/scrapy/spidermanager.py", line 43, in create raise KeyError("Spider not found: %s" % spider_name) KeyError: 'Spider not found: dmoz'
it's probably because your spiders are not in the right place. Spiders should go in the directory NAMEOFPROJECT/spiders/NAMEOFSPIDER_spider.py
.
For example, if I'm working on the "tutorial" project, and I have a spider named dmoz, it should be defined in the file
tutorial/spiders/dmoz_spider.py