Scrapy: Difference between revisions
From charlesreid1
| Line 74: | Line 74: | ||
This spider will go through each of the URLs given in <code>start_urls</code> and parse the content. The parser will create a file whose filename comes from the URL, and dump the entire contents of the response to Scrapy's HTTP request into that file. | This spider will go through each of the URLs given in <code>start_urls</code> and parse the content. The parser will create a file whose filename comes from the URL, and dump the entire contents of the response to Scrapy's HTTP request into that file. | ||
If you wanted to download non-HTML content, such as a PDF file, the spider works the same way: | |||
<source lang="python"> | |||
from scrapy.spider import BaseSpider | |||
class DmozSpider(BaseSpider): | |||
name = "dmoz" | |||
allowed_domains = ["dmoz.org"] | |||
start_urls = [ | |||
"http://www.charlesmartinreid.com/files/TabbloidExample.pdf" | |||
] | |||
def parse(self, response): | |||
filename = response.url.split("/")[-1] | |||
open(filename, 'wb').write(response.body) | |||
</source> | |||
You probably won't need to add an extension using the above method, since the extension is part of the split url, but if you use any other string you'll probably have to add the file extension: | |||
<pre> | |||
filename = response.url.split("/")[-2] + '.pdf' | |||
</pre> | |||
==Run Spider== | ==Run Spider== | ||
Revision as of 22:38, 12 May 2012
Installation
Prerequisites
You'll need to install libxml2 and libxslt to use Scrapy. Installation instructions can be found here: Libxml and libxslt
Build/Install
Scrapy can be installed by downloading the tarball, unzipping it, and doing the usual Python build/install steps:
$ python setup.py build
$ python setup.py install
Create Project
Following the instructions here, create a new project:
$ scrapy startproject tutorial
Define Data Items
The first thing to do in a project is to define the type of data that the scraper will be extracting (in Scrapy's terms, this is an Item). If you were scraping IMDB for movie information, an Item might be a movie, or an actor; each Item would have various Fields, such as release year, studio, length, and so on.
Later, you will tell Scrapy how to populate these fields using the data scraped from the web page.
As an example, this would go in tutorial/items.py:
from scrapy.item import Item, Field
class DmozItem(Item):
title = Field()
link = Field()
desc = Field()
Items are covered in great detail here: http://doc.scrapy.org/en/latest/topics/items.html#topics-items-declaring
They are basically a fancy version of a dictionary.
Create Spider
A spider grabs the HTML content of a web page (using an HTTP request, same as a web browser), but instead of displaying that HTML as a web page (as a browser might), the spider processes the contents to extract information.
The spider should define:
- a URL or list of URLS to crawl
- how to parse the contents of URLs
The spider's behavior is defined in tutorial/spiders/dmoz_spider.py
from scrapy.spider import BaseSpider
class DmozSpider(BaseSpider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
]
def parse(self, response):
filename = response.url.split("/")[-2]
open(filename, 'wb').write(response.body)
This spider will go through each of the URLs given in start_urls and parse the content. The parser will create a file whose filename comes from the URL, and dump the entire contents of the response to Scrapy's HTTP request into that file.
If you wanted to download non-HTML content, such as a PDF file, the spider works the same way:
from scrapy.spider import BaseSpider
class DmozSpider(BaseSpider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://www.charlesmartinreid.com/files/TabbloidExample.pdf"
]
def parse(self, response):
filename = response.url.split("/")[-1]
open(filename, 'wb').write(response.body)
You probably won't need to add an extension using the above method, since the extension is part of the split url, but if you use any other string you'll probably have to add the file extension:
filename = response.url.split("/")[-2] + '.pdf'
Run Spider
To run the spider named dmoz, execute the following from the project directory:
$ scrapy crawl dmoz
Errors
Errors Creating a Project
Module object has no attribute HTML_PARSE_RECOVER
You may see an error related to a missing attribute HTML_PARSE_RECOVER when you try and create a new project:
$ scrapy startproject tutorial
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/Current/bin/scrapy", line 5, in <module>
pkg_resources.run_script('Scrapy==0.14.3', 'scrapy')
File "/Users/charles/pkg/mechanize/std/lib/python2.7/site-packages/setuptools-0.6c11-py2.7.egg/pkg_resources.py", line 489, in run_script
File "/Users/charles/pkg/mechanize/std/lib/python2.7/site-packages/setuptools-0.6c11-py2.7.egg/pkg_resources.py", line 1207, in run_script
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Scrapy-0.14.3-py2.7.egg/EGG-INFO/scripts/scrapy", line 4, in <module>
execute()
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Scrapy-0.14.3-py2.7.egg/scrapy/cmdline.py", line 112, in execute
cmds = _get_commands_dict(inproject)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Scrapy-0.14.3-py2.7.egg/scrapy/cmdline.py", line 37, in _get_commands_dict
cmds = _get_commands_from_module('scrapy.commands', inproject)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Scrapy-0.14.3-py2.7.egg/scrapy/cmdline.py", line 30, in _get_commands_from_module
for cmd in _iter_command_classes(module):
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Scrapy-0.14.3-py2.7.egg/scrapy/cmdline.py", line 21, in _iter_command_classes
for module in walk_modules(module_name):
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Scrapy-0.14.3-py2.7.egg/scrapy/utils/misc.py", line 65, in walk_modules
submod = __import__(fullpath, {}, {}, [''])
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Scrapy-0.14.3-py2.7.egg/scrapy/commands/shell.py", line 8, in <module>
from scrapy.shell import Shell
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Scrapy-0.14.3-py2.7.egg/scrapy/shell.py", line 14, in <module>
from scrapy.selector import XPathSelector, XmlXPathSelector, HtmlXPathSelector
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Scrapy-0.14.3-py2.7.egg/scrapy/selector/__init__.py", line 30, in <module>
from scrapy.selector.libxml2sel import *
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Scrapy-0.14.3-py2.7.egg/scrapy/selector/libxml2sel.py", line 12, in <module>
from .factories import xmlDoc_from_html, xmlDoc_from_xml
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Scrapy-0.14.3-py2.7.egg/scrapy/selector/factories.py", line 14, in <module>
libxml2.HTML_PARSE_NOERROR + \
AttributeError: 'module' object has no attribute 'HTML_PARSE_RECOVER'
This error indicates a problem with libxml or libxslt. First, follow the instructions at the Libxml and libxslt page to make sure your installation steps are correct. Also check to ensure that libxml and libxslt are on your Python path variable:
$ echo $PYTHONPATH
and ensure there are no minor typos (e.g. libxml vs libxml2). If you keep seeing this error, even after you install libxml and libxslt, it means your libxml and libxslt are not installed correctly, and are not working the way they are supposed to.
KeyError: Spider Not Found
If you see a KeyError related to the spider not being found, like this one,
$ scrapy crawl dmoz
2012-05-12 15:11:20-0700 [scrapy] INFO: Scrapy 0.14.3 started (bot: tutorial)
2012-05-12 15:11:21-0700 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2012-05-12 15:11:21-0700 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, ChunkedTransferMiddleware, DownloaderStats
2012-05-12 15:11:21-0700 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2012-05-12 15:11:21-0700 [scrapy] DEBUG: Enabled item pipelines:
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/Current/bin/scrapy", line 5, in <module>
pkg_resources.run_script('Scrapy==0.14.3', 'scrapy')
File "/Users/charles/pkg/mechanize/std/lib/python2.7/site-packages/setuptools-0.6c11-py2.7.egg/pkg_resources.py", line 489, in run_script
File "/Users/charles/pkg/mechanize/std/lib/python2.7/site-packages/setuptools-0.6c11-py2.7.egg/pkg_resources.py", line 1207, in run_script
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Scrapy-0.14.3-py2.7.egg/EGG-INFO/scripts/scrapy", line 4, in <module>
execute()
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Scrapy-0.14.3-py2.7.egg/scrapy/cmdline.py", line 132, in execute
_run_print_help(parser, _run_command, cmd, args, opts)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Scrapy-0.14.3-py2.7.egg/scrapy/cmdline.py", line 97, in _run_print_help
func(*a, **kw)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Scrapy-0.14.3-py2.7.egg/scrapy/cmdline.py", line 139, in _run_command
cmd.run(args, opts)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Scrapy-0.14.3-py2.7.egg/scrapy/commands/crawl.py", line 43, in run
spider = self.crawler.spiders.create(spname, **opts.spargs)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Scrapy-0.14.3-py2.7.egg/scrapy/spidermanager.py", line 43, in create
raise KeyError("Spider not found: %s" % spider_name)
KeyError: 'Spider not found: dmoz'
it's probably because your spiders are not in the right place. Spiders should go in the directory NAMEOFPROJECT/spiders/NAMEOFSPIDER_spider.py.
For example, if I'm working on the "tutorial" project, and I have a spider named dmoz, it should be defined in the file
tutorial/spiders/dmoz_spider.py