Let's start by creating a virtual environment:
python3 -m venv ~/.venv/bookspider
source ~/.venv/bookspider/bin/activate
Now we can install scrapy
pip install scrapy
Let's generate the initial directory structure
scrapy startproject bookspider
items.py
allows us to define a data object model for the pages crawled by the spider. This class encapsulates each piece of data being scraped.
import scrapy
class Book(scrapy.Item):
title = scrapy.Field()
description = scrapy.Field()
file_urls = scrapy.Field()
files = scrapy.Field()
file_urls
& files
are special fields which must be explicitly defined to scrape binary files (images, PDFs, MP3s, etc).
Create a spider inside spiders/
directory. It's a class that inherits from scrapy.Spider
and must have name
and start_urls
fields defined.
import scrapy
from speck.items import BookItem
class BookSpider(scrapy.Spider):
start_urls = [
'http://books.toscrape.com/catalogue/category/books/romance_8/page-1.html',
'http://books.toscrape.com/catalogue/category/books/romance_8/page-2.html'
]
name = 'book'
def parse(self, response):
for book in response.css('li article.product_pod'):
href = book.css('a::attr(href)').extract_first()
full_url = response.urljoin(href)
yield scrapy.Request(full_url, callback=self.parse_book)
def parse_book(self, response):
title = response.css('h1::text').extract_first()
description = response.css('.product_page > p::text').extract()
src = response.css('.product_page .thumbnail img::attr(src)').extract_first()
cover = response.urljoin(src)
yield BookItem(title=title, description=description, file_urls=[cover])
start_urls
is a list of the seed URLs the spider will crawl first. We are targeting Scrapy playground available at books.toscrape.com
which provides a dummy book store.
Every spider must have at least a parse()
method that handles the URLs defined in start_urls
. This method can either yield other requests which will trigger other pages to be crawled either return the values.
scrapy
allows to traverse the DOM using both CSS and XPath selectors.
The last thing is to enable the item pipelines in settings.py
so that Scrapy automatically downloads each files put into file_urls
ITEM_PIPELINES = {
'scrapy.pipelines.files.FilesPipeline': 1
}
FILES_STORE = 'book_cover_images'
Additionally we must also define FILES_STORE
setting i.e. the path to the output directory where the download images will be stored.
Finally, let's run the scrapper
scrapy crawl book -o book.jl