How to Use Scrapy for Image Download using Pipelines in Python

Scrapy is most popular scraping framework in Python and it’s features are optimized for web scraping.

Like ecommerce business you will need to scrape images of the products to analyze the product of competitors to your own. Using Scrapy you can download images very easily and fast. This article is all about this! How you can exactly use Scrapy to download images from any website.

Very first thing is! You must have Scrapy in your env.

pip install scrapy

Create Scrapy Project:

scrapy startproject ImageDonwload

Create your spider in Project just created.

cd ImageDonwloadscrapy genspider spiderName www.example.com

So, before writing code, you will need Pillow library for images.

pip install Pillow

Settings.py:

IMAGES_STORE = 'images'  # folder name or path where to save images
DOWNLOAD_DELAY = 2 # delay in downloading images
ITEM_PIPELINES = {'scrapy.pipelines.images.ImagesPipeline': 1}

Items.py: (must have these fields)

from scrapy import Field
from scrapy import Item


class ImageItem(Item):
image_urls = Field()
images = Field()

Now it’s time to code your spider!

Let’s do it!

import scrapy
from ImageDownload.items import ImageItem


class ScrapeSpider(scrapy.Spider):
name = 'spiderName'
allowed_domains = ['www.example.com']
start_urls = ['https://example.com/']

def parse(self, response):
item = ImageItem()
if response.status == 200:
rel_img_urls = response.xpath("//img/@src").extract()
item['image_urls'] = self.url_join(rel_img_urls, response)
return item

def url_join(self, rel_img_urls, response):
joined_urls = []
for rel_img_url in rel_img_urls:
joined_urls.append(response.urljoin(rel_img_url))

return joined_urls

img_urls needs to be a list and needs to contain ABSOLUTE URLs that’s why sometimes you have to create a function to transform relative URLs to absolute.

If everything will be working fine! You will see the output on console like:

Custom Names for ImageItem Fields:

You can use the custom_image_urls, custom_images field name instead of default names (image_urls, images)in the settings.py

FILES_URLS_FIELD = 'instead_of_image_urls_field_name'
FILES_RESULT_FIELD = 'instead_of_images_field_name'

Creating Thumbnails:

Scrapy have the feature of thumbnails in it’s Pipeline. So, you can create thumbnails in setting.py like:

IMAGES_THUMBS = {
'small': (50, 50),
'big': (260, 260),
}

Scrapy will generate the smaller and bigger thumbnails for your defined sizes.

Scrapy Bonus:

Expiration:

Scrapy have feature of skipping the images that are already downloaded in past crawl and you can also change the expire time in settings.

IMAGES_EXPIRES = 2  # 2 days of delay for image expiration (default:90 days)

Rename Image FILES names:

As scrapy will save the images using SHA1 hash of their URLs by default but you need to change the image names to recognize what’s in the image?

You can give whatever image name you want!

You have to override the two functions of ImagesPipeline:

get_media_requests()
file_path()

So, how you can do that! Just have a look on code below.

Pipelines.py:

import scrapy
from scrapy.pipelines.images import ImagesPipeline
from scrapy.exceptions import DropItem
class CustomImagesPipeline(ImagesPipeline):

def get_media_requests(self, item, info):
return [Request(x, meta={'image_name': item["image_name"]})
for x in item.get('image_urls', [])]

def file_path(self, request, response=None, info=None):
return '%s.jpg' % request.meta['image_name']

After doing this you have to change the Items.py and add one more field named “image_name” in the item class.

from scrapy import Field
from scrapy import Item


class RecapcthadataItem(Item):
image_name = Field()
image_urls = Field()
images = Field()

That’s All.

You need more assistance or need me to do your project of web scraping, data mining, data collection? Just Inbox me on Fiverr.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store