Crawl Product Details in Decathlon Pages Using Scrapy-Splash
In this tutorial, we will scrape product details by following links using the Scrapy-Splash plugin.
First Steps
Create a virtual environment to avoid package conflicts. Install necessary packages and start scrapy project.
Create a Scrapy Project
Install scrapy:
pip install Scrapy
If you have trouble with installing Scrapy through pip, you can use conda. See docs here.
conda install -c conda-forge scrapy
Start the project with:
scrapy startproject productscraper
cd productscraper
Also, install scrapy-splash as we will use it further in the tutorial. I assume you already have Docker installed on your device. Otherwise, go ahead and install it first. You will need it to run the scrapy-splash plugin however you don’t need to know how containers work for this project.
# install it inside your virtual envpip install scrapy-splash# this command will pull the splash image and run the container for youdocker run -p 8050:8050 scrapinghub/splash
Now you are ready to scrape data out of the web. Let’s try to get some data before using Scrapy-Splash. This is the link I am going to scrape in this tutorial. Feel free to try on different links and websites as well. Take some time to view the page sources and inspect the elements you want to extract.
To discover about the scrapy selectors check out here.
Open Shell
Use shell to extract elements you want to scrape before trying to run the spider on the script. In this way, you will gain some time, you will not make requests many times, and avoid getting banned from the website.
Open your scrapy shell with:
scrapy shell
Now you can try to extract elements here and see if it works. First, fetch the link and check the response. If it is not returning 200, check the link on the browser link might be broken or it might be a typo.
>>> fetch(‘https://www.decathlon.com/collections/womens-shoes')
2021–05–15 12:14:52 [scrapy.core.engine] INFO: Spider opened
2021–05–15 12:14:53 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.decathlon.com/collections/womens-shoes> (referer: None)
>>> response
<200 https://www.decathlon.com/collections/womens-shoes>
The plan is to get product URLs on this page, go into them one by one and scrape product details.
Try to get one of the product links by selecting the link element:
>>> response.css(‘a.js-de-ProductTile-link::attr(href)’).get()
‘/collections/womens-shoes/products/womens-nature-hiking-mid-boots-nh100’
To grab all of the elements, use getall()
.
Since we get the URLs correctly we can now fetch one of the product pages and see if we also get the product details correctly.
>>> fetch(‘https://www.decathlon.com/collections/womens-shoes/products/womens-nature-hiking-mid-boots-nh100')
2021–05–15 12:32:38 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.decathlon.com/collections/womens-shoes/products/womens-nature-hiking-mid-boots-nh100> (referer: None)
Try to get the name of the product:
>>> response.css(‘h1.de-u-textGrow1::text’).get()
“\n Quechua NH100 Mid-Height Hiking Shoes, Women’s\n "
Try to get the description, price, image URL:
>>> response.css(‘h3.de-u-textGrow3::text’).get()
“\n Quechua NH100 Mid-Height Hiking Shoes, Women’s is designed for Half-day hiking in dry weather conditions and on easy paths.\n “
>>> response.css('span.js-de-PriceAmount::text').get()
'\n $24.99\n '
>>> response.css('img.de-CarouselFeature-image::attr(src)').get()
'//cdn.shopify.com/s/files/1/1330/6287/products/2dbcb677-82a9-48af-92fd-e803f2edfd69_675x.progressive.jpg?v=1608271582'
So far so good. Let’s now try to get the other images. You will notice it is a slider. It requires you to click a button to get other images.
>>> response.css(‘response.css(img.de-CarouselThumbnil-image::attr(srcset)').getall()
[]
Our scrapy spider cannot select the other images correctly because it is rendered by JavaScript. This is where come Scrapy-Splash plugin comes to the rescue.
I assume your container is still running from the docker command above. Check it out at http://localhost:8050/. You should see the splash page which means your splash is ready to get requests from you.
Try rendering the same product page through your splash container:
http://localhost:8050/render.html?url=https%3A%2F%2Fwww.decathlon.com%2Fcollections%2Fwomens-shoes%2Fproducts%2Fwomens-nature-hiking-mid-boots-nh100
You should be able to see the product page on your localhost. Go back to your shell and fetch splash URL this time.
>>> fetch(‘http://localhost:8050/render.html?url=https%3A%2F%2Fwww.decathlon.com%2Fcollections%2Fwomens-shoes%2Fproducts%2Fwomens-nature-hiking-mid-boots-nh100')
2021–05–15 13:55:52 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://localhost:8050/render.html?url=https%3A%2F%2Fwww.decathlon.com%2Fcollections%2Fwomens-shoes%2Fproducts%2Fwomens-nature-hiking-mid-boots-nh100> (referer: None)
Now try again for the images:
>>> response.css(‘img.de-CarouselThumbnail-image::attr(src)’).getall()
[‘//cdn.shopify.com/s/files/1/1330/6287/products/2dbcb677–82a9–48af-92fd-e803f2edfd69_150x.progressive.jpg?v=1608271582’, ‘//cdn.shopify.com/s/files/1/1330/6287/products/934cf5a0–71ae-4d21–9912–722210d4fd4b_150x.progressive.jpg?v=1608271582’, ‘//cdn.shopify.com/s/files/1/1330/6287/products/7147eb56–43af-4496-b72c-7806755441aa_150x.progressive.jpg?v=1608271583’, ‘//cdn.shopify.com/s/files/1/1330/6287/products/85c3af12-f85e-4ab7-b9e9-f7de16aed656_150x.progressive.jpg?v=1608271583’, ‘//cdn.shopify.com/s/files/1/1330/6287/products/a17dcbc1-f497–49e0–88db-50d8e7b51d39_150x.progressive.jpg?v=1608271583’, ‘//cdn.shopify.com/s/files/1/1330/6287/products/6071bd3a-dcf8–4455–9dcc-f7d5395774d2_150x.progressive.jpg?v=1608271583’, ‘//cdn.shopify.com/s/files/1/1330/6287/products/27c8e41b-9e44–43f1-a779–6890ea84693f_150x.progressive.jpg?v=1608271583’, ‘//cdn.shopify.com/s/files/1/1330/6287/products/6c0665f4–279e-4954–9ccd-50587e3d51dd_150x.progressive.jpg?v=1608271583’, ‘//cdn.shopify.com/s/files/1/1330/6287/products/ca38ef42–07ec-448e-beb0–7e33a81da085_150x.progressive.jpg?v=1608271583’, ‘//cdn.shopify.com/s/files/1/1330/6287/products/3ee8337e-bd74–4a2e-a4ff-a2b889dd79e8_150x.progressive.jpg?v=1608271583’]
Bom! It is all there. We were able to get all the data we want thanks to Splash.
To integrate Splash with your own scrapy project go to settings.py and add these lines:
# Splash SetupSPLASH_URL = 'http://<YOUR-IP-ADRESS>:8050'DOWNLOADER_MIDDLEWARES = {'random_useragent.RandomUserAgentMiddleware': 400,'scrapy_splash.SplashCookiesMiddleware': 723,'scrapy_splash.SplashMiddleware': 725,'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,}SPIDER_MIDDLEWARES = {'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,}DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
The only thing left is preparing our spider to extract data out of the page.
Create Your Spider
Normally for following links, you would do :
yield response.follow(link, callback=self.parse_products)
With splash, you just need to replace response.follow
with SplashRequest
You also need to override the start_request
method to start making requests through splash.
See the example below:
import scrapyfrom scrapy_splash import SplashRequestclass DecathlonSpider(scrapy.Spider):name = ‘Decathlonspider’ # You will run the crawler with this namestart_urls= [‘https://www.decathlon.com/collections/womens-shoes',]# When writing with splash, you need to override the start_request method to start making request through splash.def start_requests(self):
for url in self.start_urls:
yield SplashRequest(url=url, callback=self.parse, args= {'wait':1})
# Extract the links we need and start another Splash Request to follow themdef parse(self, response):links=response.css(‘a.js-de-ProductTile-link::attr(href)’).getall()for link in links: splashLink = ‘https://www.decathlon.com' + link yield SplashRequest(splashLink, callback=self.parse_product)
# Extract product detailsdef parse_product(self, response):datasets = response.css(‘img.de-CarouselThumbnail-image::attr(srcset)’).getall()images = []# get the biggest image inside data-setfor data in datasets:dataArr = data.split(‘,’)images.append(dataArr[len(dataArr) — 1].strip())yield {‘response’: response,
‘brand’: response.css(‘h1.de-u-textGrow1::text’).get().split(‘ ‘)[1],
‘name’: response.css(‘h1.de-u-textGrow1::text’).get(),
‘price’ : response.css(‘span.js-de-PriceAmount::text’).get(),
‘mainImage’:response.css(‘img.de-CarouselFeatureimage::attr(src)’).get(),
‘images’: images
}
To understand better how scrapy and spiders works you can check out this article I wrote.
To run the spider and get the extracted data into a json file, run:
scrapy crawl Decathlonspider -o decathlon.json
It should create a file with the data, otherwise check the command line to debug the mistakes.
Conclusion
Here we handled JavaScript rendered content in a Scrapy Project using Scrapy-Splash
project. Splash is a lightweight web browser that is capable of processing multiple pages, executing custom JavaScript in the page context. You can find more info on Splash itself in the docs.
If you have any questions regarding this, feel free to ask in the comments!