Here's a step-by-step guide:
Let's create a folder named politics.
Setup Environment
Now let's make a virtualenv by using the command:
python3 -m venv .venv
Activate it:
source .venv/bin/activate
Install Dependencies
Let's install scrapy
by using:
pip install scrapy.
Initiate Project
scrapy startproject politics
cd politics
scrapy genspider wiki en.wikipedia.org
Project Structure:
politics/
scrapy.cfg
politics/
__init__.py
items.py
middlewares.py
pipelines.py
settings.py
spiders/
__init__.py
Start Scraping
import scrapy
import csv
class WikiSpider(scrapy.Spider):
name = 'wiki'
allowed_domains = ['en.wikipedia.org']
start_urls = ['https://en.wikipedia.org/wiki/List_of_Nepalese_politicians']
def parse(self, response):
urls = response.css('div.div-col li a:not([href^="#"])::attr(href)').getall()
for url in urls:
yield scrapy.Request(f'https://en.wikipedia.org{url}', callback=self.parse_description)
def parse_description(self, response):
title = response.css('#firstHeading > span::text').get()
descriptions = title + ''.join(response.css('div.mw-parser-output p:not([class])::text').getall()).replace(',', '')
yield {
'title': title ,
'url': response.url ,
'descriptions': descriptions.strip(),
}
We are using css
selector here so that we find it easy to work with it.
response.css('#firstHeading > span::text').get()
descriptions = title + ''.join(response.css('div.mw-parser-output p:not([class])::text').getall()).replace(',', '')
## We are using selector and also some preprocessing in descriptions as our use case.
After mentioning the things that we need, we need to check if it works or not. Let's execute a command.
Crawling Spider
scrapy crawl wiki
This will show the response in the terminal. Hurray, it worked.
Saving the Results
If we want to save it in CSV or JSON file. Let's execute:
scrapy crawl wiki -O output.csv
for json
scrapy crawl wiki -O output.json
This will make a filename called output.csv in the folder where we execute the command.
Find this project on GitHub:
Don't forget to give a follow :)
Thank You for reading.