Scraping using Scrapy

Scraping using Scrapy

Nepalese Politicians Informations

Here's a step-by-step guide:

Let's create a folder named politics.

Setup Environment

Now let's make a virtualenv by using the command:

python3 -m venv .venv

Activate it:

source .venv/bin/activate

Install Dependencies

Let's install scrapy by using:

pip install scrapy.

Initiate Project

scrapy startproject politics

cd politics

scrapy genspider wiki en.wikipedia.org

Project Structure:

politics/
    scrapy.cfg
    politics/
        __init__.py
        items.py
        middlewares.py
        pipelines.py
        settings.py
        spiders/
            __init__.py

Start Scraping

import scrapy
import csv

class WikiSpider(scrapy.Spider):
    name = 'wiki'
    allowed_domains = ['en.wikipedia.org']
    start_urls = ['https://en.wikipedia.org/wiki/List_of_Nepalese_politicians']

    def parse(self, response):
        urls = response.css('div.div-col li a:not([href^="#"])::attr(href)').getall()

        for url in urls:
            yield scrapy.Request(f'https://en.wikipedia.org{url}', callback=self.parse_description)

    def parse_description(self, response):
        title = response.css('#firstHeading > span::text').get()
        descriptions = title + ''.join(response.css('div.mw-parser-output p:not([class])::text').getall()).replace(',', '')
        yield {
            'title': title , 
            'url': response.url , 
            'descriptions': descriptions.strip(), 
        }

We are using css selector here so that we find it easy to work with it.

response.css('#firstHeading > span::text').get()
        descriptions = title + ''.join(response.css('div.mw-parser-output p:not([class])::text').getall()).replace(',', '')

## We are using selector and also some preprocessing in descriptions as our use case.

After mentioning the things that we need, we need to check if it works or not. Let's execute a command.

Crawling Spider

scrapy crawl wiki

This will show the response in the terminal. Hurray, it worked.

Saving the Results

If we want to save it in CSV or JSON file. Let's execute:

scrapy crawl wiki -O output.csv

for json

scrapy crawl wiki -O output.json

This will make a filename called output.csv in the folder where we execute the command.

Find this project on GitHub:

https://github.com/kuxall/NPPoliticiansScraper

Don't forget to give a follow :)

Thank You for reading.