How to Build Your Own Web Crawler Using an Ubuntu VPS

Dec 24, 2016 @ 6:53 am

If you want to learn how to build your own web crawler using a VPS, have you considered using Scrapy? In this installment of LowEndTutorials, we’ll go over the basic functions of the Scrapy web crawling app.

Scrapy is an open source application that is used to extract data from websites. Its framework is developed in Python which enables your VPS to perform crawling tasks in a fast, simple and extensible way.

How to Install Scrapy on Ubuntu 16.04 LTS

As we previously mentioned, Scrapy is dependent on Python, development libraries and pip software.

Python’s latest version should be pre-installed on your Ubuntu VPS. From there, we will only have to install pip and python developer libraries before installation of Scrapy.

Before continuing let’s make sure that our system is up to date. Let’s therefore log into our system and gain root privileges using the following command:

> sudo -i

We can now make sure everything is up to date using the two following commands:

> apt-get update

> apt-get install python

In the next step we are going to install Pip. Pip is the replacement for easy_install for python package indexer. It is used for installation and management of Python packages. We can perform that installation using the following command:

> apt-get install python-pip

Once Pip is installed, we will have to install python development libraries by using following command.

> apt-get install python-dev

If this package is missing, the installation of Scrapy will generate an error about the python.h header file. Make sure to check the output of the previous command before continuing with the next steps of the installation.

Scrapy framework can be installed from a deb package. Try running the following command:

> pip install scrapy

The installation will take some time and should end with the following message:

“Successfully installed scrapy queuelib service-identity parsel w3lib PyDispatcher cssselect Twisted pyasn1 pyasn1-modules attrs constantly incremental

Cleaning up...”

If you see that, you have successfully installed Scrapy and you are now ready to start crawling the web!

Before you start scraping, you will have to set up a new Scrapy project. Enter a directory where you’d like to store your code and run:

> scrapy startproject myProject

This will create a “myProject” directory with the following content:

- scrapy.cfg - the project configuration file - myProject/

- you'll import your code from here

- items.py - project items definition file

- pipelines.py - project pipelines file

- settings.py - project settings file

- spiders/ - a directory where you'll later put your spiders

We are now going to create our first spider and execute it to collect some information from the web.

Spiders are classes that you define. Scrapy uses spiders to scrape information from a website (or a group of websites). This is the code for our first Spider. Save it in a file named “quotes_spider.py” under the “myProject/spiders” directory in your project:

import scrapy

class QuotesSpider(scrapy.Spider):

name = "quotes"

def start_requests(self):

urls = [

'http://quotes.toscrape.com/page/1/',

'http://quotes.toscrape.com/page/2/',

]

for url in urls:

yield scrapy.Request(url=url, callback=self.parse)

def parse(self, response):

page = response.url.split("/")[-2]

filename = 'quotes-%s.html' % page

with open(filename, 'wb') as f:

f.write(response.body)

self.log('Saved file %s' % filename)

What this code will do is basically navigate the two following webpages that contain quotes from different authors and save them in html files named, quote-1.html and quote-2.html:

http://quotes.toscrape.com/page/1/

http://quotes.toscrape.com/page/2/

Once you have saved the file with the code you are ready to execute your first crawler using the two following commands:

> cd myProject

> scrapy crawl quotes

The execution of the spider should end with the following line:

 

“…..[scrapy] INFO: Spider closed (finished)”

If you list the files in your current directory you should see the new html files generated by the spider:

 
quotes-1.html

quotes-2.html

In the following example we are going to extract the information of each author, following the links to their page and save the result in a JSON Lines format file. We will first need to create a new spider named author_spider.py with the following content:

 
import scrapy

class AuthorSpider(scrapy.Spider):

name = 'author'

start_urls = ['http://quotes.toscrape.com/']

def parse(self, response):


# follow links to author pages

for href in response.css('.author+a::attr(href)').extract():

yield scrapy.Request(response.urljoin(href),

callback=self.parse_author)
 

# follow pagination links

next_page = response.css('li.next a::attr(href)').extract_first()

if next_page is not None:

next_page = response.urljoin(next_page)

yield scrapy.Request(next_page, callback=self.parse)


def parse_author(self, response):

def extract_with_css(query):

return response.css(query).extract_first().strip()


yield {

'name': extract_with_css('h3.author-title::text'),

'birthdate': extract_with_css('.author-born-date::text'),

'bio': extract_with_css('.author-description::text'),

}

We can now execute this new crawler with the following command:

> scrapy crawl author -o author.jl

This will create a file named author.jl with the content of the extraction. The JSON Lines format is useful because it’s stream-like, you can easily append new records to it.

This is just a brief overview of the Scrapy app. It looks like you could do perform some pretty sophisticated tasks using Scrapy on your Ubuntu VPS.

If you’d like to learn more about Scrapy, the best thing to do is to take a deep dive into Scrapy’s documentation.

The End of Uniprocessor Configs on Linux - It's a Multicore-Only Kernel Now

Have You Missed Any of these LowEndBoxTV Videos?

LowEndBoxTV: Ubuntu 24: Hot Rod Ferrari Speed Freak, Crippled Dump Truck, or Somewhere in Between?

LowEndBoxTV: Free Power Toys for Your Linux Server!

Examining the Top 12 Server Operating Systems of 2024: Choose the Best One for Your Needs

LowEndBoxTV: Debian 12 "Bookworm" Benchmarks!

Jon Biloh

Jon Biloh is the owner and operator of LowEndBox and LowEndTalk, two of the most recognized platforms in the global web hosting and infrastructure community. With nearly two decades of experience in the IT industry, Jon has spent his career building, growing, and acquiring internet infrastructure companies, with a focus on affordability, transparency, and performance.

Before acquiring LowEndBox and LowEndTalk, Jon was involved in founding and scaling several hosting-related ventures, from bare-metal data center operations to high-volume cloud platforms. His hands-on understanding of both the business and technical sides of infrastructure has shaped his approach to community building and editorial strategy.

Today, Jon’s mission is to modernize and expand the LowEnd ecosystem while staying true to its roots: serving developers, small businesses, and independent providers who value control, value, and community. Under his leadership, LowEndBox is evolving into more than just a deals site, it’s becoming a resource hub, media platform, and discovery engine for DIY hosting enthusiasts and indie cloud providers.

When he’s not working on platform improvements, Jon can be found engaging with users on LowEndTalk (@jbiloh), collaborating with providers, and exploring new ways to support the next generation of internet builders.

1 Comment

ChrisR:
Don’t forget
apt-get upgrade
After the
apt-get update
for some systems
January 9, 2017 @ 1:26 am | Reply