LowEndBox

Hosting Websites on Bare Minimum VPS/Dedicated Servers

How to Archive Websites on your VPS using Httrack

Date/Time: March 1, 2017 @ 5:56 pm, by Matt Zelasko

lowendtutorial

There might come a time where you’d want to backup a website on your own VPS. Perhaps you’d want to mirror, archive or preserve a website in its entirety. Luckily, Httrack is an app that allows you to accomplish this quite easily.

In this tutorial, we’ll use an Ubuntu 16.04 image. Depending upon the size and the amount of data that you’ll be backing up, you might want to get a VPS enough with enough disk space to accommodate your backup.

At the end of the tutorial, you’ll be able to mirror a website and publish it publicly using Apache. Let’s go ahead get started.

Update Your Repository

> sudo apt-get update

Install Httrack

Once your repository is updated, you’re ready to kick off the Httrack installer.

> sudo apt-get install httrack –y

Congrats, you’ve installed Httrack! Let’s test it out to make sure it works.

Test Out Httrack

The following command will allow you to backup the homepage of Ubuntu.com

> httrack "https://www.ubuntu.com/" -O "/tmp/www.ubuntu.com/"

The –O switch dictates where the output of the mirrored homepage will reside. In the above example, we’re simply putting the contents of the archived website in the TMP directory. Httrack is extremely powerful in the sense that if we put this data in our Apache HTML directory, we can see the result of the copied website through your browser.

Let’s go ahead and try this out.

Install Apache

Let’s go back to our SSH session and kickoff the Apache install.

> sudo apt-get install apache2 –y

Open Up Your Firewall for Apache

This is a precautionary step to ensure that the proper ports are open on your server

> sudo ufw allow in "Apache Full"

To make sure that this worked, we can go to http://<YOURIPADDRESS> and you’ll see the Apache config page.

Mirror the LowEndBox Homepage Publicly

With this command, we’ll mirror the Ubuntu homepage to the default Apache directory

> httrack "https://www.ubuntu.com/" -O "/var/www/html/www.ubuntu.com"

So if you want to test this out, you can go to http://<YOURIPADDRESS>/www.ubuntu.com/

Here is the output:

Use Httrack with a Proxy

You might find yourself in a spot where it might be prudent to use a proxy to mirror websites. Here’s how to do this:

> sudo httrack "https://www.ubuntu.com/" -O "/var/www/html/www.ubuntu.com" -P <user>:<pass>@<proxy>:<port>

You’d replace the following attributes in the above statements with your proxy server’s information

  • Proxy username: <user>
  • Proxy password: <pass>
  • Proxy server’s name: <proxy>
  • Proxy server’s port: <port>

Tip: You can find a list of updated proxy servers at ProxyNova.com. If your proxy server doesn’t require a username and password, you can simply delete the @ sign and everything before it up until the –P switch.

Your command would look like this:

> sudo httrack "https://www.ubuntu.com/" -O "/var/www/html/www.ubuntu.com" -P <proxy>:<port>

More Reading on Httrack

On the Httrack website, you’ll find a complete “User’s Guide” that was written by Fred Cohen. This guide will give you the ins and outs of the Httrack app. For example, with the commands listed in the User’s Guide, you’ll be able to throttle the rate at which you grab web pages and use specific parameters to grab exactly what you need to mirror.

Have you mirrored a website with Httrack? Tell us about your experiences with Httrack in the comments section below.

 

No Comments

Leave a Reply

Some notes on commenting on LowEndBox:

  • Do not use LowEndBox for support issues. Go to your hosting provider and issue a ticket there. Coming here saying "my VPS is down, what do I do?!" will only have your comments removed.
  • Akismet is used for spam detection. Quoting webhostingtalk.com URL seems to get binned consistently here, but I do peek into the spam box frequently to publish those comments.
  • Use <pre>...</pre> to quote the output from your terminal/console, or consider using a pastebin service.

Your email address will not be published. Required fields are marked *