LowEndBox - Cheap VPS, Hosting and Dedicated Server Deals

Run Your Own AI LLM Model on a LowEnd VPS for Only $2.49 a Month! Part One: Ollama and the Model

TinyLlamaLarge Language Models (LLMs) have exploded in popularity, and typically they run in massive datacenters operated by tech giants.  But here on LowEndBox, we’re all about democratizing computing.  Is it possible to run your own AI on a cheap VPS?

Yes!

Now, to be clear, you can’t self-host a digital brain the size of ChatGPT or Claude. Those models  have billions upon billions of parameters and require a lot of specialized compute (GPU) to run.  Even something like Llama 3, which is a generation back from Meta’s latest model, is not going to be happy without a GPU, and consumer GPUs push prices outside the LowEnd realm.

So how can we run an LLM on a LowEnd VPS?  Quantization.

What is Quantization?

Quantization is the process of reducing the precision of the numbers used to represent a neural network’s weights and activations.  Most AI models are originally trained using float32 (32-bit floating-point numbers).  These offer high precision but consume lots of memory and compute.  Quantization shrinks this down to 8-bit, 4-bit, or even fewer bits, drastically reducing the model size.

So instead of 232 possible values per weight, you get 24, which is 16 possible values.  This can radically shrink the size of models and hence the amount of RAM they use.

Of course, there’s a tradeoff.  Because of the loss of precision, you lose subtlety in the language and weaker performance on complex tasks.  Or to put it more simply, a 4-bit model is not going to be anywhere as good as a 32-bit model.

In this example, we’ll be using TinyLlama, which is a 1.1B-parameter 4-bit LLM model.

What Kind of VPS Do You Need?

I’m going to be using a RackNerd VPS with these specs:

  • 3 vCPU Cores
  • 60 GB Pure SSD Storage
  • 3.5 GB RAM
  • 5TB Monthly Transfer
  • Only $29.89/YEAR!

That’s only $2.49/month which is a fantastic deal.  You can GET YOUR OWN HERE.  I placed mine in Los Angeles, CA but these are available in multiple datacenters in North America, including San Jose, Seattle, Chicago, Dallas, New York, Ashburn, and Toronto.

Setup

I’m using Debian 12 and Ollama, a tool for running LLMs.

Ollama is not in apt.  You have a couple options.

Option #1: Install Script (recommended)

There’s a nice self-install script from Ollama.  This will install the product and configure it to start at boot.

I recommend going this route.  All you need to do is:

  • Go to ollama.com
  • Click Download in the upper right
  • Select Linux
  • Click the copy icon on the script and paste it into a root shell

Ollama Script Install

That will install Ollama, enable it, and start it.

Option #2: Homebrew Option

Ollama is in Homebrew.  I initially went down this path but I’ve found Homebrew to be a bit awkward and buggy with Linux.  It’s awesome on MacOS.  But on Linux, it sets up a parallel systemd, and I kept running into errors and found others had the same experience.  So again…almost mandatory for MacOS, but I prefer not to use it on Linux.

But if you want to try it out, perhaps your experience will be different (or perhaps I’m clueless).

To install Homebrew, visit brew.sh, copy the one-liner under “Install Homebrew,” and paste it in your Linux terminal.  Homebrew will not mess with apt or your system package manager.  You can have both Homebrew and apt running side by side with no issues.  apt manages things it installs and Homebrew manages things it installs.  My advice would be to use apt primarily, fill in anything missing with Homebrew, and don’t install the same packages with both.

Homebrew

Note: Homebrew will not install as root, so paste it in a non-root terminal.  However, the user who’s installing Homebrew should have sudo capabilities.  There are different ways to configure sudo, but the easiest is to install sudo (apt install sudo) and then add whatever user you’re using into the ‘sudo’ group in /etc/group.

Now install Ollama:

brew install ollama

Start Ollama:

brew services start ollama

Trying It Out

We need to get the TinyLlama LLM.  Ollama makes this easy:

ollama pull tinyllama

Ollama pull

Now run it:

ollama run tinyllama

Here’s a video showing the experience of running a prompt.  As you can see, it starts up quickly and output is pretty sprightly:

While this was running, the system was under heavy load:

Ollama top

Still plenty of available RAM:

Ollama RAM

Now that we’ve got Ollama up and running, let’s turn this into a full-fledged web UI like ChatGPT.  Stay tuned for part two!

 

No Comments

    Leave a Reply

    Some notes on commenting on LowEndBox:

    • Do not use LowEndBox for support issues. Go to your hosting provider and issue a ticket there. Coming here saying "my VPS is down, what do I do?!" will only have your comments removed.
    • Akismet is used for spam detection. Some comments may be held temporarily for manual approval.
    • Use <pre>...</pre> to quote the output from your terminal/console, or consider using a pastebin service.

    Your email address will not be published. Required fields are marked *