LowEndBox - Cheap VPS, Hosting and Dedicated Server Deals

Want to Run a LLM Like Llama3 at Home for a Self-Hosted ChatGPT? A Mac Might Be Your Best, Cheapest Option

Apple AIInterested in running an large language model at home? Turns out there is an interesting architectural alternative to the standard “strap a bunch of Nvidia GPUs to a 2400-watt PSU”.  Here I’m discuss running an LLM, not necessarily training one.

Self-Hosted GPT?

As of this writing, the best LLM in the world (that we know about) is OpenAI’s ChatGPT.  That is a model trained with trillions of parameters. If you want the best, you’re going to sign up with OpenAI’s service and pay.

But you can also download and run LLMs at home.  They’re not going to be as good as OpenAI’s – unless you have billions – but they are pretty good, and can be uncensored if you wish (go ahead and ask it how to build a nuclear bomb and it’ll help you).

An example is the famous Llama3 model released by Meta. It comes in 7-billion or 70-billion parameters.  The largest model I’ve seen is the Goliath-120 model, which has 120b parameters.

Here’s the problem, though…those big models really only run well if you can put them all on a GPU.  They’ll run on CPU but as you increase context and make them work, they…start to run…kind…

of…

(pause)

slow…

It’s like chatting with someone on the other end of a 300-baud modem. Or worse.  Now sometimes that’s OK but if you’re iteratively working on something, this is a severe limitation.

If you can fit that entire LLM inside a GPU, it’s blazing fast.  But Goliath-120 wants 80+ GB of memory.

If you’re stacking Nvidia cards, that’s either a $10,000+ card (just the card!) or it’s multiple 3090s or 4090s, plus a big mobo to space them out, plus cooling, plus a giant PSU…

Or You Could Just Buy a Mac

On Macs, the architecture is a little different.  CPU and GPU both share the same high speed memory, which can run at 800GB/sec.  That’s well beyond what DDR5 (64GB/sec) or DDR6 (134GB/sec) can offer.  So if you buy a Mac with 64GB, it can use that memory for either system CPU processing or GPU memory.

Out of the box, a Mac will max out at 75% of RAM for the GPU, though this can be adjusted.

My 64GB M1 Max runs models that require 40GB+ of GPU VRAM just fine.

My current favorite model is taozhiyuai/llama-3-uncensored-lumi-tess-gradient:70b_i1_q4_k_m which I run on Ollama.  That’s a 42GB model and response time is nearly instant.  (I’m not trying to run a nuclear bomb, but I hate censored models).

Now let’s look at the really high end.  Imagine you want to run goliath-120B.  That’s almost 90GB for the biggest version.  Let’s say an Nvidia 4090 is about $1800 (that’s what I see on Amazon).  That gets you 24GB of VRAM.  But you need 90, so that’s…4 cards.  $8000.  Plus you need a big beefy mobo, plus a fast CPU, system RAM, a case, storage…that’s gotta be $11,000 by the time you’re done.

OTOH, you can buy a Mac Studio with an M2 Ultra chip (24 core CPU, 60-core GPU) and 192GB of system RAM and 2TB of storage for just under $6000.

Bit of a shocker, eh?

The Mac’s ability to use fast RAM as either CPU or GPU is a game-changer in this arena, given that Nvidia cards are still comparatively RAM-poor.

I’m in the process of getting access to an M2 Ultra configured like this, and I’ll report back once I do on how it performs.

 

 

raindog308

No Comments

    Leave a Reply

    Some notes on commenting on LowEndBox:

    • Do not use LowEndBox for support issues. Go to your hosting provider and issue a ticket there. Coming here saying "my VPS is down, what do I do?!" will only have your comments removed.
    • Akismet is used for spam detection. Some comments may be held temporarily for manual approval.
    • Use <pre>...</pre> to quote the output from your terminal/console, or consider using a pastebin service.

    Your email address will not be published. Required fields are marked *