Want to Run a LLM Like Llama3 at Home for a Self-Hosted ChatGPT? A Mac Might Be Your Best, Cheapest Option

Jun 29, 2024 @ 2:10 pm

ai diy, apple, artificial intelligence, llama3, mac, ollama

Apple AI Interested in running an large language model at home? Turns out there is an interesting architectural alternative to the standard “strap a bunch of Nvidia GPUs to a 2400-watt PSU”. Here I’m discuss running an LLM, not necessarily training one.

Self-Hosted GPT?

As of this writing, the best LLM in the world (that we know about) is OpenAI’s ChatGPT. That is a model trained with trillions of parameters. If you want the best, you’re going to sign up with OpenAI’s service and pay.

But you can also download and run LLMs at home. They’re not going to be as good as OpenAI’s – unless you have billions – but they are pretty good, and can be uncensored if you wish (go ahead and ask it how to build a nuclear bomb and it’ll help you).

An example is the famous Llama3 model released by Meta. It comes in 7-billion or 70-billion parameters. The largest model I’ve seen is the Goliath-120 model, which has 120b parameters.

Here’s the problem, though…those big models really only run well if you can put them all on a GPU. They’ll run on CPU but as you increase context and make them work, they…start to run…kind…

of…

(pause)

slow…

It’s like chatting with someone on the other end of a 300-baud modem. Or worse. Now sometimes that’s OK but if you’re iteratively working on something, this is a severe limitation.

If you can fit that entire LLM inside a GPU, it’s blazing fast. But Goliath-120 wants 80+ GB of memory.

If you’re stacking Nvidia cards, that’s either a $10,000+ card (just the card!) or it’s multiple 3090s or 4090s, plus a big mobo to space them out, plus cooling, plus a giant PSU…

Or You Could Just Buy a Mac

On Macs, the architecture is a little different. CPU and GPU both share the same high speed memory, which can run at 800GB/sec. That’s well beyond what DDR5 (64GB/sec) or DDR6 (134GB/sec) can offer. So if you buy a Mac with 64GB, it can use that memory for either system CPU processing or GPU memory.

Out of the box, a Mac will max out at 75% of RAM for the GPU, though this can be adjusted.

My 64GB M1 Max runs models that require 40GB+ of GPU VRAM just fine.

My current favorite model is taozhiyuai/llama-3-uncensored-lumi-tess-gradient:70b_i1_q4_k_m which I run on Ollama. That’s a 42GB model and response time is nearly instant. (I’m not trying to run a nuclear bomb, but I hate censored models).

Now let’s look at the really high end. Imagine you want to run goliath-120B. That’s almost 90GB for the biggest version. Let’s say an Nvidia 4090 is about $1800 (that’s what I see on Amazon). That gets you 24GB of VRAM. But you need 90, so that’s…4 cards. $8000. Plus you need a big beefy mobo, plus a fast CPU, system RAM, a case, storage…that’s gotta be $11,000 by the time you’re done.

OTOH, you can buy a Mac Studio with an M2 Ultra chip (24 core CPU, 60-core GPU) and 192GB of system RAM and 2TB of storage for just under $6000.

Bit of a shocker, eh?

The Mac’s ability to use fast RAM as either CPU or GPU is a game-changer in this arena, given that Nvidia cards are still comparatively RAM-poor.

I’m in the process of getting access to an M2 Ultra configured like this, and I’ll report back once I do on how it performs.

Is Meta Discontinuing the Free LLaMA Model Series? News Reports Suggest LLaMA Drama

Big Tech's Resilience Offers Encouraging Signals for the Hosting Industry and the "LowEnd" Ecosystem

Will Dropbox Survive to 2030, or Is the End in Sight?

AMD Boosts AI Prowess Through Strategic Acquisition of ZT Systems

We Need To Stop Supporting THIS As Consumers… (My One Apple Complaint)

"I Married My AI! We Met on Tinder AI! Pictures!!! ": We Need to Talk About the Exploding World of A...

raindog308

Raindog308 is a longtime LowEndTalk community administrator, technical writer, and self-described techno polymath. With deep roots in the *nix world, he has a passion for systems both modern and vintage, ranging from Unix, Perl, Python, and Golang to shell scripting and mainframe-era operating systems like MVS. He’s equally comfortable with relational database systems, having spent years working with Oracle, PostgreSQL, and MySQL.

As an avid user of LowEndBox providers, Raindog runs an empire of LEBs, from tiny boxes for VPNs, to mid-sized instances for application hosting, and heavyweight servers for data storage and complex databases. He brings both technical rigor and real-world experience to every piece he writes.

Beyond the command line, Raindog is a lover of German Shepherds, high-quality knives, target shooting, theology, tabletop RPGs, and hiking in deep, quiet forests.

His goal with every article is to help users, from beginners to seasoned sysadmins, get more value, performance, and enjoyment out of their infrastructure.

You can find him daily in the forums at LowEndTalk under the handle @raindog308.