Run your own LLM locally

Edgar_

Well-known member
Local time
Today, 02:46
Joined
Jul 8, 2023
Messages
772
In case any of you needs to run a Large Language Model (LLM) locally, there are a few ways to do this, I will post what I found to be the simplest way.

1. Download and install LM Studio, I use Developer mode, you can choose another option.
2. Download a model, more on this below.
3. Do this from the windows terminal, it doesn't matter where you open it from:
lms import C:/path/to/your/model.gguf
4. In LM Studio, there is a dropdown at the top "Select a model to load (Ctrl+L)", drop it down and select your model
5. Chat

You can figure out the rest of it.
 
By the way, you don't have to pay for any of this.

About models:
Depending on the models you choose, their capabilities. There are plenty of models for different purposes.

For example, I installed an unrestricted model that can talk about anything, and it can also write code. Expect sporadic sloppy code and hallucinations (depending on what you're doing), but it's nice that it runs locally.

Requirements:
I tested on a laptop with 16GB RAM, RTX 3070 with 8GB VRAM, i7 processor and enough space for the model.

How to choose a model:
There are full models, well beyond 15GB and quantized (reduced in quality and size) models, usually between 2GB and 15GB, obviously, the bigger the more precise. I'm not sure if LLM and video/image generation models look the same in terms of size and naming convention, but what I have learned so far is this:
1. The extensions are usually .safetensors and .gguf, where .gguf models are the quantized models.
2. About the quantized models, sometimes they have a naming convention, if you happen to see them named like Q2, Q3, Q4... Q8, I recommend Q4, it usually is a good trade off between output quality and size.
3. The size you choose has everything to do with your RAM and VRAM. If you have 16GB of RAM, you can run a 15GB model.
4. Most of the models are on huggingface, but there are other sources too
5. Some models don't follow a naming convention and they're named after something else
6. If you're curious, I tested a few "Forgotten Safeword" models, they're NSFW, but they can talk freely if you ask them to, it's not like it will be flirting with you first thing when you say Hi, they're polite unless you specify otherwise. I will be testing other models soon, but the proof of concept was a fun experiment.
 
When you're chatting with a model running locally, if you're on a moderately decent but humble computer like I am, you'll notice that the answers take like 2 seconds to start and they take a while to complete, depending on their length. It's nothing like the speed you're used to from Claude or ChatGPT. Looking at the CPU's temperature meters, it also goes well beyond 90°C while it's answering.

Makes you wonder how much computer power these big guys are using to provide such quick responses.
 
Makes you wonder how much computer power these big guys are using to provide such quick responses.

Enough that if they actually published the power usage, the "green Earth" groups would lynch them. One key to knowing this is in the recent discussions (that included linked articles) regarding the need to water-cool the circuitry, with enough water that they could seriously compromise water supplies for farmers. That is NOT a trivial amount of heat.
 
with enough water that they could seriously compromise water supplies for farmers
Yeah I heard they're moving their data centers to water abundant places too.

Mitigate that by a nano-fraction if you run the thing locally. Sadly, I know most people would simply prefer the convenience of just using it online without giving it any thought.
 

Users who are viewing this thread

Back
Top Bottom