Spyke
localllama·LocalLLaMAbySvinhufvud

Models for 16 GB vram?

What models are currently good for running coding tasks? I just ran Qwen3-14B-Q6_K.gguf with llama.cpp on my card with 16GB of vram (+32GB ddr4), but I get really close to filling the entire vram on a single short conversation, so I am looking for some (smaller) alternatives to test.

I might throw OpenCode container in the mix next, if that is relevant information.

::: spoiler spoiler

podman run --rm --replace --pull=newer \
  --name llama \
  -p 8080:8080 \
  -v ./llama_models:/models:Z \
  --device /dev/dri/card1:/dev/dri/card1 \
  --device /dev/dri/renderD128:/dev/dri/renderD128 \
  ghcr.io/ggml-org/llama.cpp:full-vulkan \
  --server \
  -m /models/Qwen3-14B-Q6_K.gguf \
  -ngl 99 \
  -fa on \
  -c 16384 \
  --temp 0.6 \
  --top-k 20 \
  --top-p 0.95 \
  --jinja \
  --host 0.0.0.0 --port 8080

:::

View original on sopuli.xyz
lemmy.dbzer0.com

I came across an sticker article recently which talked about sites/tools that tell you what could run on your machine (canirun.ai and llmfit).

whichllm seems to tell what gives you the best results on your hardware.

6

I run gemma4:26b in 16 GB of RAM. It's slow on my test rig with only 2 GB VRAM but it should fit 16 GB VRAM fine. I have one of those AMD BC-250 crypto mining units setup as a gaming rig, but my plan was to also run ollama on it. gemma4:26b was the model I planned to make the default. I haven't messed with it yet since I'm playing through my Steam catalog that was waiting for me to have a PC that could run them lol.

4

You reached the end

Models for 16 GB vram? | Spyke