Models for 16 GB vram?

What models are currently good for running coding tasks? I just ran Qwen3-14B-Q6_K.gguf with llama.cpp on my card with 16GB of vram (+32GB ddr4), but I get really close to filling the entire vram on a single short conversation, so I am looking for some (smaller) alternatives to test.

I might throw OpenCode container in the mix next, if that is relevant information.

::: spoiler spoiler

podman run --rm --replace --pull=newer \
  --name llama \
  -p 8080:8080 \
  -v ./llama_models:/models:Z \
  --device /dev/dri/card1:/dev/dri/card1 \
  --device /dev/dri/renderD128:/dev/dri/renderD128 \
  ghcr.io/ggml-org/llama.cpp:full-vulkan \
  --server \
  -m /models/Qwen3-14B-Q6_K.gguf \
  -ngl 99 \
  -fa on \
  -c 16384 \
  --temp 0.6 \
  --top-k 20 \
  --top-p 0.95 \
  --jinja \
  --host 0.0.0.0 --port 8080

:::

View original on sopuli.xyz

Comments4

SuspiciousCarrot78

aussie.zone

https://huggingface.co/allenai/SERA-8B-GA

https://huggingface.co/bartowski/allenai_SERA-8B-GA-GGUF

Flyswat

lemmy.dbzer0.com

I came across an ~~sticker~~ article recently which talked about sites/tools that tell you what could run on your machine (canirun.ai and llmfit).

whichllm seems to tell what gives you the best results on your hardware.

Svinhufvud reply

sopuli.xyz

Thanks!

Iced Raktajino

startrek.website

I run gemma4:26b in 16 GB of RAM. It's slow on my test rig with only 2 GB VRAM but it should fit 16 GB VRAM fine. I have one of those AMD BC-250 crypto mining units setup as a gaming rig, but my plan was to also run ollama on it. gemma4:26b was the model I planned to make the default. I haven't messed with it yet since I'm playing through my Steam catalog that was waiting for me to have a PC that could run them lol.