Models for 16 GB vram?
What models are currently good for running coding tasks? I just ran Qwen3-14B-Q6_K.gguf with llama.cpp on my card with 16GB of vram (+32GB ddr4), but I get really close to filling the entire vram on a single short conversation, so I am looking for some (smaller) alternatives to test.
I might throw OpenCode container in the mix next, if that is relevant information.
::: spoiler spoiler
podman run --rm --replace --pull=newer \
--name llama \
-p 8080:8080 \
-v ./llama_models:/models:Z \
--device /dev/dri/card1:/dev/dri/card1 \
--device /dev/dri/renderD128:/dev/dri/renderD128 \
ghcr.io/ggml-org/llama.cpp:full-vulkan \
--server \
-m /models/Qwen3-14B-Q6_K.gguf \
-ngl 99 \
-fa on \
-c 16384 \
--temp 0.6 \
--top-k 20 \
--top-p 0.95 \
--jinja \
--host 0.0.0.0 --port 8080
:::
https://huggingface.co/allenai/SERA-8B-GA
https://huggingface.co/bartowski/allenai_SERA-8B-GA-GGUF
I came across an
stickerarticle recently which talked about sites/tools that tell you what could run on your machine (canirun.ai and llmfit).whichllm seems to tell what gives you the best results on your hardware.
Thanks!
I run
gemma4:26bin 16 GB of RAM. It's slow on my test rig with only 2 GB VRAM but it should fit 16 GB VRAM fine. I have one of those AMD BC-250 crypto mining units setup as a gaming rig, but my plan was to also run ollama on it.gemma4:26bwas the model I planned to make the default. I haven't messed with it yet since I'm playing through my Steam catalog that was waiting for me to have a PC that could run them lol.