ExLlamaV3 adds tensor parallelism support
Title says it - it's been 10 days already but I didn't catch the release. This might be huge for those of us running on multiple GPUs. At least for Gemma3, I was able to double inference speed by using vLLM with tensor parallelism vs. ollama's homegrown parallelism. Support in ExLlamaV3 could additionally allow to pair TP with lower-bit quants. Haven't tested this yet, but I'm looking very much forward to.
https://github.com/turboderp-org/exllamav3/releases/tag/v0.0.6Open linkView original on lemmy.ml13
Comments