Spyke

Posts

showerthoughts·Showerthoughtsbyrobber

Status symbols could also be called symbols of inequality

Some days ago I saw people who attended a Fridays for Future demonstration excitedly put political stickers on a shiny blue Lamborghini which was obviously parked at the wrong point in spacetime.

When discussing this with a friend, we concluded that there was quite strong symbolism in that situation - like direct payback for the unnecessary pollution of the planet, the car being the canvas where the activists were able to project their anger onto.

We also talked about luxury cars being a symbol of social inequality.

And only later it hit me, how luxury cars, among other things, are usually called status symbols and how actually they could also be called symbols of equality.

View original on lemmy.ml
localllama·LocalLLaMAbyrobber

Gemma4 12b released with "unified" approach to multi-modality

From the model card, sounds interesting:

The "Unified" in Gemma 4 12B Unified refers to its encoder-free architecture. Other Gemma 4 models use dedicated encoders to process multimodal data before passing it to the LLM. Gemma 4 12B eliminates these encoders entirely, projecting raw image patches and audio waveforms directly into the LLM's embedding space through lightweight linear layers. This unified approach means all modalities flow straight into a single decoder-only transformer, reducing multimodal latency and allowing the entire model to be fine-tuned in one pass.

The benchmarks put it closer to the 26b MoE than to the E variants of the Gemma4 series, but mostly below Qwen3.5 9b.

Looking forward to giving it a shot.

Gemma4 12b released with "unified" approach to multi-modalityhttps://huggingface.co/google/gemma-4-12BOpen linkView original on lemmy.ml
localllama·LocalLLaMAbyrobber

llama.cpp: don't sleep on --split-mode tensor

In case you missed it, 2-3 weeks ago, experimental tensor-parallelism support was merged into llama.cpp.

In a nutshell, this allows in multi-GPU setups to not only combine the VRAM of the cards but also their computing power. The results depend a lot on the specific setup and model, but on my 3x RTX 2000e Ada rig running Qwen3.6-35b it almost doubled generation throughput (these are low-powered cards which are not very powerful on their own).

The option to turn it on is --split-mode tensor.

It's not yet officially documented, I assume because it's still experimental. But since #22362 was merged yesterday, in my case it now also work for the latest Qwen3.6 models.

llama.cpp: don't sleep on --split-mode tensorhttps://github.com/ggml-org/llama.cpp/pull/19378Open linkView original on lemmy.ml
localllama·LocalLLaMAbyrobber

Relevance of GPU driver version for inference performance

Hey everyone! I was just skimming through some inference benchmarks of other people and noticed the driver version is usually mentioned. It made me wonder how relevant this is. My prod server runs Debian 12 so the packaged nvidia drivers are rather old, but I'd prefer not to mess with the drivers if it won't bring a benefit. Does any of you have any experience or did do some testing?

View original on lemmy.ml
localllama·LocalLLaMAbyrobber

ExLlamaV3 adds tensor parallelism support

Title says it - it's been 10 days already but I didn't catch the release. This might be huge for those of us running on multiple GPUs. At least for Gemma3, I was able to double inference speed by using vLLM with tensor parallelism vs. ollama's homegrown parallelism. Support in ExLlamaV3 could additionally allow to pair TP with lower-bit quants. Haven't tested this yet, but I'm looking very much forward to.

ExLlamaV3 adds tensor parallelism supporthttps://github.com/turboderp-org/exllamav3/releases/tag/v0.0.6Open linkView original on lemmy.ml
localllama·LocalLLaMAbyrobber

Do you quantize models yourself?

Hey fellow llama enthusiasts! Great to see that not all of lemmy is AI sceptical.

I'm in the process of upgrading my server with a bunch of GPUs. I'm really excited about the new Mistral / Magistral Small 3.2 models and would love to serve them for me and a couple of friends. My research led me to vLLM with which I was able to double inference speed compared to ollama at least for qwen3-32b-awq.

Now sadly, the most common quantization methods (GGUF, EXL, BNB) are either not fully (GGUF) or not at all (EXL) supported in vLLM, or multi-gpu inference thouth tensor parallelism is not supported (BNB). And especially for new models it's hard to find pre-quantized models in different, more broadly supported formats (AWQ, GPTQ).

Does any of you guys face a similar problem? Do you quantize models yourself? Are there any up-to-date guides you would recommend? Or did I completely overlook another, obvious solution?

It feels like when I've researched something yesterday, it's already outdated again today, since the landscape is so rapidly evolving.

Anyways, thank you for reading and sharing your thoughts or experience if you feel like it.

View original on lemmy.ml
selfhosted·Selfhostedbyrobber

Any experience with Pangolin?

Hi fellow homelabbers! I hope your day / night is going great.

Just stubled across this self-hosted cloudflare tunnel alternernative called Pangolin.

  • Does anyone use it for exposing their homelab? It looks awesome, but I've never heard of it before.

  • Should I be reluctant since it's developed by a US-based company? I mean security-wise. (I'll remove this question if it's too political.)

  • Does anyone know of alternatives pieces or stacks or software that achieve the same without relying on cloudflare?

Your insights are highly appreciated!

View original on lemmy.ml
selfhosted·Selfhostedbyrobber

[Solved] Chaining routers and GUA IPv6 addresses

Hey fellow self-hosting lemmoids

Disclaimer: not at all a network specialist

I'm currently setting up a new home server in a network where I'm given GUA IPv6 addresses in a 64 bit subnet (which means, if I understand correctly, that I can set up many devices in my network that are accessible via a fixed IP to the oustide world). Everything works so far, my services are reachable.

Now my problem is, that I need to use the router provided by my ISP, and it's - big surprise here - crap. The biggest concern for me is that I don't have fine-grained control over firewall rules. I can only open ports in groups (e.g. "Web", "All other ports") and I can only do this network-wide and not for specific IPs.

I'm thinking about getting a second router with a better IPv6 firewall and only use the ISP router as a "modem". Now I'm not sure how things would play out regarding my GUA addresses. Could a potential second router also assign addresses to devices in that globally routable space directly? Or would I need some sort of NAT? I've seen some modern routers with the capability of "pass-through" IPv6 address allocation, but I'm unsure if the firewall of the router would still work in such a configuration.

In IPv4 I used to have a similar setup, where router 1 would just forward all packets for some ports to router 2, which then would decide which device should receive them.

Has any of you experience with a similar setup? And if so, could you even recommend a router?

Many thanks!


Edit: I was able to achieve what I wanted by using OpenWrt and their IPv6 relay mode. Now my ISP router handles all IPv6 addresses directly, but I'm still able to filter the packets using the OpenWrt firewall. For IPv4 I didn't figure out how to, at the same time, use the ISP's DHCP server, so I just went with double NAT. Everything works like a charm. Thank you guys for pointing me in the right direction.

View original on lemmy.ml
mildlyinfuriating·Mildly Infuriatingbyrobber

Modern online banking

A couple of years ago, QR-bills were introduced in Switzerland as a means to make payments easier. My bank provides an app to scan the QR codes, which I prefer not to install. The only other option they provide to scan the codes is to use the webcam. Am I supposed to print my digital bills to have my webcam scan them again? Just let me upload a goddamn screenshot.

View original on lemmy.ml
selfhosted·Selfhostedbyrobber

Any of you have a self-hosted AI "hub"? (e.g. for LLM, stable-diffusion, ...)

I've been looking into self-hosting LLMs or stable diffusion models using something like LocalAI and / or Ollama and LibreChat.

Some questions to get a nice discussion going:

  • Any of you have experience with this?
  • What are your motivations?
  • What are you using in terms of hardware?
  • Considerations regarding energy efficiency and associated costs?
  • What about renting a GPU? Privacy implications?
View original on lemmy.ml