I'm still messing around with self hosting llm, rn ive settled on using lumo from proton if I use an llm.
When I have run llm, I used koboldcpp. Works pretty well, depends on what you are doing and what models you use. Forget which models ive been using off the top of my head
I use it for light scripting, but real coding is done by cloud models.
I'm also using it as the brain for my Hermes agent. It sends me digests of news, subreddits, chats that I'd like to read but don't have time for. It does a great job researching things on the web for me, too.
Running decencored Qwen3.6-27b and a 9b Gemma for RAG and scrapes on Ollama with a mostly vibe coded discord bot. Just got it to run tools and scrape and post news on a schedule. The first model I can run locally that's smart enough to be useful. May give Jan a try for the back end after reading that other guys rant.
Mostly use it for stupid questions I could have googled and to brag to friends.
I host my own AI, mostly for testing and because I wanted something that was mine and mine alone. I use Ollama and run models like Llama, Mistral, and Qwen. I honestly don’t use it much, but I wanted to have my own setup just in case online services go down or become less available. It’s part of my whole “own everything I use” mantra that I’ve been on lately.
Partially. I started with hosting my own llama3.2 + granite4 models using Ollama for my Home Assistant smart home and for general chat with OpenWebUI. I also run whisper for speech-to-text locally on my 1080 Ti GPU. I like the privacy and ownership of my self-hosted models, but I started to run into limitations with the small weights. So I built some tools that allow me to selectively route traffic to larger models hosted on DeepInfra depending on my need. For example, to GLM/Kimi models for code reviews or for my custom harnesses or harder problems.
I currently run Qwen3.6-27b on llama.cpp and use it via openwebui. Mostly, I use it for web research via tavily, to a lesser extent for coding and interactively learning about things that are new to me but common in training data (such as basic math or ML concepts).
Yes, llama-swap and I use it for home assistant text-gen notifications, basic coding tasks, etc
If anyone here self-hosts definitely check out llama-swap as it has some nifty features for hotswapping LLMs, image generation models and voice models.
I prefer my critical faculties completely intact and un-altered, thank you very much.
I do not require or desire a 400 watt bullshit-artist yes-man or vulnerability coder cooking my GPU.
Given the 27b is a dense model, I think the numbers are quite ok. Curious about the quant tho.
The cool thing about the strix is its large unified memory, but it lacks memory bandwith for compute intensive workloads. Something like Qwen3.5-122b MoE with only like 12b active parameters might run at twice the speed if it fits the configuration.
My go to model for knowledge. Definitely much faster at Q5 but it lacks the tool calling quality of the Qwen3.6 models. Really hoping we see a Qwen3.6-122b soon...
I hosted Qwen 3.5 9b uncensored on my site at https://masland.tech/ for a while. I didn't really use it and no one else used it so I took it down. These days I'm spending most of my time finding uses for AI and accessibility. One of the next things I'm planning is a video to text reasoning system, primarily for the purpose of grading used electronic devices.
I have a simple slow model running on CPU in my cluster for karakeep. I've tried running a variety of models on my 7900XT but even with 16GB their performance just isn't there. My new work m5 Mac book with 48GB of ram is the first time I've seen usable performance for local models and it has been pretty impressive.
Thanks for this link. Because of this article, I had claude stand up a llama.cpp container next to my already running ollama container. It ran side by side tests with the same model and parameters, and the results blew ollama out of the water. I'm in the process of moving hermes and openwebgui over to the llama.cpp instance to see how it goes day to day.
Frankly, I find the description "VC funding a FOSS" offensive. They aren't funding the engine. I've been messing with LLM inference engines since 2022, and Ollama is the worst I've seen in the community.
They misname models for SEO. They leech off llama.cpp while deliberately hiding attribution yet redirecting GH support requests there. They sometimes make their own GGUFs+forked releases which are broken and incompatibile with upstream llama.cpp, just so they can get a release out a day ahead for hype, even though it doesn't really work and they'll never upstream one line. They set a default context size thats basically unusable, they screw up chat templates and deep internal code with no obvious indicators, they release suboptimal quants without iMatrix, they gate you into their internal quantization repo and model card format, they hide model downloads on your hard drive, they mess with standard APIs for no good reason other than to mess up other backends. I could go on and on.
And if that's all fine, they're enshittifying the app with closed code, and pointers to cloud models.
They GIVE LLM inference a bad name, by making it a terrible quality engine that happens to show up in search as the "default." Hence the comments below of people being unimpressed with local inference. And they sap attention from actual llama.cpp devs, without contributing a single dime. Everyone in the localllama communtity hates their guts, and that's not even getting into the interpersonal drama they've stirred.
They are a leech that's a net drag to the whole community, that we can't get rid of because they're attention grifters. And they've gotten worse and worse over time.
It's more morale to use any cloud API over Ollama, in my eyes. They're a grift.
EDIT: And, to be clear, I’m not against VC funded downstream stuff.
LM Studio is good! Even though it’s closed source.
i don't use it at all, i do want some selfhosted speech to text model (whisper?) but my computer is ancient so it would be awfully slow. i have some multi hour audio recordings from presentations, would be nice to have them in text and searchable..
How ancient is ancient? TTS and STT are much lighter than llm. (eg: Whisper, Piper, Kokoro, Coqui etc)....you might have more capability than you think, especially if you're doing batch processing like that.
a haswell xeon e5-1650 machine, i remember running llama 7b in llama.cpp in like 2023 and it was quite sluggish. guess i should try whisper at some point..
Same, toyed with it for creating stupid things like bot for telegram, that basically was a 3rd-person NSFW storyteller in RP chat. Sadly, after I made said bot I remember that I don't have friends to RP with.
That being said, ollama+openwebui kinda sucks: openwebui have "wider scope" and features that you don't need like auth via social providers and managing multiple accounts, while ollama itself does the opposite and lacks certain features (like proper mmap support to load big models), slow in comparison to pure llama.cpp and generally easily replaceable with lm studio, that provides both - client and server. So yeah, my advise for anyone who want to try it localy - just use lm manager.
Same. Its somewhat useful on some very small scripting or tasks...but its mostly just to try out a new model or two. Its not really useful for anything big.
I will have to say....even my tiny models are about as good as Chatgpt/Claude/etc... which makes me think about how much people are spending on tokens regularly. I was able to get the same kind of python script started with my local tiny model that was comparable to the newest Claude code offerings.
What local models have you been using? And what hardware are you running them on? I've been playing with local LLMs a bit for exactly your use case.
I have zero interest in vibe coding or full agentic workflows. But having a local LLM generate a Bash script to help me automate parts of my home lab infrastructure would be nice.
It is difficult to understand in the beginning but has great support for premade workflows. It even saves the workflow into its output images so you can drag and drop them into the webui to duplicate the setup that generated the image. Use the internet to get premade workflows and mess around with them to see what the options do and you'll slowly learn how it works. If you don't care about precise control over the generations or understanding how image generators works then just use something else more all-in-one.
Yeah, I'm using qwen 31b a3b on an amd 9070xt requires a bit of cpu offloading, but still plenty fast. Using it wall llama.cpp.
Combine that with some mcp's such as ddg-search to make it truly useful by actually being able to search online.
I mostly use it for small tedious tasks with well defined inputs and outputs. For example when hyprland recently changed from their own configuration language to lua. At first I started going line by line translating my config to the new lua language until I realized oh wait this is exactly the type of thing that ML is useful for. Going from the well defined hyprland configuration language to their also well defined lua syntax. It banged it out in less than a minute with only a single mistake which I easily fixed. The mistake it made was that it forgot to translate the comments to lua. It did it in less than a minute and worked first try. Where as I had made several typos and gotten a few lines wrong when I was doing it by hand.
Not to say that I couldn't do it. I would have gotten it done in about half an hour, but less than a minute is a lot faster.
I also used it to transform a bunch of unstructured data into json data, so that I could then use purpose built tools like jq to parse that. If I'm having trouble finding certain information. I'll ask it to find me some resources to look at.
Basically small well defined tasks and parsing data is what I use it for and it seems to be pretty good at that.
What I don't like is the way companies try to market it to people. I don't believe people should be trying to summarize emails or messages from loved ones, writing essays or any other creative tasks for the most part. Translating is okay. I don't expect a machine to be able to decide things for me or to be some filter between me and others.
I ran through lmstudio because it really eazy, I ran some kind of qwen 3.6 27b imatrix neo code DI, it is the best local model for coding I tried, I think it can be better than some cloud model
No, too expensive. I wish I could but it doesn't make sense financially for me right now, it is much cheaper to buy openrouter credits from time to time
I've been running ministral on CPU on a home-server: works pretty nicely, not very performant for everyday tasks and the savings were not sufficient for it to make sense. It still was cheaper and faster to just use Mistral API and get better models.
This is the default version I got when I first tried using ollama without any experience. It worked, but it's a heavily quantized, lower parameter version of the model -- i.e. it's pretty dumb -- compared to what you can actually run on your hardware.
And that’s not even all of it. Basically they break models in many ways, and they’re slimey Tech Bros.
LM Studio is better, and easy.
If you’re on Nvidia, and want to run optimally, I would use the ik_llama.cpp fork. On AMD, regular llama.cpp. On a Mac, use an MLX runner (Like LM Studio) with an MLX quant (ideally an MLX-DWQ quant).
It’s all pretty technical, and… thats kinda the point. LLMs are just too performance sensitive and too finicky to not have a grasp of how they work. There is no "easy button" to run them without bad results, there can't be.
But if you don’t have time for that and just want to see if it’s worth it, I’d suggest self hosing your own UI, and trying the dirt cheap APIs of models you can theoretically run on your setup. This will give you a “best case” taste of what they’re capable of.
To get more specific, you can actually run way better models than Qwen 3.5 and Deepseek coder (both of which are very obsolete now). The best that's practical depends on how much CPU RAM you have, but at the minimum you can do Qwen 3.6 27B, with a more optimal quant like ones here: https://huggingface.co/ubergarm/Qwen3.6-27B-GGUF/tree/main
I set up ollama on our thinkstation in the lab and I use it for looking up documentation, generating readmes, searching papers, and sometimes coding when I know what to do but don't feel it is worth it to spend time on it myself. So basically the chat with web search.
I have a 5080 and 128gb of ram running on a AMD 9950X.
Depending on the task I can get over 170-200t/s when the MOE only calls a few agents and can fit inside the VRAM or as low as 5-10ts when it calls more agents and has to hit the system memory. But for grunt work that doesn't need professor level tasks, it's more than capable and if you have the time, it's super worth it because it's basically free tokens.
I only use this for overnight work to save on tokens during the day. When I'm pulling analytics for my work and it just needs basic analysis that doesn't touch multiple tooks.
During work hours I'm using GLM5.2 for web development, Kimi k2.7 for complicated data analysis and Minimax m3 if I need the context window to be bigger than what kimik2.7 can give me.
Not anymore. Not with hybrid offloading, where the GPU handles dense tensors and the CPU only runs the sparse MoEs. I'm running a 300B model on a single 3090, and its faster than I can read.
You just need to use the right framework, and the right model.
I'll check that out - speed isn't my biggest issue so much as coding performance... The qwen 3.5 model I was using can write code, but it's... Meh? Like sometimes it doesn't even compile.
I did try tweaking llama.cpp to do some cpu offloading and it does seem to allow for much larger contexts at a modest performance loss. I'll check out larger models.
CPU offloading is too slow unless you use a hybrid MoE model, with the --n-cpu-moe parameter, specifically.
This only offloads "sparse" parts of the model to the CPU, which take up a lot of RAM but are very compute-lite to run. In practice, thats most of the size of modern MoE LLMs.
Since implementation of the --fit parameter and its relatives, and --fit on becoming the default, llama.cpp intelligently decides what to offload. For me, it made --n-cpu-moe obsolete.
Sometimes it’s better to “cut it close,” with (for instance) a 27B model that’s nearly OOMing your VRAM fully offloaded, but you know will be fine in regular use without too many programs open.
In my case, with MiMo 2.5, it fills both my CPU and GPU RAM rather completely, so it’s best to set a static value so I don’t swap CPU RAM, and don’t OOM on the GPU either.
I do, I use ollama. I mostly just tinker, but I use with with home assistant for a quasi Alexa like experience with the voice assistant, I use it for summarizing some YouTube transcripts in too lazy to read/watch, and I've tried to see how capable it is with coding.
Can you elaborate on what you are using exactly with home assistant ? And is English your primary language in that context ?
Trying to do something similar, English not primary and its a bit... Harder than it seems. Can't figure out if it is because I'm not using English or something else. (3060 12GB BTW)
English is my primary, so that does make it easier. I use it for general conversion things, like asking it questions about the Titanic or making up a new story or something. It doesn't work as well as I'd like yet, but like I said, it's just an other thing for me to mess around with and change.
I started out playing around with code generation using Ollama/open-webui and qwen 2.5 coder 14b on a 3060 12GB, but ended up on a winding journey with an ex datacenter card called the AMD V620. Its roughly equivalent to an RX 6800XT, but with double the VRAM.
At this point i've really done nothing productive with it but learned a lot about bios settings, GPU/ROCm drivers, and custom fan solutions/PWM controls trying to get it setup and optimized haha.
It's pretty sick though, that amount of VRAM with 512GB/s bandwidth can run Qwen 3.6 27B dense with 100k context window at 20 tokens/sec in LM studio. Draws 300 watts at the wall on my ITX chassis (idling about 30w).
I've been dabbling in building an aviation weather and field condition report application using this, but my next step is to rebuild my VS Code environment into a new machine. I'm kinda enjoying just fucking around with building the hardware too though
If you are having trouble getting the 6800xt to work with pi.dev I'd be surprised if the V620 would be any different, but I haven't tried that tool. I can attempt it and get back to ya in a couple days if you'd like.
I ended up getting it purely as it seemed like the cheapest option for 32GB VRAM that didnt have discontinued driver support. Around Jan/Feb 2026 the MI60's had recently blown up in price but the V620 still seemed niche/slept on partially because AMD hasn't released an SR-IOV driver for this. Servethehome forums had a big thread about how these aren't particularly useful for home server/virtual machines as a result. I think it's still possible to pass it through to docker containers but I haven't tried it yet.
This guy accepted a $350 offer for mine: www.ebay.com/itm/157133307609
Then you'll need a shroud: www.ebay.com/itm/286347509481
The optional included fan works well, pushes 60CFM but is LOUD. I ended up replacing it with an Arctic P8 Max which is much quieter but only pushes 40CFM, but cools it fine with -100mV undervolt in LACT.
Yup, ollama, various models. I initially downloaded it because I, along with thousands of other people, wanted to see what would happen if I made models debate with each other after RAGging them with various books (The Prince, The Art of War, The complete works of Shakespeare, etc.).
The results were uninteresting and I abandoned the project pretty quickly. I'll sometimes use them for code analysis but they're too slow on my rig to be really useful.
Nothing so fancy. I just made a little python script to prompt the first model, wait for a response, then prompt the next model with the initial prompt + the response, and so on. It was very hacky and slow.
Oh neat. Yeah, if something like that had existed (and I'd been aware of it) I probably would have used it instead of building my own shoestring version.
One of the projects I started and never got to a satisfactory end state was basically that, plus a judging round. Every model would respond to the same prompt, then every model would evaluate every other model's response for accuracy and completeness. Then the results would get logged to a spreadsheet.
It's simple enough, but for N models it requires N + N^2 model calls so it takes forever to run any decent dataset on consumer hardware. If I had the resources and a way to run it that didn't fry the planet, I think it would be a cool running set of comparative benchmarks. IDK if it'd be useful at all but I'm still interested to see the data.
Every model would respond to the same prompt, then every model would evaluate every other model’s response for accuracy and completeness
If I understand correctly I sorta kinda do that. I'll copy and paste one AI's response into another and prompt something like 'Validate AI response: and paste it in. HAHA I thought I was being tricky but you're already on it.
I think it's tricky. It's kind of like adding LLMs like vectors, and hopefully the effect can soften or at least reveal the shortcomings of individual models. Is it a good idea? I don't know, I think there are good reasons to think it's a waste of time and resources. I certainly think I'd need a better explanation of what use it would be before I spent more time building it. But I still think about what use it would be from time to time; I haven't decided that it's a bad idea yet.
P.S. This is a hypothesis, I haven't even designed the test for it, much less run it. What follow are my suppositions.
I think whether or not it's a good idea depends on how similar all the models are. I don't have a rigorous definition of "similar" but things like similar training data, similar design methodologies, similar QA processes would all contribute. Theoretically (I think), if they're all dissimilar, they should each catch errors the others miss. However, the more similar they are, the more likely they have the same biases and weak spots, and your error rate from a response + verification may be the same or even higher than the error rate for just the original prompt, and you'd be unlikely to detect those errors using just two similar models. It can instill false confidence in the results because you're doing something that should in theory increase the validity of the data, but in practice might make no difference or even make the quality of responses worse.
Technically, TTS/STT are mostly MLs; I'm pretty sure many people run these. I have a setup but I'm better with buttons that with spoken words, and I listen to ambient sounds or music. I think some day I'll make voice assistant for talking to while driving, but that's not a trivial task hardware-wise, even if I used cloud LLM layer, which I won't. Putting AI on baremetal sounds like an interesting project.
I have a homemade "local agent" that can actually "code" somewhat, I use it just to figure out how this thing works on the inside practically. Mostly useless otherwise (also I have GPU that's older than AI, so it's kind of fun technical task to run this stuff on pure RAM+swap). Feels like the whole hype is greatly overrated, but I appreciate a chance to learn something new anyway.
Ollama + about 8 different models at the moment, hosted on a mac mini with open webui as a front end.
Predominantly for transcription, translation, an extra round of security checks on code, a more context friendly home assistant interface, and a daily run of context evaluation on property I'm looking for with a lot of specific needs (acreage, min elevation change, soil type, area, etc).
On the list but haven't gotten to it yet, but I know I should. I could probably get a bit more out of that box with it, expand the context windows a bit...
Nice, I’ve got a Mac Studio M1 Max with 32GB of RAM that I use with Ollama and then I host OpenWebUI and OpenCode on my Arch Server. I use the Mac as a primary workstation, so it’s a little rough when I start running a model. I’m sure I could probably do and learn more about Ollama to improve my experience, but for now it works for certain tasks.
The other day I made a machine learning model that classifies images as either 'a certain type of undesirable image' (no, not porn) or 'any other image'. It is 96.4% accurate and takes 14 ms to classify one image (using CPU only - with a GPU it could be 5x - 10x faster).
I plan to offer this as an API service that social media networks can use to filter posts.
Currently I run my own custom IQ3_KT quantization of MiMo 2.5 300B, and it’s crazy good. It’s better than API models from not that long ago, and it’s served at about reading speed.
Never thought I’d ever run such a thing on my lowly desktop.
For quick scripts or code assistant, sometimes I use Qwen 27B (another custom quant, currently experimenting with exllama). Or Gemini 12B for messing with image/audio input. But TBH MiMo 2.5 with thinking disabled is smarter than 27B with it.
…And honestly, I use GLM 5.2 API a good bit.
I was lucky enough to get a yearly subscription for like $30, 6 months ago. I do self host the UIs or whatever takes the prompts, though.
Jup. Ollama and OpenWebUI is a great stack to tinker with some LLM models. They're kinda useful for aggregating large datasets, translations, frontend development and gathering relevant sources for me to read into. Also, Qwen has been amazing in understanding frameworks without documentation and writing one for me. I had to use some self-developed PHP framework for a task once and without qwen, I would've taken probably two more weeks to get the task done.
MiniCPM has also been REALLY good at image detection, describing it as accurately as possible, feeding it into qwen who then searches what the object could be and returning the result. I always liked google lense and that stack gave me a TEMU-Version of google lense that isn't quite as reliable, but definitely very useful.
I don’t host it exactly, just use it when I don’t use my graphics card for gaming. I run Qwen3.6-35b on my 16gb vram RX 9700 xt with 34t/s. I use it as an IT advisor, admin and Linux teacher for my cachyOS gaming PC.
I actually ran a series of A|B split tests (using GPT, Claude, Qwen 27B, Qwen 35B, GLM) on some code I'd written.
The Qwen models managed to find issues the others missed and offer useful suggestions.
Coding wise, they're a little too eager to take the next step / be a helpful assistant, and context collapse is a real thing with them. I would say yes, they are capable, and probably even more so in the Qwen specific coding harness.
The thing is, small models can only hold so much in their latent space. If you give them a big job or free range task, they will find a way to monkey paw it. They need short leash and test gates.
Pretty simple. People keep going on about how useful these local models are for coding. So what I wanted to do was to create a standardized test for myself to see if that was true before committing to anything.
( I think the various benchmarks out there are a bit fluffy, so I wanted to try it against a real workload.)
What I did was throw a bunch of money up at OpenRouter and then used Roo to call in diff models, one at a time.
I gave each the same task - that is, here is a piece of code, here is my ticket, here is my repo. Investigate what you want and then do what my ticket says.
I already knew what was wrong with the code, but I wanted to see how obedient the models are at sticking to a scoped ticket and what they would find.
By far the best bang for buck was GPT 5.4 mini. It is exceptionally obedient at doing exactly what you tell it as long as you tell it exactly what to do.
It won't go off piste if properly constrained.
I think for light - med workloads, $20 on ChatGPT is a crimal steal. Chat and Codex have a separate usage pool.
I'm also aware that this is open AI's lock in phase where they provide the samples of crack for free to get you hooked. And, yes, they are crack dealers in every sense of the word.
Anyway, it's good to know that with a little bit of elbow grease and some smarts, the smaller models, which could reasonably be self-hosted, could do a decent enough job if they are narrowly scoped.
You're probably not going to be able to yeet an entire code base at them and go "figure out what's wrong and fix it" while you snooze tho, but I think that's probably a good thing from a human in the middle perspective.
That Qwen 35B model is going to remain the people's champ for a long time I think. Surprisingly capable, even for code. I hear it loops badly at Q4 quant?
Probably that plus a higher quant solves it. Thing is most of us default to Q4_K_M as "precise enough"... and that seems to be kryptonite for the new Qwen's.
That's another thing with hosting AI that's not often discussed. Sure, you can maybe run that 27B model...but if it's at Q3_XS it's going to be .... "mentally challenged".
I've heard the Gemma models with QAT are meant to be near full precision at Q4 size. Haven't tried em yet.
Actually, on that topic - I've heard there's a different architecture (RWKV), that’s supposed to be much more efficient for long context because it uses an entirely different KV system.
Sadly, there are few RWKV native models and retraining a standard transformer to RWKV seems like a pain in the ass. I'd need to hire a cloud GPU, distill into a different architecture, mess with datasets .... honestly ICBF.
I started running LLMs a couple months ago on my own hardware. I have a Framework Desktop that I ordered last year and also recently picked up a refurbished 24GB AMD RX 7900 XTX which I'm doing some performance testing against. The dGPU is much better for dense models, and slightly faster for MoE if I'm willing to run them at a lower quant -- but uses more power and has annoying coil whine. The Framework Desktop uses ~100W under load, is quieter, and for the MoE models already runs them fast enough for most of my needs -- so most of my LLM use happens on that system still.
For software: I'm using ollama on the Framework currently, but I want to replace it with just using llama.cpp directly eventually. I've been using llama-cli for testing the dGPU. I wrote my own chat client to interact with ollama as well as a few other programs for specific tasks.
I've been using the LLMs for a mix of research (both personal and professional), entertainment, practical coding tasks (mostly debugging and brainstorming, plus a bit of UI prototyping, automatic generation of sequence diagrams for documentation, and light scripting), as well as automation of tedious tasks.
As an example of the latter, people often send me requests to prepare data sets by email but don't specify the sources they want precisely so I have to go match the name against the real name in our archives; LLMs are great for mapping the imperfect name -- with typos, missing prefixes, incorrect addition of spaces, addition/removal of hyphens, etc. -- to the exact name I actually need to pull the data off disk when given a lookup table to compare against.
As far as models go, I'm mostly using various Qwen 3.6 and Gemma4 variants. I have multiple versions of each for different purposes. llmfan46's uncensored Qwen 3.6 35B-A3B @ Q6_K (from Hugging Face) is my default model currently.
Bought b70 with egpu enclosure and usb4 connection wasn't really planning to actually run anything but now ended up with llama.cpp with openwebui - kids/parents want to/have to use chat, might as well provide local solution than them using industry options. Also started with ollama and Gemma 4 26b a4b - asked it to write script to setup llama.cpp in container.
Yeah, I've heard the B70 is good bang for buck. My kids love using chat GPT to generate images and I'm aware that there are some really capable local models that can do that as well now - B70 should make short work of it.
That may be something for me to look at later on if I decide to keep self hosting.
OTOH, I'm also aware that I may end up building something that they don't actually use. Been there, done that, and I don't want to do it again.
Actually, on that topic, one interesting use case for me is my youngest one wants to have a YouTube channel.
So obviously, I'm not going to let her become a YouTuber, but what I'm thinking of doing is providing her my old phone (properly locked down) so that she can video record clips of what she wants.
Then - have those clips sent automatically to our jellyfin server so it appears like a channel. Code a fake YT plugin so that AI can do likes, positive comments etc.
It's... work. I dunno...maybe a good enough AI can vibe code the entire project for me.
I've been testing coding capabilities a bit (mostly scripts - so that work done by ai is reproducible).
Context size is very much a required thing along with model capabilities.
Local model can generate good enough script in one shot - but reiterations r crazy
Use git to keep files tracked (easy to revert) and make a modular script - main script calls function a, function b etc where functions are relatively self contained (no need to look at others) when u need new capability add function c
If something need changing- try to do it urself (unless it's whole architectural change, then just start new project)
As for image gen + chat . as long as chat model + context + image model fits, u should be fine.
I use my gaming rig to serve up qwen3.6-coder to Open Web UI and that's been very successful in helping me refactor my home lab to be more effecient and easier to support. Over the years of building my server I got everything working, but lets just say it's a bot of a mess and a lot of shortcuts were taken.
I plan to look into ComfyUI soon but I do that have much of a use case for it at the moment.
Myself - I've self hosted LLMs before, but with only 4-8GB vram (depending which card is in place), I can't run the good stuff at acceptable enough speeds.
(Don't @ me - I know all the tricks with turbo quants, spec decoding, MoE etc. 192GB/s is 192GB/s)
I do use Handy (STT) which is amazing (my fingers are arthritic and typing hurts after a while).
My personal use case for LLM is quite simple - a trumped up super google and / or self reflection / journalling / sound board. Despite being glib about it, that's actually very useful to me.
Work wise, I use the big winking orange asshole (Claude) when I have to. I have moral tension with with it, so am seriously looking at other options. I hear good things about GLM 5.2, but if I can't run Qwen 35B at any kind of decent speed, well....self hosted GLM is a pipe dream.
I have played around with ollama and whisper. It's just too slow to be practical. The cost of the hardware is preclusive.
That said, I do selfhost openwebui and use inference end points from huggingface and ovh.
I've never used chatgpt or claude and I have to wonder whether those alternatives are really as terrible as the models available on huggingface. The output is always super plausible but usually just plain wrong.
Well, I don't exactly host AI. But some of my software uses AI and/or machine learning. My photo gallery does face detection, I've installed text to speech and speech to text. My Home Assistant has a voice satellite (which is a poor-man's Alexa because I lack the hardware to do voice recognition in realtime). And I also regularly try some large language models and chatbots. But I don't have any real application (yet). And it's slow without a proper GPU. So I'm more or less just messing around. Currently that's with Ministral 3.
I've played with it for Home Assistant integration, but I just dont have much interest in it, the whole thing is too inefficient at the moment, and the tiny models that can run in a few gigs of system ram on an ipgu or npu arent good enough in quality or speed to rely on.
Hopefully some future generation micro-models will be more useful for the way I want to use it (aka , ultra light, no dedicated hardware etc.), but for now it's a lot of compute resources, plus heat and energy for a gimmick.
Agreed. It will be ironic if 1.58B models (Microsoft) turns out to be the great white hope.
I looked at the recent Steam stats (which is a GPU sample of convenience); the most common GPU size was 6GB. Meanwhile you probably need what...64GB unified memory or a 5090 to drive a decent model at a decent speed/context?
There's a real gap between the haves and the have nots and it's widening.
I’ve got ollama setup with whisper and piper and a HA voice PE, but I honestly haven’t gotten around to configuring much yet. Most notable thing was being able to use the wake word to start a timer, but it was pickier than old Siri about the precise wording.
I've fiddled around with a few models on ollama and opencode but more for the sake of seeing what I can run as ive yet to really find a use for it in my home usage.
I've tried just about most of the small models. Tried NanoClaw. I just don't have the equipment necessary to pull that off and make it a worthwile, in house tool rather than an in house oddity. I really, really want to tho. So much so that I have been looking at what it would take to accomplish that, which seems to be at the $4k to $5k USD range. The sweet spot for GPUs seems to be at the 32 gb level. It is pricey, but hell, at my age, I figure wtf....I should treat myself. Whats wrong with that? If I do pull the trigger, I want it to be a LTS type computer like I built 15 years ago and is still running like a champ today tho it's probably worth less than a quarter of what I had invested. So, I'd probably overstock it to the max.
Dont waste 5k on an ai computer. Of you want new one, buy for 2k at most. Ai will get optimized more and more. Now we have MoE with which we can run things at home we couldn't even dream about. The companies loose money fast, there will be massive optimization sooner or later.
You can get a P40 for much less than that, if your case can hold full height card. It's an old card but its 24GB, 400GB/s.
Else yeah...$3-4,000 is about table stakes, which doesn't amortise for just AI (not for my use cases anyway). I'd love a Strix but Santa is stingy.
Me - I have a fetish for tiny, low power computers. 1L lenovos, raspberry pis etc. That limits what I can run but with constraint comes inginuity. So I'm making an expert system for myself.
It's not cooked yet (this is actually the first time I'm sharing it in public; it's not in installable state and the repo is new) but once it's done, I can have an always on local brain in a 2W envelope that runs fast. Might even port it to C64...I need an excuse to purchase the new Commodore ultimate.
Well, I haven't had any new equipment in 15 years. I always buy used or refurb'd. I'm getting old. I figure, I've worked hard enough, might as well enjoy the fruits of my labor.
My server is way to weak for that unfortunately. I run some llms on my laptop with ollama but it's not particularly effective. I use it to run dolphin series models when k need an uncensored LLM I have tried running some of the coding models but they just aren't smart enough on my level of compute for any useful work so Ive ended up just paying api prices on open routers.
I'm still messing around with self hosting llm, rn ive settled on using lumo from proton if I use an llm.
When I have run llm, I used koboldcpp. Works pretty well, depends on what you are doing and what models you use. Forget which models ive been using off the top of my head
If I wanted AI for some reason, it'd be self-host or nothing.
Running qwen3.6 27b through llama.cpp.
It's about as capable as sonnet 3.5.
I use it for light scripting, but real coding is done by cloud models.
I'm also using it as the brain for my Hermes agent. It sends me digests of news, subreddits, chats that I'd like to read but don't have time for. It does a great job researching things on the web for me, too.
That's a great model and it's the one I use too.
Do you mean Sonnet 4.5?
I don't have the rig to run it at real speeds but I've played with it over API. Seems pretty good.
Running decencored Qwen3.6-27b and a 9b Gemma for RAG and scrapes on Ollama with a mostly vibe coded discord bot. Just got it to run tools and scrape and post news on a schedule. The first model I can run locally that's smart enough to be useful. May give Jan a try for the back end after reading that other guys rant.
Mostly use it for stupid questions I could have googled and to brag to friends.
I host my own AI, mostly for testing and because I wanted something that was mine and mine alone. I use Ollama and run models like Llama, Mistral, and Qwen. I honestly don’t use it much, but I wanted to have my own setup just in case online services go down or become less available. It’s part of my whole “own everything I use” mantra that I’ve been on lately.
No. I still have no use for it and everything I use is automated without at a far lower footprint.
Partially. I started with hosting my own llama3.2 + granite4 models using Ollama for my Home Assistant smart home and for general chat with OpenWebUI. I also run whisper for speech-to-text locally on my 1080 Ti GPU. I like the privacy and ownership of my self-hosted models, but I started to run into limitations with the small weights. So I built some tools that allow me to selectively route traffic to larger models hosted on DeepInfra depending on my need. For example, to GLM/Kimi models for code reviews or for my custom harnesses or harder problems.
No, I have taste.
I currently run Qwen3.6-27b on llama.cpp and use it via openwebui. Mostly, I use it for web research via tavily, to a lesser extent for coding and interactively learning about things that are new to me but common in training data (such as basic math or ML concepts).
Yes, llama-swap and I use it for home assistant text-gen notifications, basic coding tasks, etc
If anyone here self-hosts definitely check out llama-swap as it has some nifty features for hotswapping LLMs, image generation models and voice models.
I prefer my critical faculties completely intact and un-altered, thank you very much.
I do not require or desire a 400 watt bullshit-artist yes-man or vulnerability coder cooking my GPU.
Yes, I got a Strix Halo machine before the RAM price hike and use it to run all my ML stuff on it.
Currently using llama-swap with llama.cpp/ComfyUI and opencode/Open WebUI as frontend.
I'm running Qwen3.6-27b, Voxtral Mini 4b, Piper and Qwen Image. Also, some embedding and reranking models.
I use them for:
What sort of tok/s are you getting on the strix?
About 200 t/s prompt processing and 10-20 t/s with MTP.
Greatly depends on the task, predictable things like code generates at 18-20 t/s. Creative writing more like 10-17 t/s.
Damn - I thought strix would do a bit better than that, for how much it costs.
Given the 27b is a dense model, I think the numbers are quite ok. Curious about the quant tho.
The cool thing about the strix is its large unified memory, but it lacks memory bandwith for compute intensive workloads. Something like Qwen3.5-122b MoE with only like 12b active parameters might run at twice the speed if it fits the configuration.
Q8 from unsloth.
My go to model for knowledge. Definitely much faster at Q5 but it lacks the tool calling quality of the Qwen3.6 models. Really hoping we see a Qwen3.6-122b soon...
Yeah. Though I think theres a new strix out soon (Medusa? Gorgon? Something like that).
Its a bit like my P40. On paper, it has 24GB. But that 24gb is capped at 400GB/s and the ai compute is what...Pascal era?
AI = Good, fast, cheap - pick 2
I hosted Qwen 3.5 9b uncensored on my site at https://masland.tech/ for a while. I didn't really use it and no one else used it so I took it down. These days I'm spending most of my time finding uses for AI and accessibility. One of the next things I'm planning is a video to text reasoning system, primarily for the purpose of grading used electronic devices.
I'm using anythingllm. It's quite easy to setup and use. I'm impressed of the perf on comodity hardware.
I have a simple slow model running on CPU in my cluster for karakeep. I've tried running a variety of models on my 7900XT but even with 16GB their performance just isn't there. My new work m5 Mac book with 48GB of ram is the first time I've seen usable performance for local models and it has been pretty impressive.
An aside for anyone reading this:
https://sleepingrobots.com/dreams/stop-using-ollama/
And that barely scratches the surface. Please.
Use anything but Ollama. Even APIs.
Thanks for this link. Because of this article, I had claude stand up a llama.cpp container next to my already running ollama container. It ran side by side tests with the same model and parameters, and the results blew ollama out of the water. I'm in the process of moving hermes and openwebgui over to the llama.cpp instance to see how it goes day to day.
If you’re using docker anyway, and “fast” pure GPU models, you might try a vllm container while you’re at it.
It should be much faster than even llama.cpp, albeit at the cost of context length, and it supports some exotic 4-bit quantization like SPQA.
Same with TabbyAPI. It’s quantization is SOTA, though it does not support CPU offloading, and it’s speed is somewhere between vllm and llama.cpp.
Llama.cpp or death!
It's not that hard to use
llama.cppdirectly anyway. Why would I use a wrapper when I can just run a python script?Or exllama! Vllm, sglang, Lorax. Koboldcpp, Aphrodite, text-generation-webui, LM Studio, powerinfer, ktransformers, mlc-LLM, really whatever floats your boat. Just not ollama, specifically.
I agree that the concerns listed there are smells, and I wasn't aware of some of the options listed there.
Thank you for sharing this!
looks like extreme nitpicking without any real issues beyond some VC funding a FOSS issues.
//whyre you spamming the comment to everyone? its quite alarmist actually
I completely disagree.
Frankly, I find the description "VC funding a FOSS" offensive. They aren't funding the engine. I've been messing with LLM inference engines since 2022, and Ollama is the worst I've seen in the community.
They misname models for SEO. They leech off llama.cpp while deliberately hiding attribution yet redirecting GH support requests there. They sometimes make their own GGUFs+forked releases which are broken and incompatibile with upstream llama.cpp, just so they can get a release out a day ahead for hype, even though it doesn't really work and they'll never upstream one line. They set a default context size thats basically unusable, they screw up chat templates and deep internal code with no obvious indicators, they release suboptimal quants without iMatrix, they gate you into their internal quantization repo and model card format, they hide model downloads on your hard drive, they mess with standard APIs for no good reason other than to mess up other backends. I could go on and on.
And if that's all fine, they're enshittifying the app with closed code, and pointers to cloud models.
They GIVE LLM inference a bad name, by making it a terrible quality engine that happens to show up in search as the "default." Hence the comments below of people being unimpressed with local inference. And they sap attention from actual llama.cpp devs, without contributing a single dime. Everyone in the localllama communtity hates their guts, and that's not even getting into the interpersonal drama they've stirred.
They are a leech that's a net drag to the whole community, that we can't get rid of because they're attention grifters. And they've gotten worse and worse over time.
It's more morale to use any cloud API over Ollama, in my eyes. They're a grift.
EDIT: And, to be clear, I’m not against VC funded downstream stuff.
LM Studio is good! Even though it’s closed source.
Tons of downstream projects are great.
Yes. My Actual Intelligence lives in my head, and runs mostly on coffee.
Just coffee?!? That's cool.
Mine runs on:
Mostly on coffee, not exclusively. Noticable amounts of spite & tortilla chips are also present, yes, but... no shame.
Nice!
If that's not already on a shirt it should be
Do you get many hallucinations?
Only when I'm deprived of coffee.
Would flowers work instead?
No. I'm not dead yet.
I'll make sure to send you flowers, Algernon.
@[email protected] this comment is not (directly) for you, I just want it in context.
Before you report someone for breaking rule 1, please look a the context. Specifically, the username someone may be replying to.
LOL.
https://en.wikipedia.org/wiki/Flowers_for_Algernon
Looks like someone got big mad over a harmless, good natured and on topic joke. You love to see it.
Sorry they wasted your time.
Eh, its fine. Certainly better than the "I don't like this so I'm going to report it" approach.
critical security bug: if coffee is taken away my head hurts :(
As we know AI stands for "An Indian", so if you're not from India, its actually impossible to self host.
Well, unless you manage to trap one in your basement, but that would violate human rights and hopefully also break the laws of your country.
You may be confusing Indians with gremlins (AGI). Which might explain ChatGPTs obsession with gremlins
That doesn't sound artificial.
With sufficient coffee, mine shows considerable artifice.
Plastic flowers.
Acronyms, initialisms, abbreviations, contractions, and other phrases which expand to something larger, that I've seen in this thread:
3 acronyms in this thread; the most compressed thread commented on today has 3 acronyms.
[Thread #27 for this comm, first seen 25th Jun 2026, 15:40] [FAQ] [Full list] [Contact] [Source code]
i don't use it at all, i do want some selfhosted speech to text model (whisper?) but my computer is ancient so it would be awfully slow. i have some multi hour audio recordings from presentations, would be nice to have them in text and searchable..
How ancient is ancient? TTS and STT are much lighter than llm. (eg: Whisper, Piper, Kokoro, Coqui etc)....you might have more capability than you think, especially if you're doing batch processing like that.
a haswell xeon e5-1650 machine, i remember running llama 7b in llama.cpp in like 2023 and it was quite sluggish. guess i should try whisper at some point..
Ha. You were doing inference on CPU on a haswell era. Been there, done that.
OTOH...whisper.cpp is heavily optimised for it.
Plus, you're doing batch transcription, not real-time, so slow doesn't actually matter.
Fire Whisper small or medium overnight and wake up to searchable text.
PS: if you want a good fast little llm, something like Qwen 3.6 2B will work well on the Xeon.
Yes. Openwebui/ollama for LLM, comfyui for stable diffusion. I just dick around with it as a toy.
Same, toyed with it for creating stupid things like bot for telegram, that basically was a 3rd-person NSFW storyteller in RP chat. Sadly, after I made said bot I remember that I don't have friends to RP with.
That being said, ollama+openwebui kinda sucks: openwebui have "wider scope" and features that you don't need like auth via social providers and managing multiple accounts, while ollama itself does the opposite and lacks certain features (like proper mmap support to load big models), slow in comparison to pure llama.cpp and generally easily replaceable with lm studio, that provides both - client and server. So yeah, my advise for anyone who want to try it localy - just use lm manager.
Same. Its somewhat useful on some very small scripting or tasks...but its mostly just to try out a new model or two. Its not really useful for anything big.
I will have to say....even my tiny models are about as good as Chatgpt/Claude/etc... which makes me think about how much people are spending on tokens regularly. I was able to get the same kind of python script started with my local tiny model that was comparable to the newest Claude code offerings.
What local models have you been using? And what hardware are you running them on? I've been playing with local LLMs a bit for exactly your use case.
I have zero interest in vibe coding or full agentic workflows. But having a local LLM generate a Bash script to help me automate parts of my home lab infrastructure would be nice.
What are your hardware specs?
Ryzen 7 5800 X3D Radeon RX 9070XT 32GB of DDR4 system memory.
How hard does it push this setup? How far can you scale up your own models on this hardware?
I was put off by ComfyUI, seems awfully complex. How is your experience?
Any suggestions to start? I have Fooocus installed now
It is difficult to understand in the beginning but has great support for premade workflows. It even saves the workflow into its output images so you can drag and drop them into the webui to duplicate the setup that generated the image. Use the internet to get premade workflows and mess around with them to see what the options do and you'll slowly learn how it works. If you don't care about precise control over the generations or understanding how image generators works then just use something else more all-in-one.
Yeah, I'm using qwen 31b a3b on an amd 9070xt requires a bit of cpu offloading, but still plenty fast. Using it wall llama.cpp. Combine that with some mcp's such as ddg-search to make it truly useful by actually being able to search online.
I mostly use it for small tedious tasks with well defined inputs and outputs. For example when hyprland recently changed from their own configuration language to lua. At first I started going line by line translating my config to the new lua language until I realized oh wait this is exactly the type of thing that ML is useful for. Going from the well defined hyprland configuration language to their also well defined lua syntax. It banged it out in less than a minute with only a single mistake which I easily fixed. The mistake it made was that it forgot to translate the comments to lua. It did it in less than a minute and worked first try. Where as I had made several typos and gotten a few lines wrong when I was doing it by hand.
Not to say that I couldn't do it. I would have gotten it done in about half an hour, but less than a minute is a lot faster.
I also used it to transform a bunch of unstructured data into json data, so that I could then use purpose built tools like jq to parse that. If I'm having trouble finding certain information. I'll ask it to find me some resources to look at.
Basically small well defined tasks and parsing data is what I use it for and it seems to be pretty good at that.
What I don't like is the way companies try to market it to people. I don't believe people should be trying to summarize emails or messages from loved ones, writing essays or any other creative tasks for the most part. Translating is okay. I don't expect a machine to be able to decide things for me or to be some filter between me and others.
I tried but I only have 16g of ram and it wouldn't complete a thought alas
I ran through lmstudio because it really eazy, I ran some kind of qwen 3.6 27b imatrix neo code DI, it is the best local model for coding I tried, I think it can be better than some cloud model
I have the setup, never found a use for it though.
Why would I?
Nope.
No, too expensive. I wish I could but it doesn't make sense financially for me right now, it is much cheaper to buy openrouter credits from time to time
I tried Qwen 3.6 a3b and Gemma 4 a4b, but both were too stupid for everyday work.
I've been running ministral on CPU on a home-server: works pretty nicely, not very performant for everyday tasks and the savings were not sufficient for it to make sense. It still was cheaper and faster to just use Mistral API and get better models.
Yeah, mostly for translation purposes.
I think I currently have gemma 4 set up.
I recently gave it a try with qwen3.5 and deepseek coder v2. I have a RTX3090 and these are the largest models that can run comfortably on it.
Conclusion, they are both fucking useless. Free tier claude runs circles.
If you just pulled the default version of qwen3.5 from ollama's repo you downloaded a mediocre one that only uses ~6GB.
Check
ollama show qwen3.5and see if you get something like this in the result:This is the default version I got when I first tried using ollama without any experience. It worked, but it's a heavily quantized, lower parameter version of the model -- i.e. it's pretty dumb -- compared to what you can actually run on your hardware.
I will check it later. I loaded whichever one cluade suggested lol
Yeah :(
Were not there yet on consumer rigs.
Did you serve them with ollama?
It’s basically broken, if you did. Try the same models over API, and you’ll see what I mean.
Is there an alternative to ollama? The point was to run something locally.
https://sleepingrobots.com/dreams/stop-using-ollama/
And that’s not even all of it. Basically they break models in many ways, and they’re slimey Tech Bros.
LM Studio is better, and easy.
If you’re on Nvidia, and want to run optimally, I would use the ik_llama.cpp fork. On AMD, regular llama.cpp. On a Mac, use an MLX runner (Like LM Studio) with an MLX quant (ideally an MLX-DWQ quant).
It’s all pretty technical, and… thats kinda the point. LLMs are just too performance sensitive and too finicky to not have a grasp of how they work. There is no "easy button" to run them without bad results, there can't be.
But if you don’t have time for that and just want to see if it’s worth it, I’d suggest self hosing your own UI, and trying the dirt cheap APIs of models you can theoretically run on your setup. This will give you a “best case” taste of what they’re capable of.
Oh, and I just saw you have a 3090.
To get more specific, you can actually run way better models than Qwen 3.5 and Deepseek coder (both of which are very obsolete now). The best that's practical depends on how much CPU RAM you have, but at the minimum you can do Qwen 3.6 27B, with a more optimal quant like ones here: https://huggingface.co/ubergarm/Qwen3.6-27B-GGUF/tree/main
Or Gemma 31B QAT: https://huggingface.co/unsloth/gemma-4-31B-it-qat-GGUF
If you have 128GB CPU RAM, I can upload my custom MiMo 2.5 quant. That should "beat" the cheapest Claude, give or take.
If you have 64GB, I'd suggest a quantization of Step 3.7.
If you have 32GB or 48, I'm not sure. I'd need to look if any "small" MoE is actually better than Qwen 27B now.
I run Handy with Parakeet for speech to text, and home assistant with Whiper for the same. Whisper+ on my phone.
I think that counts. But I have more relevant and useful things to do on my hardware and no 2000€+ to get LLM-capable hardware 😂
I set up ollama on our thinkstation in the lab and I use it for looking up documentation, generating readmes, searching papers, and sometimes coding when I know what to do but don't feel it is worth it to spend time on it myself. So basically the chat with web search.
Which models did you find particularly useful for those tasks?
Gemma 4, gpt oss, and nemotron. Currently I've been sticking with Gemma more, the 31 billion parameters one.
I'm running dwarfstar which is a 2 bit deepseek v4 flash. It's quite capable even at 2 bit.
This dwarfstar looks interesting, can you elaborate on your setup and what kind of inference speeds you are getting?
I have a 5080 and 128gb of ram running on a AMD 9950X.
Depending on the task I can get over 170-200t/s when the MOE only calls a few agents and can fit inside the VRAM or as low as 5-10ts when it calls more agents and has to hit the system memory. But for grunt work that doesn't need professor level tasks, it's more than capable and if you have the time, it's super worth it because it's basically free tokens.
I only use this for overnight work to save on tokens during the day. When I'm pulling analytics for my work and it just needs basic analysis that doesn't touch multiple tooks.
During work hours I'm using GLM5.2 for web development, Kimi k2.7 for complicated data analysis and Minimax m3 if I need the context window to be bigger than what kimik2.7 can give me.
I've thought about it, but I actually could never think of anything I would do with it.
Found vLLM to be the most efficient local runtime service. And "ray" as a good (but complicated) way to distribute the load: https://docs.ray.io/
I've tried a few times but with only 8gig of vram it's simply not worth it.
Have you tried qwen3.5-9b? It’s pretty solid for its size.
Yeah, it's "good for its size" but it's just too flaky for me to use for any significant coding.
Yeah, I wouldn’t use it for coding. It’s a bit dumb unfortunately.
How much CPU RAM do you have?
64G. But CPU inference is painfully slow.
Not anymore. Not with hybrid offloading, where the GPU handles dense tensors and the CPU only runs the sparse MoEs. I'm running a 300B model on a single 3090, and its faster than I can read.
You just need to use the right framework, and the right model.
I'd suggest trying ik_llama.cpp and a MoE like one of these: https://huggingface.co/models?other=ik_llama.cpp&sort=modified&search=35B
And speculative decoding like DFlash or MTP (which you can also get specific models for).
EDIT: Wrong link.
I'll check that out - speed isn't my biggest issue so much as coding performance... The qwen 3.5 model I was using can write code, but it's... Meh? Like sometimes it doesn't even compile.
I did try tweaking llama.cpp to do some cpu offloading and it does seem to allow for much larger contexts at a modest performance loss. I'll check out larger models.
CPU offloading is too slow unless you use a hybrid MoE model, with the --n-cpu-moe parameter, specifically.
This only offloads "sparse" parts of the model to the CPU, which take up a lot of RAM but are very compute-lite to run. In practice, thats most of the size of modern MoE LLMs.
Since implementation of the
--fitparameter and its relatives, and--fit onbecoming the default, llama.cpp intelligently decides what to offload. For me, it made--n-cpu-moeobsolete.Mostly, yeah.
Sometimes it’s better to “cut it close,” with (for instance) a 27B model that’s nearly OOMing your VRAM fully offloaded, but you know will be fine in regular use without too many programs open.
In my case, with MiMo 2.5, it fills both my CPU and GPU RAM rather completely, so it’s best to set a static value so I don’t swap CPU RAM, and don’t OOM on the GPU either.
I do, I use ollama. I mostly just tinker, but I use with with home assistant for a quasi Alexa like experience with the voice assistant, I use it for summarizing some YouTube transcripts in too lazy to read/watch, and I've tried to see how capable it is with coding.
Can you elaborate on what you are using exactly with home assistant ? And is English your primary language in that context ?
Trying to do something similar, English not primary and its a bit... Harder than it seems. Can't figure out if it is because I'm not using English or something else. (3060 12GB BTW)
English is my primary, so that does make it easier. I use it for general conversion things, like asking it questions about the Titanic or making up a new story or something. It doesn't work as well as I'd like yet, but like I said, it's just an other thing for me to mess around with and change.
I started out playing around with code generation using Ollama/open-webui and qwen 2.5 coder 14b on a 3060 12GB, but ended up on a winding journey with an ex datacenter card called the AMD V620. Its roughly equivalent to an RX 6800XT, but with double the VRAM. At this point i've really done nothing productive with it but learned a lot about bios settings, GPU/ROCm drivers, and custom fan solutions/PWM controls trying to get it setup and optimized haha.
It's pretty sick though, that amount of VRAM with 512GB/s bandwidth can run Qwen 3.6 27B dense with 100k context window at 20 tokens/sec in LM studio. Draws 300 watts at the wall on my ITX chassis (idling about 30w).
I've been dabbling in building an aviation weather and field condition report application using this, but my next step is to rebuild my VS Code environment into a new machine. I'm kinda enjoying just fucking around with building the hardware too though
Oh...i recognise this sickness :)
I went down the same rabbit hole. I have a 6800xt however but have issues getting it to perform outside of llm chats into using tools like pi.dev
Is it worth getting a v620?
If you are having trouble getting the 6800xt to work with pi.dev I'd be surprised if the V620 would be any different, but I haven't tried that tool. I can attempt it and get back to ya in a couple days if you'd like.
I ended up getting it purely as it seemed like the cheapest option for 32GB VRAM that didnt have discontinued driver support. Around Jan/Feb 2026 the MI60's had recently blown up in price but the V620 still seemed niche/slept on partially because AMD hasn't released an SR-IOV driver for this. Servethehome forums had a big thread about how these aren't particularly useful for home server/virtual machines as a result. I think it's still possible to pass it through to docker containers but I haven't tried it yet.
This guy accepted a $350 offer for mine:
www.ebay.com/itm/157133307609
Then you'll need a shroud:
www.ebay.com/itm/286347509481
The optional included fan works well, pushes 60CFM but is LOUD. I ended up replacing it with an Arctic P8 Max which is much quieter but only pushes 40CFM, but cools it fine with -100mV undervolt in LACT.
Yup, ollama, various models. I initially downloaded it because I, along with thousands of other people, wanted to see what would happen if I made models debate with each other after RAGging them with various books (The Prince, The Art of War, The complete works of Shakespeare, etc.).
The results were uninteresting and I abandoned the project pretty quickly. I'll sometimes use them for code analysis but they're too slow on my rig to be really useful.
Did you use OWUIs native "call simultaneous models to answer" feature for that or one of the AI debate harnesses?
Nothing so fancy. I just made a little python script to prompt the first model, wait for a response, then prompt the next model with the initial prompt + the response, and so on. It was very hacky and slow.
Ah - I thought you might have used something like this
https://github.com/hereisSwapnil/ai-council
Oh neat. Yeah, if something like that had existed (and I'd been aware of it) I probably would have used it instead of building my own shoestring version.
LOL I kind of do that...sort of. I'll ask several AI the very same question to see what they spit out.
You'll like this then
https://aisaywhat.org/
Well I'll be damned. Of course the law of large numbers dictates someone, somewhere has the same thought.
One of the projects I started and never got to a satisfactory end state was basically that, plus a judging round. Every model would respond to the same prompt, then every model would evaluate every other model's response for accuracy and completeness. Then the results would get logged to a spreadsheet.
It's simple enough, but for N models it requires N + N^2 model calls so it takes forever to run any decent dataset on consumer hardware. If I had the resources and a way to run it that didn't fry the planet, I think it would be a cool running set of comparative benchmarks. IDK if it'd be useful at all but I'm still interested to see the data.
If I understand correctly I sorta kinda do that. I'll copy and paste one AI's response into another and prompt something like 'Validate AI response: and paste it in. HAHA I thought I was being tricky but you're already on it.
I think it's tricky. It's kind of like adding LLMs like vectors, and hopefully the effect can soften or at least reveal the shortcomings of individual models. Is it a good idea? I don't know, I think there are good reasons to think it's a waste of time and resources. I certainly think I'd need a better explanation of what use it would be before I spent more time building it. But I still think about what use it would be from time to time; I haven't decided that it's a bad idea yet.
I mean I do it, in my rudimentary way, to check for some semblance of consistency. I'm unclear why you think that not a good idea?
P.S. This is a hypothesis, I haven't even designed the test for it, much less run it. What follow are my suppositions.
I think whether or not it's a good idea depends on how similar all the models are. I don't have a rigorous definition of "similar" but things like similar training data, similar design methodologies, similar QA processes would all contribute. Theoretically (I think), if they're all dissimilar, they should each catch errors the others miss. However, the more similar they are, the more likely they have the same biases and weak spots, and your error rate from a response + verification may be the same or even higher than the error rate for just the original prompt, and you'd be unlikely to detect those errors using just two similar models. It can instill false confidence in the results because you're doing something that should in theory increase the validity of the data, but in practice might make no difference or even make the quality of responses worse.
Technically, TTS/STT are mostly MLs; I'm pretty sure many people run these. I have a setup but I'm better with buttons that with spoken words, and I listen to ambient sounds or music. I think some day I'll make voice assistant for talking to while driving, but that's not a trivial task hardware-wise, even if I used cloud LLM layer, which I won't. Putting AI on baremetal sounds like an interesting project.
I have a homemade "local agent" that can actually "code" somewhat, I use it just to figure out how this thing works on the inside practically. Mostly useless otherwise (also I have GPU that's older than AI, so it's kind of fun technical task to run this stuff on pure RAM+swap). Feels like the whole hype is greatly overrated, but I appreciate a chance to learn something new anyway.
Yep.
Ollama + about 8 different models at the moment, hosted on a mac mini with open webui as a front end.
Predominantly for transcription, translation, an extra round of security checks on code, a more context friendly home assistant interface, and a daily run of context evaluation on property I'm looking for with a lot of specific needs (acreage, min elevation change, soil type, area, etc).
I have to recommend switching to llamacpp. It's SO much faster than ollama.
On the list but haven't gotten to it yet, but I know I should. I could probably get a bit more out of that box with it, expand the context windows a bit...
How? What is your average response time?
Apple silicon is pretty good at it as long as you've got the ram for it. I wouldn't do less than 16GB.
A few seconds for most of the tasks
What spec Mini do you use?
Just an m2 w/ 16gb I repurposed.
Can't really do a lot at once, and the context is limited, but it does the trick. I'd buy a few more if I saw them at the right price.
Nice, I’ve got a Mac Studio M1 Max with 32GB of RAM that I use with Ollama and then I host OpenWebUI and OpenCode on my Arch Server. I use the Mac as a primary workstation, so it’s a little rough when I start running a model. I’m sure I could probably do and learn more about Ollama to improve my experience, but for now it works for certain tasks.
I got mine a few years back for some iOS builds, don't need to do them that often so it became the model host for me
The other day I made a machine learning model that classifies images as either 'a certain type of undesirable image' (no, not porn) or 'any other image'. It is 96.4% accurate and takes 14 ms to classify one image (using CPU only - with a GPU it could be 5x - 10x faster).
I plan to offer this as an API service that social media networks can use to filter posts.
Ollama with gemma 4 for LLM stuff, coding brainstorming, etc.
Comfy ui with z-image or stable diffusion for images.
How is Gemma compared to others?
Yep.
I have a RTX 3090 + 128GB CPU RAM.
Currently I run my own custom IQ3_KT quantization of MiMo 2.5 300B, and it’s crazy good. It’s better than API models from not that long ago, and it’s served at about reading speed.
Never thought I’d ever run such a thing on my lowly desktop.
For quick scripts or code assistant, sometimes I use Qwen 27B (another custom quant, currently experimenting with exllama). Or Gemini 12B for messing with image/audio input. But TBH MiMo 2.5 with thinking disabled is smarter than 27B with it.
…And honestly, I use GLM 5.2 API a good bit.
I was lucky enough to get a yearly subscription for like $30, 6 months ago. I do self host the UIs or whatever takes the prompts, though.
That's impressive and probably within reach of most serious home labs.
I quite like MiMo and I agree with your assessment of its capability.
Mind you, I’m running Mimo, not the big Mimo Pro.
But yeah. I really like the model, even for one of its size. And it hardly feels quantized as a trellis quant.
Jup. Ollama and OpenWebUI is a great stack to tinker with some LLM models. They're kinda useful for aggregating large datasets, translations, frontend development and gathering relevant sources for me to read into. Also, Qwen has been amazing in understanding frameworks without documentation and writing one for me. I had to use some self-developed PHP framework for a task once and without qwen, I would've taken probably two more weeks to get the task done.
MiniCPM has also been REALLY good at image detection, describing it as accurately as possible, feeding it into qwen who then searches what the object could be and returning the result. I always liked google lense and that stack gave me a TEMU-Version of google lense that isn't quite as reliable, but definitely very useful.
This sounds like an excellent use case and I don't know why you were downvoted.
Because AI bad I guess. I'm not expecting sane behavior on lemmy when it comes to AI or capitalism lmao
Probably. I wish Lemmy would remove up / down votes entirely. I might ask our lovely new mod if that's possible.
The like / dislike button has been a curse since invented.
https://www.theguardian.com/technology/2017/oct/05/smartphone-addiction-silicon-valley-dystopia
EDIT: ah - I can hide em on my end. Not as good but it will do
I don’t host it exactly, just use it when I don’t use my graphics card for gaming. I run Qwen3.6-35b on my 16gb vram RX 9700 xt with 34t/s. I use it as an IT advisor, admin and Linux teacher for my cachyOS gaming PC.
Do you find you can do coding tasks with it well?
I‘m not a coder, so I don’t know exactly. It is able to code, but I would say somebody with experience should guide it and have an eye on the results.
I actually ran a series of A|B split tests (using GPT, Claude, Qwen 27B, Qwen 35B, GLM) on some code I'd written.
The Qwen models managed to find issues the others missed and offer useful suggestions.
Coding wise, they're a little too eager to take the next step / be a helpful assistant, and context collapse is a real thing with them. I would say yes, they are capable, and probably even more so in the Qwen specific coding harness.
The thing is, small models can only hold so much in their latent space. If you give them a big job or free range task, they will find a way to monkey paw it. They need short leash and test gates.
Neat, what did your setup look like? You mentioned the qwen harness? Run it all on one machine?
Pretty simple. People keep going on about how useful these local models are for coding. So what I wanted to do was to create a standardized test for myself to see if that was true before committing to anything.
( I think the various benchmarks out there are a bit fluffy, so I wanted to try it against a real workload.)
What I did was throw a bunch of money up at OpenRouter and then used Roo to call in diff models, one at a time.
I gave each the same task - that is, here is a piece of code, here is my ticket, here is my repo. Investigate what you want and then do what my ticket says.
I already knew what was wrong with the code, but I wanted to see how obedient the models are at sticking to a scoped ticket and what they would find.
By far the best bang for buck was GPT 5.4 mini. It is exceptionally obedient at doing exactly what you tell it as long as you tell it exactly what to do.
It won't go off piste if properly constrained.
I think for light - med workloads, $20 on ChatGPT is a crimal steal. Chat and Codex have a separate usage pool.
I'm also aware that this is open AI's lock in phase where they provide the samples of crack for free to get you hooked. And, yes, they are crack dealers in every sense of the word.
Anyway, it's good to know that with a little bit of elbow grease and some smarts, the smaller models, which could reasonably be self-hosted, could do a decent enough job if they are narrowly scoped.
You're probably not going to be able to yeet an entire code base at them and go "figure out what's wrong and fix it" while you snooze tho, but I think that's probably a good thing from a human in the middle perspective.
That Qwen 35B model is going to remain the people's champ for a long time I think. Surprisingly capable, even for code. I hear it loops badly at Q4 quant?
Looping was a problem after reaching a certain context window size. The llama.cpp flags - -flash-attn on and looping penalties helped.
Probably that plus a higher quant solves it. Thing is most of us default to Q4_K_M as "precise enough"... and that seems to be kryptonite for the new Qwen's.
That's another thing with hosting AI that's not often discussed. Sure, you can maybe run that 27B model...but if it's at Q3_XS it's going to be .... "mentally challenged".
I've heard the Gemma models with QAT are meant to be near full precision at Q4 size. Haven't tried em yet.
Actually, on that topic - I've heard there's a different architecture (RWKV), that’s supposed to be much more efficient for long context because it uses an entirely different KV system.
Sadly, there are few RWKV native models and retraining a standard transformer to RWKV seems like a pain in the ass. I'd need to hire a cloud GPU, distill into a different architecture, mess with datasets .... honestly ICBF.
Yeah, a higher quant would be nice, I actually try not to go below Q5, but you can domino’s so much with 16GB of VRAM and the ddr4 system RAM.
But I must say I‘m pretty impressed by Qwen3.6-35b, not only from its capabilities but also from hardware requirements. MoE for the win I guess.
RWKV sounds interesting, have to look into it, thanks!
I started running LLMs a couple months ago on my own hardware. I have a Framework Desktop that I ordered last year and also recently picked up a refurbished 24GB AMD RX 7900 XTX which I'm doing some performance testing against. The dGPU is much better for dense models, and slightly faster for MoE if I'm willing to run them at a lower quant -- but uses more power and has annoying coil whine. The Framework Desktop uses ~100W under load, is quieter, and for the MoE models already runs them fast enough for most of my needs -- so most of my LLM use happens on that system still.
For software: I'm using ollama on the Framework currently, but I want to replace it with just using llama.cpp directly eventually. I've been using llama-cli for testing the dGPU. I wrote my own chat client to interact with ollama as well as a few other programs for specific tasks.
I've been using the LLMs for a mix of research (both personal and professional), entertainment, practical coding tasks (mostly debugging and brainstorming, plus a bit of UI prototyping, automatic generation of sequence diagrams for documentation, and light scripting), as well as automation of tedious tasks.
As an example of the latter, people often send me requests to prepare data sets by email but don't specify the sources they want precisely so I have to go match the name against the real name in our archives; LLMs are great for mapping the imperfect name -- with typos, missing prefixes, incorrect addition of spaces, addition/removal of hyphens, etc. -- to the exact name I actually need to pull the data off disk when given a lookup table to compare against.
As far as models go, I'm mostly using various Qwen 3.6 and Gemma4 variants. I have multiple versions of each for different purposes. llmfan46's uncensored Qwen 3.6 35B-A3B @ Q6_K (from Hugging Face) is my default model currently.
Bought b70 with egpu enclosure and usb4 connection wasn't really planning to actually run anything but now ended up with llama.cpp with openwebui - kids/parents want to/have to use chat, might as well provide local solution than them using industry options. Also started with ollama and Gemma 4 26b a4b - asked it to write script to setup llama.cpp in container.
Yeah, I've heard the B70 is good bang for buck. My kids love using chat GPT to generate images and I'm aware that there are some really capable local models that can do that as well now - B70 should make short work of it.
That may be something for me to look at later on if I decide to keep self hosting.
OTOH, I'm also aware that I may end up building something that they don't actually use. Been there, done that, and I don't want to do it again.
Actually, on that topic, one interesting use case for me is my youngest one wants to have a YouTube channel.
So obviously, I'm not going to let her become a YouTuber, but what I'm thinking of doing is providing her my old phone (properly locked down) so that she can video record clips of what she wants.
Then - have those clips sent automatically to our jellyfin server so it appears like a channel. Code a fake YT plugin so that AI can do likes, positive comments etc.
It's... work. I dunno...maybe a good enough AI can vibe code the entire project for me.
I've been testing coding capabilities a bit (mostly scripts - so that work done by ai is reproducible).
As for image gen + chat . as long as chat model + context + image model fits, u should be fine.
Yes. Currently using Gemma4:12b behind OpenWebUI and Hermes Agent plus a few lighter models for OCR and tagging in Paperless.
I use my gaming rig to serve up qwen3.6-coder to Open Web UI and that's been very successful in helping me refactor my home lab to be more effecient and easier to support. Over the years of building my server I got everything working, but lets just say it's a bot of a mess and a lot of shortcuts were taken.
I plan to look into ComfyUI soon but I do that have much of a use case for it at the moment.
Myself - I've self hosted LLMs before, but with only 4-8GB vram (depending which card is in place), I can't run the good stuff at acceptable enough speeds.
(Don't @ me - I know all the tricks with turbo quants, spec decoding, MoE etc. 192GB/s is 192GB/s)
I do use Handy (STT) which is amazing (my fingers are arthritic and typing hurts after a while).
My personal use case for LLM is quite simple - a trumped up super google and / or self reflection / journalling / sound board. Despite being glib about it, that's actually very useful to me.
Work wise, I use the big winking orange asshole (Claude) when I have to. I have moral tension with with it, so am seriously looking at other options. I hear good things about GLM 5.2, but if I can't run Qwen 35B at any kind of decent speed, well....self hosted GLM is a pipe dream.
The short answer is no.
I have played around with ollama and whisper. It's just too slow to be practical. The cost of the hardware is preclusive.
That said, I do selfhost openwebui and use inference end points from huggingface and ovh.
I've never used chatgpt or claude and I have to wonder whether those alternatives are really as terrible as the models available on huggingface. The output is always super plausible but usually just plain wrong.
They're not. Call them via API on Open Router and see for yourself.
There's a reason OAI and Anthropic are considered best in class and it's not just hype.
Well, I don't exactly host AI. But some of my software uses AI and/or machine learning. My photo gallery does face detection, I've installed text to speech and speech to text. My Home Assistant has a voice satellite (which is a poor-man's Alexa because I lack the hardware to do voice recognition in realtime). And I also regularly try some large language models and chatbots. But I don't have any real application (yet). And it's slow without a proper GPU. So I'm more or less just messing around. Currently that's with Ministral 3.
I've played with it for Home Assistant integration, but I just dont have much interest in it, the whole thing is too inefficient at the moment, and the tiny models that can run in a few gigs of system ram on an ipgu or npu arent good enough in quality or speed to rely on.
Hopefully some future generation micro-models will be more useful for the way I want to use it (aka , ultra light, no dedicated hardware etc.), but for now it's a lot of compute resources, plus heat and energy for a gimmick.
Agreed. It will be ironic if 1.58B models (Microsoft) turns out to be the great white hope.
I looked at the recent Steam stats (which is a GPU sample of convenience); the most common GPU size was 6GB. Meanwhile you probably need what...64GB unified memory or a 5090 to drive a decent model at a decent speed/context?
There's a real gap between the haves and the have nots and it's widening.
I installed LM Studio just for fun on a 6800XT. But it was even less useful than the web-based ones.
I’ve got ollama setup with whisper and piper and a HA voice PE, but I honestly haven’t gotten around to configuring much yet. Most notable thing was being able to use the wake word to start a timer, but it was pickier than old Siri about the precise wording.
I've fiddled around with a few models on ollama and opencode but more for the sake of seeing what I can run as ive yet to really find a use for it in my home usage.
I've tried just about most of the small models. Tried NanoClaw. I just don't have the equipment necessary to pull that off and make it a worthwile, in house tool rather than an in house oddity. I really, really want to tho. So much so that I have been looking at what it would take to accomplish that, which seems to be at the $4k to $5k USD range. The sweet spot for GPUs seems to be at the 32 gb level. It is pricey, but hell, at my age, I figure wtf....I should treat myself. Whats wrong with that? If I do pull the trigger, I want it to be a LTS type computer like I built 15 years ago and is still running like a champ today tho it's probably worth less than a quarter of what I had invested. So, I'd probably overstock it to the max.
Dont waste 5k on an ai computer. Of you want new one, buy for 2k at most. Ai will get optimized more and more. Now we have MoE with which we can run things at home we couldn't even dream about. The companies loose money fast, there will be massive optimization sooner or later.
Perhaps, but I probably won't be around for that massive optimization. LOL
You can get a P40 for much less than that, if your case can hold full height card. It's an old card but its 24GB, 400GB/s.
Else yeah...$3-4,000 is about table stakes, which doesn't amortise for just AI (not for my use cases anyway). I'd love a Strix but Santa is stingy.
Me - I have a fetish for tiny, low power computers. 1L lenovos, raspberry pis etc. That limits what I can run but with constraint comes inginuity. So I'm making an expert system for myself.
https://codeberg.org/BobbyLLM/picoGURU
It's not cooked yet (this is actually the first time I'm sharing it in public; it's not in installable state and the repo is new) but once it's done, I can have an always on local brain in a 2W envelope that runs fast. Might even port it to C64...I need an excuse to purchase the new Commodore ultimate.
I was thinking something along the lines of:
Which, with all the other accoutrements like water cooling, etc, will put me right at the $4k mark.
I'll check it out.
Nice bit of kit that. Very nice. Planning on serious AI shenanigans?
Cool. It's not ready any time soon but when it is, I'll announce it and make sure it's callable via SSH / terminal / OpenAI style chat end point.
That way you don't need anything fancier than a nice terminal to call it.
Absolutely. Like I said, if I'm going to do it, I want to do it up right. I don't want to come back in 5+ minutes for a result. LOL
Damn bro, you're treating yourself for sure! The RAM alone is what, 2k? 🥴
Well, I haven't had any new equipment in 15 years. I always buy used or refurb'd. I'm getting old. I figure, I've worked hard enough, might as well enjoy the fruits of my labor.
My server is way to weak for that unfortunately. I run some llms on my laptop with ollama but it's not particularly effective. I use it to run dolphin series models when k need an uncensored LLM I have tried running some of the coding models but they just aren't smart enough on my level of compute for any useful work so Ive ended up just paying api prices on open routers.