Every now and then I get the itch to run an AI model locally. No API costs, no data leaving my machine, just my own hardware doing the work. I gave it another try this weekend.
The hardware
My desktop is reasonably capable:
- Ryzen 9 5950X
- 64 GB RAM
- AMD Radeon RX 9070 with 16 GB VRAM
- Debian (7.0.7-x64v3-xanmod1 kernel)
The model I picked was Qwen3.6-27B. The plan was simple: grab a GGUF quantized version from Hugging Face, point llama-server at it, hook it into opencode, and have a local coding assistant.
ROCm is still fiddly
AMD’s ROCm has improved a lot, but it’s not Nvidia. I grabbed a pre-built llama.cpp binary from lemonade-sdk/llamacpp-rocm which targets gfx120X (the RDNA4 architecture of the RX 9070). First run:
ggml_cuda_init: failed to initialize ROCm: no ROCm-capable device is detected
The GPU was visible to the system and the amdgpu kernel module was loaded, but /dev/dri/ was empty. No render nodes. Turns out the Navi 48 firmware isn’t named navi48_*.bin like you’d expect, but gc_12_0_1_*.bin. A newer naming convention that I couldn’t find documented anywhere. Adding myself to the render group and rebooting fixed it.
16 GB of VRAM isn’t 16 GB
With ROCm working, I loaded the Q4_K_M quantization, a 15.65 GB file. My GPU has 16 GB. Should fit, right?
Not really. The system takes about 560 MB, leaving 15.4 GB. The model needs 16 GB, plus the KV cache on top. Result:
cudaMalloc failed: out of memory
Segmentation fault
Dropping to Q3_K_M (around 13 GB) fits comfortably and leaves room for a decent context window. The quality difference at this size is negligible.
How fast is it?
With Q3_K_M fully on the GPU, I got around 25 tokens per second. For a 27B model on consumer hardware, that’s genuinely impressive.
But Qwen3.6 is a thinking model. It generates a <think>...</think> block before every response. A moderately complex prompt produces 800 thinking tokens, which is 32 seconds of watching text scroll before the actual answer shows up. You can disable thinking mode, but then you’re giving up most of the reasoning quality, which was the whole point of running a 27B over something smaller.
Was it worth it?
Technically, yes. The tooling has come a long way. ROCm on RDNA4 works. Quantized models are surprisingly good.
But I write code for a living. When I’m working, I need answers in seconds, not after half a minute of watching a thinking block scroll by. In the time my local model is still warming up its thinking, Claude Code has already solved a complex problem, and the subscription pays for itself in the first hour of work it saves me. For the actual job, that math is hard to beat.
Local LLMs are a fun weekend project. For paying the bills, I’ll stick with the subscription.