Running Qwen3.6 locally on my AMD GPU

Every now and then I get the itch to run an AI model locally. No API costs, no data leaving my machine, just my own hardware doing the work. I gave it another try this weekend.

The hardware

My desktop is reasonably capable:

Ryzen 9 5950X
64 GB RAM
AMD Radeon RX 9070 with 16 GB VRAM
Debian (7.0.7-x64v3-xanmod1 kernel)

The model I picked was Qwen3.6-27B. The plan was simple: grab a GGUF quantized version from Hugging Face, point llama-server at it, hook it into opencode, and have a local coding assistant.

ROCm is still fiddly

AMD’s ROCm has improved a lot, but it’s not Nvidia. I grabbed a pre-built llama.cpp binary from lemonade-sdk/llamacpp-rocm which targets gfx120X (the RDNA4 architecture of the RX 9070). First run:

./llama-server \
  -m ~/models/qwen3.6/Qwen-Qwen3.6-27B-Q4_K_M.gguf \
  -ngl 99 \
  -c 32768 \
  --host 127.0.0.1 \
  --port 8080

ggml_cuda_init: failed to initialize ROCm: no ROCm-capable device is detected

The GPU was visible to the system and the amdgpu kernel module was loaded, but /dev/dri/ was empty. No render nodes. Turns out the Navi 48 firmware isn’t named navi48_*.bin like you’d expect, but gc_12_0_1_*.bin. A newer naming convention that I couldn’t find documented anywhere. Adding myself to the render group and rebooting fixed it.

The checks that mattered were:

lspci -nnk | grep -A3 -E "VGA|Display|3D"
lsmod | grep "^amdgpu"
ls -l /dev/dri/
ls /lib/firmware/amdgpu/gc_12_0_1*.bin

And the group fix was just:

sudo usermod -aG render "$USER"
sudo reboot

16 GB of VRAM isn’t 16 GB

With ROCm working, I loaded the Q4_K_M quantization, a 15.65 GB file. My GPU has 16 GB. Should fit, right?

Not really. The system takes about 560 MB, leaving 15.4 GB. The model needs 16 GB, plus the KV cache on top. Result:

cudaMalloc failed: out of memory
Segmentation fault

Dropping to Q3_K_M (around 13 GB) fits comfortably and leaves room for a decent context window. The quality difference at this size is negligible.

./llama-server \
  -m ~/models/qwen3.6/Qwen-Qwen3.6-27B-Q3_K_M.gguf \
  -ngl 99 \
  -c 32768 \
  --host 127.0.0.1 \
  --port 8080

Then I pointed opencode at the local OpenAI-compatible endpoint:

{
  "$schema": "https://opencode.ai/config.json",
  "model": "llama.cpp/qwen3.6-27b-q3",
  "provider": {
    "llama.cpp": {
      "npm": "@ai-sdk/openai-compatible",
      "name": "llama-server (local)",
      "options": {
        "baseURL": "http://127.0.0.1:8080/v1"
      },
      "models": {
        "qwen3.6-27b-q3": {
          "name": "Qwen3.6 27B Q3_K_M (local)",
          "limit": {
            "context": 32768,
            "output": 8192
          }
        }
      }
    }
  }
}

Quick smoke test:

curl http://127.0.0.1:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3.6-27b-q3",
    "messages": [
      { "role": "user", "content": "Write a tiny Ruby method that reverses a string." }
    ]
  }'

How fast is it?

With Q3_K_M fully on the GPU, I got around 25 tokens per second. For a 27B model on consumer hardware, that’s genuinely impressive.

But Qwen3.6 is a thinking model. It generates a <think>...</think> block before every response. A moderately complex prompt produces 800 thinking tokens, which is 32 seconds of watching text scroll before the actual answer shows up. You can disable thinking mode, but then you’re giving up most of the reasoning quality, which was the whole point of running a 27B over something smaller.

Was it worth it?

Technically, yes. The tooling has come a long way. ROCm on RDNA4 works. Quantized models are surprisingly good.

But I write code for a living. When I’m working, I need answers in seconds, not after half a minute of watching a thinking block scroll by. In the time my local model is still warming up its thinking, Claude Code has already solved a complex problem, and the subscription pays for itself in the first hour of work it saves me. For the actual job, that math is hard to beat.

Local LLMs are a fun weekend project. For paying the bills, I’ll stick with the subscription.