Higher detail than [/work](/work). The personal AI tooling layer — local LLM infra, the Codex CLI fork, multi-agent routing, the OpenClaw bridges, and the operating doctrine. Aimed at peers who care how the sausage is made. The reason I'm bullish on local models: **codex-local** — my fork of the Codex CLI, tuned to babysit a local model through a structured task list — drives multi-turn, multi-file, multi-task coding sessions to completion using just **Qwen3.5-9B-IQ4_XS**. The trick isn't the model — it's the scaffolding. Specifically: - **Docs-first.** Every repo has an `AGENTS.md` and a `TASK_STATE.md`. The model reads them on every turn. Persistent intent survives compaction. - **State in files, not in-context.** Long runs survive crashes, context exhaustion, model swaps, OAuth expirations. The agent rehydrates from disk. - **Explicit task tracking.** TODO list in a file, model crosses items off as it goes. No "where was I" questions. - **Verify before declare-done.** No green-by-inference. Run the test, read the output, look at the deployed version. Most local-model coding agents stall past trivial single-edit tasks. With those four disciplines, 9B handles real work. The frontier model is no longer the moat — the scaffolding is. The fork itself adjusted compaction behavior, model-routing interception, and exec-log inspection so a long unattended run could survive overnight without losing the thread. - **GPUs.** GTX 1080 (8GB) + RTX 3080. Two hosts, services routed by endpoint. - **Runtime.** Ollama, tuned: `OLLAMA_FLASH_ATTENTION=1`, `OLLAMA_KV_CACHE_TYPE=q8_0`. Roughly 2× context fit at the same VRAM for the cost of a few percent of output quality. - **Model picks against VRAM budgets:** - **Qwen3 4B** for the chat gate — small footprint, fast first-token latency. - **Qwen3.5-9B-IQ4_XS** for the worker — sweet spot for code on the 1080. - **qwen3.5:35b-a3b** for harder reasoning when the 3080 has headroom. - **Topology.** Chat gate on one host, Qwen service on another. Cross-host routing via explicit endpoint config — no service mesh, no magic. Built so my third son could direct his Roblox / Blender / Daz scenes through natural language — _"make Torus.1 50% thinner on the Z-axis"_ — and watch the live environment update. AI helper bridge through OpenClaw, with the language layer driving live scene state. Sits at the intersection of three threads: my AI work, my long-running 3DCG interest, and a kid who's already designing in Blender, Roblox Studio, and Suno. Docs-first multi-agent CLI router. Routes prompts across specialized workers with shared context, built on the assumption that no single model is the right choice for every step in a long task. Pairs with codex-local for the local side and with cloud providers for harder work. Everything above runs in production at Kora Labs as **K.O.R.A.** — the autonomous Discord ops agent. Multi-stage pipeline, multi-model routing with graceful fallback, three-tier test battery, set-me-loose-all-night execution mode. The personal tools and the production agent share the same operating doctrine. See [/work](/work) for the platform context. - **"Mitigations, fallbacks, band-aids are the worst. Find the upstream fix."** *The* doctrine. Encoded in personal AGENTS.md, enforced across every repo. - **Docs-first.** PRD → spec → architecture → AGENTS.md in every repo. Docs aren't notes; they're load-bearing artifacts the agent reads on every turn. - **No green-by-inference.** Verify before declaring done. Check the deployed version, inspect the running process, read the Discord channel. - **Persistent state in files.** TODO + TASK_STATE patterns survive crashes and compactions. Agents revive from disk. - **AI as real infrastructure, not autocomplete.** Instrument it, test it in tiers (hermetic → real-process → live-canary), surface its silent failures (codex OAuth heartbeat) the same way you'd surface a dead database. - **Long autonomous runs are normal.** *"Set me loose all night"* is a working mode, not a stunt. - **Platform-engineering reflexes are non-negotiable.** Redundancy, failover, quick recovery, auto-scaling, reproducible deployment — with minimal overhead. Higher dollar cost beats higher human-operations cost.