The Local Agent Stack Architecture
Building a local coding agent requires two distinct components: the inference engine (the "brain") and the agent harness (the "operating environment"). The harness provides the agent with the ability to read files, execute shell commands, and verify code changes. Moving to a local stack offers significant advantages in privacy, predictable costs, and offline capability, though it requires the user to manage hardware resources and security auditing.
Model Selection and Inference
For local coding tasks, model performance is highly dependent on the synergy between the model and the harness. The author recommends using models optimized for specific harnesses, such as Qwen3.6 35B-A3B with the Qwen-Code harness.
- Inference Engine: Ollama is recommended as a plug-and-play solution for serving models across macOS, Linux, and Windows.
- Hardware Optimization: On Apple Silicon, prioritize
*-mlxmodel variants to leverage Metal performance shaders. On Linux, standard versions are preferred. - Benchmarking: Before deploying an agent, perform a speed and memory assessment. Aim for a generation speed of at least 20-30 tokens/sec, which is comparable to high-reasoning proprietary models. Use scripts to monitor RAM usage during long-context tasks (up to 50k tokens), as local agents often consume 20-30GB of RAM.
Security and Reliability Auditing
Because coding agents have the authority to read and modify local files, they present a higher security risk than standard chatbots. Before running any open-source agent harness, perform a rigorous audit:
- Codebase Review: Use a trusted proprietary model (like Claude Code or Codex) to scan the agent's source code for data egress points, hardcoded API keys, or insecure file permission defaults.
- Blast Radius Limitation: Run experimental agents in isolated environments, such as a dedicated virtual machine, a separate user account, or a containerized environment.
- Tool-Use Evaluation: Run a custom reasoning benchmark to test the model's ability to select the correct tools (e.g.,
read_filevs.edit_file) and handle errors. Models that fail to select the correct tool or hallucinate actions should be restricted to narrow, heavily constrained tasks.
Performance Assessment
Don't rely solely on static leaderboard benchmarks. Instead, create a personal evaluation suite that mirrors your actual workflow. A model that scores well on general coding benchmarks may still fail at agentic judgment—the ability to decide which file to edit first or when to ask for clarification. If a model fails to pass a majority of your custom tool-use tasks, it is likely not ready for autonomous operation.