Context / Harness
published: 2026-05-09
updated: 2026-05-09
Sequence planning shell for context engineering, harnesses, memory, and agent control loops.
Sequence: Context / Harness
Main Ideas And Sequence Order
Blank for collaborative planning.
References
- Ryan Lopopolo: Harness engineering — OpenAI essay framing harnesses as the environment, tools, feedback, and scaffolding that let agents do useful work.
- Ryan Lopopolo: The MLD framework — Thread introducing the Mistakes / Learnings / Desires artifact loop for agent work.
- Ryan Lopopolo: MISTAKES.md / DESIRES.md / LEARNINGS.md — Earlier grounding for the MLD pattern as durable agent memory and feedback.
- Ryan Lopopolo: alternative harnesses versus post-training — Argues that many improvements come from changing the work environment rather than only changing model weights.
- Ryan Lopopolo: long-term coherence of agent-produced artifacts — Raises the problem of preserving coherence across long-running agent work.
- OpenAI Codex: Follow a goal — Official Codex guide for durable objectives, stopping conditions, and long-running task loops.
- Patrick Toulme: OpenCodeMAX — Open-source harness artifact with persistent goals, side quests, subagents, task boards, and operator visibility.
- Patrick Toulme OpenCodeMAX announcement — Launch thread describing the OpenCodeMAX harness primitives.
- Thomas Ricouard on Codex /goal — Practitioner reaction to goal loops as a coding-agent interaction primitive.
- JQ Lee on Ouroboros and Codex goal — Connects goal-oriented agent loops to Ouroboros-style persistent coding agents.
- Ouroboros — Repository for a persistent coding-agent loop.
- Georgios Konstantopoulos on harness defaults — Notes which defaults and affordances make coding agents more productive.
- Viv on agent + harness engineering — Thread tying agent quality to harness and operating environment design. Core claim: you can outperform any default harness+model on a given task by engineering around it using the same model.
- Viv Trivedy: The Anatomy of an Agent Harness — Defines harness components by working backwards from desired agent behavior: system prompts, tools/MCPs, bundled infrastructure, orchestration logic, hooks/middleware. Key sub-concepts: Ralph Loops (reinject original prompt in clean context to force continuation), Progressive Disclosure for skills (load tool info selectively to prevent context rot), Self-Verification Loops (agent runs its own tests and inspects logs). “Harnesses today are largely delivery mechanisms for good context engineering.” Notes model-harness co-evolution risk: post-training with a specific harness in the loop can overfit the model to it.
- Viv Trivedy: Improving Deep Agents with Harness Engineering — Documents LangChain moving from 52.8% to 66.5% on Terminal Bench 2.0 (top 30 → top 5) by changing only the harness around GPT-5.2-Codex. Three levers: system prompt (plan/build/verify/fix workflow), middleware (PreCompletionChecklistMiddleware, LocalContextMiddleware, LoopDetectionMiddleware), and adaptive reasoning budget (“reasoning sandwich”: xhigh-high-xhigh for planning and verification). Key finding: models don’t naturally enter build-test-verify cycles — harness must force them in. Built a Trace Analyzer Skill to automate error analysis across runs. “The goal of a harness is to mold the inherently spiky intelligence of a model for tasks we care about.”
- Viv Trivedy: The Claude Code SDK and the Birth of HaaS — Argues the industry is shifting from LLM APIs (chat endpoints) to Harness-as-a-Service: pre-built runtime environments covering conversation management, tool invocation, permissions, and error handling. Progression:
client.chat.completions.create() → client.responses.create() → agent.query(). Claims harnesses commodify agent infra so teams iterate on prompts/tools/context rather than rebuilding infrastructure. Predicts most user-facing AI products adopt existing harnesses within six months. Sees open-source harnesses as the key growth surface (“App Store for Agents”). - Marius Hobbhahn: coding agents should be treated as untrusted by default — Safety-oriented framing for tool-using coding agents and review boundaries.
- OpenAI: Auto-review of agent actions without synchronous human oversight — Trebacz et al. (April 2026): a dedicated reviewing agent evaluates sandbox-boundary-crossing actions instead of requiring human sign-off, reducing approval interruptions ~200x. Key finding: rejection friction causes users to route around safety (full access mode, permissive rules, rubber-stamping), so the review agent provides reasoning that lets Codex find safer alternatives >50% of the time. Safety recall: 90.3% agent overreach, 99.3% prompt injection, 96.1% MonitoringBench. Authors note it cannot defend against model scheming and recommend CoT inspection as complement.
- Tibo on Codex Auto-Review — Example of harnessing models for review loops rather than direct autonomous editing.
- Moraine trace database — Tooling reference for storing and inspecting traces from agent runs.
- Teammind — Repository exploring shared team memory for agents.
- Stash — Repository for persistent agent state/context.
- OpenClaw — Open-source agent framework reference.
- OpenAI Agents SDK update — Official OpenAI update on agent-building primitives.
- Claude Managed Agents / dreaming announcement — Anthropic launch thread for managed agents and background work.
- Danny Cosson on memory + skills + evals — Thread connecting agent memory, skills, and evaluation.
- Chroma Context-1 — Research post on training a self-editing search agent.
- Remember, Refine, Retrieve — Applied Compute post separating memory, refinement, and retrieval as distinct agent subsystems.
- Useful Memories Become Faulty When Continuously Updated by LLMs — Research artifact arguing that repeatedly consolidated textual agent memory can drift, overgeneralize, and perform worse than episodic/raw-rollout baselines.
- Electric Agents — Process-control framing for agents as long-lived systems.
- Electric signal verb RFC — Concrete proposal for interrupt/control semantics in agent processes.
- Kyle Mathews on Electric Agents signals — Thread explaining why agents need signal-like control surfaces.
- Contextual Agentic Memory is a Memo, Not True Memory — Paper arguing that common context-memory schemes should be understood as memoization rather than human-like memory.
- OBLIQ-Bench / harder search queries thread — Benchmark thread focused on difficult search and retrieval tasks.
- Coding agents are not compilers — Essay against treating coding agents as simple spec-to-code compilers.
- Coding agents cost accounting — Thread on the economics and practical cost structure of coding-agent workflows.
- React Doctor v2 — Example of agentic tooling aimed at diagnosing codebase problems.
- fsilavong/agent-eval — Repository for evaluating agents.
- intertwine/dspy-agent-skills — Repository exploring DSPy-style agent skills.
- DeepSeek-TUI — Terminal UI reference for model/agent interaction.
- openprose/prose — Open-source writing/code agent interface reference.