Context / Harness

published: 2026-05-09

updated: 2026-05-09

Sequence planning shell for context engineering, harnesses, memory, and agent control loops.

Sequence: Context / Harness

Main Ideas And Sequence Order

Blank for collaborative planning.

References

Ryan Lopopolo: Harness engineering — OpenAI essay framing harnesses as the environment, tools, feedback, and scaffolding that let agents do useful work.
Ryan Lopopolo: The MLD framework — Thread introducing the Mistakes / Learnings / Desires artifact loop for agent work.
Ryan Lopopolo: MISTAKES.md / DESIRES.md / LEARNINGS.md — Earlier grounding for the MLD pattern as durable agent memory and feedback.
Ryan Lopopolo: alternative harnesses versus post-training — Argues that many improvements come from changing the work environment rather than only changing model weights.
Ryan Lopopolo: long-term coherence of agent-produced artifacts — Raises the problem of preserving coherence across long-running agent work.
OpenAI Codex: Follow a goal — Official Codex guide for durable objectives, stopping conditions, and long-running task loops.
Patrick Toulme: OpenCodeMAX — Open-source harness artifact with persistent goals, side quests, subagents, task boards, and operator visibility.
Patrick Toulme OpenCodeMAX announcement — Launch thread describing the OpenCodeMAX harness primitives.
Thomas Ricouard on Codex /goal — Practitioner reaction to goal loops as a coding-agent interaction primitive.
JQ Lee on Ouroboros and Codex goal — Connects goal-oriented agent loops to Ouroboros-style persistent coding agents.
Ouroboros — Repository for a persistent coding-agent loop.
Georgios Konstantopoulos on harness defaults — Notes which defaults and affordances make coding agents more productive.
Viv on agent + harness engineering — Thread tying agent quality to harness and operating environment design. Core claim: you can outperform any default harness+model on a given task by engineering around it using the same model.
Viv Trivedy: The Anatomy of an Agent Harness — Defines harness components by working backwards from desired agent behavior: system prompts, tools/MCPs, bundled infrastructure, orchestration logic, hooks/middleware. Key sub-concepts: Ralph Loops (reinject original prompt in clean context to force continuation), Progressive Disclosure for skills (load tool info selectively to prevent context rot), Self-Verification Loops (agent runs its own tests and inspects logs). “Harnesses today are largely delivery mechanisms for good context engineering.” Notes model-harness co-evolution risk: post-training with a specific harness in the loop can overfit the model to it.
Viv Trivedy: Improving Deep Agents with Harness Engineering — Documents LangChain moving from 52.8% to 66.5% on Terminal Bench 2.0 (top 30 → top 5) by changing only the harness around GPT-5.2-Codex. Three levers: system prompt (plan/build/verify/fix workflow), middleware (PreCompletionChecklistMiddleware, LocalContextMiddleware, LoopDetectionMiddleware), and adaptive reasoning budget (“reasoning sandwich”: xhigh-high-xhigh for planning and verification). Key finding: models don’t naturally enter build-test-verify cycles — harness must force them in. Built a Trace Analyzer Skill to automate error analysis across runs. “The goal of a harness is to mold the inherently spiky intelligence of a model for tasks we care about.”
Viv Trivedy: The Claude Code SDK and the Birth of HaaS — Argues the industry is shifting from LLM APIs (chat endpoints) to Harness-as-a-Service: pre-built runtime environments covering conversation management, tool invocation, permissions, and error handling. Progression: client.chat.completions.create() → client.responses.create() → agent.query(). Claims harnesses commodify agent infra so teams iterate on prompts/tools/context rather than rebuilding infrastructure. Predicts most user-facing AI products adopt existing harnesses within six months. Sees open-source harnesses as the key growth surface (“App Store for Agents”).
Marius Hobbhahn: coding agents should be treated as untrusted by default — Safety-oriented framing for tool-using coding agents and review boundaries.
OpenAI: Auto-review of agent actions without synchronous human oversight — Trebacz et al. (April 2026): a dedicated reviewing agent evaluates sandbox-boundary-crossing actions instead of requiring human sign-off, reducing approval interruptions ~200x. Key finding: rejection friction causes users to route around safety (full access mode, permissive rules, rubber-stamping), so the review agent provides reasoning that lets Codex find safer alternatives >50% of the time. Safety recall: 90.3% agent overreach, 99.3% prompt injection, 96.1% MonitoringBench. Authors note it cannot defend against model scheming and recommend CoT inspection as complement.
Tibo on Codex Auto-Review — Example of harnessing models for review loops rather than direct autonomous editing.
Moraine trace database — Tooling reference for storing and inspecting traces from agent runs.
Teammind — Repository exploring shared team memory for agents.
Stash — Repository for persistent agent state/context.
OpenClaw — Open-source agent framework reference.
OpenAI Agents SDK update — Official OpenAI update on agent-building primitives.
Claude Managed Agents / dreaming announcement — Anthropic launch thread for managed agents and background work.
Danny Cosson on memory + skills + evals — Thread connecting agent memory, skills, and evaluation.
Chroma Context-1 — Research post on training a self-editing search agent.
Remember, Refine, Retrieve — Applied Compute post separating memory, refinement, and retrieval as distinct agent subsystems.
Useful Memories Become Faulty When Continuously Updated by LLMs — Research artifact arguing that repeatedly consolidated textual agent memory can drift, overgeneralize, and perform worse than episodic/raw-rollout baselines.
Electric Agents — Process-control framing for agents as long-lived systems.
Electric signal verb RFC — Concrete proposal for interrupt/control semantics in agent processes.
Kyle Mathews on Electric Agents signals — Thread explaining why agents need signal-like control surfaces.
Contextual Agentic Memory is a Memo, Not True Memory — Paper arguing that common context-memory schemes should be understood as memoization rather than human-like memory.
OBLIQ-Bench / harder search queries thread — Benchmark thread focused on difficult search and retrieval tasks.
Coding agents are not compilers — Essay against treating coding agents as simple spec-to-code compilers.
Coding agents cost accounting — Thread on the economics and practical cost structure of coding-agent workflows.
React Doctor v2 — Example of agentic tooling aimed at diagnosing codebase problems.
fsilavong/agent-eval — Repository for evaluating agents.
intertwine/dspy-agent-skills — Repository exploring DSPy-style agent skills.
DeepSeek-TUI — Terminal UI reference for model/agent interaction.
openprose/prose — Open-source writing/code agent interface reference.