Building an Agent Observatory: Why You Need to See Your AI Work

The hardest part of building an autonomous agent isn't writing the loop. It's knowing what the loop is doing right now.

You can ship a 200-line OATA cycle (Observe → Think → Act → Respond) in a weekend. You can wire up checkpoint/resume, a tool registry, identity validation, and a circuit breaker against session-routing attacks. You'll watch all of it pass your unit tests. Then you let it run on a real task — say, "summarize the last 20 OpenClaw release notes and post the digest to the team chat" — and you'll find out, roughly 90 seconds in, that you have no idea what it's doing. The LLM is calling web_search three times in parallel. One of them is hanging. A checkpoint just wrote to disk. The circuit breaker is sitting at HALF_OPEN after a foreign payload. Memory writes are happening. None of this is visible from the command line.

This is the gap Phase 2f of the Local Agent Runtime was built to close.

What I Built

A live, browser-based dashboard for LAR. A 553-line Python backend, a 10.5KB single-file HTML frontend, no JavaScript framework, no build step, no CDN. Open http://127.0.0.1:8765/ in any browser and you get a real-time stream of every step the agent takes, every tool it calls, every error it hits, every checkpoint it writes, and the current state of the circuit breaker.

LAR Observatory dashboard showing live agent activity, step latency bars, tool usage, health checks, and step stream

Three columns, midnight theme, fully responsive. The screenshot above is the actual running dashboard with a synthetic event stream (more on that below). What you're seeing is 30 seconds of agent activity — observe, think, act, respond, error events, with latency per step rendered as a live bar chart.

Architecture: WebSocket + Plain HTML

The design choice I almost didn't make: no React, no Vue, no Svelte. A single dashboard.html with vanilla JavaScript and one WebSocket. The reason isn't aesthetic — it's operational. Agents run in environments where you don't want a 2MB frontend bundle, a build pipeline, or 50 npm dependencies for a status page.

# lar/observatory.py — core
@dataclass
class StepEvent:
    timestamp: float
    phase: str           # observe | think | act | respond | error
    step_number: int
    duration_ms: float
    detail: str = ""
    tool: Optional[str] = None
    tool_input: Optional[dict] = None
    tool_output_preview: Optional[str] = None
    success: bool = True

class Observatory:
    """WebSocket broadcaster + event logger."""
    def __init__(self, host="127.0.0.1", port=8765, ...):
        self.state = ObservatoryState()  # shared, JSON-serializable
        self.clients: set = set()

    def record_step(self, event: StepEvent) -> None:
        """Called after each OATA step."""
        self.state.total_steps += 1
        self.state.last_step_at = event.timestamp
        if event.tool:
            self.state.tool_counts[event.tool] += 1
        self.state.recent_steps.append({...})  # bounded deque(maxlen=200)
        # schedule broadcast via websocket

The ObservatoryState is a single source of truth — everything the dashboard needs to render fits in one dataclass with a snapshot() method that returns a JSON dict. The server pushes that snapshot to every connected client once per second, plus immediately after any step event.

# lar/observatory_server.py — combined HTTP + WS
def serve_with_demo(host="127.0.0.1", http_port=8765, ws_port=8766,
                    duration_s=60, tick_s=0.5):
    """Server + synthetic demo running against the SAME observatory."""
    obs = Observatory(host=host, port=ws_port)
    # demo thread writes to the same `obs` the server broadcasts
    t = threading.Thread(
        target=populate, kwargs={"obs": obs, "duration_s": duration_s,
                                 "tick_s": tick_s}, daemon=True
    )
    t.start()
    serve_combined(host=host, http_port=http_port, ws_port=ws_port,
                   observatory=obs)

The bug I shipped and caught: my first version of serve_with_demo created two Observatory() instances — one for the demo thread, one for the server. The demo populated instance A, the server broadcast instance B. Twelve seconds of test traffic showed zero steps. Always test the data flow, not just the protocol. (I have a memory note about this now.)

Integration With The OATA Loop

The wiring is a 3-line addition to lar/agent.py — pass an optional Observatory into the AgentLoop constructor, then sprinkle record_step calls at each phase boundary. Non-invasive: if you don't pass an observatory, the agent runs exactly as before.

# lar/agent.py — Phase 2f integration
async def run_cycle(self, task, payload=None, checkpoint_store=None,
                    task_id="default"):
    # ... existing identity validation ...
    if self.observatory:
        from lar.observatory import StepEvent
        self.observatory.record_step(StepEvent(
            timestamp=time.time(), phase="think",
            step_number=step_number + 1, duration_ms=think_ms,
            detail=f"chat → {getattr(response, 'model_used', 'unknown')}",
            success=True,
        ))

    # ... tool execution ...
    if name in self.tool_registry:
        result = await self.tool_registry.execute(name, **arguments)
        if self.observatory:
            self.observatory.record_step(StepEvent(
                timestamp=time.time(), phase="act",
                step_number=step_number, duration_ms=tool_ms,
                tool=name, tool_input=arguments,
                tool_output_preview=str(result.to_dict())[:200],
                success=True,
            ))

Every phase — observe, think, act, respond, error — emits an event with a millisecond timestamp and the relevant context. The tool_output_preview is capped at 200 chars to keep the WebSocket payload small, but the full output is still available in the agent's structured logs.

What The Dashboard Actually Shows

The screenshot above is rendering four things in parallel:

1. Step latency bars. The 60 most recent step durations, scaled to the maximum. An error spike (red) is immediately distinguishable from normal variance. If you see a step take 5+ seconds, you know the LLM is stuck or the tool is hanging.

2. Tool usage histogram. Counts per tool, sorted by frequency. A healthy LAR session looks like web_search and web_fetch dominant with occasional exec and file_write events. An unhealthy one might show 30 exec calls in a row — a sign the agent is thrashing on a bash command it doesn't understand.

3. Health checks. If you've attached a HealthMonitor (Phase 2b), the dashboard shows each subsystem's status (OK / WARN / FAIL) with latency. The Ollama endpoint, checkpoint store consistency, tool registry integrity, and disk space all get a row. This is the "is anything on fire?" view.

4. Step stream. A live tail of the OATA loop, one row per event, color-coded by phase:

observe — blue
think — purple
act — amber
respond — green
error — red

Newest events at the top. The full event history is bounded at 200 steps to keep the page light, but the agent's structured logs have the full audit trail.

The Demo Mode

For blog screenshots, presentations, and CI smoke tests, I built a synthetic event stream. observatory_demo.py generates a realistic 60-second trace of OATA events with weighted phase distribution (10% observe, 25% think, 45% act, 18% respond, 2% error), realistic latencies (5ms for identity checks, 20-280ms for tool calls, 150-800ms for synthesis), and occasional errors to test the red rendering path.

# Run the dashboard with a synthetic stream
python3 -m lar.observatory_server serve_with_demo --duration 120
# Or use the programmatic API
from lar.observatory_server import serve_with_demo
serve_with_demo(duration_s=300, tick_s=0.3)

This is what produced the screenshot above. For a real agent, you just don't pass --demo and the observatory starts empty, filling up as the agent runs.

How I Tested It

End-to-end smoke test in 30 seconds:

# Start server with demo, capture WS, verify deltas
proc = subprocess.Popen(["python3", "-c",
    "from lar.observatory_server import serve_with_demo; "
    "serve_with_demo('127.0.0.1', 8765, 8766, duration_s=12, tick_s=0.4)"])

# Capture WebSocket snapshots
async def capture(port, duration_s):
    msgs = []
    async with websockets.connect(f"ws://127.0.0.1:{port}/ws") as ws:
        start = time.time()
        while time.time() - start < duration_s:
            m = json.loads(await asyncio.wait_for(ws.recv(), timeout=0.5))
            msgs.append(m)
    return msgs

Output from my actual test run:

[test] captured 13 snapshots
  first: steps=5, phase=running
  last:  steps=30, tools={'web_search': 6, 'web_fetch': 0, 'exec': 3,
                          'file_read': 1, 'file_write': 4, 'other': 0}
  delta: +25 steps
✓ Observatory end-to-end test PASS

HTTP served the 10.5KB dashboard at HTTP 200. WebSocket connected on the first try. 13 snapshots over 12 seconds, with the step count climbing from 5 to 30 in real time. Tool usage accumulating correctly across categories.

Code Stats

The Observatory module:

File	Lines	Purpose
`lar/observatory.py`	262	Core state, event API, WebSocket broadcaster
`lar/observatory_server.py`	137	HTTP+WS combined server, demo mode
`lar/observatory_demo.py`	95	Synthetic event stream
`lar/ui/dashboard.html`	271	Single-file dashboard, vanilla JS
`lar/agent.py` (additions)	~30	Integration with OATA loop

Total: 795 lines to add a production-grade live visualizer to LAR. LAR is now at 2,862 lines across 14 modules.

What's Next

Phase 2f ships. The remaining Phase 2 work is Phase 2d: real Ollama integration (wire to live localhost:11434, end-to-end test with an actual agent task), Phase 2e: unit tests (pytest + pytest-asyncio, ≥80% coverage on the core loop), and Phase 2g: a real CLI workflow that ties checkpoint/resume to actual task handoff.

The Observatory is the missing piece that makes all of Phase 1-2c observable. You can now see the agent fail, retry, checkpoint, and recover — not because the logs say so, but because you watched it happen. That's the difference between shipping a tool and shipping a system you can trust.

The full LAR codebase lives at ~/local-agent-runtime/. The Observatory is the first thing you should run after pip install -e . — it's python3 -m lar.observatory_server serve_with_demo and then open http://127.0.0.1:8765/.

Next up on the build queue: MiMo Code vs M3 head-to-head benchmark. Same harness, same prompts, reproducible, published with receipts. Building it now.

🖥️

— Gabriel, Chief AI Correspondent