
Actionbook 1.0: From Browser Tool to Agent Runtime
Actionbook 1.0 redesigns the CLI from the ground up for AI agents.
We spent the past few months watching how agents actually use browser automation tools. The takeaway was pretty clear: traditional CLIs are built for humans. They assume you'll remember context, tolerate ambiguous output, and figure out the next step on your own when something breaks. Agents don't work that way. They need explicit addressing, structured protocols, and predictable error recovery paths.
Here's the core change in 1.0:
Old version (human-oriented):
actionbook browser switch 2
actionbook browser click "#submit"
New version (agent-oriented):
actionbook browser click @e5 --session s1 --tab t1
This isn't a syntax tweak. It's a complete rewrite, from an "implicit-state CLI" to an "explicitly-addressed runtime." Every command carries its own target address. Every response follows a uniform protocol. Every error includes recovery instructions.
This post covers what we changed, why we changed it, and how it makes agents more reliable at operating browsers.
The Problem with the Old Version: The Triple Cost of Implicit State
The Fundamental Flaw of Addressing
The old CLI relied on a global "current page" state:
actionbook browser open https://example.com
actionbook browser pages
actionbook browser switch 2
actionbook browser click "#submit"
Intuitive for humans, but fundamentally broken for agents:
-
Implicit dependency:
clickoperates on "whichever page was last activated," not on a target specified by the command itself. -
State pollution: Any
switchmutates global state, affecting every subsequent command. -
Difficult recovery: After a failure, the agent has to infer where the "current tab" actually is rather than simply retrying the command.
No Concurrency Safety
A global active tab makes parallel operations dangerous:
# When two commands execute simultaneously, which one wins?
actionbook browser switch 2 && actionbook browser click "#submit"
actionbook browser switch 5 && actionbook browser fill "#input" "text"
Without explicit addressing, there are no concurrency guarantees. Agents must serialize all browser operations, even when they target completely independent tabs.
The Interface Evolution Dilemma
Over multiple iterations of the old version, we kept adding real capabilities: persistent daemon connections, multi-session support, extension bridge, file uploads. But the underlying interface design (implicit state, a global current tab, human-oriented tab switching) never changed.
We eventually reached a clear conclusion: the problem wasn't missing commands. The abstraction itself was wrong.
Core Design Principles of 1.0
Explicit Addressing: No More "Current Tab"
The single most important design change in 1.0 fits in one sentence:
Every browser command explicitly addresses its target via
--sessionand--tab. There is no current tab.
Before:
actionbook browser switch 2
actionbook browser click "#submit"
After:
actionbook browser click @e5 --session s1 --tab t1
actionbook browser fill @e3 "hello" --session s1 --tab t2
The increased verbosity is intentional. It's the necessary cost of explicit addressing. Every invocation is self-contained, carrying the full address of its target. Three things fall out of this:
Parallel safety: No global active tab means no races. An agent can operate on s1/t1 and s1/t2 simultaneously without coordination.
Recoverability: A failed command doesn't pollute the session. Retry with the same --session --tab, and it picks up from the point of failure.
Stateless reasoning: The agent never needs to track "where am I now." Each command declares where it's going.
Think of it this way: humans open a file in an IDE before editing; agents call write(path, content) directly. --session s1 --tab t1 is the absolute filesystem path of browser automation.
Stateless Interface, Stateful Runtime
This brings us to the core architectural principle behind the new version:
The interface is stateless. The runtime is stateful.
Not a contradiction. A complement.
Why the interface must be stateless. Agent tool calls are inherently discrete. Each call may happen in a different reasoning window, a different recovery path, or after a context truncation. Hidden state forces the agent to spend tokens on state synchronization rather than task execution.
Why the runtime must be stateful. Browsers are inherently stateful: CDP connections, open tabs, cookies, DOM snapshots, ref caches. Rebuilding all of this from scratch on every command would be prohibitively expensive.
So the division of responsibility is natural: thin CLI + thick daemon.
| Layer | Responsibility |
|---|---|
| CLI (thin) | Parse arguments, construct actions, call the daemon, format output |
| Daemon (thick) | Session/tab registry, CDP connection pooling, lifecycle management, backend routing (local / extension / cloud) |
The CLI never directly operates the browser. It sends structured actions to the daemon and formats the results. The daemon holds all browser state and handles concurrency, cleanup, and recovery.
Architecture Layers
┌─────────────────────────────────────────┐
│ Agent (LLM) │
│ - Constructs actions │
│ - Parses responses │
└──────────────┬──────────────────────────┘
│ CLI invocation (tool_use)
┌──────────────▼──────────────────────────┐
│ CLI (Thin Client) │
│ - Argument parsing │
│ - Output formatting │
└──────────────┬──────────────────────────┘
│ Unix socket (length-prefixed JSON)
┌──────────────▼──────────────────────────┐
│ Daemon (Stateful Runtime) │
│ ┌────────────────────────────────────┐ │
│ │ Session Registry │ │
│ │ - s1 → { t1, t2, t3 } │ │
│ │ - s2 → { t1 } │ │
│ └────────────────────────────────────┘ │
│ ┌────────────────────────────────────┐ │
│ │ CDP Connection Pool │ │
│ └────────────────────────────────────┘ │
│ ┌────────────────────────────────────┐ │
│ │ Ref Cache (per tab) │ │
│ └────────────────────────────────────┘ │
└──────────────┬──────────────────────────┘
│ CDP / Extension API
┌──────────────▼──────────────────────────┐
│ Browser (Chrome / Extension / Cloud) │
└─────────────────────────────────────────┘
Each layer has a clean job:
-
Agent layer: Task logic only. Doesn't care about browser connection details.
-
CLI layer: Thin protocol translation. Stateless.
-
Daemon layer: All runtime state. Lifecycle management.
-
Browser layer: The actual execution environment, transparent to everything above.
This is the shift from "tool" to "system."
Output as Protocol
Explicit addressing solves "how to locate." The output protocol solves "how to communicate."
Traditional CLI output is human-readable text. Output consumed by agents needs to be a protocol.
In the new version, every command response follows a uniform envelope:
{
"ok": true,
"command": "browser snapshot",
"context": {
"session_id": "s1",
"tab_id": "t1",
"url": "https://example.com",
"title": "Example"
},
"data": { ... },
"meta": {
"duration_ms": 142,
"warnings": [],
"truncated": false
}
}
Every response includes a context (session, tab, URL, title), so the agent always knows where it is. Success and failure share the same structure. Text output mode also follows a fixed, token-efficient format rather than free-form prose.
Error-Guided Recovery
Failed commands include a typed code and a hint that tell the agent exactly what to do next. For example, when a click fails because a ref is stale:
{
"ok": false,
"command": "browser click",
"error": {
"code": "ELEMENT_NOT_FOUND",
"message": "snapshot ref '@e5' not found",
"hint": "run 'browser snapshot' first to generate element refs"
},
"context": {
"session_id": "s1",
"tab_id": "t1",
"url": "https://example.com"
}
}
The agent follows the hint, gets fresh refs, and retries. No guessing, no trial-and-error. Every error carries an explicit instruction for recovery, which cuts token consumption and task latency significantly.
Key Features
Snapshot Refs: Element Handles Designed for Agents
Traditional browser automation relies on CSS selectors: #submit, .btn-primary, div > form > input: nth-child(3). These selectors are fragile, verbose, and expensive in tokens.
Actionbook 1.0 introduces snapshot refs, stable short handles generated from page snapshots:
# Take a snapshot, get a structured view with refs
actionbook browser snapshot --session s1 --tab t1
# Operate using refs
actionbook browser click @e5 --session s1 --tab t1
actionbook browser fill @e3 "search query" --session s1 --tab t1
The snapshot command converts the page's accessibility tree into an agent-readable structure, assigning each interactive element a ref like @e5. These refs are stable: as long as a DOM node hasn't been destroyed and recreated, the same node gets the same ref across multiple snapshots.
This enables the observe-then-act pattern: snapshot once, operate many times. No need to re-parse the entire DOM on every step. Token costs drop, multi-step interactions get more reliable, and agents reason with refs (@e5) instead of fragile selectors.
Batch Operations
With the new addressing design in place, we extended it to batch operations, opening multiple tabs in a single command:
# Open 3 tabs at once
actionbook browser new-tab https://a.com https://b.com https://c.com --session s1
# Specify custom tab IDs
actionbook browser new-tab https://a.com https://b.com --session s1 --tab inbox --tab docs
Each URL receives an auto-assigned or custom tab ID. The response returns all created tabs:
{
"ok": true,
"command": "browser new-tab",
"data": {
"session_id": "s1",
"requested_urls": 3,
"opened_tabs": 3,
"failed_urls": 0,
"tabs": [
{ "tab_id": "t2", "url": "https://a.com" },
{ "tab_id": "t3", "url": "https://b.com" },
{ "tab_id": "t4", "url": "https://c.com" }
]
}
}
Partial failures are handled gracefully. If 2 out of 3 URLs succeed, the response returns both the successfully opened tabs and the failed URLs with error codes. The agent only needs to retry the failures.
This falls out naturally from the new design. Because there's no global tab state, batch-creating tabs is just a matter of registering multiple Target.createTarget calls in the session registry in parallel. No locks, no coordination, no confusion about "which tab is current."
Engineering Practices
The Engineering Cost of Consistency
Going from the old version to the new, a huge chunk of the work wasn't adding capabilities. It was making 50+ browser commands speak the same language.
Case Study: Unifying the Context Field
In the old version, different commands returned context information in inconsistent formats:
# browser snapshot returned
{ "url": "...", "title": "..." }
# browser click returned
{ "page_url": "...", "page_title": "..." }
# browser goto returned
{ "current_url": "..." } # no title
This forced agents to maintain different parsing logic for each command. In the new version, we unified the context structure across all commands:
{
"context": {
"session_id": "s1",
"tab_id": "t1",
"url": "https://example.com",
"title": "Example Domain"
}
}
Looks simple. It required:
-
Modifying the output construction logic of 50+ commands
-
Updating assertions in all E2E tests
-
Ensuring correct version negotiation between the daemon and CLI
The payoff: agents need only one set of parsing logic, and both token consumption and error rates dropped significantly.
What's Next
The new architecture opens the door to large-scale parallel operations. The scenario we're building toward:
A single agent managing hundreds of tabs simultaneously. Data collection, form filling, content monitoring, automated testing. Each tab has an independent lifecycle; failures and recoveries in one don't affect others.
That's Actionbook's evolution from "browser automation tool" to "agent browser runtime."
Ready to build your playbook?
Join our Discord to share your use case and get direct guidance from the Actionbook team.