I Looked Inside Clawdbot's Architecture. Here's What Most Developers Get Wrong About AI Agents.

I Looked Inside Clawdbot's Architecture. Here's What Most Developers Get Wrong About AI Agents.

I took a look inside Clawdbot (aka Moltbot) architecture and how it handles agent executions, tool use, browser automation, etc. There are many lessons to learn for AI engineers.

Learning how Clawdbot works under the hood allows a better understanding of the system and its capabilities, and most importantly, what it’s GOOD at and BAD at.

This started as a personal curiosity about how Clawdbot handles its memory and how reliable it is. In this article I’ll go through the surface-level of how Clawdbot works.

What Clawdbot Actually Is

So everybody knows Clawdbot is a personal assistant you can run locally or through model APIs and access as easy as on your phone. But what is it really?

At its core, Clawdbot is a TypeScript CLI application.

It’s not Python, Next.js, or a web app.

It’s a process that:

  • Runs on your machine and exposes a gateway server to handle all channel connections (telegram, whatsapp, slack, etc.)
  • Makes calls to LLM APIs (Anthropic, OpenAI, local, etc.)
  • Executes tools locally
  • And does whatever you want on your computer

The Architecture

To explain the architecture more simply, here’s what happens when you message Clawdbot on a messenger:

1. Channel Adapter

A Channel Adapter takes your message and processes it (normalize, extract attachments). Different messengers and input streams have their dedicated adapters.

2. Gateway Server

The Gateway Server is the task/session coordinator. It takes your message and passes it to the right session. This is the heart of Clawdbot. It handles multiple overlapping requests.

Lane-Based Command Queue : An abstraction over queues where serialization is the default architecture. A session has its own dedicated lane, and low-risk parallelizable tasks can run in parallel lanes. This shifts the mental model from “what do I need to lock?” to “what’s safe to parallelize?”

To serialize operations, Clawdbot uses a lane-based command queue. A session has its own dedicated lane, and low-risk parallelizable tasks can run in parallel lanes (cron jobs).

This is in contrast to using async/await spaghetti. Over-parallelization hurts reliability and brings out a huge swarm of debugging nightmares.

Default to Serial, go for Parallel explicitly.

If you’ve worked with agents you’ve already realized this to some extent. This is also the insight from Cognition’s “Don’t Build Multi-Agents” blog post. A simple async setup per agent will leave you with a dump of interleaved garbage. Logs will be unreadable, and if they share states, race conditions will be a constant fear you must account for in development.

Lane is an abstraction over queues where serialization is the default architecture instead of an afterthought. As a developer, you write code naturally, and the queue handles the race conditions for you.

3. Agent Runner

This is where the actual AI comes in. It figures out which model to use, picks the API key (if none work it marks the profile in cooldown and tries next), and falls back to a different model if the primary one fails.

The agent runner assembles the system prompt dynamically with available tools, skills, memory, and then adds the session history (from a .jsonl file).

This is next passed to the context window guard and makes sure there is enough context space. If the context is almost full, it either compacts the session (summarize the context) or fails gracefully.

4. LLM API Call

The LLM call itself streams responses and holds an abstraction over different providers. It can also request extended thinking if the model supports it.

5. Agentic Loop

If the LLM returns a tool call response, Clawdbot executes it locally and adds the results to the conversation. This is repeated until the LLM responds with final text or hits max turns (default ~20).

This is also where Computer Use happens, which I’ll get to.

6. Response Path

Pretty standard. Responses get back to you through the channel. The session is also persisted through a basic jsonl with each line a JSON object of the user message, tool calls, results, responses, etc. This is how Clawdbot remembers (session-based memory).

How Clawdbot Remembers

Without a proper memory system, an AI assistant is just as good as a goldfish. Clawdbot handles this through two systems:

  1. Session transcripts in JSONL as mentioned
  2. Memory files as markdowns in MEMORY.md or the memory/ folder

For searching, it uses a hybrid of vector search and keyword matches. This captures the best of both worlds.

So searching for “authentication bug” finds both documents mentioning “auth issues” (semantic) and exact phrase (keyword match).

For the vector search SQLite is used and for keyword search FTS5 which is also a SQLite extension. The embedding provider is configurable.

It also benefits from Smart Syncing which triggers when file watcher triggers on file changes.

This markdown is generated by the agent itself using a standard ‘write’ file tool. There’s no special memory-write API. The agent simply writes to memory/*.md.

Once a new conversation starts, a hook grabs the previous conversation, and writes a summary in markdown.

Clawdbot’s memory system is surprisingly simple. No merging of memories, no monthly/weekly memory compressions.

This simplicity can be an advantage or a pitfall depending on your perspective, but I’m always in favor of explainable simplicity rather than complex spaghetti.

The memory persists forever and old memories have basically equal weight, so we can say there’s no forgetting curve.

Computer Use: How It Uses Your Machine

This is one of the MOATs of Clawdbot: you give it a computer and let it use. So how does it use the computer?

Clawdbot gives the agent significant computer access at your own risks. It uses an exec tool to run shell commands on:

  • Sandbox: the default, where commands run in a Docker container
  • Directly on host machine
  • On remote devices

Aside from that Clawdbot also has:

  • Filesystem tools (read, write, edit)
  • Browser tool, which is Playwright-based with semantic snapshots
  • Process management (process tool) for background long-term commands, kill processes, etc.

The Safety Model (Or Lack Of?)

Similar to Claude Code there is an allowlist for commands the user would like to approve (allow once, always, deny prompts to the user).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
// ~/.clawdbot/exec-approvals.json
{
  "agents": {
    "main": {
      "allowlist": [
        { "pattern": "/usr/bin/npm", "lastUsedAt": 1706644800 },
        { "pattern": "/opt/homebrew/bin/git", "lastUsedAt": 1706644900 }
      ]
    }
  }
}

Safe commands (such as jq, grep, cut, sort, uniq, head, tail, tr, wc) are pre-approved already.

Dangerous shell constructs are blocked by default:

1
2
3
4
5
# these get rejected before execution:
npm install $(cat /etc/passwd)     # command substitution
cat file > /etc/hosts              # redirection
rm -rf / || echo "failed"          # chained with ||
(sudo rm -rf /)                    # subshell

The safety is very similar to what Claude Code has installed. The idea is to have as much autonomy as the user allows.

Browser: Semantic Snapshots

Semantic Snapshot : A text-based representation of a web page’s accessibility tree (ARIA) instead of a visual screenshot. This reduces token cost dramatically while preserving the information needed for interaction.

The browser tool does not primarily use screenshots, but uses semantic snapshots instead, which is a text-based representation of the page’s accessibility tree (ARIA).

So an agent would see:

1
2
3
4
5
6
7
8
- button "Sign In" [ref=1]
- textbox "Email" [ref=2]
- textbox "Password" [ref=3]
- link "Forgot password?" [ref=4]
- heading "Welcome back"
- list
  - listitem "Dashboard"
  - listitem "Settings"

This gives away four significant advantages. As you may have guessed, the act of browsing websites is not necessarily a visual task.

While a screenshot would have 5 MB of size, a semantic snapshot would have less than 50 KB, and a fraction of the token cost of an image.

FAQ

Why use TypeScript instead of Python for an AI agent?

TypeScript provides better async handling for the many concurrent operations in a gateway server, type safety for complex message routing, and first-class support for the Node.js ecosystem used by browser automation tools like Playwright.

How does the lane-based queue prevent race conditions?

Each session gets its own dedicated lane where operations execute serially by default. Only explicitly marked parallel-safe operations (like cron jobs) run in separate lanes. This makes concurrency opt-in rather than opt-out.

Why hybrid search instead of just vector search?

Vector search finds semantically similar content but can miss exact matches. Keyword search finds exact phrases but misses related concepts. Combining both captures “authentication bug” whether the document says “auth issues” or the exact phrase.

Is semantic snapshot browsing less capable than screenshot-based?

For most web automation tasks, semantic snapshots are more capable because they provide structured, actionable data. Screenshots require vision models to interpret pixels into actions. Semantic snapshots give the agent direct access to interactive elements.

How does memory persist across conversations?

A hook captures the previous conversation when a new one starts and writes a summary to markdown files in the memory folder. The agent uses standard file tools to read and write memory, no special API.

Conclusion

Key Takeaways

  • Clawdbot is a TypeScript CLI, not a web app or Python script
  • Lane-based command queues serialize operations by default, preventing race conditions
  • The architecture follows “default to serial, go parallel explicitly”
  • Memory uses hybrid search: SQLite vectors plus FTS5 keyword matching
  • Memory files are plain markdown written by the agent using standard file tools
  • Computer use runs in sandbox by default, with allowlist approval for host access
  • Dangerous shell constructs (command substitution, redirects, subshells) are blocked
  • Browser automation uses semantic snapshots of ARIA trees, not screenshots
  • Semantic snapshots are 100x smaller than screenshots and provide structured interaction data
  • Simple, explainable architecture beats complex multi-agent spaghetti

Understanding Clawdbot’s architecture reveals why most custom AI agents fail: they start with parallelism, add complexity for complexity’s sake, and skip the boring reliability work.

The lesson is clear: serial by default, hybrid search, simple memory, explicit safety boundaries. These aren’t exciting architectural choices. They’re the ones that actually work.

AI Coding Security Insights.
Ship Vibe-Coded Apps Safely.

Effortlessly test and evaluate web application security using Vibe Eval agents.