I took a look inside Clawdbot (aka Moltbot) architecture and how it handles agent executions, tool use, browser automation, etc. There are many lessons to learn for AI engineers.
Learning how Clawdbot works under the hood allows a better understanding of the system and its capabilities, and most importantly, what it’s GOOD at and BAD at.
This started as a personal curiosity about how Clawdbot handles its memory and how reliable it is. In this article I’ll go through the surface-level of how Clawdbot works.
What Clawdbot Actually Is
So everybody knows Clawdbot is a personal assistant you can run locally or through model APIs and access as easy as on your phone. But what is it really?
At its core, Clawdbot is a TypeScript CLI application.
It’s not Python, Next.js, or a web app.
It’s a process that:
- Runs on your machine and exposes a gateway server to handle all channel connections (telegram, whatsapp, slack, etc.)
- Makes calls to LLM APIs (Anthropic, OpenAI, local, etc.)
- Executes tools locally
- And does whatever you want on your computer
The Architecture
To explain the architecture more simply, here’s what happens when you message Clawdbot on a messenger:
1. Channel Adapter
A Channel Adapter takes your message and processes it (normalize, extract attachments). Different messengers and input streams have their dedicated adapters.
2. Gateway Server
The Gateway Server is the task/session coordinator. It takes your message and passes it to the right session. This is the heart of Clawdbot. It handles multiple overlapping requests.
To serialize operations, Clawdbot uses a lane-based command queue. A session has its own dedicated lane, and low-risk parallelizable tasks can run in parallel lanes (cron jobs).
This is in contrast to using async/await spaghetti. Over-parallelization hurts reliability and brings out a huge swarm of debugging nightmares.
Default to Serial, go for Parallel explicitly.
If you’ve worked with agents you’ve already realized this to some extent. This is also the insight from Cognition’s “Don’t Build Multi-Agents” blog post. A simple async setup per agent will leave you with a dump of interleaved garbage. Logs will be unreadable, and if they share states, race conditions will be a constant fear you must account for in development.
Lane is an abstraction over queues where serialization is the default architecture instead of an afterthought. As a developer, you write code naturally, and the queue handles the race conditions for you.
3. Agent Runner
This is where the actual AI comes in. It figures out which model to use, picks the API key (if none work it marks the profile in cooldown and tries next), and falls back to a different model if the primary one fails.
The agent runner assembles the system prompt dynamically with available tools, skills, memory, and then adds the session history (from a .jsonl file).
This is next passed to the context window guard and makes sure there is enough context space. If the context is almost full, it either compacts the session (summarize the context) or fails gracefully.
4. LLM API Call
The LLM call itself streams responses and holds an abstraction over different providers. It can also request extended thinking if the model supports it.
5. Agentic Loop
If the LLM returns a tool call response, Clawdbot executes it locally and adds the results to the conversation. This is repeated until the LLM responds with final text or hits max turns (default ~20).
This is also where Computer Use happens, which I’ll get to.
6. Response Path
Pretty standard. Responses get back to you through the channel. The session is also persisted through a basic jsonl with each line a JSON object of the user message, tool calls, results, responses, etc. This is how Clawdbot remembers (session-based memory).
How Clawdbot Remembers
Without a proper memory system, an AI assistant is just as good as a goldfish. Clawdbot handles this through two systems:
- Session transcripts in JSONL as mentioned
- Memory files as markdowns in
MEMORY.mdor thememory/folder
For searching, it uses a hybrid of vector search and keyword matches. This captures the best of both worlds.
So searching for “authentication bug” finds both documents mentioning “auth issues” (semantic) and exact phrase (keyword match).
For the vector search SQLite is used and for keyword search FTS5 which is also a SQLite extension. The embedding provider is configurable.
It also benefits from Smart Syncing which triggers when file watcher triggers on file changes.
This markdown is generated by the agent itself using a standard ‘write’ file tool. There’s no special memory-write API. The agent simply writes to memory/*.md.
Once a new conversation starts, a hook grabs the previous conversation, and writes a summary in markdown.
Clawdbot’s memory system is surprisingly simple. No merging of memories, no monthly/weekly memory compressions.
This simplicity can be an advantage or a pitfall depending on your perspective, but I’m always in favor of explainable simplicity rather than complex spaghetti.
The memory persists forever and old memories have basically equal weight, so we can say there’s no forgetting curve.
Computer Use: How It Uses Your Machine
This is one of the MOATs of Clawdbot: you give it a computer and let it use. So how does it use the computer?
Clawdbot gives the agent significant computer access at your own risks. It uses an exec tool to run shell commands on:
- Sandbox: the default, where commands run in a Docker container
- Directly on host machine
- On remote devices
Aside from that Clawdbot also has:
- Filesystem tools (read, write, edit)
- Browser tool, which is Playwright-based with semantic snapshots
- Process management (process tool) for background long-term commands, kill processes, etc.
The Safety Model (Or Lack Of?)
Similar to Claude Code there is an allowlist for commands the user would like to approve (allow once, always, deny prompts to the user).
| |
Safe commands (such as jq, grep, cut, sort, uniq, head, tail, tr, wc) are pre-approved already.
Dangerous shell constructs are blocked by default:
| |
The safety is very similar to what Claude Code has installed. The idea is to have as much autonomy as the user allows.
Browser: Semantic Snapshots
The browser tool does not primarily use screenshots, but uses semantic snapshots instead, which is a text-based representation of the page’s accessibility tree (ARIA).
So an agent would see:
| |
This gives away four significant advantages. As you may have guessed, the act of browsing websites is not necessarily a visual task.
While a screenshot would have 5 MB of size, a semantic snapshot would have less than 50 KB, and a fraction of the token cost of an image.
FAQ
Why use TypeScript instead of Python for an AI agent?
How does the lane-based queue prevent race conditions?
Why hybrid search instead of just vector search?
Is semantic snapshot browsing less capable than screenshot-based?
How does memory persist across conversations?
Conclusion
Key Takeaways
- Clawdbot is a TypeScript CLI, not a web app or Python script
- Lane-based command queues serialize operations by default, preventing race conditions
- The architecture follows “default to serial, go parallel explicitly”
- Memory uses hybrid search: SQLite vectors plus FTS5 keyword matching
- Memory files are plain markdown written by the agent using standard file tools
- Computer use runs in sandbox by default, with allowlist approval for host access
- Dangerous shell constructs (command substitution, redirects, subshells) are blocked
- Browser automation uses semantic snapshots of ARIA trees, not screenshots
- Semantic snapshots are 100x smaller than screenshots and provide structured interaction data
- Simple, explainable architecture beats complex multi-agent spaghetti
Understanding Clawdbot’s architecture reveals why most custom AI agents fail: they start with parallelism, add complexity for complexity’s sake, and skip the boring reliability work.
The lesson is clear: serial by default, hybrid search, simple memory, explicit safety boundaries. These aren’t exciting architectural choices. They’re the ones that actually work.