Inbox Auto-pilot — teaching an AI to triage my email

I get around 500 emails a week. Most are noise. The ones that need a reply — maybe seven out of five hundred — get buried under calendar invites, vendor pitches, SaaS renewal notices, and team threads where I'm CC'd for visibility.

I've tried filters, labels, priority inbox. None of it works when the signal is contextual. A calendar invite from an external partner matters if it's about a contract renewal — it doesn't if it's a recurring all-hands. A cold outreach subject line like "exploring collaboration" is always noise, unless it's from someone you already know. A "weekly report" is usually archivable, unless the numbers are off and it showed up in your inbox instead of the usual label.

Gmail can't make those calls. But an AI can — if you teach it.

Inbox Auto-pilotinbox

╭─ Inbox (7) ──────────────╮

★Arjun MehtaRe: Q1 integration timeline — need sign-off5h

Direct ask for CTO sign-off. 4 messages, latest requesting approval before Monday sprint planning.

·Neha KrishnanPCI-DSS v4 compliance — vendor assessment21h

·Vikram DesaiHiring update — Sr. Platform Engineer1d

★Riya SharmaRe: Mobile app release v3.2 — hotfix1d

·Aditya RaoArchitecture review — event sourcing RFC1d

·Maya IyerRe: SOC 2 audit — evidence collection2d

·Karthik NairAPI rate limits — follow up from call2d

Activity

✓ Gateway connected

✓ Inbox marked as read

Enter:open a:approve s:skip x:archive D:dispatch L:learn R:refresh q:quit

🪺Nest●ready·sonnet-4-5·26.3k/200k (13%)·$0.34

The TUI cycles through: queue → thread → draft → learn. Every action is one keystroke.

The problem with email triage

The hard part isn't filtering spam. Gmail already does that. The hard part is the middle — the hundreds of emails that are legitimate but don't require your attention. Figma comment notifications. Notion page updates. Google Drive shares. Accepted/declined calendar invitations that are already in your calendar. Vendor follow-ups on a "gentle reminder" cadence. LinkedIn telling you someone viewed your profile.

Each one reasonable on its own. Together, they bury the handful of emails where someone is actually blocked on you.

Rules help. I have dozens. But rules are brittle — they match patterns, not intent. A noreply@ sender is usually noise, except when it's a compliance notification with a 30-day deadline. A subject matching "sign off for" is a routine release approval, except when the release contains a hotfix for a production crash. The exceptions are where you miss things.

What I needed was a system that could:

Apply fast, cheap rules to the obvious stuff
Use an LLM to classify the ambiguous middle
Let me correct its mistakes quickly
Learn from those corrections

Three-layer architecture

The filtering pipeline has three layers, each cheaper and faster than the next.

Layer 1Gmail QueriesServer-side filters Labels, categories, time window~500 threads

→

Layer 2Rule Enginetriage_rules.json Sender, domain, subject patterns~250 survive

→

Layer 3LLM ClassifierClaude Sonnet needs_reply · informational · noise~40 survive

→

OutputQueueJSON files 5-10 need a reply~7 actionable

Each layer is cheaper and faster than the one after it

Layer 1 runs on Gmail's servers. It's just search queries — newer_than:14d, exclude promotions and social, check for unread in inbox. This is the coarsest filter but it's free and instant.

Layer 2 is a JSON file of patterns — the triage_rules.json that grows over time. Sender addresses, domains, subject line regexes, cold outreach patterns. This is where most of the filtering happens.

{
  "archiveSenderPatterns": [
    "noreply", "no-reply", "digest",
    "comments-.*@email.figma.com",
    "notify@mail.notion.so"
  ],
  "archiveSubjectPatterns": [
    "updated invitation:", "cancelled event:",
    "accepted:", "declined:",
    "subscription will renew",
    "new login to", "account deleted"
  ],
  "coldOutreachSubjectPatterns": [
    "exploring collaboration",
    "hire top tech talent",
    "gentle follow-up",
    "strategic partnership",
    "closed-door roundtable"
  ]
}

These patterns look obvious in hindsight. They weren't obvious upfront — they emerged from watching my own triage decisions over weeks.

Layer 3 is the expensive one — an LLM call per surviving email. I use Claude Sonnet because it's fast and cheap enough to run on 40 emails without thinking about cost. Each email gets classified as needs_reply (someone is waiting for me), informational (worth reading, no response needed), or noise (shouldn't have survived layers 1-2).

The LLM also writes a one-line reason for each classification. This is critical — it's what I see in the queue view, and it's how I judge whether the classification was right.

Why a TUI

I built a terminal UI because:

Speed. I'm already in the terminal. No browser tab, no context switch. Open it, triage 10 emails in 2 minutes, close it.
Keyboard-only. Every action is one key. No clicking, no scrolling through menus.
It's a feedback device, not an inbox. I'm not reading email here. I'm reviewing the AI's classifications and correcting them. The UI is optimized for that loop.

The TUI is built with OpenTUI — a Bun-native terminal UI framework with a Zig renderer and Yoga flexbox layout. Screen refreshes are instant. Scrolling is smooth. It feels like a real application, not a curses hack.

The agentic shell

Here's where it gets interesting. Inbox Auto-pilot isn't a standalone app — it's a thin client that talks to an AI agent.

The agent runs on OpenClaw on my Mac Mini. It has access to my org's knowledge graph, people directory, communication history, and writing style. When I press g to generate a draft reply, the TUI doesn't call an LLM API directly. It sends a structured command to the agent through a local gateway HTTP API:

POST /v1/chat/completions
{
  "model": "openclaw:main",
  "user": "inbox-autopilot",
  "messages": [{
    "role": "user",
    "content": "[inbox-pilot] generate_draft\n{...}"
  }]
}

The agent receives this, reads the email, looks up the sender in the org directory, checks recent context, and drafts a reply in my voice. The draft isn't generic — it knows who the sender is, what we've discussed recently, and how I typically respond to this kind of email.

ggenerate— Draft a reply using org context and writing style

eedit— "Make it shorter" — rewrite with a natural-language instruction

pprocess— "Send a summary to Slack" — free-form action on the email

aapprove— Queue for dispatch

sskip— No reply needed — logged for learn loop

xarchive— Archive in Gmail — trains the filter

Ddispatch— Send all approved replies in one batch

Llearn— Analyze decisions, propose rules, apply to filter

The key insight: the TUI is stateless and dumb. The agent is stateful and smart. The TUI handles rendering and keyboard input. The agent handles context, reasoning, and tool use. They communicate through a simple HTTP API.

The skill on the OpenClaw side is just a markdown file — no code. It describes how to handle each command (generate_draft, edit_draft, dispatch, process_email, learn) and the agent figures out how to execute them using its existing tools: Gmail CLI, Slack, memory, knowledge graph. The TUI doesn't need to know about any of that.

How the agent reasons

When the agent opens an email, it doesn't just show the text — it explains why this email surfaced. It looks up the sender, pulls recent context, and writes a one-line reason. When it generates a draft, it adds notes explaining its choices.

Inbox Auto-pilotloading

╭─ Exploring Product Designer Opportunities ─╮

From: Mohit Singh <mohitkr0231@gmail.com>

Date: 2026-03-19 13:53 · 4 messages

Labels: Attention, IMPORTANT, CATEGORY_PERSONAL, INBOX

Reason:

Unsolicited job application for a Product Designer role with 4 messages in thread — given Saurabh's oversight of design hiring, this warrants a response or routing decision, especially with the Attention/IMPORTANT labels and multi-message thread suggesting follow-ups.

── Email Body ──

Hi Saurabh,

I'm Mohit Singh, a product designer with 8+ years of
experience. I came across Plum and was impressed by the
product and mission.

I'd love to explore any open design roles. I've attached
my portfolio and LinkedIn for your reference.

Looking forward to hearing from you.

── Draft Reply ──

Hi Mohit,

Thanks for reaching out. I'm forwarding your details to
our design team — they'll get in touch if there's a fit.

Best,
Saurabh

Notes: Cold inbound job inquiry. Polite redirect — Saurabh
shouldn't be handling hiring pipeline directly. Could
forward to the design lead or HR/talent team.

Activity

✓ Gateway connected

✓ Inbox marked as read

⏳ Loading email...

[Esc]back [a]pprove [s]kip [x]archive [e]dit [g]enerate [p]rocess

🪺 Nest● ready· sonnet-4-5· 27.8k/200k (14%)· $0.55

The agent loads the email, reasons about the sender and context, drafts a reply, and explains its choices.

The activity log on the right tracks everything the agent does across the session — emails loaded, senders looked up, drafts generated, items triaged. You can see the rhythm: load, reason, decide, move on.

The learn loop

This is the part I'm most proud of.

Press L from the queue. Inbox Auto-pilot sends your entire action history — every approve, skip, and archive decision — to the agent and asks: "what patterns do you see?"

Inbox Auto-pilot — learnanalyzing

╭─ Learnings & Rule Suggestions ─╮

⏳Analyzing 198 triage decisions...

Activity

✓ Gateway connected

⏳ Analyzing patterns...

[Esc]back [a]pply rules & refresh [j/k]scroll

🪺 Nest● ready· sonnet-4-5· 32.1k/200k (16%)· $0.71

The learn loop: analyze decisions → propose rules → apply → pipeline re-runs → queue shrinks.

The agent produces a structured report:

Auto-archive candidates. Patterns where emails were consistently skipped or archived. Calendar accepts/declines (~25 emails), vendor cold outreach (~15), SaaS notifications, LinkedIn alerts, bulk digests. Each one with a specific rule — sender pattern, subject regex, domain match — ready to be added to triage_rules.json.

Misclassifications. Emails that were skipped but probably shouldn't have been — a tax notice with a 30-day deadline, an overdue vendor invoice, an employee resignation. The system watches for its own mistakes.

Priority adjustments. Tax/legal notices should auto-escalate. Vendor payment failures should flag finance. Employee resignations should surface as informational at minimum.

The bottom line. The 9 new rules would eliminate ~120 of 198 emails (60%) from the queue — caught at Layer 2 instead of Layer 3. For free, instantly, forever.

Press a to apply. The agent writes the patterns into triage_rules.json, re-runs the pipeline, and the queue refreshes smaller.

Session 1:  500 emails → 40 survive → 8 need reply
  You skip 15, archive 12, approve 5

Session 2:  Press L → agent finds patterns → adds 9 rules
  500 emails → 22 survive → 7 need reply

Session 3:  Fewer corrections needed
  500 emails → 16 survive → 6 need reply

The filter gets better each time you use it. Not because I'm writing rules — because I'm making decisions, and the system is watching.

The triage pipeline

The Python pipeline runs on a cron — four times a day. It queries Gmail via the gog CLI, classifies each thread through the three layers, writes two JSON queue files (pending-reply-queue.json and informational-queue.json) to a vault directory, and marks everything as read.

The pipeline never deletes emails. It reads, classifies, and marks as read. The worst failure mode is a missed email, not a lost one. If the LLM call fails (rate limit, timeout), the email falls back to rule-based classification and I get a notification.

What I learned

More context makes AI worse, not better. Early versions sent the agent entire email threads — headers, forwarded chains, signature blocks, legal disclaimers. Classification got worse. The LLM would fixate on irrelevant details. Now I send just the latest message body and metadata. Accuracy went up. Cost went down.

Rules beat LLMs for the easy stuff. The temptation is to throw everything at the LLM. But a regex that matches noreply@ is infinitely faster, cheaper, and more reliable than an LLM call. The LLM should only see the emails that rules can't decide on. The learn loop bridges the two — it watches the LLM's work and extracts deterministic rules from it.

The learning loop is the whole product. Without it, this is just another email filter. With it, the system gets better every day. The first week I was correcting 30% of classifications. After a month, it's under 5%. The rules file grew from 20 patterns to over 100, all from learn sessions.

A TUI is the right form factor for triage. Speed matters more than features. I don't need rich text rendering or inline images. I need to see the sender, subject, reason, and make a decision. One keystroke. Next.

The stack

TUI: TypeScript, OpenTUI — Bun, Zig renderer, Yoga layout
Agent: OpenClaw on a Mac Mini — gateway HTTP API, skill system
Pipeline: Python, gog CLI for Gmail, Claude Sonnet for classification
Data: JSON queue files in a local vault
Terminal: Ghostty

~1,200 lines of TypeScript for the TUI. ~400 lines of Python for the pipeline. A markdown file for the skill. No database. No web server. No deployment.

What's next

More sources. Slack messages, calendar invites, PR reviews — anything that competes for attention. The architecture generalizes: pipeline filters → queue → TUI → learn → better filters.

Autonomy tiers. Right now, every reply needs my approval. But for certain patterns — "yes, let's schedule that" or "thanks, noted" — the agent could draft and send without asking. The learn loop already identifies these patterns. The missing piece is confidence calibration.

For now, it works. I open it once or twice a day, spend five minutes triaging, and close it. The inbox stays at zero. The filter gets better. The agent gets smarter.

That's the whole point — not to automate email, but to build a feedback loop between human judgment and machine filtering. The human stays in the loop. The machine does the boring part.