Teaching AI Agents to See and Touch: A macOS GUI Automation Bridge

The Gap

AI coding agents are good at files. They read code, write diffs, run terminal commands. But ask one to click "Export" in a video editing app, and it's stuck. The agent can't see the screen. It can't move the mouse. It doesn't know if the dialog that was open three seconds ago is still there.

I hit this wall while batch-exporting subtitles from CapCut (剪映). Forty audio files needed SRT export. Each one required: import → add to timeline → recognize speech → wait → export → set filename → choose folder → confirm. Manually, this takes a full afternoon. An AI agent could do it in minutes — if it could interact with the GUI.

There are existing automation tools. AppleScript is slow and brittle. Hammerspoon requires Lua scripting. Neither speaks MCP. None of them solve the real problem: the agent doesn't know what state the application is in before it acts.

What MacAutoBridge Is

A Swift MCP server that runs locally and exposes 20 tools over stdio JSON-RPC. Zero external dependencies — pure Apple frameworks (Accessibility API, Vision, ScreenCaptureKit, CGEvent).

The core principle: observe before you act, verify after you act. Every write operation (click, type, drag) starts by checking that the correct app and window have focus. If focus is lost mid-operation, the action aborts immediately. No blind typing into the wrong window.

AI Agent (Claude Code / Codex)
    │ JSON-RPC over stdio
    ▼
MacAutoBridge (Swift process)
    ├── Perception: AX tree + OCR + display topology
    ├── Action: CGEvent mouse/keyboard + focus lock
    ├── Transaction: observe → judge → act → verify
    └── Facade: high-level composite operations

GitHub: RuoqiJin/mac-auto-bridge

Five-Pillar Architecture

Perception (read-only)

Three sensors, each covering a different gap:

Accessibility API — reads the UI element tree. Gives you button labels, text field values, window titles. Fast and structured, but some apps (CapCut's export dialog) return incomplete trees.
Vision OCR — captures the window via ScreenCaptureKit and runs Apple's on-device text recognition. Finds any visible text with screen-global coordinates. Slower but universal.
HybridLocator — tries AX first, falls back to OCR automatically. The agent doesn't need to know which method worked.

Multi-display is handled correctly. Coordinates are always screen-global, with Retina scaling factored in. Negative Y values for monitors above the primary display work as expected.

Action (write-only)

CGEvent synthesis for mouse clicks, drags, scrolling, and keyboard input. Every action method calls focus.verify() before executing. The FocusManager tracks which app should be in front and uses a two-strategy activation sequence:

NSRunningApplication.activate() (standard)
AppleScript tell application id "..." to activate (forceful fallback)

This handles the common case where Chrome or another app fights for focus.

Transaction

Multi-step action sequences with verification conditions. You can define: "click this button, then wait until this text appears on screen, then type this path." If any verification fails, the transaction aborts. No half-completed operations.

Facade

High-level convenience tools that combine multiple steps:

snapshot — returns window list + AX tree + optional OCR in one call (replaces three separate tool calls)
goto_folder — sends Cmd+Shift+G, types a path, presses Enter, verifies arrival (replaces three tool calls)
click_text — captures screen, finds text via OCR, clicks its center (replaces two tool calls)

These cut agent tool calls by roughly 60% in file dialog workflows.

The 20 Tools

| Category | Tools | |----------|-------| | Perception | snapshot · list_windows · list_displays · ax_snapshot · capture_window · capture_app · find_text_on_screen | | Action | focus_app · click · click_text · right_click · drag · type_text · type_in_focused_field · scroll · press_key | | Transaction | wait_until · focus_and_assert · goto_folder | | Diagnostic | diagnose |

What I Learned Building This

Focus is the hardest problem

Not OCR accuracy. Not coordinate mapping. Focus. On macOS, any application can steal focus at any time. A notification pops up. The user moves their mouse. Spotlight activates. If your automation tool types "rm -rf" into Terminal when it thinks it's typing into a filename field, you have a very bad day.

The focus lock pattern — verify before every keystroke, abort on mismatch — is the single most important safety feature. It's also the most annoying to implement because NSWorkspace.shared.frontmostApplication has latency, and activation isn't instantaneous.

AX trees are unreliable for non-native apps

CapCut's export dialog returns an AX tree where the focused window title is nil. System save/open panels sometimes have incomplete element hierarchies. Electron apps expose minimal accessibility information.

The fix is the hybrid approach: trust AX when it works, fall back to OCR when it doesn't, and use CGWindowList as a third source for window titles. Three data sources covering each other's blind spots.

OCR speed vs. accuracy is a real tradeoff

Apple Vision's .accurate mode with language correction on a complex UI (video editor timeline, 2880×2374 pixels) can take over two minutes. .fast mode returns in seconds but misses some button labels.

The solution: use .accurate for action tools (where missing a button label means clicking the wrong thing) and .fast for observation tools (where speed matters more than completeness).

MCP servers are surprisingly easy to build

The stdio transport is just newline-delimited JSON-RPC. Read a line from stdin, parse it, dispatch to a handler, write the response to stdout. The entire MCP server implementation is about 100 lines of Swift. The other 1,400 lines are the actual automation logic.

The hardest part was discovering that Claude Code reads MCP server config from ~/.claude.json, not ~/.claude/settings.json. That cost me three debugging sessions.

Running It

git clone https://github.com/RuoqiJin/mac-auto-bridge.git
cd mac-auto-bridge
swift build

Add to ~/.claude.json:

{
  "mcpServers": {
    "mac-auto-bridge": {
      "type": "stdio",
      "command": "/path/to/mac-auto-bridge/.build/debug/MacAutoBridge",
      "args": []
    }
  }
}

Grant Accessibility and Screen Recording permissions in System Settings → Privacy & Security.

Restart Claude Code. The 20 tools appear automatically.

What's Next

The bridge works. I've used it to batch-export 39 SRT files from CapCut in a single agent session. The agent imported audio, triggered speech recognition, waited for completion, navigated to the output folder, and exported — all through MCP tool calls.

What's missing:

App-specific adapters — state machines for common workflows (CapCut export, Finder file operations, Safari form filling)
Visual anchoring — template matching for icons and images, not just text
Workflow recording — watch a human do it once, replay the sequence with variation handling

The code is MIT licensed. If you're building AI agents that need to interact with macOS GUI applications, this is the foundation.