by Sam

Building Token-Efficient AI Applications: Lessons from Screen Capture

AI APIs charge by the token. When your app processes thousands of screenshots daily, efficiency isn't optional. Here's how we cut token usage by 10x without losing accuracy.

When you’re building an AI-powered application that processes screen activity, you face a fundamental tension: more context means better results, but AI APIs charge by the token. Send everything and you’ll go broke. Send too little and your AI is useless.

Stubble processes thousands of screenshots per day per user. We had to get creative about token efficiency. Here’s what we learned.

The Naive Approach Doesn’t Scale

The obvious implementation: capture screenshots, send them to a vision model, get back descriptions. Simple, right?

Here’s the math problem. A single screenshot encoded for a vision API consumes thousands of tokens. If you capture one screenshot per minute during an 8-hour workday, that’s 480 screenshots. At thousands of tokens each, you’re looking at millions of tokens per user per day.

At current API pricing, that’s financially unsustainable for a consumer product. And it’s also slow—vision model inference takes time, and processing hundreds of images sequentially creates unacceptable latency.

We needed a different approach.

Principle 1: Extract Locally, Send Summaries

The single biggest optimization: don’t send images to the AI at all.

Modern operating systems include capable OCR engines. Apple’s Vision framework can extract text from screenshots entirely on-device, with no network calls. The text is available in milliseconds.

Instead of sending a 3000-token image, we send 200 tokens of extracted text. That’s a 15x reduction right there.

But raw OCR text is noisy. It includes UI chrome, button labels, menu items—a lot of low-signal content. So we built a local extraction pipeline that pulls out high-value signals:

  • URLs — Collapsed to domain + path for deduplication
  • File paths — Truncated to the meaningful parts
  • Code symbols — Function names, class definitions, imports
  • Document titles — Heuristically identified from content
  • Terminal commands — Extracted from shell prompts

Each category has strict count limits and deduplication. The result is a structured digest of ~2-3K characters that captures what matters without the noise.

Principle 2: Aggregate Before You Send

Screen activity is inherently fragmented. You click between apps constantly. A raw activity log might show hundreds of app switches per day.

But the AI doesn’t need to know about every switch. It needs to understand work sessions. So before building our prompt, we aggregate:

Block formation: Consecutive activities in the same app become a single block. Five minutes in VS Code is one block, not 300 separate events.

Short block merging: Blocks under 60 seconds that occur near other blocks of the same app get merged. This handles the “quick check email, back to coding” pattern without creating micro-tasks.

Minimum duration filtering: Blocks under 5 seconds are dropped entirely. Accidental clicks and brief pauses aren’t meaningful work.

This aggregation typically reduces hundreds of raw events to 20-30 meaningful blocks. That’s a 10x reduction in prompt size.

Principle 3: Limit Samples, Not Coverage

You can’t include the full text of every screenshot. But you also can’t just pick random samples—you might miss important context.

Our approach: include limited samples, but choose them strategically.

For each activity block, we include up to 5 OCR samples. Across the entire day, we cap at 20 total samples. Each sample is truncated to 800 characters max.

The samples are deduplicated by prefix matching—if two screenshots have nearly identical text (common when you’re reading a long document), we only include one.

This gives the AI enough context to understand each block without drowning it in repetitive content.

Principle 4: Sanitize Before Sending

This isn’t just about privacy—it’s about token efficiency too.

API keys, JWT tokens, AWS credentials, private keys—these are long strings that consume tokens without adding value for task understanding. Our sanitizer strips them before the prompt is built.

Similarly, we normalize control characters, escape JSON properly, and clean up formatting artifacts. Malformed text doesn’t just waste tokens—it can confuse the model.

Principle 5: Target Your Output

Unbounded AI output is expensive. If you ask an LLM to “describe my day,” it might generate thousands of tokens of prose.

Instead, we target specific output volumes:

  • Calculate expected task count based on active hours and desired granularity
  • Include this target in the prompt (“produce approximately 12-15 tasks”)
  • Use JSON mode to enforce structured output
  • Set explicit output token limits

The AI generates exactly what we need, no more.

Principle 6: Parallelize Inference

If you need to extract multiple things from the same data—say, tasks AND user preferences AND project patterns—don’t make sequential API calls.

We run multiple extraction goals concurrently against the same activity data. The prompt is built once, and multiple inferences run in parallel. This doesn’t reduce token usage per se, but it cuts latency significantly and amortizes the fixed costs of context preparation.

Principle 7: Post-Process Aggressively

Sometimes the AI generates redundant output despite your best efforts. Post-processing catches this:

  • Merge tasks with identical titles
  • Combine overlapping time ranges
  • Deduplicate extracted entities
  • Repair malformed JSON responses

This cleanup is cheap (just local string processing) and can significantly improve output quality without additional API calls.

The Results

These optimizations compound. Let’s trace a typical day:

StageData Volume
Raw screenshots captured480 images
After local OCR extraction480 text blocks (~200 tokens each)
After structured extraction480 digests (~50 tokens each)
After block aggregation25 blocks
After sampling & limits~15K tokens total prompt

We went from potentially millions of tokens (vision API on every screenshot) to ~15K tokens for a full day’s summarization. That’s roughly a 100x reduction.

More importantly, the accuracy didn’t suffer. By extracting the right signals locally, we preserved the information that matters for task understanding while discarding the noise.

Lessons for Other Applications

If you’re building an AI-powered application that processes continuous data:

  1. Do as much as possible locally. Modern devices are capable. OCR, pattern matching, entity extraction—these don’t need cloud AI.

  2. Aggregate aggressively. Raw event streams are almost always over-detailed for AI consumption.

  3. Sample strategically. You don’t need all the data, but you need representative data.

  4. Sanitize everything. Remove noise, normalize formats, strip irrelevant content.

  5. Be specific about output. Tell the model exactly what you want and in what format.

  6. Post-process the results. Clean up redundancy and errors locally.

The goal isn’t to minimize AI usage—it’s to maximize value per token. Send the right context, get better results, spend less money.

That’s how you build AI applications that actually scale.