Inside Cursor: CursorBench, Internal PMF and Agents

With David Gomes and Eric Zakariasson

Dec 16, 2025

Software Synthesis analyses the evolution of software companies in the age of AI - from how they're built and scaled, to how they go to market and create enduring value. You can reach me at akash@earlybird.com.

Last week we hosted David and Eric from Cursor to discuss Cursor’s approach to benchmarking models, product development, the future of IDEs and CLIs, and more.

Cursor 2.2 Launch

1. Debug Mode

Collects runtime context by automatically adding logging statements to code
Agent reproduces issues and analyses actual data flow
Launches a Node.js server locally to capture logs across multiple repos
Works across different languages (Python, JavaScript) via HTTP requests

2. Plan Mode Improvements

Emerged from developers using markdown files to steer models
Allows upfront alignment between developer and AI before code execution
Treats plans as markdown files that agents can search and reference
Plans may evolve into long-term memory/context for projects
Moving toward storing all artifacts (plans, chat histories) as files rather than in databases

3. Multi-Agent Judging

Runs same task across multiple models simultaneously
Uses LLM-as-judge (currently Opus 4.5) to recommend best solution
The judge model sees original prompt and outputs, not tools calls or thinking process, not implementation details
Future: Tournament-style bracket judging for 8+ models
Already shown to outrank any individual model in Cursor’s benchmarks

Technical Philosophy & Infrastructure

Context as Core Focus

The team emphasised gathering context at two critical points:

Pre-generation: Plan mode for upfront alignment
Post-generation: Debug mode for validation

“Everything as Files” Strategy

Models excel at reading/writing files
Converting all artifacts to files enables agents to grep and semantic search
Chat histories moving from SQLite to markdown files
Leverages existing primitives agents already understand

CursorBench (Private Benchmark)

Internal benchmark using real software engineering tasks
Must remain private to prevent training data contamination
Currently tracks best single model vs ensemble approach:
Better predictor than public benchmarks (SWE-bench now potentially in training data)

Model Evaluation Insights

Quality Metrics

Two key signals tracked:

Code persistence: How long generated code remains in codebase after commits
Sentiment analysis: Next message indicates success (topic change) vs. failure (error logs, continued same issue)

Model Usage Patterns

Internal data shows many engineers are using Tab less, and more agent-heavy workflows
Sonnet family most popular, with spikes when new versions release
Brief Gemini adoption, then return to Claude/OpenAI models

Development Culture & Process

Internal PMF First

Ship features internally before external release
Debug mode initially built on weekends, faced skepticism

Rapid Iteration

Minimal RFC process for most features
Each person is empowered to ship internally when they have ideas
RFCs mainly for context/knowledge sharing and alignment
RFC writing now valuable because agents can implement from specs

Sub-Agents

Had sub-agents in April but no internal PMF
Bringing back with Composer model (fast, nearly Sonnet-quality)
Models improved enough to make the experience acceptable

Product Strategy & Roadmap

Interfaces (IDE, CLI)

P0 (primary product) is to automate coding - the interface will evolve over time

Computer Use for Validation

Using computer use to verify code changes work correctly
Example: Bug reported on Slack → Cursor fixes → Returns video proof → PR merged
Async workflow advantage - not waiting for human to test

Other Features

Deep Links for sharing: Commands and rules shareable via links

Voice Input: Already working well, working on voice output back from Cursor

Future Features Discussed

Extension Marketplace: Reuse VS Code marketplace for commands, rules, MCP servers
Video Input: Waiting for model providers to add capability

Technical Deep Dives

Tool Management for MCP

Solutions being explored:

Lazy loading: Agent sees tool list, loads definitions only when needed
File-based discovery: Tools stored as files, agents grep/search
Code Mode: Wrap MCPs as APIs (models better at APIs than MCP definitions), enables programmatic chaining

Context Window Challenges

Opus 4.5 only 180K context window (surprisingly small)
Team stops using chat at ~60% capacity (quality degrades)
Preference for one-shot prompts vs. multi-turn conversations
Summarisation degrades quality significantly

Other Insights

BugBot Feature

Automatic PR review bot in Cursor
Intentionally catches fewer bugs than competitors
Reason: Filtering false positives - too many warnings = users ignore them
Custom model trained for false positive detection

Internal Tools Usage

Can see internal data on feature usage, model preferences
Agent quality team (8 people) solely focused on harness/evals/tools

Competitive Positioning

Cursor Bench used as de facto model ranking
Model selector in IDE trusted more than public benchmarks

Signals

Weekly token volume by model type — Source: OpenRouter

What I’m Reading

Titans + MIRAS: Helping AI have long-term memory

Portable Memory & Behavioral Signatures: the missing layer for AI personalisation

TPU Mania

OS Agent | Byte’s Mobile Assistant & Implications

B2B “AI products” that are really services in a trenchcoat

Earnings Commentary

You can achieve performance-wise so much better in the custom purpose-designed, hardware-driven XPU. And we see that in the TPU and we see that in all the accelerators we are doing for our other customers, much, much better in areas of sparse core, training, inference, reasoning, all that stuff.

Hock E. Tan, Broadcom Q4 2025 Earnings Call

I’ll give you one example from a financials or economics perspective. Let’s take a media and entertainment company we’re working on... that organization was spending $10 million with us ARR on our core creative products... We were able to sell them Firefly Services and Firefly Foundry for about $7 million, so a pretty significant step up in terms of the engagement that we have with the customer.

David Wadhwani, Adobe Q4 2025 Earnings Call

Have any feedback? Email me at akash@earlybird.com.

Discussion about this post

Ready for more?