Inside Cursor: CursorBench, Internal PMF and Agents
With David Gomes and Eric Zakariasson
Software Synthesis analyses the evolution of software companies in the age of AI - from how they're built and scaled, to how they go to market and create enduring value. You can reach me at akash@earlybird.com.
Last week we hosted David and Eric from Cursor to discuss Cursor’s approach to benchmarking models, product development, the future of IDEs and CLIs, and more.
Cursor 2.2 Launch
1. Debug Mode
Collects runtime context by automatically adding logging statements to code
Agent reproduces issues and analyses actual data flow
Launches a Node.js server locally to capture logs across multiple repos
Works across different languages (Python, JavaScript) via HTTP requests
2. Plan Mode Improvements
Emerged from developers using markdown files to steer models
Allows upfront alignment between developer and AI before code execution
Treats plans as markdown files that agents can search and reference
Plans may evolve into long-term memory/context for projects
Moving toward storing all artifacts (plans, chat histories) as files rather than in databases
3. Multi-Agent Judging
Runs same task across multiple models simultaneously
Uses LLM-as-judge (currently Opus 4.5) to recommend best solution
The judge model sees original prompt and outputs, not tools calls or thinking process, not implementation details
Future: Tournament-style bracket judging for 8+ models
Already shown to outrank any individual model in Cursor’s benchmarks
Technical Philosophy & Infrastructure
Context as Core Focus
The team emphasised gathering context at two critical points:
Pre-generation: Plan mode for upfront alignment
Post-generation: Debug mode for validation
“Everything as Files” Strategy
Models excel at reading/writing files
Converting all artifacts to files enables agents to grep and semantic search
Chat histories moving from SQLite to markdown files
Leverages existing primitives agents already understand
CursorBench (Private Benchmark)
Internal benchmark using real software engineering tasks
Must remain private to prevent training data contamination
Currently tracks best single model vs ensemble approach:
Better predictor than public benchmarks (SWE-bench now potentially in training data)
Model Evaluation Insights
Quality Metrics
Two key signals tracked:
Code persistence: How long generated code remains in codebase after commits
Sentiment analysis: Next message indicates success (topic change) vs. failure (error logs, continued same issue)
Model Usage Patterns
Internal data shows many engineers are using Tab less, and more agent-heavy workflows
Sonnet family most popular, with spikes when new versions release
Brief Gemini adoption, then return to Claude/OpenAI models
Development Culture & Process
Internal PMF First
Ship features internally before external release
Debug mode initially built on weekends, faced skepticism
Rapid Iteration
Minimal RFC process for most features
Each person is empowered to ship internally when they have ideas
RFCs mainly for context/knowledge sharing and alignment
RFC writing now valuable because agents can implement from specs
Sub-Agents
Had sub-agents in April but no internal PMF
Bringing back with Composer model (fast, nearly Sonnet-quality)
Models improved enough to make the experience acceptable
Product Strategy & Roadmap
Interfaces (IDE, CLI)
P0 (primary product) is to automate coding - the interface will evolve over time
Computer Use for Validation
Using computer use to verify code changes work correctly
Example: Bug reported on Slack → Cursor fixes → Returns video proof → PR merged
Async workflow advantage - not waiting for human to test
Other Features
Deep Links for sharing: Commands and rules shareable via links
Voice Input: Already working well, working on voice output back from Cursor
Future Features Discussed
Extension Marketplace: Reuse VS Code marketplace for commands, rules, MCP servers
Video Input: Waiting for model providers to add capability
Technical Deep Dives
Tool Management for MCP
Solutions being explored:
Lazy loading: Agent sees tool list, loads definitions only when needed
File-based discovery: Tools stored as files, agents grep/search
Code Mode: Wrap MCPs as APIs (models better at APIs than MCP definitions), enables programmatic chaining
Context Window Challenges
Opus 4.5 only 180K context window (surprisingly small)
Team stops using chat at ~60% capacity (quality degrades)
Preference for one-shot prompts vs. multi-turn conversations
Summarisation degrades quality significantly
Other Insights
BugBot Feature
Automatic PR review bot in Cursor
Intentionally catches fewer bugs than competitors
Reason: Filtering false positives - too many warnings = users ignore them
Custom model trained for false positive detection
Internal Tools Usage
Can see internal data on feature usage, model preferences
Agent quality team (8 people) solely focused on harness/evals/tools
Competitive Positioning
Cursor Bench used as de facto model ranking
Model selector in IDE trusted more than public benchmarks
Signals
What I’m Reading
Titans + MIRAS: Helping AI have long-term memory
Portable Memory & Behavioral Signatures: the missing layer for AI personalisation
OS Agent | Byte’s Mobile Assistant & Implications
B2B “AI products” that are really services in a trenchcoat
Earnings Commentary
You can achieve performance-wise so much better in the custom purpose-designed, hardware-driven XPU. And we see that in the TPU and we see that in all the accelerators we are doing for our other customers, much, much better in areas of sparse core, training, inference, reasoning, all that stuff.
Hock E. Tan, Broadcom Q4 2025 Earnings Call
I’ll give you one example from a financials or economics perspective. Let’s take a media and entertainment company we’re working on... that organization was spending $10 million with us ARR on our core creative products... We were able to sell them Firefly Services and Firefly Foundry for about $7 million, so a pretty significant step up in terms of the engagement that we have with the customer.
David Wadhwani, Adobe Q4 2025 Earnings Call
Have any feedback? Email me at akash@earlybird.com.





