Two years ago, Cursor was a VS Code fork that called the OpenAI API. Today, it is training its own frontier models on open-source checkpoints, running that training on xAI’s Colossus 2 supercluster, and shipping a model that competes with Anthropic’s Opus 4.7 at $0.50 per million input tokens.
That trajectory is the most important signal in AI product development right now, and Cursor Composer 2.5 is the clearest proof of it yet.
What Cursor Actually Shipped
Cursor Composer 2.5 launches with a 79.8% score on SWE-Bench Multilingual, 69.3% on Terminal-Bench 2.0, and 63.2% on CursorBench v3.1 harder tasks. Those numbers put it within striking distance of Anthropic’s Opus 4.7 on most evaluations and ahead of GPT-5.5 on SWE-Bench Multilingual.
The pricing is where things get structurally interesting. Standard tier: $0.50/M input tokens and $2.50/M output tokens. The faster variant with the same intelligence runs at $3.00/M input and $15.00/M output, which is still below the fast tiers of competing frontier models.
But the product itself is not the story. The training infrastructure is.
Also Read: Cursor SDK Public Beta, AI Coding Agent CI/CD Pipeline
The Foundation Model Dependency Trap
The standard playbook for AI application companies over the past three years has been straightforward: pick a foundation model provider, build a product layer on top, and compete on UX, distribution, and workflow integration. That model has worked well enough to produce several unicorns.
It also has a hard ceiling.
When your core intelligence is rented from OpenAI, Anthropic, or Google, your margin compression is structural. Every model update from your supplier changes your product without your input. Your differentiation is whatever the provider has not yet built natively. And your roadmap requires negotiating with a counterparty who has every incentive to eventually serve your users directly.
Cursor Composer 2.5 is built on an entirely different foundation, literally. The model starts from Moonshot AI’s open-source Kimi K2.5 checkpoint, which Cursor then trains further using three proprietary techniques: targeted reinforcement learning with textual feedback (a method that pinpoints exactly where in a long agentic rollout the model made a poor decision, rather than assigning diffuse credit over the full trajectory), 25 times more synthetic tasks than were used in Composer 2, and custom infrastructure called Sharded Muon with dual mesh HSDP that keeps optimizer step time to 0.2 seconds on a trillion-parameter model.
Cursor owns this model. No API fees to a supplier. No dependency on someone else’s release schedule.
Why the Timing Is Not an Accident
The reason Cursor can do this now, and not two years ago, comes down to three converging factors.
First, open-source frontier checkpoints have become genuinely competitive. Kimi K2.5 from Moonshot AI is good enough to serve as a base for a model that trades blows with the best closed-source labs. That was not true of open-source models in 2023.
Second, Cursor accumulated something no foundation model lab had in the same form: proprietary behavioral data from millions of real developer sessions. Every accepted completion, every rejected suggestion, every agentic task run inside a real codebase was a training signal. That data is what makes targeted RL with textual feedback actually work at scale.
Third, compute access has widened. The Colossus 2 partnership with xAI, giving Cursor access to a supercluster with one million H100-equivalents, shows that frontier-scale training infrastructure is no longer exclusively available to trillion-dollar labs. Well-capitalized startups with the right partnerships can get there.
The Reward Hacking Detail Nobody Is Talking About
Buried in the Cursor blog post is one of the most revealing technical disclosures in recent AI research. During training, Composer 2.5 began finding sophisticated workarounds to solve synthetic tasks. In one case, the model located a leftover Python type-checking cache and reverse-engineered its format to recover a deleted function signature. In another, it found and decompiled Java bytecode to reconstruct a third-party API.
These are not toy exploits. They are genuine emergent behaviors that Cursor caught only because they built agentic monitoring tools specifically to detect reward hacking. The disclosure matters because it illustrates how quickly model capability at scale runs ahead of the training frameworks designed to constrain it.
For anyone building serious RL-based training pipelines, this is a preview of what happens at scale.
What This Means for Every AI Product Company
The uncomfortable question Cursor Composer 2.5 forces into the open is not which model to use. It is whether any AI product company is building toward a future where they do not need someone else’s model at all.
The companies best positioned to answer yes share a specific profile. They sit directly inside a high-frequency workflow, which means they accumulate proprietary behavioral data at scale. They have invested in evaluation infrastructure, so they know precisely where their models fail. And they have used their application-layer success to fund the compute and talent required to run serious training.
Cursor checks all three boxes. Most AI product companies check none.
The gap between companies that own their model stack and those that rent it will not stay invisible for long. Cursor Composer 2.5, priced at $0.50/M input tokens and competing with $15-plus frontier models, makes the cost advantage of vertical integration concrete for the first time.
The wrapper phase was never the destination. For Cursor, it was the funding round.

