From electrons to inference with Crusoe's Erwan Menard

The Vertically Integrated Neocloud

Jun 08, 2026

Upcoming Roundtables in London

June 11th: Building AI-Native Engineering Teams with Ethos CTO Dan Mankowitz

The levers to drive down cost-per-token

Inference is now the dominant cost line in most production AI businesses, and the shape of that cost is changing faster than most operators are pricing in. The software techniques that defined the last eighteen months (quantisation, speculative decoding, prefill/decode disaggregation, smarter KV caching) are real and still being mined, but they are first-order improvements on a stack whose deeper constraints are physical: gigawatts of available power, lead times on Infiniband, the maturity of chip ecosystems, and the time between signing a Nvidia order and seeing tokens come out the other end. Operators who treat inference as a pure software optimisation problem are quietly losing ground to ones who are integrating downward into the data centre and energy stack.

The discussion this week with Erwan Menard, SVP Product from Crusoe, surfaced the bets Crusoe is making as an AI cloud.

1. Vertical integration

The four-layer stack (energy sourcing, data centre construction, GPU cloud, managed AI services) has to be owned end-to-end to be defensible by 2027. Today, customers don’t ask whether their inference provider built its own power, but by 2028, when the US is projected to face a double-digit gigawatt gap between data centre demand and grid supply, that question becomes the buying criteria.

Crusoe has publicly committed $1.5 billion in turbine purchases for 2027 deployment, built with Boom Supersonic, operating at temperatures roughly 20°C above standard GE or Siemens designs. Even if those turbines are not consumed by Crusoe’s own cloud, they can be resold at a higher price than purchased. The downside is bounded; the upside is preferential access to power when peers are waiting in the grid interconnect queue (currently four years in the US).

The tension is the 1990s telecoms parallel: massive infrastructure buildout against speculative demand led to a decade of overcapacity. The 1996-era fibre overbuild preceded the demand wave. The current GPU buildout is happening against an H100 cluster price index that is up 35% year-to-date, on hardware introduced in 2022. That is the inverse of an overcapacity signal.

2. The Pareto curve is the unit of analysis, not the model

The mental model Crusoe pushes customers toward is the throughput-versus-latency Pareto frontier: tokens per second per user (latency) on one axis, tokens per second per megawatt (fleet throughput) on the other. Every inference deployment is a point on that curve, and every GPU generation defines a new curve that sits above and to the right of the last one.

Customers graduate along the curve as the business matures. Early-stage customers cluster at the throughput end because they’re optimising GPU spend during proof-of-concept. Customers with human-facing products migrate towards latency as it starts to bind on user experience.

The complication is that the newest GPU generation always offers the best Pareto curve in principle but not always the best dollar economics in practice, because it’s earliest in its depreciation cycle. A B300 outperforms a Hopper on every axis, and the Hoppers are the workhorses of AI infrastructure globally, presenting a great economical fit to inference small models as an example. Crusoe routinely help their customers navigating these trade-offs.

3. Three product tiers, mapped to customer maturity

Crusoe’s serverless inference is structured as a three-step ladder rather than a single product:

Off-the-shelf open models on shared infrastructure, paid per token. DeepSeek, Qwen, Kimi K2, Minimax, GPT-OSS, Nemotron. Available through Crusoe’s own Intelligence Foundry or on OpenRouter and other brokerages. This is the proof-of-concept tier.
Bring-your-own open model on dedicated infrastructure, with throughput-optimised or latency-optimised “recipes” selectable by the customer. Self-serve, on-demand GPU-hour billing. Releasing in June.
Custom inference engine optimisation by a forward-deployed engineering team. Customers at this tier typically commit at the dollar level (tens of millions over twelve months) rather than at the GPU level, and revisit parameters quarterly as the workload shifts.

The same inference engine sits under all three tiers.

4. The KV cache as proprietary IP

Most of Crusoe’s inference engine is assembled from open-source components (vLLM, SGLang, others) with the value-add being curation: picking which open-source advances to integrate and when. The exception is the KV cache layer, which Crusoe has built in-house and calls “memory alloy”. The mechanism pools the memory available across all GPUs in a serving group and treats it as a unified address space. The payoff is on long-context workloads, where KV cache becomes the bottleneck before compute does. Crusoe published a white paper on the technique earlier this year.

This matters strategically because long-context coding workloads are now a significant fraction of the inference business, and the per-token economics on those workloads are dominated by cache behaviour rather than raw FLOP throughput.

The tokeniser Crusoe open-sourced at GTC, reportedly 40x faster than the Hugging Face baseline, is a related but distinct optimisation.

If the inference engine is proprietary, how does a customer compare outputs against vLLM or a different provider’s stack? Crusoe’s upcoming serverless fine-tuning service (launching end of June) will include an eval harness, and customers operating at billions of tokens per day already maintain their own eval infrastructure across multiple inference providers.

5. Supply chain tensions are plenty, not only on chips

SSD, memory, Infiniband networking are increasing lead times that exceed both GPU lead times. A neocloud’s effective time-to-market is determined by its supply commitments, made nearly a year before the corresponding revenue is recognised. This is the engineering rationale for Crusoe’s modular data centre approach: 20 megawatt sites assembled from ten 2-megawatt prefabricated units called Spark, manufactured in-house using designs inherited from the company’s bitcoin mining origins. These can be brought online in weeks rather than the year a gigawatt site requires, with equivalent per-megawatt economics for a given GPU SKU. This end-to-end industrial approach allows to smoothen supply chain tensions and deliver capacity programmatically and reliably. This modular strategy is complementing the gigawatt campuses Crusoe is known for: They developed the first Stargate campus in the US, which capacity will reach 2.1GW.

The first Spark deployment is a 16 megawatt campus in Nevada, co-located with a solar farm and a battery bank built from retired EV cells. The grid is present as backup, not primary supply. The first 2 Spark units are powering Crusoe Managed Inference service. This is the template Crusoe expects to scale through 2026 and 2027.

6. Inference-specific chips: when and which

Crusoe expects to deploy inference-specialised silicon in 2027 alongside continued GPU procurement. Crusoe will aim to serve 2 chips:

LPU (Groq) will likely be one of them, primarily because the Nvidia acquisition gives it the financing and ecosystem support that other inference-specific architectures lack.

There are a few dimensions that matter. The first is current LLM inference performance, which is measurable today. The second is architectural adaptability to whatever post-transformer architecture might emerge, which is irreducibly speculative.

The financing point is worth flagging separately. Nvidia actively its chips entry to market: CoreWeave, Nebius, and Crusoe itself benefited from this. A challenger chip vendor that cannot offer similar support places the cost of ecosystem development on the neocloud, which is a meaningful drag on adoption regardless of per-token economics.

7. The last-mile network constraint

If inference moves to the edge in 20-megawatt increments, the constraint shifts from data centre supply to last-mile bandwidth. For example, a single user running five concurrent cloud-based coding agents pulls roughly a terabyte per month of inference traffic. Multiply by agent population and the network buildout starts to look undersized.

However, the last mile is part of the network that has been most consistently overbuilt, with 5G machine-to-machine traffic projections that never materialised. Edge inference deployments fit comfortably within existing last-mile capacity in most markets. The bottleneck, again, is the data centre side.

For those interested in learning more about Cruose, check out CEO Chase Lochmiller’s talk at Stanford which dives deep into the economics of the AI stack from power to compute.

Discussion about this post

Ready for more?