Small Language Models & Context Engineering Roundtable

With Microsoft AI's Marlene Mhangami

Nov 14, 2025

Software Synthesis analyses the evolution of software companies in the age of AI - from how they're built and scaled, to how they go to market and create enduring value. You can reach me at akash@earlybird.com.

Gradient Descending Roundtables in London

November 18th: Agent Frameworks & Memory with Cloudflare

November 19th: Designing AI-Native Software with SPACING

This week, we hosted Marlene and Chris from Microsoft AI to discuss Small Language Models, Context Engineering and MCP. Thanks to everyone who came and made the discussion so insightful!

I’m sharing the summary of our discussion below.

Edge AI & Microsoft Phi Models

Microsoft’s SLM Strategy

Marlene presented Microsoft’s edge AI push, centered around the Phi model family (currently at v4):

Key components:

Phi 4 flagship: General-purpose SLM optimized for edge deployment
Phi 4 reasoning: Reasoning version with strength in mathematical reasoning
Foundry Local: Microsoft’s Ollama-equivalent platform for downloading and running local models, with a path to Azure cloud services

Competitive positioning:

Compared favorably to Qwen models (described as “model of choice” by many practitioners)
Also competing with Mistral 7B, Gemma, and other distilled models
Cost advantage: Phi 4 is approximately 150x cheaper than GPT-4.5 for serverless compute

The Email Agent Case Study: A Context Engineering Example

Marlene’s email agent project serves as an excellent case study in the practical challenges of deploying SLMs:

Architecture:

User Query → Supervisor Agent → {
    - Manage Email Agent
    - Scheduling Event Agent  
    - Search Email History Agent (with Postgres + semantic search)
} → MCP Server (M365 tools) → Results

Critical Design Decision: Sub-agent Architecture

The most significant insight was the necessity of dividing context across specialised sub-agents rather than loading all MCP tools into a single agent. This addresses a fundamental problem: when you connect an MCP server with dozens of tools, the tool descriptions alone overwhelm the SLM’s context window, degrading performance catastrophically.

Solution strategies employed:

Tool segmentation: Each sub-agent receives only relevant tools for its domain
Middleware layer: Custom JSON generation to work around Phi’s lack of native function calling (though this is being addressed soon)
Result summarisation: Critical challenge of managing tool output that floods context windows
Semantic search: Using Postgres for email history rather than raw context stuffing

Performance metrics:

Response latency: ~2 seconds (surprisingly fast for local inference)
Key bottleneck: Not compute, but memory/storage on the device

The Gaming Paradigm

LLM Paradigm vs. SLM Paradigm

LLM world:

Time is abundant (10+ seconds acceptable)
Quality is paramount
Single-threaded workflows
Cloud-centric
Cost scales with usage

SLM world (especially gaming):

Sub-second latency requirements (ideally <1 second)
Minimum viable quality threshold
Massively parallel workflows
Device-constrained
Fixed cost model

The GPU Budget War

Game developers traditionally allocate GPU budgets across departments (sound: 10%, graphics: 40%, etc.). AI is now demanding its own substantial budget allocation, forcing painful trade-offs. This explains why gaming hasn’t yet widely adopted generative AI - it’s not just a technical challenge but a fundamental resource reallocation problem.

Client-Side AI in Gaming: Current Reality vs. Future Vision

Current constraints:

Can typically run 1-2 SLMs on-device simultaneously
Most prioritise single model for maximum quality after distillation
Hybrid approaches: Pre-generate assets server-side, deliver real-time locally
Examples: “Operators” game doing text-to-speech with smart caching

Future paradigm shift:

Multiple specialised SLMs running in parallel
One for dialogue generation
One for text-to-speech
Others for translation, behaviour trees, etc.

The killer use case: NPC conversations

Walking into a room with 10 NPCs having simultaneous conversations. Cloud models can’t scale this (10 parallel LLM calls = cost explosion), but device-based SLMs could handle multiple concurrent agents efficiently.

The Streaming Gaming Convergence

A strategic insight emerged about cloud gaming services (AWS Luna, etc.): They create an opportunity to co-locate GPU streaming and GPU inference, enabling hybrid architectures where:

Latency-critical elements run locally
High-fidelity generation happens cloud-side but close to streaming source
New compression paradigms could enable prompt-based video streaming

Context Engineering: The Central Challenge

Context management is the defining challenge for SLM deployment:

Problem 1: Tool Description Overload

Loading all MCP tools overwhelms context window
Model performance degrades even before executing any tools
Anthropic’s recent blog post confirmed this industry-wide issue

Problem 2: Tool Result Flooding

Example: “Find emails from past 3 months” returns massive data
Keeping results in context causes model failure
Need to balance information preservation with context limits

Mitigation Strategies Discussed

Architectural solutions:
- Sub-agent decomposition (Marlene’s approach)
- Virtual file systems (Cloudflare’s code mode approach)
- Durable execution patterns (Temporal workflows)
Data management:
- Result trimming (lossy but pragmatic)
- Summarisation layers (risk of losing critical context)
- Semantic indexing (Postgres example)
Emerging solutions:
- Samba model: New Microsoft Research model claiming “unlimited context windows” through novel compression/persistence mechanisms
- Middleware patterns for context transformation

The Agent vs. Workflow Debate:

Agent-heavy approaches:

Tool-calling paradigm with autonomous decision-making
Fills context windows quickly
Unpredictable latency
Popular but problematic for constrained environments

Workflow/chaining approaches:

Airflow-style deterministic pipelines
More predictable, cheaper, faster
Better for real-time requirements

Hardware Evolution: The Coming NPU Revolution

The CUDA Parallel

When Nvidia first released CUDA, AMD had faster chips, and people questioned dedicating silicon to such a niche use case. Today, CUDA lock-in dominates AI infrastructure.

The NPU trajectory will follow a similar path:

Current state: Hardware lagging software (unusual reversal)
Microsoft shipping next-gen PCs with NPUs optimized for inference
Gaming consoles (PlayStation, Xbox) will include AI-specific silicon
Mobile devices (especially iPhones) already have powerful NPUs

AI is already embedded in every layer of computing (OS, browser, applications). Without optimized hardware, most features become unusable. This creates an inevitable forcing function for NPU adoption.

Infrastructure Layer: Inference Providers & Trade-offs

Groq/Cerebras Discussion

The conversation revealed nuanced understanding of specialized inference providers:

Groq advantages:

Extreme speed (2-second responses vs. 20 seconds for Anthropic)
Quality now on par with frontier models
Critical for customer support, voice agents
Cost-effective for fire-and-forget tasks

Groq/Cerebras limitations:

No prompt caching (due to architectural choices around on-chip memory)
Makes agentic workflows economically unviable
Can’t support coding workflows that rely on caching

Optimal use cases:

Summarisation (no caching benefit anyway)
Single-pass tasks
Speed-critical applications without iterative loops

The Caching Dependency

The entire agentic paradigm is built on prompt caching. Providers without caching are fundamentally non-viable for agent workflows, regardless of speed advantages. This suggests prompt caching has become infrastructure-critical, not just an optimisation.

Cost-Performance Frontier

The 150x Cost Advantage

Implications:

B2C AI becomes economically viable (currently mostly B2B due to costs)
Multiple SLM agents cheaper than single LLM call
Enables new business models previously impossible

Quality Threshold Debate

Current consensus:

Adequate for experimentation
Not yet production-ready for most applications
Quality gap expected to close within 1-2 years
Gaming sector: Still struggling to ship SLM-powered features

The deployment readiness spectrum:

Ready now: Summarisation, classification, simple extraction
Close: Customer support, basic agents
Not ready: Complex reasoning, multimodal generation, parallel agent orchestration

Model Context Protocol

Why MCP matters:

Abstracts away complex APIs (Microsoft Graph example)
Pre-built authentication flows
Rapid prototyping without API expertise
Standardised tool interface

The Cybersecurity Shadow

Chris’s book and warnings introduce crucial caution: MCP creates supply chain attack vectors similar to NPM/pip packages:

Malicious MCP servers could exfiltrate credentials
Users connecting servers via Claude Desktop or GitHub Copilot may not understand risks
Analogous to the JavaScript package ecosystem’s security challenges

The tension: Ease of use vs. security. MCP democratizes AI tool integration but potentially at the cost of introducing new attack surfaces.

Is MCP the Right Abstraction?

Marlene posed whether MCP is even the correct model for providing context to SLMs. The underlying question: Should we be using RPC-style protocols for context?

Quantisation Quality Cliff

Model quantisation has a quality cliff:

16-bit quantisation: Acceptable quality
12-bit and below: “Absolute garbage”
Industry marketing problem: Everyone reports performance on 16-bit variants

Implications for edge deployment: The vaunted small size of SLMs may be illusory if quality degradation from aggressive quantisation makes them unusable. This suggests edge devices need more capable NPUs than currently assumed.

Discussion about this post

Ready for more?