Small Language Models & Context Engineering Roundtable
With Microsoft AI's Marlene Mhangami
Software Synthesis analyses the evolution of software companies in the age of AI - from how they're built and scaled, to how they go to market and create enduring value. You can reach me at akash@earlybird.com.
Gradient Descending Roundtables in London
November 18th: Agent Frameworks & Memory with Cloudflare
November 19th: Designing AI-Native Software with SPACING
This week, we hosted Marlene and Chris from Microsoft AI to discuss Small Language Models, Context Engineering and MCP. Thanks to everyone who came and made the discussion so insightful!
I’m sharing the summary of our discussion below.
Edge AI & Microsoft Phi Models
Microsoft’s SLM Strategy
Marlene presented Microsoft’s edge AI push, centered around the Phi model family (currently at v4):
Key components:
Phi 4 flagship: General-purpose SLM optimized for edge deployment
Phi 4 reasoning: Reasoning version with strength in mathematical reasoning
Foundry Local: Microsoft’s Ollama-equivalent platform for downloading and running local models, with a path to Azure cloud services
Competitive positioning:
Compared favorably to Qwen models (described as “model of choice” by many practitioners)
Also competing with Mistral 7B, Gemma, and other distilled models
Cost advantage: Phi 4 is approximately 150x cheaper than GPT-4.5 for serverless compute
The Email Agent Case Study: A Context Engineering Example
Marlene’s email agent project serves as an excellent case study in the practical challenges of deploying SLMs:
Architecture:
User Query → Supervisor Agent → {
- Manage Email Agent
- Scheduling Event Agent
- Search Email History Agent (with Postgres + semantic search)
} → MCP Server (M365 tools) → ResultsCritical Design Decision: Sub-agent Architecture
The most significant insight was the necessity of dividing context across specialised sub-agents rather than loading all MCP tools into a single agent. This addresses a fundamental problem: when you connect an MCP server with dozens of tools, the tool descriptions alone overwhelm the SLM’s context window, degrading performance catastrophically.
Solution strategies employed:
Tool segmentation: Each sub-agent receives only relevant tools for its domain
Middleware layer: Custom JSON generation to work around Phi’s lack of native function calling (though this is being addressed soon)
Result summarisation: Critical challenge of managing tool output that floods context windows
Semantic search: Using Postgres for email history rather than raw context stuffing
Performance metrics:
Response latency: ~2 seconds (surprisingly fast for local inference)
Key bottleneck: Not compute, but memory/storage on the device
The Gaming Paradigm
LLM Paradigm vs. SLM Paradigm
LLM world:
Time is abundant (10+ seconds acceptable)
Quality is paramount
Single-threaded workflows
Cloud-centric
Cost scales with usage
SLM world (especially gaming):
Sub-second latency requirements (ideally <1 second)
Minimum viable quality threshold
Massively parallel workflows
Device-constrained
Fixed cost model
The GPU Budget War
Game developers traditionally allocate GPU budgets across departments (sound: 10%, graphics: 40%, etc.). AI is now demanding its own substantial budget allocation, forcing painful trade-offs. This explains why gaming hasn’t yet widely adopted generative AI - it’s not just a technical challenge but a fundamental resource reallocation problem.
Client-Side AI in Gaming: Current Reality vs. Future Vision
Current constraints:
Can typically run 1-2 SLMs on-device simultaneously
Most prioritise single model for maximum quality after distillation
Hybrid approaches: Pre-generate assets server-side, deliver real-time locally
Examples: “Operators” game doing text-to-speech with smart caching
Future paradigm shift:
Multiple specialised SLMs running in parallel
One for dialogue generation
One for text-to-speech
Others for translation, behaviour trees, etc.
The killer use case: NPC conversations
Walking into a room with 10 NPCs having simultaneous conversations. Cloud models can’t scale this (10 parallel LLM calls = cost explosion), but device-based SLMs could handle multiple concurrent agents efficiently.
The Streaming Gaming Convergence
A strategic insight emerged about cloud gaming services (AWS Luna, etc.): They create an opportunity to co-locate GPU streaming and GPU inference, enabling hybrid architectures where:
Latency-critical elements run locally
High-fidelity generation happens cloud-side but close to streaming source
New compression paradigms could enable prompt-based video streaming
Context Engineering: The Central Challenge
Context management is the defining challenge for SLM deployment:
Problem 1: Tool Description Overload
Loading all MCP tools overwhelms context window
Model performance degrades even before executing any tools
Anthropic’s recent blog post confirmed this industry-wide issue
Problem 2: Tool Result Flooding
Example: “Find emails from past 3 months” returns massive data
Keeping results in context causes model failure
Need to balance information preservation with context limits
Mitigation Strategies Discussed
Architectural solutions:
Sub-agent decomposition (Marlene’s approach)
Virtual file systems (Cloudflare’s code mode approach)
Durable execution patterns (Temporal workflows)
Data management:
Result trimming (lossy but pragmatic)
Summarisation layers (risk of losing critical context)
Semantic indexing (Postgres example)
Emerging solutions:
Samba model: New Microsoft Research model claiming “unlimited context windows” through novel compression/persistence mechanisms
Middleware patterns for context transformation
The Agent vs. Workflow Debate:
Agent-heavy approaches:
Tool-calling paradigm with autonomous decision-making
Fills context windows quickly
Unpredictable latency
Popular but problematic for constrained environments
Workflow/chaining approaches:
Airflow-style deterministic pipelines
More predictable, cheaper, faster
Better for real-time requirements
Hardware Evolution: The Coming NPU Revolution
The CUDA Parallel
When Nvidia first released CUDA, AMD had faster chips, and people questioned dedicating silicon to such a niche use case. Today, CUDA lock-in dominates AI infrastructure.
The NPU trajectory will follow a similar path:
Current state: Hardware lagging software (unusual reversal)
Microsoft shipping next-gen PCs with NPUs optimized for inference
Gaming consoles (PlayStation, Xbox) will include AI-specific silicon
Mobile devices (especially iPhones) already have powerful NPUs
AI is already embedded in every layer of computing (OS, browser, applications). Without optimized hardware, most features become unusable. This creates an inevitable forcing function for NPU adoption.
Infrastructure Layer: Inference Providers & Trade-offs
Groq/Cerebras Discussion
The conversation revealed nuanced understanding of specialized inference providers:
Groq advantages:
Extreme speed (2-second responses vs. 20 seconds for Anthropic)
Quality now on par with frontier models
Critical for customer support, voice agents
Cost-effective for fire-and-forget tasks
Groq/Cerebras limitations:
No prompt caching (due to architectural choices around on-chip memory)
Makes agentic workflows economically unviable
Can’t support coding workflows that rely on caching
Optimal use cases:
Summarisation (no caching benefit anyway)
Single-pass tasks
Speed-critical applications without iterative loops
The Caching Dependency
The entire agentic paradigm is built on prompt caching. Providers without caching are fundamentally non-viable for agent workflows, regardless of speed advantages. This suggests prompt caching has become infrastructure-critical, not just an optimisation.
Cost-Performance Frontier
The 150x Cost Advantage
Implications:
B2C AI becomes economically viable (currently mostly B2B due to costs)
Multiple SLM agents cheaper than single LLM call
Enables new business models previously impossible
Quality Threshold Debate
Current consensus:
Adequate for experimentation
Not yet production-ready for most applications
Quality gap expected to close within 1-2 years
Gaming sector: Still struggling to ship SLM-powered features
The deployment readiness spectrum:
Ready now: Summarisation, classification, simple extraction
Close: Customer support, basic agents
Not ready: Complex reasoning, multimodal generation, parallel agent orchestration
Model Context Protocol
Why MCP matters:
Abstracts away complex APIs (Microsoft Graph example)
Pre-built authentication flows
Rapid prototyping without API expertise
Standardised tool interface
The Cybersecurity Shadow
Chris’s book and warnings introduce crucial caution: MCP creates supply chain attack vectors similar to NPM/pip packages:
Malicious MCP servers could exfiltrate credentials
Users connecting servers via Claude Desktop or GitHub Copilot may not understand risks
Analogous to the JavaScript package ecosystem’s security challenges
The tension: Ease of use vs. security. MCP democratizes AI tool integration but potentially at the cost of introducing new attack surfaces.
Is MCP the Right Abstraction?
Marlene posed whether MCP is even the correct model for providing context to SLMs. The underlying question: Should we be using RPC-style protocols for context?
Quantisation Quality Cliff
Model quantisation has a quality cliff:
16-bit quantisation: Acceptable quality
12-bit and below: “Absolute garbage”
Industry marketing problem: Everyone reports performance on 16-bit variants
Implications for edge deployment: The vaunted small size of SLMs may be illusory if quality degradation from aggressive quantisation makes them unusable. This suggests edge devices need more capable NPUs than currently assumed.
Further Reading
The Phi Cook Book
Foundry Local (Microsoft’s offering for local models)
Samba (a new local model moving towards unlimited context)
Edge AI for Beginners (a course we have for working with local models)
Have any feedback? Email me at akash@earlybird.com.



