Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains

Codifying Tribal Knowledge Into Vertical-Specific Reasoning

Nov 03, 2025

Software Synthesis analyses the evolution of software companies in the age of AI - from how they're built and scaled, to how they go to market and create enduring value. You can reach me at akash@earlybird.com.

Gradient Descending Roundtables in London

November 12th: Small Language Models, Context Engineering and MCP with Microsoft Azure AI

November 18th: Agent Frameworks & Memory with Cloudflare

November 19th: Designing AI-Native Software with SPACING

Last week, Cursor finally released their first frontier coding model, Composer 1 Alpha.

Many are speculating whether the model is a fine-tuned Chinese MoE model, which seems plausible given Cursor’s stance on where their edge lies:

Our primary focus is on RL post-training. We think that is the best way to get the model to be a strong interactive agent.

Earlier that morning, we hosted Aidan and Matt in our office to discuss Scale AI’s recently published research on ‘Rubrics as Rewards’ and its implications for post-training base models.

I’ve written at length about how application-layer companies (or ‘agent labs’) capture reward signals that are becoming increasingly valuable as the cost and complexity of post-training collapse (fine-tuning APIs, managed infra).

‘Reward Engineering’ has proven to be one of the defining AI themes of the year as companies like Thinking Machines Lab, Applied Compute, Osmosis and others emerged with a value proposition of post-training custom models underpinned by high quality reward data. This is all against a backdrop of rapid advances in large model capabilities across objectively verifiable domains like coding and mathematics (where the reward is binary; a proposed code change either runs or it doesn’t).

As soon as this vector of scaling RL became clear, the immediate next question was how to apply it to non-verifiable domains where subjectivity plays a bigger role.

That’s what the Scale paper is focusing on, proposing a richer way of capturing tribal knowledge and reasoning than simple RLHF or preference-ranking. In effect, this is closer to having a process reward model that not only rewards the final outcome but also the steps taken to get to it - and the results are striking.

Below are the notes from our discussion. The relevant papers are here and here - you can also see examples of Rubrics here on the Scale website.

Evolution of Model Training: Post-Training

The progression of post-training approaches:

Supervised Fine-Tuning (SFT) - Early approach
- Provides prompt-response pairs with “correct” answers
- Useful for smaller, specialised models
- Falls away for larger, more generalised models
- Can lead to overfitting in complex reasoning scenarios
RLHF (Reinforcement Learning from Human Feedback) - Current approach
- Humans select preferred responses between model outputs
- Effective for simple queries but struggles with multi-step reasoning
- Volume-intensive and requires extensive human evaluation
- Risk of reward hacking
Rubrics-Based Approach - New approach
- Combines expert knowledge with scalable automation
- Particularly valuable for unverifiable domains

The Continuous Evaluation Loop

Training follows a cyclical process:

Evaluation → Data Production → Performance Improvement → Re-evaluation
Typically runs in 6-month cycles with 2-3 iterations per model development phase
Requires constant creation of new data based on previous evaluation results
Includes adversarial components (red teaming) to identify security and safety weaknesses

Understanding Rubrics as Rewards

What Are Rubrics?

Rubrics are structured evaluation frameworks consisting of:

Binary or weighted criteria that can be objectively assessed
Multiple independent factors that together evaluate quality
Specific rules stating what makes an ideal response
Examples to help models understand expectations

Simple Example Structure

For a basic prompt, a rubric might include:

Must address the core question
Should provide specific details
Must avoid certain pitfalls
Should follow appropriate tone/style

Complex Professional Example

In an investment context (e.g., market entry analysis for EVs in South America):

20-30 different evaluation components
Checks for specific data points (market size, timelines, partnerships)
Validates reasoning steps
Assesses structural clarity and relevance
Can be binary (present/not present) or weighted

Three Key Advantages of Rubrics

1. Mitigating Reward Hacking

The Problem: When using simple preference selection, models may learn unintended patterns

Example: Accidentally training preference for “red cars over blue cars” when color wasn’t the intended differentiator
Human evaluators have intrinsic biases that can skew preference data

How Rubrics Help:

Break down evaluation into specific, objective criteria
Exclude components that might drive unintended behavior
Allow larger volumes of diverse data to be generated
A recent Scale AI paper empirically demonstrated reduced reward hacking with rubrics

2. Adaptability

The Challenge: Models need constant refinement based on real-world performance

Errors emerge during deployment
Requirements change based on user feedback
New security issues are discovered

Rubrics Solution:

Much easier to tweak a rubric than retrain thousands of human evaluators
Can quickly adjust weighting or add new criteria
Enables faster iteration cycles
More agile response to identified problems

3. Domain-Specific Control

Different domains require different priorities:

Creative writing: Style, tone, human-like quality matter most
Mathematics: Correct answer and logical reasoning are paramount
Professional contexts: Structural clarity, verifiable data, reasoning transparency

Rubrics enable precise weighting of different evaluation components based on what matters in each specific domain.

The Role of LLMs in Rubric Evaluation

Automation Through AI

Humans design the rubrics (expert knowledge)
LLMs automate the evaluation against those rubrics
This creates scalable volume while maintaining expert-defined standards

Why This Works

LLMs are particularly good at:

Checking if specific elements are present
Verifying binary conditions
Following clear, objective rules

Model Selection for Evaluation

For research purposes (as in the paper): Used GPT-4 Mini as LLM judge to ensure standardisation
For production: Smaller models can work if rubrics are well-designed
Frontier labs building cutting-edge models still prefer human-written rubrics
Smaller budget projects can use LLM-generated rubrics with acceptable results

Practical Applications

Investment Due Diligence

Multi-step reasoning example:

Initial analysis of market conditions
Financial projections
Partnership identification
Risk assessment
Each step can have its own rubric
Important to maintain observable reasoning pathways

Healthcare/Medicine

Unverifiable domain characteristics:

No single “correct” diagnosis for complex presentations
Expert intuition difficult to codify
Rubrics help capture decision-making criteria
Can evaluate diagnostic reasoning steps

Legal Services

Contract analysis requirements:

Specific clause identification
Risk assessment
Compliance checking
Precedent application

Insurance

Knowledge extraction challenge:

Much expertise exists only “in people’s heads”
Retiring workforce creates knowledge drain
Rubrics help codify expert decision-making
Quality assurance already uses similar scoring systems

Technical Details from the Research

Experimental Setup

Base model: Qwen 2
Comparison: Rubrics approach vs. direct preference ranking
Evaluation: GPT-4 Mini as judge for standardisation
Training: Offline RL (not online due to cost)

Performance Results

The rubrics approach showed improvements in:

Context awareness
Communication quality
Accuracy in reasoning-heavy tasks

Key Finding

Rubrics are particularly effective for:

Complex, multi-step reasoning
Domains without clear “correct” answers
Professional/expert knowledge domains
Tasks requiring observable reasoning chains

Critical Considerations and Debates

Evaluating the Evaluators

The recursion problem:

How do you evaluate whether rubrics are good?
Ultimately requires testing impact on model performance
Need comprehensive evaluation datasets for each domain
Iterative refinement based on real-world results

Human vs. Automated Rubric Creation

For frontier performance: Human-written rubrics still superior

For smaller models/budgets: LLM-generated rubrics can work

Key insight: The intersection of domain expertise and understanding of model behaviour is rare and valuable

Market Implications

The Changing Nature of AI Data Work

From volume to expertise:

Less emphasis on mass preference labelling
More focus on expert rubric design
Smaller teams of highly skilled data creators
Higher per-person cost but better outcomes

Professional Knowledge Extraction

The hidden expertise problem:

Most professional knowledge isn’t documented
Exists in practitioners’ intuition and experience
Examples: Insurance brokers, doctors, lawyers, investment analysts
Rubrics provide a framework to codify this tacit knowledge

Future Opportunities

Industries with retiring workforces (insurance, legal)
Domains with complex, multi-step reasoning
Applications requiring explainable AI decisions
Situations where “average” data isn’t sufficient

Autonomous Improvement Discussion

The Holy Grail Question

Can frontier models achieve fully autonomous self-improvement using only AI-generated rubrics?

Current state:

Deepseek recently claimed full self-improvement in specific domains
Depends heavily on model size and domain complexity
Still an open research question
Likely varies significantly by application

Hybrid Approaches

Most realistic near-term:

Human-designed rubrics for frontier domains
LLM-automated evaluation and data generation
Human oversight for refinement and validation
Iterative improvement cycles

Key Takeaways

Rubrics represent a middle ground between fully manual RLHF and purely automated approaches
Most effective for unverifiable domains where there’s no single correct answer but expert judgment exists
Enables scaling expert knowledge by codifying it into structured, scalable evaluation criteria
Reduces but doesn’t eliminate human involvement - shifts humans to higher-level rubric design rather than individual preference selection
Particularly valuable for professional applications in law, medicine, finance, and other expert domains
The data market is shifting from volume-based preference ranking to expert-driven, rubric-based approaches
Quality of rubric design is critical - requires intersection of domain expertise and understanding of AI model behavior

Future Directions

Application to visual language models and robotics
Integration with existing quality assurance frameworks
Automated rubric generation for non-frontier applications
Expansion into multimodal domains
Development of standardised rubric libraries for common domains

Signals

A line chart showing the number of new developers joining GitHub from 2020 to 2025. The line rises steadily, reaching 36.2 million in 2025, with a sharp increase after the launch of Copilot Free in late 2024. The chart has a dark background with blue data lines and the title ‘The number of new developers on GitHub.’ — Source

A table listing the fastest-growing open source projects on GitHub in 2025 by contributors. The top ten are zen-browser/desktop, cline/cline, vllm-project/vllm, astral-sh/uv, microsoft/vscode, infiniflow/ragflow, sgl-project/sglang, continuedev/continue, comfyanonymous/ComfyUI, and home-assistant/core. Growth rates range from 2,301% to 6,836%, with most projects marked as AI-focused. Displayed on a blue gradient background with the GitHub Octoverse ribbon graphic. — Source

What I’m Reading

How to Keep Winning

Top Down and Bottom Up Investors

The Case for the Return of Fine-Tuning

Is Diffusion the Future of LLMs?

Marc Andreessen and Charlie Songhurst on the past, present, and future of Silicon Valley

Europe’s Hidden Sovereign AI Gem: Nebius

Earnings Commentary

“Today, virtually every business is becoming a software business, and AI has made software easier than ever to create. In this world, we believe your design, your craft and your brand’s point of view is what’s going to make your product and your company stand out. Design is now the differentiator. It’s how companies win or lose.“‘

Dylan Field, Figma Q2 Earnings Call

When people hear [Data Cloud], naturally assume, this must be a Snowflake competitor or a Databricks competitor, and that’s just not the case. Snowflake and Databricks and BigQuery and Redshift are among our biggest partners.

Salespeople do not log in to Snowflake. Snowflake is fantastic, but it’s really analysts that tend to log into Snowflake, not salespeople. People in the contact center do not log into Databricks... That’s the problem that Data 360 really, really, really solves.

Stephen Fisher, Salesforce Analyst Day

Our most recent Adobe Digital Index data which is based on online transactions across over 1 trillion visits to U.S. retail sites, shows that LLM traffic grew 4,700% year-over-year in July 2025. The rapid changes in consumer behavior and expectations in the era of AI are forcing brands to reinvent marketing and customer experience.

Anil Chakravarthy, Adobe Q3 Earnings Call

We estimate 80% of the leading AI companies already rely on us. A huge percentage of the Internet sits behind us. The agents of the future will inherently have to pass through our network and abide by its rules. And as they do, we will help set the protocols, guardrails and business rules for the Agentic Internet of the future.

Matthew Prince, Cloudflare Q3 Earnings Call

Machines simply can’t govern themselves, AI is like any other enterprise asset, it needs to be cataloged, tracked, supervised and secured. ServiceNow’s configuration management leadership gives us and our customers a clean single pane of glass to govern all artificial intelligence.

Bill McDermott, ServiceNow Q3 Earnings Call

Have any feedback? Email me at akash@earlybird.com.