Skip to main content

One post tagged with "metrics"

View All Tags

· 8 min read
VibeGov Team

AI is producing a weird measurement problem.

This is the first piece in a short VibeGov series about AI throughput, governance, budgets, and organizational control. It sets the foundation for the rest: tokens, governance movement, and delivered value are different layers, and teams get into trouble when they treat them as the same thing.

A lot of people now casually claim that AI gives developers 10x leverage. Maybe it does in some contexts. Maybe it does not in others. But if the claim is going to mean anything operationally, the gain should show up somewhere more concrete than vibes.

The tempting answer is tokens. If models are doing more work, then token usage should tell us how much extra throughput we are getting.

That sounds reasonable for about five minutes.

After that, it collapses.

A team can burn through huge amounts of context and still produce:

  • unclear issues
  • weak specs
  • unverified implementation
  • stalled reviews
  • false completion claims
  • expensive confusion

So the problem is not that tokens are meaningless. The problem is that tokens are being asked to do a job they are not good at.

Tokens are fuel, not throughput

The cleanest way to think about AI usage is this:

  • tokens are input / fuel
  • governance movement is throughput
  • delivered outcome is value

Those are not the same thing.

This matters because a lot of AI measurement talk quietly collapses them into one blurry number. More tokens become more work. More work becomes more productivity. More productivity becomes more value.

That chain breaks all the time.

A model can consume a large budget while doing low-quality search, retrying avoidable mistakes, or wandering around an under-specified problem. A smaller, well-governed run can move work much further with fewer tokens because the issue is clearer, the spec is tighter, and the evidence path is already defined.

That is why token burn alone is a poor productivity metric. It measures effort expended more reliably than progress achieved.

Why token counts are still useful

Rejecting tokens as a standalone productivity metric does not mean ignoring them.

Token usage still tells you useful things about a system:

  • cost pressure
  • orchestration overhead
  • prompt inefficiency
  • context drag
  • model verbosity
  • retry churn
  • search breadth

Those are real operational signals. They just are not the same thing as throughput.

Counting tokens as productivity is a bit like counting fuel burned by a delivery truck. The fuel matters. It affects cost, efficiency, and route design. But it does not tell you whether the right packages arrived at the right places in a usable state.

What throughput should mean in AI-native delivery

If AI is part of real delivery, then throughput should be measured by movement through governed work.

That means asking questions like:

  • Did a vague intake item become a real issue?
  • Did the issue get bound to a requirement or spec?
  • Did implementation stay inside scope?
  • Did validation actually run?
  • Did blockers get surfaced instead of hidden?
  • Did the work reach PR, review, merge, and release-readiness?
  • Were follow-up gaps captured instead of disappearing into chat?

That is throughput. Not because it is bureaucratic, but because it reflects actual work becoming safer, clearer, and closer to ship.

In a governed system, movement is visible. You can see work progress from:

  • idea
  • issue
  • spec
  • implementation
  • verification
  • review
  • release candidate
  • shipped result
  • follow-up backlog

That visibility matters more in AI-assisted delivery, not less. AI can generate activity extremely quickly. Without governance, that speed can multiply ambiguity just as easily as it multiplies useful output.

Governance movement is the output signal

A practical measurement model for AI-native teams should separate three layers.

1. Effort / input

Examples:

  • tokens consumed
  • runtime spend
  • tool calls
  • elapsed model time
  • retries and restarts

Useful for:

  • cost management
  • efficiency tuning
  • routing decisions
  • identifying churn

2. Throughput / governed progress

Examples:

  • issues clarified
  • requirements bound
  • specs created or updated
  • validations passed
  • blockers routed
  • PRs opened
  • PRs merged
  • release-readiness checks completed

Useful for:

  • delivery measurement
  • backlog movement
  • execution quality
  • team/system effectiveness

3. Delivered value

Examples:

  • shipped outcomes
  • risk reduced
  • incidents avoided
  • user problems solved
  • business constraints removed

Useful for:

  • strategic prioritization
  • ROI discussion
  • portfolio decisions

These layers should inform each other, but they should not be confused.

A team with low token spend and no governed movement is not efficient. A team with huge token spend and no shipped outcomes is not productive. A team with strong governed movement but weak value selection may be operating well on the wrong things.

Different failures live at different layers. That is exactly why the layers should stay separate.

The quadrants teams should watch

Once tokens and governance movement are split apart, the picture gets much clearer.

High token use, low governance movement

Usually means:

  • churn
  • vague requirements
  • poor orchestration
  • too much search, not enough convergence
  • hidden blocker loops

Low token use, high governance movement

Usually means:

  • clear issues
  • strong specs
  • tight execution
  • efficient validation
  • disciplined scope

High token use, high governance movement

Usually means:

  • expensive but productive work
  • sometimes justified on hard or ambiguous problems
  • worth optimizing, not dismissing

Low token use, low governance movement

Usually means:

  • under-engagement
  • stalled delivery
  • low urgency
  • blocked or abandoned work

That is a much more useful operating picture than pretending token totals alone are a scoreboard.

Progress over perfection

AI-native delivery creates a new temptation: teams can generate enough activity to simulate momentum.

That makes perfection theater strangely easy. It also makes false precision easy. A team can produce impressive-looking drafts, long transcripts, and massive token counts while staying weak on the thing that matters most: governed progress.

A better principle is progress over perfection.

That does not mean lowering standards. It means measuring whether work is moving through real gates:

  • from ambiguity into issues
  • from issues into spec binding
  • from implementation into evidence
  • from blockers into explicit follow-up
  • from review into trustworthy status

In other words, do not reward volume. Reward visible movement toward validated outcomes.

This is one reason VibeGov treats governed artifacts as important:

  • issue quality
  • spec binding
  • validation evidence
  • checkpoint honesty
  • blocker routing
  • traceable completion

Those things make progress legible. And once progress is legible, throughput becomes measurable in a way that survives contact with reality.

What organizations should actually track

A useful AI delivery scorecard probably mixes all three layers.

Input metrics

  • tokens consumed
  • model/runtime cost
  • average run length
  • retries per task
  • context size

Throughput metrics

  • issues advanced to implementation-grade quality
  • spec gaps closed
  • validations passed
  • PRs opened and merged
  • release checks passed
  • blocker turnaround time

Quality and risk metrics

  • regressions introduced
  • reopen rate
  • false completion rate
  • post-merge correction rate
  • residual risk left untracked

Over time, teams can also look at ratio metrics such as:

  • tokens per validated issue
  • tokens per passed governance gate
  • tokens per merged PR
  • cost per release-ready increment

Those ratios are imperfect. That is fine. They are still more honest than pretending raw token consumption is the same thing as productivity.

The real question

The wrong question is:

How much did the AI say?

A better question is:

How much governed work moved forward because of it?

That is the measurement shift AI-native teams need.

Tokens matter. They affect cost, efficiency, and operating model design. But tokens are fuel. Throughput is what gets through the gates. And value is what survives after the gates were worth crossing in the first place.

If AI is going to change software delivery in a serious way, we should expect serious measurement in return. Not activity theater. Not giant prompt transcripts mistaken for proof. Not cost without throughput, or throughput without value.

Just a clearer model:

  • input
  • governed progress
  • delivered outcome

That is a better foundation for the next stage of AI-native delivery.

Series navigation

The next pieces in this series take that model outward:

  • budgets as delivery infrastructure
  • company-governed runtime as a delivery requirement
  • progress over perfection as an operating discipline
  • unbudgeted AI as unmanaged production capacity