The Measurement Problem: When Your Metrics Reward the Wrong Behavior

Last year, I sat in a quarterly business review where an engineering director presented what he called “the best quarter in the team’s history.” Velocity was up 42%. Pull requests per developer had nearly doubled. Sprint burndowns were textbook smooth. The slides were polished, the trend lines all pointed up, and the room was nodding along.

Then the VP of Product asked a simple question: “So why are we three weeks late on the feature customers are actually waiting for?”

The room went quiet. Not because nobody had an answer, but because the answer exposed something uncomfortable: every metric on the dashboard was improving while the thing that mattered to the business was getting worse. The team had adopted AI coding assistants six months earlier, and the numbers had responded exactly the way everyone hoped. More code. More velocity. More throughput. The dashboard was a success story. The product roadmap was not.

This is the measurement problem, and it is more common than most engineering leaders want to admit. It is not a tooling issue. It is not a calibration issue. It is a structural failure in how we measure software development, one that AI adoption doesn’t create but exposes with uncomfortable clarity.

The Metrics Were Already Fragile

Let me be direct about something: the metrics most engineering organizations rely on were never as reliable as we treated them. Velocity in story points, lines of code, sprint burndown charts, pull requests per developer. These were always proxies, rough approximations of progress that worked well enough when the relationship between effort and output was relatively stable.

AI breaks that stability.

Story points were designed to capture relative effort in a world where implementation complexity varied dramatically across tasks. A “1” was a config change. An “8” was a new service with database integration, API design, and test coverage. AI collapses that range. When AI generates the implementation, the remaining human work, reviewing, validating, deciding, is more uniform across tasks. The “8” becomes a “2” not because the work is simpler, but because the mechanical portion has been automated. Velocity goes up, but the unit of measurement has changed. It is like measuring distance in miles and then switching to kilometers without telling anyone: the number is bigger, but you haven’t gone farther.

I worked with a healthcare company where this played out over six months. Their velocity increased from an average of 45 points per sprint to 78 points. Leadership celebrated. But when I asked the product owner how many features had shipped to production in the same period, the answer was sobering: roughly the same as before. The velocity increase was entirely an artifact of point deflation. Stories that used to be “5s” were now “2s” because AI handled the implementation. The team was completing more points but delivering the same amount of business value.

The dashboard said they were winning. The product said otherwise.

When Individual Output Metrics Meet Collaborative Work

The distortion gets worse when you look at individual output metrics in the context of how AI-driven development actually works.

Pull requests per developer, commits per developer, stories completed per developer. These metrics assume that the unit of productivity is the individual. That assumption was already questionable in any team-based development environment, but it becomes actively harmful when AI changes the nature of the work.

A fintech company I worked with saw pull requests per developer increase by 80% after AI adoption. They also saw the average review time drop from 45 minutes to 12 minutes. Leadership read this as efficiency. It was not efficiency. It was rubber-stamping. Reviewers could not keep up with the volume of AI-generated code, so they skimmed instead of reviewed. The defects that should have been caught in review were instead caught in production.

The metric improved. The system degraded.

This pattern accelerates when organizations use individual metrics for performance evaluation. If a developer knows she is measured on PRs merged per sprint, she will optimize for that number. In a pre-AI world, that optimization was mostly harmless because merging a PR required writing the code, which took real effort. In an AI-assisted world, generating code is fast. The bottleneck shifts to validation, integration, and architectural coherence, none of which show up in a PR count.

The result is a team where everyone is individually productive and collectively ineffective. Each person’s dashboard looks strong. The system they are building together does not hold up.

In the developer’s evolving role, I wrote about how AI shifts the developer’s value from code generation to validation and judgment. The measurement problem is the organizational mirror of that shift: if you keep measuring generation while the value has moved to validation, you are rewarding the wrong behavior.

The Behavioral Trap

This is where the measurement problem becomes a management problem.

Charles Goodhart observed that when a measure becomes a target, it ceases to be a good measure. In software development, this plays out with devastating precision.

When teams know they are measured on velocity, they inflate estimates. A task that the team privately agrees is a “3” gets estimated as a “5” because the sprint commitment needs to look achievable. After AI adoption, this dynamic intensifies: the gap between actual effort and estimated effort widens, making the inflation both easier and harder to detect.

When managers compare velocity across teams, they create perverse incentives. Team A runs at 60 points per sprint. Team B runs at 90. A director who doesn’t understand that story points are team-specific, that they were never designed for cross-team comparison, concludes that Team B is 50% more productive. Team A responds by inflating their estimates. Now both teams run at 90 points, and the director concludes that the organization improved. Nothing changed except the numbers.

When organizations tie bonuses or promotions to sprint completion rates, they incentivize scope reduction. Teams learn to commit to less, complete it reliably, and present a perfect burndown chart. The metric is flawless. The ambition is gone.

AI amplifies every one of these dynamics because it increases the volume of measurable output without proportionally increasing the volume of measurable outcomes. There is more to count, and counting is easier than evaluating. So organizations count.

This is productivity theater: the AI version of the DevOps theater I wrote about recently. DevOps theater was about renaming silos without removing them. Productivity theater is about inflating dashboards without improving delivery. Same dynamic, different domain.

The Political Dimension

If the measurement problem were purely technical, it would be easy to fix. Swap the old metrics for better ones. Update the dashboards. Move on.

But metrics are not just measurement instruments. They are political artifacts. Every metric has a constituency.

The project manager who uses velocity for forecasting. The director who uses team velocity comparisons for performance reviews. The executive who reports sprint completion rates to the board. The HR team that uses individual output metrics for promotion decisions. Each of these stakeholders has built processes, reports, and decisions around the existing metrics. Retiring a metric means telling these people that the number they have been relying on is no longer meaningful.

That is an uncomfortable conversation, and most leaders avoid it.

The most common avoidance strategy is running dual metrics: keep the old ones “for continuity” while introducing new ones “for experimentation.” This sounds reasonable. It is a trap. When teams face two measurement systems, they optimize for whichever one has consequences. The old metrics, because they are tied to existing processes and incentives, always win the attention battle. The new metrics, because they are unfamiliar and not yet tied to anything, get ignored.

I have seen this pattern across multiple organizations. The leadership team announces a new measurement approach. The old dashboards stay up. Within two months, every conversation defaults back to velocity and sprint burndown because those are the numbers people know how to interpret. The new metrics become a side project that nobody owns.

The measurement problem is not “we don’t know what to measure.” It is “we can’t let go of what we’ve been measuring.”

What This Actually Costs

The cost of measuring the wrong things is not abstract. It shows up in three concrete ways.

The scale of the problem is larger than most leaders realize. A Wharton study of 800 senior decision-makers found that 74% of companies report positive ROI from generative AI. Meanwhile, MIT’s research on enterprise AI pilots found that 95% fail to deliver measurable P&L impact, with only 5% achieving substantial value at scale. Both studies are credible. Both are measuring different things. That gap between reported ROI and actual business impact is the measurement problem at industry level: organizations are measuring activity and calling it value.

First, talent attrition. The engineers who understand that the metrics are broken, the ones with the judgment and system thinking you most need to retain, are the first to leave. They see the disconnect between what the dashboard rewards and what the work requires. They watch colleagues get recognized for high PR counts while the codebase deteriorates. They raise the issue, get told “the numbers look good,” and start updating their resumes. I explored this dynamic in the conversation this book is really about: the gap between what metrics say and what teams experience is one of the most corrosive forces in engineering organizations.

Second, misallocated investment. When leadership makes resource decisions based on metrics that no longer correlate with value, they invest in the wrong things. The team with the highest velocity gets more headcount. The team that is actually delivering the most business value but has lower velocity because they spend time on validation and architectural coherence gets questioned. Resources flow toward output, not outcomes.

Third, delayed recognition of real problems. When the dashboard says everything is fine, nobody investigates. The integration issues, the declining code quality, the security vulnerabilities that AI-generated code introduces, all of these accumulate silently behind a wall of green metrics. By the time the problems surface in production, they are expensive to fix and embarrassing to explain.

The Questions Worth Asking

I am not going to prescribe a replacement framework in this post. The measurement problem is first a diagnostic problem: you have to understand what is broken before you can fix it. And the diagnosis starts with honest questions.

If your team adopted AI coding assistants in the last year, ask yourself:

Has velocity increased? If yes, has the number of features shipped to production increased proportionally? If the answer is no, your velocity metric has lost its calibration.

Are you measuring individual developer output? If yes, are those metrics creating incentives that conflict with collaborative work? If a developer can improve her personal metrics by working alone instead of participating in team reviews, the metrics are working against you.

Do your metrics have consequences? If velocity is tied to performance reviews, sprint completion is tied to bonuses, or PR count is tied to promotions, you have created a system where people will optimize for the metric, not the outcome. That was risky before AI. It is dangerous now.

Can you answer the question “are we delivering business value?” with your current metrics? Not “are we completing sprints” or “are we shipping code,” but “is the software we built making a difference for the business?” If your metrics cannot answer that question, they are measuring activity, not impact.

When was the last time you retired a metric? If the answer is “never,” you are accumulating measurement debt the same way codebases accumulate technical debt. Old metrics that no longer correlate with reality are not harmless. They actively mislead.

The Uncomfortable Truth

The measurement problem is ultimately a leadership problem. It is not about dashboards or tools or frameworks. It is about whether leaders are willing to admit that the numbers they have been relying on no longer tell the truth.

That admission is hard. It means acknowledging that decisions made based on those numbers may have been wrong. It means having uncomfortable conversations with stakeholders who depend on familiar metrics. It means accepting a period of uncertainty while new measurement approaches are established and validated.

But the alternative is worse. The alternative is an organization that optimizes for metrics that reward the wrong behavior, loses the people who see the disconnect, and discovers the real problems only when they become crises.

AI did not create the measurement problem. But it removed the margin of error that allowed fragile metrics to appear reliable. The question is not whether your metrics need to change. The question is whether you will change them before the cost of not changing becomes impossible to ignore.

Ricardo