Arguments rage on around the baseball world about who should be inducted into the Hall of Fame, but there are still other arguments within these arguments. These inner squabbles focus on the statistics used to measure the greatness of players, and one stat seems to get the most attention. Damian Dydyn lays out the case as to why WAR is an imperfect stat and how we should be using WAR for good not evil.
Wins Above Replacement, or WAR, is having an irrevocably negative impact on the discourse surrounding America’s pastime because of how it is being utilized. Whether you prefer the Fangraphs version, Baseball Reference’s or Baseball Prospectus’, WAR can be a useful statistic. Its popularity has reached the point that mainstream media outlets are using it along with conventional measures like batting average, on-base percentage and earned run average. WAR has been marketed incredibly well and it’s wonderful that fans in general are branching out beyond the traditional numbers that have long defined the baseball watching experience. However, it can be, and often is, terribly misused.
Comparing the Limitations of OPS and WAR
In that way, WAR is actually similar to OPS, another statistic that is built from component stats and purports to tell the whole story, though it ignores baserunning and defensive value. OPS is a combination of OBP (on-base percentage) and SLG (slugging percentage). You simply add them together to generate one number that supposedly tells you how proficient a player is at hitting. Like WAR, it’s decent at first glance, but it doesn’t do a great job of comparing two players since it fails to provide context.
OPS weighs OBP and SLG equally, but OBP and SLG don’t correlate equally well with run scoring over the course of one or several seasons. Recently, SLG has correlated slightly better, while this study from 1996-2000 shows OBP with a stronger correlation. Different run-scoring environments will have teams scoring runs in different ways, depending on a number of factors: how lineups are being constructed, how pitchers and coaches are adjusting to those changes, how the strike zone is being called, how often and effectively defensive shifts are being employed, etc. If the correlation between each component and run scoring is not static, then the the relationship between OPS and run scoring will not be static. If the relationship between OPS and run scoring isn’t stable, then OPS does not encapsulate total value at the plate as well as we might think it does.
Therefore, when using OPS to evaluate a player, it is important to understand what the underlying components mean in the context of the player’s run-scoring environment. If a player has an .800 OPS with a .500 SLG and a .300 OBP, his impact on run scoring will be different than if he had a .400 SLG and a .400 OBP. For example, if a player is hitting before the pitcher, you want someone with a higher slugging than on-base percentage, since being on base in front of the pitcher means the chances of being driven in are likely as low as they are going to get for any of the hitters in that lineup. Further, if a player with a .400 OBP has achieved that with a .365 batting average fueled by an abnormally high BABIP, there’s a pretty good chance he won’t be repeating that slash line when his BABIP returns to normal. Like WAR, OPS lacks context, and understanding the context of the numbers is vital to directly compare two different players.
Context is Crucial When Using WAR
WAR, in any form, breaks down into three fundamental components before including a positional adjustment: offense, baserunning, and defense. The offensive component is quite stable while the other two are not, and one of the major reasons for this is that offense does not require subjective measurements.
Offensive Component of WAR
We’ll start with the offensive component, which is computed using linear weights, to yield an expected number of runs scored. For a comprehensive explanation of linear weights, check out this write up at Fangraphs; the general idea is that not all situations contribute equally to a game’s outcome, so each event in a game should be weighted a little differently. Two players with the same number of hits and runs driven in playing in different leagues or years can have different WAR values, for instance. It requires no guesswork, however, and thus is quite reliable – even in relatively small sample sizes. Over the course of a season, a player’s wOBA, for example, is very likely to tell us precisely how good a hitter was. Unfortunately, WAR uses far more than the offensive component and this is where it can go awry.
Baserunning Component of WAR
The baserunning component is based on two things: stolen base numbers (which includes both steals and caught stealing) and a form of baserunning data after the ball has been put in play by the batter that is based on subjective measures. The first is fairly straightforward, but the latter is not. It requires watching video and making judgements about each play: Should the runner have made it to third base on that single to right? Should the runner have scored from second on that base hit? This adds a level of subjectivity into the stat which varies from person to person. You can ask two people to evaluate the same set of plays and they may come up with very different numbers. This makes it tough to trust those numbers. On top of it all, the number of baserunning events for any player during a typical year is much smaller than the number of batting events, leading to even less consistency in summary statistics.
Defensive Component of WAR
The same issue exists with defensive WAR values, which also requires analysts to make subjective judgements while watching video. These measures of a player’s value require very large samples of data to make up for the random biases of the observers. In the specific case of defensive metrics like Ultimate Zone Rating, UZR, and Defensive Runs Saved, DRS, anything less than three years of data lacks predictive value. The further you get from three years, the less stable that number becomes. The following graph shows single-season values of dWAR (defensive) and bWAR (batting) from the Fangraphs version of the stat over the three-year span of 2013-2015:
It is plain to see that there is a much larger amount of variability in the defensive WAR scatter and histograms, which means there is far less stability or reliability. While both the offense’s and defense’s correlation is fairly stable over multiple years, the offensive stat is much more accurate than the defensive version inside the individual seasons. None of this is a secret. None of these sites claim that their defensive metrics are stable in smaller samples. Fangraphs has a pretty solid breakdown of the shortcomings:
Beware of sample sizes! If a player only spent 50 innings at a position last season, it’d be a good idea not to draw too many conclusions from their UZR score over that time. Like with any defensive statistic, you should always use three years of UZR data before trying to draw any conclusions on the true talent level of a fielder.
Subjective Nature of the Defensive Component of WAR
The subjective nature of the defensive measurements is a significant reason why the sample sizes must be so large. The subjective measurements of a player’s defense is inherently susceptible to errors in judgement and bias, and that means every player’s defensive value is going to have much larger error bars than objective statistics like batting average or even wOBA. Because of this, a larger number of chances is required before those errors can reasonably be expected to even out with errors on the other side. It’s similar to the idea that over the course of a season, errors by an umpire when calling balls and strikes will even out for a team. In theory it makes sense, but in practice it is more likely to end up with a non-zero number of runs to one side or the other as October rolls around. If we look at a single season of data for an individual defender, we’ll probably end up making him look better (or worse) than he really is.
Additionally, defensive metrics measure players against their peers and that means the comparative data is limited to those players who are on the field that season. If a great defender like Jose Iglesias is out for the season with an injury, the league’s average UZR is going to be higher. It’s sort of like having the really smart kid in class miss a test that is graded on a curve. Without that student getting all the answers right, the average number of correct answers is lower, and the grades for everyone else will be higher. This is also true when a great defender has a poor season, or even when an average defender has a bad year.
So why is context so important when using WAR?
Well, did the first baseman not get to that ball to his right because the second baseman is really good and usually covers that ground, allowing him to shade more toward the line to protect against doubles? Was the third baseman creeping toward the plate to protect against a bunt? Did the ball take just enough of a bad hop to get past the shortstop, but not so much so that it was obvious to the person watching the video? Was the ground wet from a rain delay, preventing the outfielder from getting a good jump toward the ball’s path? One person might answer these questions differently than another, and that’s before we even consider the idea of bias.
It’s human nature to see things the way we believe them to be true. When someone we know to be a spectacular defender, say Jackie Bradley Jr. or Andrelton Simmons, makes a diving catch, we assume it is because it was a very difficult play, and many times that will be the case. But what if such a defender got a bad jump or slipped on his first step, before the camera switched over to him? We wouldn’t realize that there is a reason for him to be behind the play, forcing him to dive to make up for it, and we can’t fairly determine that he should have made the play without diving. This example of slipping is extreme, and unlikely to be missed on replays, but it demonstrates the point clearly. We expect these players to be great defensively, so our default take on such a play (without conclusive video evidence) would be that it was too difficult for an average defender to make.
Regressing the Defensive Component of WAR
Yes, UZR and DRS are the best metrics we have for measuring defense right now, but that doesn’t mean that we should ignore their flaws. We need to adjust these numbers to account for sample size. How? I will point to Mitchel Licthman and Tom Tango, the creators of UZR. When asked how to compensate for smaller sample sizes, they said the following:
”If you see one fielder at +30 and another at -20, it’s almost certain there is not a 50 run difference between the two players, in terms of their performance. Chances are, the +30 fielder had good context that was not apparent or considered. And the -20 had bad context that wasn’t considered.”
~ Tom Tango
“And the more extreme the recorded performance, the +30 and -20, the more likely it is that the data is bad and the larger the magnitude of those errors, always in one direction or another. That’s why we MUST regress (some unknown amount) in order to estimate the true performance of the fielder, at least given the algorithms that metrics like UZR and DRS use, and the quality of the data.”
~ Mitchel Licthman
The above quotes are from Tom Tango’s blog. The entire entry and the brief run of comments that follow are worth careful study if you want to understand defensive metrics.
Using WAR for Good not Evil
Now, let’s go back to WAR as a complete statistic. We most frequently see it computed in single-season samples. This is two seasons short of the time required for the defensive component to stabilize; and the baserunning component does not stabilize as quickly as the offensive component because there are not enough contested events on the basepaths. This means we can’t trust it to be a reliable measure of how one player stacks up against another. If you are looking for a very general guess at what a player was worth in a season, it’s a fine statistic to use. If you are trying to argue about who the MVP is or which player your favorite team ought to sign, you need to dig deeper.
Regression, while not as precise or intuitive as a confidence interval, is easier to do and can help to minimize the variance inherent in a single-season sample. If you are not, at the very least, regressing WAR’s defensive component as described above by Licthman, you are doing exactly what their creators said not to do with UZR or DRS.
In short, WAR is a poor statistic for direct comparisons of players in single seasons. It’s even worse in partial seasons. As a portal into a deeper discussion, it’s absolutely fine. As a very quick and dirty look at general value on the field, it’s OK. If you want to get into anything more precise than that, you need to get into the messy stuff. Look at offensive numbers like wOBA, wRC+, or OPS+, look at scouting reports for defense and baserunning, use your own observations of plays that you’ve seen or can review via video.
The idea of one clean and easy number wrapping up a player’s worth is enticing, but we just aren’t there with any kind of reliability. We’re headed in some really exciting directions, especially with the continuing development of Statcast, Major League Baseball’s proprietary new tracking technology. In the meantime, falling back on WAR lacks nuance and deprives us of what makes arguing about baseball so much fun. There are myriad ways to look at a player or a play or a season or a trend, and finding those avenues to knowledge is both rewarding and fascinating. Why would we want to move away from the meat and potatoes of baseball analysis?
In part two, we will take a look at the concept of WAR scaling linearly in an environment in which value is relative and the differences between individual player WAR and the combined WAR of a roster.