Using WAR for Good not Evil

4
1903

Arguments rage on around the baseball world about who should be inducted into the Hall of Fame, but there are still other arguments within these arguments. These inner squabbles focus on the statistics used to measure the greatness of players, and one stat seems to get the most attention. Damian Dydyn lays out the case as to why WAR is an imperfect stat and how we should be using WAR for good not evil.

Wins Above Replacement, or WAR, is having an irrevocably negative impact on the discourse surrounding America’s pastime because of how it is being utilized. Whether you prefer the Fangraphs version, Baseball Reference’s or Baseball Prospectus’, WAR can be a useful statistic. Its popularity has reached the point that mainstream media outlets are using it along with conventional measures like batting average, on-base percentage and earned run average. WAR has been marketed incredibly well and it’s wonderful that fans in general are branching out beyond the traditional numbers that have long defined the baseball watching experience. However, it can be, and often is, terribly misused.

Comparing the Limitations of OPS and WAR

In that way, WAR is actually similar to OPS, another statistic that is built from component stats and purports to tell the whole story, though it ignores baserunning and defensive value. OPS is a combination of OBP (on-base percentage) and SLG (slugging percentage). You simply add them together to generate one number that supposedly tells you how proficient a player is at hitting. Like WAR, it’s decent at first glance, but it doesn’t do a great job of comparing two players since it fails to provide context.

OPS weighs OBP and SLG equally, but OBP and SLG don’t correlate equally well with run scoring over the course of one or several seasons. Recently, SLG has correlated slightly better, while this study from 1996-2000 shows OBP with a stronger correlation. Different run-scoring environments will have teams scoring runs in different ways, depending on a number of factors: how lineups are being constructed, how pitchers and coaches are adjusting to those changes, how the strike zone is being called, how often and effectively defensive shifts are being employed, etc. If the correlation between each component and run scoring is not static, then the the relationship between OPS and run scoring will not be static. If the relationship between OPS and run scoring isn’t stable, then OPS does not encapsulate total value at the plate as well as we might think it does.

Therefore, when using OPS to evaluate a player, it is important to understand what the underlying components mean in the context of the player’s run-scoring environment. If a player has an .800 OPS with a .500 SLG and a .300 OBP, his impact on run scoring will be different than if he had a .400 SLG and a .400 OBP. For example, if a player is hitting before the pitcher, you want someone with a higher slugging than on-base percentage, since being on base in front of the pitcher means the chances of being driven in are likely as low as they are going to get for any of the hitters in that lineup. Further, if a player with a .400 OBP has achieved that with a .365 batting average fueled by an abnormally high BABIP, there’s a pretty good chance he won’t be repeating that slash line when his BABIP returns to normal. Like WAR, OPS lacks context, and understanding the context of the numbers is vital to directly compare two different players.

Context is Crucial When Using WAR

WAR, in any form, breaks down into three fundamental components before including a positional adjustment: offense, baserunning, and defense. The offensive component is quite stable while the other two are not, and one of the major reasons for this is that offense does not require subjective measurements.

Offensive Component of WAR

We’ll start with the offensive component, which is computed using linear weights, to yield an expected number of runs scored. For a comprehensive explanation of linear weights, check out this write up at Fangraphs; the general idea is that not all situations contribute equally to a game’s outcome, so each event in a game should be weighted a little differently. Two players with the same number of hits and runs driven in playing in different leagues or years can have different WAR values, for instance. It requires no guesswork, however, and thus is quite reliable – even in relatively small sample sizes. Over the course of a season, a player’s wOBA, for example, is very likely to tell us precisely how good a hitter was. Unfortunately, WAR uses far more than the offensive component and this is where it can go awry.

Baserunning Component of WAR

The baserunning component is based on two things: stolen base numbers (which includes both steals and caught stealing) and a form of baserunning data after the ball has been put in play by the batter that is based on subjective measures. The first is fairly straightforward, but the latter is not. It requires watching video and making judgements about each play: Should the runner have made it to third base on that single to right? Should the runner have scored from second on that base hit? This adds a level of subjectivity into the stat which varies from person to person. You can ask two people to evaluate the same set of plays and they may come up with very different numbers. This makes it tough to trust those numbers. On top of it all, the number of baserunning events for any player during a typical year is much smaller than the number of batting events, leading to even less consistency in summary statistics.

Defensive Component of WAR

The same issue exists with defensive WAR values, which also requires analysts to make subjective judgements while watching video. These measures of a player’s value require very large samples of data to make up for the random biases of the observers. In the specific case of defensive metrics like Ultimate Zone Rating, UZR, and Defensive Runs Saved, DRS, anything less than three years of data lacks predictive value. The further you get from three years, the less stable that number becomes. The following graph shows single-season values of dWAR (defensive) and bWAR (batting) from the Fangraphs version of the stat over the three-year span of 2013-2015:

It is plain to see that there is a much larger amount of variability in the defensive WAR scatter and histograms, which means there is far less stability or reliability. While both the offense’s and defense’s correlation is fairly stable over multiple years, the offensive stat is much more accurate than the defensive version inside the individual seasons. None of this is a secret. None of these sites claim that their defensive metrics are stable in smaller samples. Fangraphs has a pretty solid breakdown of the shortcomings:

Beware of sample sizes! If a player only spent 50 innings at a position last season, it’d be a good idea not to draw too many conclusions from their UZR score over that time. Like with any defensive statistic, you should always use three years of UZR data before trying to draw any conclusions on the true talent level of a fielder.

Subjective Nature of the Defensive Component of WAR

The subjective nature of the defensive measurements is a significant reason why the sample sizes must be so large. The subjective measurements of a player’s defense is inherently susceptible to errors in judgement and bias, and that means every player’s defensive value is going to have much larger error bars than objective statistics like batting average or even wOBA. Because of this, a larger number of chances is required before those errors can reasonably be expected to even out with errors on the other side. It’s similar to the idea that over the course of a season, errors by an umpire when calling balls and strikes will even out for a team. In theory it makes sense, but in practice it is more likely to end up with a non-zero number of runs to one side or the other as October rolls around. If we look at a single season of data for an individual defender, we’ll probably end up making him look better (or worse) than he really is.

Additionally, defensive metrics measure players against their peers and that means the comparative data is limited to those players who are on the field that season. If a great defender like Jose Iglesias is out for the season with an injury, the league’s average UZR is going to be higher. It’s sort of like having the really smart kid in class miss a test that is graded on a curve. Without that student getting all the answers right, the average number of correct answers is lower, and the grades for everyone else will be higher. This is also true when a great defender has a poor season, or even when an average defender has a bad year.

So why is context so important when using WAR?

Well, did the first baseman not get to that ball to his right because the second baseman is really good and usually covers that ground, allowing him to shade more toward the line to protect against doubles? Was the third baseman creeping toward the plate to protect against a bunt? Did the ball take just enough of a bad hop to get past the shortstop, but not so much so that it was obvious to the person watching the video? Was the ground wet from a rain delay, preventing the outfielder from getting a good jump toward the ball’s path? One person might answer these questions differently than another, and that’s before we even consider the idea of bias.

It’s human nature to see things the way we believe them to be true. When someone we know to be a spectacular defender, say Jackie Bradley Jr. or Andrelton Simmons, makes a diving catch, we assume it is because it was a very difficult play, and many times that will be the case. But what if such a defender got a bad jump or slipped on his first step, before the camera switched over to him? We wouldn’t realize that there is a reason for him to be behind the play, forcing him to dive to make up for it, and we can’t fairly determine that he should have made the play without diving. This example of slipping is extreme, and unlikely to be missed on replays, but it demonstrates the point clearly. We expect these players to be great defensively, so our default take on such a play (without conclusive video evidence) would be that it was too difficult for an average defender to make.

Regressing the Defensive Component of WAR

Yes, UZR and DRS are the best metrics we have for measuring defense right now, but that doesn’t mean that we should ignore their flaws. We need to adjust these numbers to account for sample size. How? I will point to Mitchel Licthman and Tom Tango, the creators of UZR. When asked how to compensate for smaller sample sizes, they said the following:

”If you see one fielder at +30 and another at -20, it’s almost certain there is not a 50 run difference between the two players, in terms of their performance. Chances are, the +30 fielder had good context that was not apparent or considered.  And the -20 had bad context that wasn’t considered.”

~ Tom Tango

“And the more extreme the recorded performance, the +30 and -20, the more likely it is that the data is bad and the larger the magnitude of those errors, always in one direction or another. That’s why we MUST regress (some unknown amount) in order to estimate the true performance of the fielder, at least given the algorithms that metrics like UZR and DRS use, and the quality of the data.”

~ Mitchel Licthman

The above quotes are from Tom Tango’s blog. The entire entry and the brief run of comments that follow are worth careful study if you want to understand defensive metrics.

Using WAR for Good not Evil

Now, let’s go back to WAR as a complete statistic. We most frequently see it computed in single-season samples. This is two seasons short of the time required for the defensive component to stabilize; and the baserunning component does not stabilize as quickly as the offensive component because there are not enough contested events on the basepaths. This means we can’t trust it to be a reliable measure of how one player stacks up against another. If you are looking for a very general guess at what a player was worth in a season, it’s a fine statistic to use. If you are trying to argue about who the MVP is or which player your favorite team ought to sign, you need to dig deeper.

Regression, while not as precise or intuitive as a confidence interval, is easier to do and can help to minimize the variance inherent in a single-season sample. If you are not, at the very least, regressing WAR’s defensive component as described above by Licthman, you are doing exactly what their creators said not to do with UZR or DRS.

In short, WAR is a poor statistic for direct comparisons of players in single seasons. It’s even worse in partial seasons. As a portal into a deeper discussion, it’s absolutely fine. As a very quick and dirty look at general value on the field, it’s OK. If you want to get into anything more precise than that, you need to get into the messy stuff. Look at offensive numbers like wOBA, wRC+, or OPS+, look at scouting reports for defense and baserunning, use your own observations of plays that you’ve seen or can review via video.

The idea of one clean and easy number wrapping up a player’s worth is enticing, but we just aren’t there with any kind of reliability. We’re headed in some really exciting directions, especially with the continuing development of Statcast, Major League Baseball’s proprietary new tracking technology. In the meantime, falling back on WAR lacks nuance and deprives us of what makes arguing about baseball so much fun. There are myriad ways to look at a player or a play or a season or a trend, and finding those avenues to knowledge is both rewarding and fascinating. Why would we want to move away from the meat and potatoes of baseball analysis?

In part two, we will take a look at the concept of WAR scaling linearly in an environment in which value is relative and the differences between individual player WAR and the combined WAR of a roster.


This article was truly a collaborative effort and would not have been possible without the hard work of the entire SoSH team.

Follow Damian on Twitter @ddydyn.

Featured image courtesy of ESPN.

SHARE
Previous articleSandy Leon: Hitting Machine
Next articleRed Sox Minor League Gameday August 23, 2016

Damian grew up smack dab in the middle of Connecticut and was indoctrinated into the culture of Red Sox fandom from the moment he was old enough to start swinging a bat. A number of trips to Fenway park and meeting Ellis Burks at his dad’s bar cemented what would become a life long obsession that would pay off in spades in both the recent run of post season success and the extra bit of connection he would have with his father throughout the years.

After a brief three year stint living in the Bronx with his wife where he enjoyed leisurely strolls through the neighborhood with a Red Sox t-shirt on to provoke the natives, he settled in Roanoke, Virginia where he can fall out of bed and land at a Salem Red Sox game.

Damian is a co-host for SoSHCast (the Sons of Sam Horn podcast) along with Justin Gorman.

4 COMMENTS

  1. You’re not making the distinction between evaluating players’ talents and evaluating players’ performances.

    Most of what you said re: the necessity of larger samplea only apply to evaluation of talent, not performance. Yes, we need 3 years of data to be somewhat confident in our evaluation that Player A is a better defender than Player B, but we do not need 3 years of data to be confident that Player A performed better defensively in 2015 than Player B.

    Trevor Story this year hit 7 HRs in his first 6 games. That’s an amazing performance and nobody can take that away from him. We don’t to regress or get a larger sample size when evaluating a performance. The performance just is.

    Of course you’d want more data to evaluate his talent as a hitter, and probably you’ll need more than his 2016 season. But if what you’re doing is measuring the performance of his single year, there’s nothing wrong with using the smaller sample. It’s not big enough to evaluate talent, but it’s exactly as big as it needs to be to measure that season’s performance.

  2. First off, thanks for reading and thanks for the input.

    As for your comment, the flaws that make single season samples so treacherous when trying to measure talent are still there impacting our measuring of performance. The inherent biases and subjectivity at play in the defensive and base running components don’t have less of an impact just because we are applying the stat one way or another. If an evaluator who is working on determining UZR, for example, misjudges a diving play, the impact of that on the recorded UZR and thus the recorded WAR is the same.

    So the issue isn’t so much how we are using the stat, but how we arrived at that number in the first place. And again, this isn’t an argument that WAR is a bad stat or that it shouldn’t be used, we just need to be cognizant of its weaknesses and either adjust for them or supplement with additional data, observations or even video when possible.

  3. Do we happen to know if any kind of interceder-reliability testing is performed on the subjective analysis of the base running and defensive components? We can say that subjectivity increases the error bars, and it is certainly true, but the magnitude of the error introduced here is really important. If you have three people making the evaluations, and they agree on the same play (such as, should the runner have advanced from first to third on a particular base hit) 90% of the time, then we can actually put a good deal of trust in the component metric. While it’s an important question to raise, the fact that a human is making the subjective call doesn’t necessarily invalidate/damage the data, it simply means that it is possible the error bars widen.

    It’s also important to note that the offensive stats aren’t actually free of this kind of thing. OBP is heavily walk dependent, which in turn relies on the umpire calling the strike zone, etc.. In both cases the sample size increasing will allow you to trust the metrics more because we assume that errors will be randomly distributed and thus ‘even out’ over time, but the reason that we can trust walk rates and plate discipline is specifically because most umpires will call 90-95% of pitches the same way over even a small sample. Knowing who is actually evaluating the base running and defensive plays for those components, and how stable and consistent the rubric is, would go a long way to identifying whether or not there actually is a problem with those WAR components, and if so, how much.

    We use the higher variance of those numbers, such as in your chart above, to say that there is an inherent problem in the coding of the events, but we haven’t actually proven that anywhere, as far as I have seen. It’s possible that we are actually seeing a true reflection of outcomes, and that there is simply more year-to-year variation in a player’s defensive performance than there is in his offensive performance. It seems likely that the evaluation process is adding to that instability, but we don’t know for sure that it is, or how much. We also don’t know, for instance, with base running, how much of that variation has to do with the runner’s decision making versus “true talent level” and the third base coach’s decision making (i.e., maybe Xander could go to third, but the coach help him up), etc.

  4. You raise some good points, Matt, but no, we do not have any real visibility on exactly how these plays are evaluated and scored, which is part of the problem. What we do have is the creators of UZR (Tom Tango and Mitchel Licthman) suggesting three seasons of data are needed for it to stabilize.

    And of course, that’s problematic in its own right since a player’s defensive skill can, and often does change over time, so who is to say that the UZR accumulated in 2014 by player X was done with the same level of defensive ability as the UZR accumulated in 2016? What if that player has lost a step or two in the course of those three seasons and had an excellent UZR in that first season and a slightly better than average UZR this season? Their “stable” UZR over the three seasons might rate them more highly than they deserve if we are taking a guess at what kind of defender they might be in 2017.

    We also mentioned that a player who plays the same level of defense for three straight seasons might have three different UZR’s if the players at that position around the league changed in name or in performance. That fluidity adds uncertainty.

    As for base running, we may have even less certainty with that component since we aren’t sure to what degree things like the third base coach holding a runner who may have made it to third or scored are measured. It’s possible that each play is watched and scored by 100 people, in which case that subjective nature of the ratings would probably be very stable. It’s also possible they are rated by one person, which would make them quite unstable.

    The truth is probably somewhere in between, and while we can’t be certain how close it is to one side or the other, we do know that these metrics do take longer to stabilize than something like OBP, which takes around 460 PA’s to stabilize. That’s less than a full season, so we can be fairly certain that the offensive component of WAR is also pretty stable in that span. Slugging takes around 320, if you were curious. Batting average takes 910, which is about 1.5 seasons if you are looking for an hitting metric that exceeds one season.

    You can check out stabilization rates here for more: http://www.fangraphs.com/library/principles/sample-size/

    We appreciate the feedback!

LEAVE A REPLY