Does a .300 hitter actually hit .300? A small sample size can make a seemingly incredible stat not all that impressive, but when does a sample stop being a small sample? Mike Richmond looks to uncover the answer to that question using a random-number generator.
If a baseball player hits .300, does that mean that he’s a .300 hitter? It may sound like a silly question, but if we examine it closely, it can lead us to some interesting conclusions. Let’s make a simple mathematical model of the batter and see what it tells us; in particular, how large a sample do we need to draw reliable conclusions? For example, Boston Red Sox fans have been frustrated by Jackie Bradley Jr.’s continued inability to hit major-league pitching. Is his sub-.200 batting average a true reflection of his skill level, or has he just been unlucky?
Batting as a roll of the die
In real life, athletes are subject to a host of influences: major injuries, minor injuries, sleepless nights, sick kids, food poisoning – the list goes on and on. Major league batters face a number of factors which are peculiar to the game of baseball: night games and day games, right-handers or southpaws, bases empty or runners in scoring position. On any given night, in any given inning, the sum of all these influences may give the batter a small boost, or drag him down. Trying to separate his innate ability from the vagaries of any one plate appearance would be… difficult.
So, let’s make things simple. Instead of considering a real player, we will examine the performance of an ideal system:
Not a die, exactly, but a random-number generator running on a computer. To simulate one at-bat by a player with a “true” .270 batting average, we’ll generate a random number between 0 and 1. If the value is less than or equal to .270, we’ll say that the batter got a hit; otherwise, we’ll say he made an out.
Since discussions of statistical interpretation sometimes involve at-bats, and other times involve games played, we’ll say that our random player gets exactly 4 at-bats (AB) per game played. Moreover, we’ll ignore the possibility of walks, errors, and other complications. He has four chances per game, each of which results in either a hit or an out.
Two representative groups of batters
To make things more interesting, let’s watch the performance of two different ideal batters. The “average” player will have a true batting average of .270, while the “good” player will have a true batting average of .330. It ought to be easy to distinguish the good player from the average one – right?
Of course, since both the real game and this simple simulation contain a large element of randomness, it could be misleading to follow the results of just a single example of each type. Let’s create a large group of identical copies of each player: 1,000 “average” batters and 1,000 “good” batters. With such a large group of players, we are very likely to see examples of the best of good luck and the worst of bad luck.
After one game…
Our ideal batters step up to the plate four times in this single game. What does the box score tell us the next day?
Well, there are only five possibilities for each player: going 0-4, 1-4, 2-4, 3-4 or 4-4. We can make a table showing the number of batters which fell into each category:
Hits-ABs | Batting Average | Red Group | Blue Group |
0-4 | .000 | 200 | 265 |
1-4 | .250 | 419 | 430 |
2-4 | .500 | 271 | 244 |
3-4 | .750 | 92 | 57 |
4-4 | 1.000 | 18 | 4 |
We can also display the results graphically, using a histogram:
Can you tell which color – red or blue – represents the “average” players and which the “good” ones? It isn’t clear to me. Clearly, we need a larger sample.
After five games (or about one week)
After each player has 20 ABs, the distribution becomes considerably less broad:
It’s obvious now that the blue symbols represent the “average” players, and the red bars the “good” players, right?
But – wait a minute. The “average” players should have a .270 batting average and the “good” players a .330 average. That’s a difference of .060. So why do the two groups have so much overlap?
The answer is that this dataset is still too small to make definitive statements about the players intrinsic ability. Twenty trips to the plate is so few that even a couple of lucky bounces can make the difference between a nice .300 (6-20) and an eye-catching .400 (8-20).
One way to quantify the range of values in each group is to compute the mean and the standard deviation of its batting averages. I’d rather not delve into the mathematical details in this article, but these quantities are relatively quick and easy to calculate, used widely by people in a range of communities, and, to a rough approximation, give us a simple handle on the range of the distributions:
Red Group | Blue Group | |
Mean | .335 | .269 |
Standard Deviation | .105 | .098 |
If we assume that the outcomes follow a normal distribution (yes, yes, this isn’t true; if you have been yelling “Poisson” at the screen, give yourself a cookie) then two-thirds of all the batters will fall within one standard deviation of the mean, and 95 percent of all batters within two standard deviations:
Red Group | Blue Group | |
+/- 1 stdev | .229 – .439 | .171 – .367 |
+/- 2 stdev | .123 – .544 | .073 – .465 |
It’s quicker and easier to see the results on a graph. The colored bands on the figure below show the range from two standard deviations below the mean of each group to two standard deviations above the mean. Only five percent or so of each group will fall outside the region of the bands.
No matter how you look at it, though, it’s hard to tell the difference between these two groups of batters. There’s still much more overlap than separation. We need a bigger sample!
After twenty games (or a bit less than one month)
Let’s give each player almost a month of playing time:
The two groups are starting to pull away from each other, but the bulk of players in each still fall within the 95-percent region of the other.
After eighty games (or half a season)
As I write this article, it’s July 1, and the Red Sox have played 79 games – just about half the season. Surely a player who had participated in every game so far would have had a chance to show his true worth by now – right?
Alas, it’s still not possible to separate with certainty the batters with a “true” batting average of .270 from those with .330. For example, more than 10% of the “average” players have batting averages over .350, while 16% of the “good” players are sitting below .250.
After 162 games
Suppose we give each player a full season to perform. Is that enough?
For some purposes, yes, this is enough. The two distributions still touch, and there’s just a little bit of overlap between the colored bands which contain 95 percent of each group… but if one is willing to overlook occasional outliers, one can make a pretty good guess at the “true” batting average of a player after this much time has passed.
A brief statistical summary
It might be useful for future reference to summarize what we’ve seen so far: just how confident can one be in the batting average of a player after some number of games? As long as the “true” value lies somewhere in the range between about .250 and .350, the following table (updated July 8) should be a decent guide:
After this many games, … | … or this many ABs | 95 percent of batting averages will be within this much of “true” value |
5 | 20 | +/- .200 |
20 | 80 | +/- .102 |
80 | 320 | +/- .052 |
162 | 648 | +/- .036 |
In our next article, we’ll put this model to the test, by comparing its predictions to the actual performances of several MLB hitters – Jackie Bradley Jr. among them. Stay tuned!
Whoopsie. Those of you who have been yelling, “That’s not a Poisson distribution — it’s a BINOMIAL distribution” should get two cookies. My mistake.
1) Your table displays the range containing 68% of the values; the 95% range is twice as wide.
2) Why did you do a Monte-Carlo simulation to determine the distributions when the analytical expression is well known and fairly simple? The errors in the Monte-carlo calculation with only 1000 trials are quite large; the true distribution for 380 AB doesn’t have multiple peaks.
Alan — thank you very much for a careful reading of the article.
1) Yes, you are exactly right. I _did_ put the 1-sigma values into the second table of the article, instead of the 2-sigma values. My mistake. I have just now (1 PM on July 8) updated the text so that the proper 2-sigma values are in the table. In addition, I updated the text in the introduction of the “Part 2” article so that the proper 2-sigma ranges are there, too. Fortunately, in all the specific examples of the “Part 2” article, I did correctly use the 2-sigma ranges.
2) You are again correct to say that the expected distribution of batting averages for any given number of attempts is known exactly as an analytic expression. Why did I not use this analytic expression to generate nice, symmetric regions in the graphical illustrations? If I were a cunning fellow, I would say something along the lines of “well, that nice bell-shaped curve would not give the reader a feeling for the size of the typical fluctuations expected in these small sample sizes … whereas the Monte Carlo results at least hint at them.” But I’m not cunning, I’m just lazy. It’s often quicker and easier for me these days to write a short program to use the Monte Carlo approach to almost anything I do (whether it’s astronomy or physics or baseball) than to figure out which analytic expression is the proper one to use, so I just use the Monte Carlo method. It’s a bad habit, for sure.
I and others who try to help explain things would do better to follow Alan’s advice and use the analytic expressions when it _is_ possible, as in this case.
Again, thanks, Alan, for your comments.