Everything you wanted to know about small samples but were afraid to ask

By Mike Jaggers-Radolf On April 12, 2011 · 13 Comments · In Statistical Analysis

Readers of this site are probably aware that you can’t read too much into statistics this early in the season because players have only given us a small sample of data to work with. The concept is that there isn’t enough data in a small sample to make an informed conclusion about the data that is available. This is technically true, but one of my biggest pet peeves in baseball analysis is that analysts seldom, if ever, define the population of data they are sampling and therefore do not actually prove that the sample is small. This drives me up the wall because frequently even good analysts suggest that a sample is small, when, in fact, it is not. Good baseball analysts also tend to dismiss trends drawn from small samples too quickly. The purpose of this post is to correct a few of these assumptions, and hopefully offer something helpful to the debate.

The number one mistake baseball analysts typically make when discussing a small sample of data is failing to understand what defines a small sample. Instead, the concept of the small sample is thrown around to describe any data that the analyst feels was collected over a period of time that does not accurately reflect the player or team’s true potential performance. You may, for example, see someone suggest that through early May it is unwise to read too much into a player’s performance because the amount of data provided constitutes a small sample. This seems true because we know, for example, that ‘s numbers through May of last year did not reflect his season totals, but in Tex’s case early May actually was a large sample, by definition. Don’t take my word for it. According to , a small sample is “defined as a sample size of 30 or fewer items” (page 98). That means that by virtually any measure — games, at-bats, plate appearances — Tex has provided us a large sample of data to work with through the first five weeks of the season. That sample just didn’t reflect the performance we’d wanted to see.

The number 30 is not an arbitrary cutoff point for separating small and large samples. Its definition is rooted in an important distinction of statistical inference, the use of statistics to draw conclusions about populations of data (the entire collection of data possibly available) using only samples (subsets of those entire populations). At samples of 31 data points or more, the t-distribution, which is the distribution used in statistics to draw conclusions when only a small sample of data is available, approximates the normal distribution. At that size, the two distributions are approximately the same (Kaplan, page 104). Below 30 observations, the t-distribution is bell-shaped, similarly to a normal distribution, except it has fatter tails. This means that its mean is more likely to be influenced by outlier values than is the case in a normal distribution.

This latter point is why it is correct to caveat, but not entirely reject, conclusions drawn from small samples. It’s not that those conclusions are wrong–it’s just that extreme outlier values in the sample can incorrectly influence the mean. provided us a perfect example on Saturday. Through Friday he was batting .276/.300/.483. He had a great game Saturday and entered Sunday’s game batting .324/.343/.618. Cano provided outlier performances before and during Saturday’s game. Those performances knocked his slash stats all over the place.

If by now you trust that samples of 31 or greater are actually large, it means that baseball players provide us with large samples quickly. When talking about batters a large sample can be said to be 31 plate appearances, which for most Yankee position players has already happened. For pitchers it can be said that a large sample of data occurs after 31 innings. This will take a little while longer, but is almost there.

This, then, raises a different question. If large samples have more than 30 observations, and can be used to draw conclusions with few caveats, why then do the first 31 plate-appearances of a player’s season, or the first 31 games of a team’s season, often fail to predict the rest of the player’s or team’s season? There are several answers to this. The first has to do with standard deviation. As more observations are collected, the standard deviation of the sample goes down, which improves the accuracy of conclusions. Another perfectly valid explanation for this may also be that the true population of baseball data is not in fact normally distributed, which throws almost all our assumptions about mean tendencies out the window (and is a far more complex topic).

There is, however, a third explanation that is often overlooked. Just because two samples are large does not mean that they were drawn from the same population of data. They may, in fact, be samples of two different populations that actually need to be separated. Two examples can illustrate this point. First, imagine you want to estimate the height of people living in New York City. You take the average heights of individuals as they leave a bus. Over a sample of 40 individuals (a large sample) you get an average height of 4’7″ tall, with only three individuals coming in at over 5′ tall. This happened because I declined to mention that the bus was a school bus. The only adults on board were two teachers and the driver. The rest were school children. This was an example of omitted variable bias. We missed a critical fact about our results, one that would have changed our analysis. In this case, school children are not representative of New York’s adult population. They need to be separated from the adults.

Returning to baseball for the second example, we all know that in 2010 Tex hit like garbage in April, a bit better in May and June, was on fire in July and August, and then went ice cold in September. Tex’s performance was different in each of these periods, so much so that while they all encompass components of the same season they are probably samples of different populations of data that are independent of each other (April Tex, Tex swinging the bat well, injured Tex, etc.). Throughout the course of the season things happen to players that have the same impact as the school bus in the first example. As players get hurt or make adjustments that change their performance, one sample ends and another begins. As analysts it is our job to recognize when these changes have occurred, and separate the samples.

In conclusion, small samples get beaten up a lot in baseball analysis because they are misunderstood. While there are justifiable caveats regarding conclusions drawn from small samples, those caveats are not as damning as we often think. Furthermore, small samples are much smaller than we often realize. Once we are observing more than 30 points of data for whatever we are analyzing we are officially working with a statistically valid, large sample. Baseball performance varies wildly during the course of the season not because it takes time for a player or a team to submit a large sample, but actually because analysts often fail to separate independent samples that describe means that at first blush appear similar, but are in fact different.

13 Responses to Everything you wanted to know about small samples but were afraid to ask

George Hadjiconstantinou says:

April 12, 2011 at 9:38 am

I appreciate the post and it was made with a great deal of thought. Having said that, we have to consider the fact that samples of 30 still have high standard deviations and are, like you said, subject to a lot of variability. In addition, many statistics tend to “stabilize” after a certain amount of plate appearances. This Fangraphs article is probably very valuable in this discussion, as well.

[Reply]
George Hadjiconstantinou says:

April 12, 2011 at 9:44 am

Also, since baseball games generally provide us with samples much larger than 30 PAs, would it not make sense to hold back judgment until we reached a sample with a smaller SD?

[Reply]
Moshe Mandel says:

April 12, 2011 at 9:48 am

I think there are two issues here. Firstly, being that the gap between a good hitter and a bad hitter is a pretty fine line, a large sample with a huge standard deviation is about as functionally valuable as a small sample. It gives me an equal lack of confidence in the conclusions.

Second, I think your last paragraph is troublesome. You assume that we can actually identify the independent samples that go into a players season, something I doubt. Second, you seem to be saying that it would be statistically relevant to isolate all of these samples, whereas Im not so sure that it would really aid your predictive abilities much. Im quite confident that taking an entire season as a sample would give you more accurate conclusions than if you broke down the sample by various factors and then attempted to apply those results to future seasons.

[Reply]
Leonardo says:

April 12, 2011 at 9:54 am

Thanks so much for this post, I really enjoyed it!

[Reply]
Damian says:

April 12, 2011 at 10:08 am

I don’t know a ton about this sort of thing, but the 30 sample benchmark for determining an adequate sample is premised on the 30 samples being selected at random, right? 30 consecutive plate appearances in May or at any time during a year is not a randomly-selected sample. If you take 30 plate appearances selected randomly from across Mark Teixeira’s entire career, though, I would bet it would give you a good idea of what kind of hitter he is.

[Reply]

Damian Reply:
April 12th, 2011 at 10:18 am

To expound a little bit, necessary controls might be lacking for 30 consecutive plate appearances in May for Teixeira to be all that informative. He faces various pitchers, hits from different sides of the plate, faces various defenses, and plays under various weather conditions, all of which might not represent what is normal for a baseball player generally. If I run an experiment to determine how long it takes for X mineral to have Y reaction with Z chemical, maybe I can do it 30 times and get accurate results. The reason for that could be that I create a controlled environment where I can isolate the characteristic I intend to study. There are too many uncontrolled aspects to playing baseball to make a one-month body of statistics reliable.

Again, I don’t actually know anything about this, so this is just my gut talking. Please correct me where appropriate.

[Reply]

Moshe Mandel Reply:
April 12th, 2011 at 10:32 am

No, I think you make an excellent point, which is why even a technically large enough sample isn’t really that meaningful in this context. The only thing that can ‘fix’ the data is more data, IMO.

[Reply]

Jeremy T Reply:
April 12th, 2011 at 8:09 pm

So maybe for batter vs pitcher matchups 30 would be the point where it becomes useful? I feel like I remember seeing a post on this not too long ago.

[Reply]
Mike Jaggers-Radolf says:

April 12, 2011 at 10:43 am

Hi everyone,

Thank you for your thoughtful comments. All of these are valid points. When I went into this post I wanted to draw attention to something specific, but found that I had to limit myself greatly.

We definitely don’t go as far into analysis as we should if we were to draw more robust conclusions. For example, we focus a lot on the size of the sample, but we tend to ignore its standard deviation, confidence level, and power. All of these statistics would help us to understand the predictive ability of a dataset of 31 observations, or a dataset 3,100. If the variance in either set is too high, neither may allow us to draw a conclusion that is statistically different from zero.

With regard to random draws, yes, absolutely correct. This was another instance where I felt a need to draw a line as to what to include in the post and see what emerged from the comments. Sampling Tex’s performance from only April or May is not randomly drawn, and therefore violates the assumptions of statistical inference. In that vein, most hypothesis testing requires that the population of data be normally distributed, which it almost never is.

Regarding the ability to segregate independent populations, I stand by my conclusion. If I were in the business of forecasting player performance I would want to introduce dummy variables to capture when a player was injured, at the very lest. Probability of injury and performance while injured is important to segregate from forecasting, similar to seasonality in business.

I’m glad everyone enjoyed the post. I’ve done my best to share my thoughts on a difficult topic. Hopefully I’ve done more good than harm.

[Reply]
The Ultimate Affiliate Marketing Education Site | Massive Reviews says:

April 12, 2011 at 5:14 pm

[...] Everything you wanted to know about small samples but were afraid … [...]
Steve S. says:

April 13, 2011 at 7:06 am

Terrific stuff, MJR. One thing that has been annoying me lately are the people running around saying “We only have 9 games with Jeter” as if last season never happened. What’s most troubling about these 9 games is that they are a continuation of what happened last year, both in watching his approach at the plate and in terms of results. Actually, April of 2010 was the last month Jeter produced anything resembling his career norms, and he’s not a Tex-type with a history of slow starts. If anything, April of 2010 pulled UP his averages for last season. So if he continues his production using May 2010 forward as the baseline, we can expect further erosion year over year. This is why I put little value in those touting his preseason projections, showing him poised for a rebound. Those numbers were skewed upward by totals produced while he was still in his prime.

[Reply]
Steve S. says:

April 13, 2011 at 7:16 am

I’d like to add something else. When we look at a down few weeks or month, we’re looking solely at the player. But what about the competition? Who was that pitcher facing? What pitchers were the hitter facing?

For example, if a hitter who suffers a big platoon split happens to face a bunch of Lefties one month, his numbers will be down. If a pitcher suffers a R/L split and faces a team loaded with good hitters of that handedness twice in one month, his month will look bad and everyone will be searching for answers.

We would be well served to understand who a player is, know his scouting report, and look at his competition before dismissing something as a small sample. While it may be a few games, it may highlight who someone is as a player.

[Reply]
A quick look at what A-Rod has done so far this year | New York Yankees blog, Yankees blog, A blog about the New York Yankees | The Yankee Analysts says:

April 13, 2011 at 5:08 pm

[...] I go further, I need to digress momentarily and acknowledge my own post from Tuesday. I was careful to mention A-Rod’s plate appearances in the first paragraph to demonstrate [...]

Everything you wanted to know about small samples but were afraid to ask

13 Responses to Everything you wanted to know about small samples but were afraid to ask

Leave a Reply Cancel reply

Recent Activity

Recent Posts

Recent Comments

Authors

Authors

Twitter

Meta

Blogroll

Blogs

Writers

Resources

Search TYA

Archives

Calendar

Site Organization

Categories

MLB Standings

Site Stats

New York Yankees blog, Yankees blog, A blog about the New York Yankees | The Yankee Analysts

Pages

The Latest

Open Thread | Yankees v. Orioles | April 14, 2011

More

April 2011
M	T	W	T	F	S	S
« Mar
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30

Everything you wanted to know about small samples but were afraid to ask

13 Responses to Everything you wanted to know about small samples but were afraid to ask

Leave a Reply Cancel reply

Recent Activity

Recent Posts

Recent Comments

Authors

Authors

Twitter

Meta

Blogroll

Blogs

Writers

Resources

Search TYA

Archives

Calendar

Site Organization

Categories

Tags

MLB Standings

Site Stats

New York Yankees blog, Yankees blog, A blog about the New York Yankees | The Yankee Analysts

Pages

The Latest

Open Thread | Yankees v. Orioles | April 14, 2011

More