Understanding p values through simulation

statistics
frequentist
teaching
Author
Affiliation

School of Psychology, University of Sussex

Published

June 6, 2022

tldr

The properties of p values can be difficult to understand. Therefore, one way to develop a good conceptual understanding of p values is through simulation. The document below allows you to simulate experiments and to examine how the distribution of data and p-values change.

Background

This set of simulations gives a brief conceptual introduction to p values. It starts with an introduction to the sampling distribution a demonstrates how a p value and a significance threshold are derived from the sampling distribution. Following this, it demonstrates how p values are distributed under two conditions: First, when drawing samples from the null distribution and, second, when drawing samples from a population where non-null effects are present.

The sampling distribution

To understand where p values come from you must have some understanding of the sampling distribution, where it comes from, and how it is used. A simple definition of the sampling distribution is that it is the distribution of a test statistic under some specified model (usually the null model). If that definition seems a bit opaque then we can illustrate it by way of an example and a simulation.

Let us say that a researcher is interested in comparing how much people enjoy interacting with a robot under two conditions. In the control condition, the robot is a stock robot in the standard configuration as it is received from the factory. In the experimental condition, the robot has had a large yellow smiley face painted on to it.

The researcher could just collect data under these two conditions and compare the average ratings between the two conditions. But if the researcher examined the ratings between the two conditions they would expect to see some difference. That is, even if the smiley face had no impact on participants ratings it seems unlikely that the difference between the two conditions would be exactly zero.

Therefore, if the researcher does observe some non-zero value then what are they to conclude? Rather than just giving you the answer for how the researcher should proceed let us try and solve the problem with the information that we have to hand. To do this, we’ll run some thought experiments (simulations).

First, in our thought experiment, we won’t have two conditions. We’ll only have one condition, our control condition. In our thought experiment, rather than having two conditions we’ll just test people twice under the control condition. For each person, we’ll calculate the difference between their control condition 1 score and their control condition 2 score. People might not get identical scores each time. Some people might score higher in the first and lower in the second. Other people might score lower in the first and higher in the second. On average, however, if we tested very many people, we’d expect the average to be somewhere around zero.

We can simulate this by drawing random values from some distribution that has an average of zero. We can really draw these numbers any way we want. We could draw the numbers from the familiar bell-shaped normal distribution. But we don’t need to. We could equally well roll dice. A single die has 6 sides and each side comes up roughly an equal number of times. This means that the average value that a die shows is 3.5. This is just the middle of all the possible values from 1 to 6. If we wanted our average to be zero, we could use a dice to generate our data for our thought experiment by rolling the die and just subtracting 3.5 from any value the dice shows.

So let’s start running our thought experiment. We’ll roll the die once, look at the number, and subtract 3.5. This value will be the score for participant 1 in condition 1. We’ll then roll the die again, look at the number, and subtract 3.5 to give us the score for participant 1 in condition 2. Next we work out the difference between condition 1 and condition 2, because what we’re really interested in is the difference between conditions—that is, the impact of drawing the smily face on the robot. We now have one data point. But this isn’t enough so we repeat the process, for example, 11 times. This gives us pretend data for 11 participants. We can now just work out our average across all participants, so we can see whether, on average, whether the scores are higher in condition 1 or condition 2. Remember, we expect this difference to be zero, because we don’t actually have two conditions, we have one condition repeated twice.

This average difference that we’ve now calculated from our dice rolls will be the outcome of our first thought experiment. But this isn’t enough. We’re not just going to do this once. Instead, we’ll do it 100,000 times. And we’ll take the outcomes from these 100,000 experiments and draw them on a plot. We can see a histogram of the outcomes in the top row of Figure 1.

The plot in Figure 1 shows a distribution centered at about zero, but with values spread out either side. This spread is because we won’t get identical values on consecutive rolls. Sometimes the values in condition 1 might be slightly higher than in condition 2, and the average difference for our experiment will be positive. Other times, values in condition 2 might be slightly higher than in condition 1, in which case the average difference for that thought experiment will be negative.

You might be wondering, however, why the scores are spread out in the exact way they are? The spread of the scores is determined by the process that we used to generate the random values for our thought experiment. We rolled a 6 sided die, but we could of rolled 2 6-sided dice, or a twenty side die, or used some other process to generate the random numbers. For a 6 sided dice the theoretical maximum difference you could get between the two conditions would be 5. This would occur if it just so happened to be the case that we rolled 1 every time we were rolling a for condition 1 and rolled 6 every time we were rolling for condition 2. This is an unlikely outcome, but it could happen! If we were rolling a 20 sided die, then the theoretical maximum difference would be larger so we’d expect the spread to potentially be wider. You can adjust the spread on the slider labelled Variability (SD) and re-run the simulations by clicking Generate sampling distribution on the widget below. When we adjust the slider and re-run the simulation, we can see that the histogram of our 100,000 thought experiments changes dramatically. The obvious question now is, what value do we set the slider to? What is the correct way of generating the random numbers for our thought experiment?

Instead of trying to come up with the correct value for the slider, we’re just going to use a trick so that the exact value of the slider doesn’t make a difference.

Figure 1: Top: The mean score for each of the 100,000 experiments. Bottom left: Mean score re-scaled to Cohen’s d. Bottom right: Mean score re-scaled to a t statistic. The vertical lines indicate the central 95% of the distribution.

Our trick is going to involve scaling our results. Previously what we’ve done is just worked out the average difference for each experiment and plotted that. This is what’s shown on the top row of Figure 1. What we’ll do to scale the value is that we’ll divide our average difference for each simulated experiment by the variability in the scores for that simulated experiment. Technically, this means we’re working our an effect size—a Cohen’s d value to be exact. Our new scaled scores are shown on the bottom left of Figure 1. Try changing the value on the variability slider again and re-running the simulations. Notice that no matter what value you select, the histogram looks the same! This is going to be the first piece of solving our puzzle.

What does this distribution tell us, and how can it help us? It tells us the range of values (in units of Cohen’s d) that we would expect to see if in fact there was no difference between condition 1 and condition 2 (that is, if the numbers we generated in the same way). We can see from the histogram that we’d see values near zero very frequently, but values at the extremes—for example, values near +/- 2—would occur far less frequently. At least for simulated experiments that have a sample size of around 11. Try changing the sample size slider and re-running the simulations. Try a value of 5 and a value of 49. By doing this, you’ll see how our Cohen’s d plot changes dramatically. From wide at 5 to very skinny at 49. This means, that if we want to work our the range of values that our thought experiments will produce then we’ll also have to take into account the sample size. We can take sample size into account by scaling our result in a slightly different way.

To work out our new scaled value we’ll take our Cohen’s d value an multiply it by the square root of the sample size. This new scaled value is the t statistic. In the right of the bottom row we can see the histogram of t values from our 100,000 simulated experiments. Try adjusting the sliders (both variability and sample size) and see how the histogram of these new scaled values hardly changes. To make the relatively unchanging nature of this plot clear click the checkbox to lock the y axis limits. You’ll see that adjusting the variability makes no difference to the shape of this t distribution. Adjusting the sample size does make a small difference. Not to the spread, however. Instead the edges of the distribution (known as the tails) get fatter or skinnier. So even with our new scaled t scores we will have to take sample size into account, but the impact will be far less dramatic, which will make the maths a little easier. We’re going to use this distribution of t values— the sampling distribution—to help guide the researcher on what to do with their result from their re-experiment. This will involve inventing the idea of statistical significance and the p value.

Using the sampling distribution

If you take a look at the sampling distribution (bottom right) you’ll see that there are two red vertical lines on the plot. 95% of the distribution falls between these two red lines. Let us think about what this means? It means that if there’s no difference between the two conditions then 95% of experiment outcomes would fall somewhere between these two lines. We know this, because that’s the process we used to generate the distribution. Only 5% (2.5% on each side) of results would fall outside the red lines. From this, we can say two things: First, if a t value from an experiment falls between the lines then it would be unsurprising. It would be unsurprising because this would happen 95% of the time! Second, we could say that if the t value from an experiment falls outside the lines then it would be surprising. It would be surprising because it would only happen 5% of the time. Statisticians don’t use the terms surprising and unsurprising. Instead, they use the terms statistically significant (surprising) and not statistically-significant (unsurprising). So if the result of an experiment produced a t value that fell outside of the red lines we’d say that “there is a statistically significant difference between condition 1 and condition 2

Usually researchers will want to say more than just whether a result is statistically significant or not. A p value is also reported. The p value just gives you a measure of where a result falls relative to the sampling distribution. If a result falls exactly on the red line then the p value would be 0.05. This means that there are only 5% of possible values that would be more surprising than that experiment’s result. If the result fell slightly outside the red lines then the p might be something like 0.03, or 0.01. This means that there are only 3% or 1% of possible values that would be more surprising than the current result. In contrast, a t value that fell just within the red lines might correspond to a p value of, for example, 0.1. This would mean that 10% of simulated results are more surprising than that result. So a p value or statistical significance just tells you how surprising the researchers result is relative to the range of possible results that would occur if there was in fact no difference between the conditions. This might be some useful information. But how useful? In particular, how informative is a single p value from a single experiment?

p values for a real difference

In Figure 2 we can see the distribution of t values when there is in fact a difference between conditions (in this case, a difference of 0.5 Cohen’s d units) and when there is no difference between conditions (the null distribution). The vertical red lines show the bounds of statistical significance derived from the null distribution. For the null distribution, 95% of results fall within the lines and 5% of results fall outside of the lines. For the “true effect” distribution the percentage of values outside the lines is larger. That is, statistically significant p values would occur more often than 5% of the time if there is in fact a difference between conditions. You’ll be able to explore exactly how often in the section that follows. For now, the point is a simple one. Statistically significant p values occur whether there is a difference between conditions or not. That is, simply obtaining a single statistically significant p value does not tell you much about whether there is a real difference or not, because statistically significant p values occur in both the null setup and the real effect setup. However, how often statistically significant p values will occur if we run the experiment several times (that is, when we replicate the experiment) will be different under the two setups. We can see this in the next section where we’ll look at the distribution of p values themselves.

Figure 2: Left: The distribution of t values when the true effect size is d = 0.5 and n = 10. Right: The null distribution (d = 0, n = 10)

The distribution of p values

The outcomes of 100,000 simulated experiments is shown below. Figure 3 shows the distribution of data (average result from the experiment) that would be generated from the 100,000 simulated experiments if the true effect size was, Cohen’s d = 0, and the sample size for each simulated experiment was, n = 11. Figure 4 shows the distribution of p-values computed from these data.

Adjust the sliders below to see how the distributions change. Adjusting the Effect size slider will change where the data distribution is centered. Adjusting the sample size slider will change the width of the data distribution. As the sample size increases, the data distribution will get narrower. Decreasing the sample size will make the data distribution wider.

When the effect size is set to 0—that is, when samples are drawn from the null distribution, then the distribution of p values is uniform. That is, p values between, for example, 0 and 0.05 occur just as often as p values between 0.5 and 0.55, or 0.9 and 0.95, and so on. When the effect size is not 0 then the p value distribution is skewed with smaller p values occurring more often that larger p values.

From this week can see that although a single p value might not tell us much, the distribution of p values across many replicated experiments is very informative. How informative will dependend on the sample sizes of the experiments, the effect size being studies, and whether we have replicated the experiment enough to get a good idea of the p value distribution.

Figure 3: The distribution of results from simulated experiments. The true effect is indicated with a vertical line.

Figure 4: The distribution of p-values from simulated experiments. p = 0.05 is indicated with the vertical line.

Although it’s outside the scope of the current post, but by way of a preview, the skewness of the p value distribution in Figure 4 will allow us to derive the concept of statistical power! But that’s for another time.

Reuse

Citation

BibTeX citation:
@online{colling2022,
  author = {Lincoln Colling},
  editor = {},
  title = {Understanding *p* Values Through Simulation},
  date = {2022-06-06},
  url = {https:research.colling.net.nz/p-values},
  langid = {en},
  abstract = {The properties of *p* values can be difficult to
    understand. Therefore, one way to develop a good conceptual
    understanding of *p* values is through simulation. The document
    below allows you to simulate experiments and to examine how the
    distribution of *data* and *p-values* change.}
}
For attribution, please cite this work as:
Lincoln Colling. 2022. “Understanding *p* Values Through Simulation.” June 6, 2022. https:research.colling.net.nz/p-values.