Speeding up BFDA

Part 1: Functional style code

functional programming
r
Author
Affiliation

School of Psychology, University of Sussex

Published

May 24, 2022

tldr

There’s recently been a bit of talk on twitter about the relative benefits of loops versus more functional-style code. Most of these discussions have centred around the relative performance benefits. However, I think this slightly misses the point. Functional-style code allows you to separate the what and the how of your code in a way that opens up new ways of thinking about problems. The end result is often faster code, but this benefit comes from writing cleaner code the encourages clearer thinking. In this post, I show how to speed up some of the simulation code in the {BFDA} R package using functional design patterns. Specifically, reductions and accumulations (aka scans).

Motivation

There has recently been some discussion on Twitter about the relative merits of loops versus functional-style programming in R. Most of these discussions have centered on the relative performance (in terms of speed) of for loops versus the apply and purrr::map family of higher-order functions. In the past, it may have been the case that for loops were slower when compared with apply; however, nowadays it is not so clear-cut. I, personally, prefer writing functional-style code. But my preference isn’t about performance per se. I think writing functional-style code changes the way you think about solving problems, and this lets you, in the end, write more performant code. In this blog post, I plan to take some slow code that relies on loops and re-write it in a functional style. In the end, we’ll solve the problem in a slightly different way, but the output will be the same, and the code will be substantially quicker!

Background

The code that I’ve decided to use as my starting point is the simulation code from the {BFDA} R package. Before I start, I should note that I really like {BFDA}, and I completely understand the design decisions that were made in how it was developed, so this post is not intended to pick on the developers. With that out of the way, let’s take a look at what {BFDA} does and why it’s slow. If you’re not interested in the background, and you just want to jump ahead to the code, then feel free to skip ahead to Spotting unnecessary computations.

BFDA and Bayesian “power” analysis.

The concept of statistical power is a frequentist concept. There’s no such thing as statistical power in Bayesian statistics. Theoretically, statistical power tells you something about how often you’ll make particular kinds of errors (false negative errors) when certain conditions hold. Controlling the rate of certain errors and estimating the rate at which these errors occur is the bedrock of frequentist inference. Bayesian statistics, on the other hand, is concerned with quantifying evidence and, therefore, controlling error rates does not figure into it (see Colling & Szűcs, 2021 for more details on the distinction between frequentist and Bayesian approaches).

The practical reason why one might want to perform a power analysis is that power analyses can be used for planning studies. Specifically, power analyses can help you estimate the sample size required for a particular study (when certain conditions hold). While Bayesians might not care about power, they do care about planning studies. Specifically, they care about planning studies that have a high chance of providing strong evidence.

When it comes to planning sample sizes, Bayesians have an advantage relative to frequentists, but this advantage also leads to a slight complication. When planning a study, frequentists have to come up with a fixed sample size before commencing the study1. Bayesians, however, are not restricted to fixed sample sizes and can instead decide to add more participants to their sample until they have achieved a satisfactory level of evidence. That is, Bayesians are permitted to engage in a practice known as optional stopping, which is forbidden under the rules of frequentist inference2. It is optional stopping that adds the complication to Bayesian sample size planning.

Bayesian sample size planning

So how does one actually go about planning the sample size for a study using Bayes factors? The most straightforward way is simulation. When employing simulation, one simply draws samples from some theoretical population that matches the predicted statistical properties of the population the researcher is planning on studying.

Here is a toy example:

A researcher might believe that a particular group (for example, Jedi Initiates) has higher than normal midi-chlorian levels. To test this, the researcher uses a galaxy-normed midi-chlorian scale, where 100 represents the average3. The researcher predicts that on this scale, Jedi Initiates will have an average level of 110 or 0.5 standard deviations above the galaxy normed value.

How many Jedi Initiates should the researcher test so that they have a good chance of finding strong evidence for this hypothesis if it is indeed true?4

Sample size planning for this experiment is straightforward when using simulation, but there are two ways to ask, “how many participants?” The simplest way is to ask, “would a sample size of N be sufficient to have a good chance of finding strong evidence?” To answer this, we could employ the following simulation procedure:

  1. Draw a sample of size N from a distribution with a mean of 0.5 and a standard deviation of 1 (this corresponds to a standardized mean difference of 0.5, which is what the effect size would be if Jedi Initiates scored 110 or 0.5 standard deviations above the mean).

  2. Run the analysis (for example, a one-sample default Bayesian t-test) and record the Bayes factor.

  3. Repeat steps 1 and 2 many times (e.g., 1000 times)

  4. Count up the proportion of samples that produced sufficient evidence (e.g., a Bayes factor above 10) and see whether this proportion is above some predefined threshold (e.g., 90%).

When we throw optional stopping into the mix—as is permitted by Bayes factor analyses—the situation is complicated a bit. Now the procedure would be something like the following:

  1. Draw a sample of size Nmin

  2. Perform the analysis and record the evidence level.

  3. Add Nstep size participants and repeat the analysis.

  4. Repeat steps 2 and 3 until reaching a sample size of Nmax.

  5. Repeat steps 1 to 4 many times (e.g., 1000 times)

  6. Count up the proportion of samples that produced sufficient evidence (e.g., a Bayes factor above 10) for all the sample sizes tested.

  7. The sample size that produced sufficient levels of evidence in—for example —90% of samples is the target sample size.

The second procedure may be conceptually simple, but it can get a little difficult to actually program up. This is where the {BFDA} package comes in. It performs that complex simulation procedure for you. And it does it more-or-less in the manner described above.

The {BFDA} R package

A common problem of simulation-based approaches is that they are incredibly slow. Simulation-based approaches are slow because they involve performing thousands or even hundreds of thousands of computations. All this adds up meaning that simulations can take many minutes or even hours to complete. This is definitely true of the simulations performed by {BFDA}.

However, when studying the source code, I realized that {BFDA} didn’t have to be as slow as it was. The code, as it stands, performs many unnecessary computations that present opportunities for optimizing. And it was thinking about the problem in terms of functional-style code that helped me identify them.

Spotting unnecessary computations

To identify the first set of inefficiencies let us take a look at the simulation procedure in a bit more detail.

One of the core steps in performing an analysis of a single sample of size Ni is to work out the summary statistics (e.g. the mean, standard deviation, effect size etc.) and the Bayes factor. The {BFDA} package performs this task by performing roughly the following steps.

  1. For a sample of Nmax participants, take a sub-sample of Ni participants.

  2. Work out the mean of the sub-sample

  3. Work out the standard deviation of the sub-sample

  4. Compute the effect size, and other statistics, of the sub-sample

  5. Compute the Bayes factor of the sub-sample

Once this is done, the sampling process will repeat, but a sub-sample of Ni + step size participants will now be taken from the sample of Nmax participants.

Loops, loops, everywhere

Let’s zoom in to step 2 and 3 above. In these steps we’re working out a mean and a standard deviation. How do we do this? Well, we use the mean() and sd() functions in R. But computationally what are we actually doing? Well, to work out a mean, we’re adding together all the numbers in the sample and dividing by the sample size. So we could re-write the mean function as follows:

my_mean <- function(x) {
  x_sum <- sum(x)
  x_n <- length(x)
  x_sum / x_n
}

But now let us think about what the sum() function is doing. When we sum up a list of numbers by hand we have a running total (which starts at 0) and we take each number in the list and add it to this running total. If we wanted to program that up, we could use a loop. That is, we could do something like the code below.

numbers <- c(1, 2, 32, 423, 4)
looped_sum <- 0
for (n in numbers) {
  looped_sum <- looped_sum + n
}
looped_sum == sum(numbers)
[1] TRUE

So taking a mean (because it involves taking a sum) means that we’re looping through a set of numbers. Once we have our mean, we can work out our standard deviation. We can do it more-or-less how we’d do it by hand. Notice in the code below that working out the standard deviation relies on working out the mean, which means we’re looping over the same set of numbers again. But we’ll need to loop over them another time, to work out the difference between each number and the mean, to produce a list of squared distances.

numbers <- c(1, 2, 32, 423, 4)
mean_of_numbers <- mean(numbers)
squared_differences <- vector()
for (n in numbers) {
  squared_differences <- c(squared_differences, (n - mean_of_numbers)^2)
}

Once we have our new list of squared differences we sum them—that is, we loop over them—divide this value by one less than the sample size, and then take the square root.

my_sd <- sqrt(sum(squared_differences) / (length(numbers) - 1))
my_sd == sd(numbers)
[1] TRUE

You’ve probably lost count of the number of loops we have in there. But it’s a lot. We’ve looped over the same set of numbers several times, and even created a new set of numbers which we’ve then looped over. But matters get worse when we think about what happens when we perform the next simulation step and increase the sample size. When we do this, we repeat all the loops above for our new sub-sample, and that means we loop over the numbers that we’ve already looped over before. In fact, the first number in our full sample of random numbers might get looped over hundreds of times! The problem here is not with loops per se, but the fact that we’re performing so many loops with some data points looped over hundreds of times. There has to be a better way!

From loops to reductions

We’re going to figure out the better way in steps, and along the way, we’ll learn a bit about some functional programming concepts. The first concept will be the reduction, implemented in base R as Reduce() or in purrr as purrr::reduce().

To see how a reduction can help us, let us go back to our looped_sum example. In this example, the solution actually consists of two parts. We’ve got the loop itself. The loop means we’re repeatedly doing something. And then we have the something we’re actually doing. In the looped_sum case this something is adding each number to the running total. In the looped example, these two bits are mixed up. We can’t neatly separate the loop and the thing we’re doing.

Before we solve the problem with a reduction, let’s get clear about what the something is that we’re doing. The something is applying the function + with arguments representing the running total and the next number in the list. The output is going to be the new running total. We might not usually think of this process as involving functions because usually we’d spell it like this (as we wrote it above):

n <- 2
looped_sum <- 0
looped_sum <- looped_sum + n
looped_sum
[1] 2

But we can easily re-write that as follows:

n <- 2
looped_sum <- 0
looped_sum <- `+`(looped_sum, n)
looped_sum
[1] 2

This respelling makes it clearer that the + operator is just a regular function that takes two arguments. The first time it “loops”, it takes some initial value, which we must supply, and the first item in our list to serve as the two arguments. On the repeat of the loop, the first argument is going to be the output of the previous step, and the second argument is going to be the next value in our list of values. The second value on the list now serves as the second argument, with the output from the previous step serving as the first argument. This is the structure that is used for all the remaining values in the list. The higher order function Reduce() / purrr::reduce() allows us to perform this kind of loop, and to separately specify the function that is going to be applied at each step.

Note that reductions expect that the function you’re supplying to it (the something that you’re doing) will always be a binary function—that is, a function that takes two arguments. Luckily + is a function that takes two arguments. To work, the Reduce() and purrr::reduce() higher order functions need to know three things: The binary function (.f for reduce() and f for Reduce()), which in our case is +; The list of values (.x or x) the function is applied to, which in our case is the vector called numbers; And the initial value (.init or init) that will be used as the first argument the first time the function is called. For us, this will be 0. The code snippets below show this in action.

Reduce(f = `+`, x = numbers, init = 0)
[1] 462

Or with purrr as follows:

purrr::reduce(.x = numbers, .f = `+`, .init = 0)
[1] 462

We can see that compared to our looped_sum example above, the results are a lot more compact. The versions using the reductions are also a lot faster than using the loop (although all versions are slower than the native sum() function which is actually implemented in C rather than R). But the point here isn’t about performance. Rather, the point I want to emphasize here is that the versions using reductions allow us to clearly separate out the thing we’re doing—applying the + function—and the “loop”. Separating out the what from the how is going to make thinking about the problem a lot easier. But it’ll also make it easier for us to switch our thinking about what we’re “looping” over. Rather than looping over sets of data, we’re going to think about processing streams of data.

Streams not sets

When we use loops, there are essentially two kinds. for loops, which I’ve used above, and while loops. for loops repeat a specified number of times while while loops repeat until some condition is no longer met. for loops force us to think about how many times our loop is going to repeat. That means, we’re encouraged to think of the items we’re looping over as a set of items of a fixed size. When we don’t know in advance how big the set will be then we can use a while loop. This means that we just want to loop until some condition is no longer met. But a while loop means that we have to think about how to specify that ending condition, which can sometimes be complicated.

But there is an alternative to thinking about things in terms of fixed sets or tricky end conditions. That is thinking about things in terms of streams. We can think of our sum problem above as doing something (applying the + function) to an incoming stream of data (the items in our vector). The whole process just stops whenever the stream ends—when the vector has run out of new items. Dealing with streams of data is actually a really common problem in computing and even electronics in general. For example, anything to do with signal processing (like applying a filter to a sound being picked up by a microphone) is dealing with a stream of data. And because dealing with streams of data is such a pervasive problem there’s lots of clever algorithms that have been developed to tackle problems with streams. We’ll use one of these clever algorithms later, but for now I just want you to make this connection.

Streaming sums

We’ll get on to more clever algorithms soon, but for now I want to introduce a new concept. This is what’s known in some programming languages (e.g., Haskell and Rust) as a scan, but in R goes by the name purrr::accumulate(). This is a variation on a reduction. Accumulating is like a reduction in that it performs the same kind of “looping” operation as a reduction; however, unlike a reduction the intermediate results are also kept. Why might this be useful? Let’s go back to what {BFDA} is doing.

It’s looping over a set of numbers, then increasing the set size, and then looping over the set again. For each loop it’s working out a mean—but for simplicity’s sake, we’ll say it’s working out a sum. So you have a sum of items 1 to N and then a sum of items 1 to N + step size (again, for simplicity we’ll say the step size in 1). We might write that out as follows:

numbers <- c(1, 2, 32, 423, 4, 8, 9)
cumulative_sum <- vector()
for (i in seq_along(numbers)) {
  running_sum <- sum(numbers[1:i]) # remember a sum is a loop
  cumulative_sum <- c(cumulative_sum, running_sum)
}

Unpacking this, we can see that we have the outer loop. This is explicitly coded as a for loop. But then inside this loop, we have another loop hidden inside that sum() function. So we have loops inside loops! This is more-or-less what the {BFDA} package is doing in one of its steps. We could simplify it down into a single loop. To do this, it would look like the following:

numbers <- c(1, 2, 32, 423, 4, 8, 9)
cumulative_sum <- vector()
looped_sum <- 0
for (n in numbers) {
  looped_sum <- looped_sum + n
  cumulative_sum <- c(cumulative_sum, looped_sum)
}
cumulative_sum
[1]   1   3  35 458 462 470 479

Although the second code chunk is computationally simpler, the code itself isn’t. In fact, the code is arguably more complex, and it’s certainly difficult to read.

But if we think in terms of functional-style code, we can actually write code that is easier to understand. Remember, as with our Reduce()/purrr::reduce() example above, the key to functional-style thinking was to separate out the what and the how. The what, the function we’re applying to our list of values is still the + function. And the how, is still taking an initial value, adding it to the next item in the list, and repeating the process. But where a reduction only gave you the final output a purrr::accumulate() is going to give you an output that contains all the intermediate values. Let’s see this in action.

numbers <- c(1, 2, 32, 423, 4, 8, 9)
cumulative_sum <- purrr::accumulate(.x = numbers, .f = `+`)
cumulative_sum
[1]   1   3  35 458 462 470 479

You can see that the syntax is almost identical to purrr::reduce(). We again have our list of values (.x) and our binary function (.f).

The similarity between a reduction and accumulate is further highlighted when we look at how we’d solve it using base R.

numbers <- c(1, 2, 32, 423, 4, 8, 9)
cumulative_sum <- Reduce(f = `+`, x = numbers, accumulate = TRUE)
cumulative_sum
[1]   1   3  35 458 462 470 479

To do it in base R we actually use the Reduce() function, but we just set the accumulate argument to TRUE.

We can see with both of these solutions we end up with some very clean code that clearly separates out the what from the how. These are a lot easier to read than either of the looping versions. This is the power of functional-style code. It’s a different way of thinking through the problem you’re trying to solve.

The eagle-eyed among you might be thinking, “why don’t you just use the cumsum() function?” For this specific problem, the cumsum() function is actually the best choice.

numbers <- c(1, 2, 32, 423, 4, 8, 9)
cumulative_sum <- cumsum(numbers)
cumulative_sum
[1]   1   3  35 458 462 470 479

As you can see, the solution is very clean, and it’s also the fastest solution. The version using Reduce() is about 5 times slower. The version using purrr::accumulate() is about 130 times slower than cumsum(). However, the version using the loop inside the loop is about 1000 times slower than using cumsum(). So why not just use cumsum()? Well, with cumsum() we’re limited to just summing. But the versions using Reduce()/purrr::accumulate() are general purpose. We could replace + with any binary function.

Working out a cumulative mean

Our accumulate versions above allow us to use any binary function instead of just +. This means, we could, for example, write a binary function that works out a cumulative mean. If you have a number that is a mean of N values, and you want to add a new value, and work out what the mean would be if you had averaged together all N + 1 values then that’s fairly trivial to work out. You can do it as follows.

  1. If \(X_{\mathrm{mean}}\) is the mean of \(N\) values, then work out \(X_{\mathrm{sum}} = X_{\mathrm{mean}} \times N\)

  2. If \(Y\) is the new value, then work out \(Y_{\mathrm{sum}} = X_{\mathrm{sum}} + Y\)

  3. Finally work out the new mean as \(Y_{\mathrm{mean}} = \frac{Y_{\mathrm{sum}}}{N + 1}\)

To write a function that can do that, you might write something like the following:

cumulative_mean <- function(old_mean, n_values, new_value) {
  sum_value <- old_mean * n_values
  new_sum <- sum_value + new_value
  new_mean <- new_sum / (n_values + 1)
  return(new_mean)
}

But you’ll immediately notice that this is not a binary function. That is, it doesn’t take two arguments. It takes three! So what do we do? Well, we can package the first two arguments together in a vector/tuple. This turns the function into something like this:

cumulative_mean <- function(old_mean_and_n_values, new_value) {
  old_mean <- old_mean_and_n_values[1]
  n_values <- old_mean_and_n_values[2]
  sum_value <- old_mean * n_values
  new_sum <- sum_value + new_value
  new_mean <- new_sum / (n_values + 1)
  new_mean
}

This new function is an improvement, but remember: when using reductions/accumulations, the output of the function applied to one item is going to be the first argument when the function is applied to the next item. What does this mean in practical terms? It means our function above should not just output the new mean, but also output the new number of values. And it should output this as a vector, because the first argument of the cumulative_mean() function is a vector with two values. Amending the function, it now looks like the following:

cumulative_mean <- function(old_mean_and_n_values, new_value) {
  old_mean <- old_mean_and_n_values[1]
  n_values <- old_mean_and_n_values[2]
  sum_value <- old_mean * n_values
  new_sum <- sum_value + new_value
  new_n_values <- n_values + 1
  new_mean <- new_sum / (new_n_values)
  c(new_mean, new_n_values)
}

We can now use this function with Reduce() or purrr::accumulate() to work out a cumulative mean. It would look something like the following:

cummean_accumulate <- purrr::accumulate(
  numbers, cumulative_mean,
  .init = c(0, 0)
)
cummean_accumulate
[[1]]
[1] 0 0

[[2]]
[1] 1 1

[[3]]
[1] 1.5 2.0

[[4]]
[1] 11.66667  3.00000

[[5]]
[1] 114.5   4.0

[[6]]
[1] 92.4  5.0

[[7]]
[1] 78.33333  6.00000

[[8]]
[1] 68.42857  7.00000
cummean_reduce <- Reduce(
  cumulative_mean, numbers,
  accumulate = TRUE, init = c(0, 0)
)
cummean_reduce
[[1]]
[1] 0 0

[[2]]
[1] 1 1

[[3]]
[1] 1.5 2.0

[[4]]
[1] 11.66667  3.00000

[[5]]
[1] 114.5   4.0

[[6]]
[1] 92.4  5.0

[[7]]
[1] 78.33333  6.00000

[[8]]
[1] 68.42857  7.00000

You’ll notice we’re now passing a vector of two values as the initial value, because the first argument of our cumulative_mean() function must be a vector of two values. You’ll also notice that the output above looks a little messy. That’s because the output is a list of vectors of two elements. We probably just want the means. To make things easier, I’ve written a small helper function that will turn the output above into a data frame.

list_vec_to_dataframe <- function(x, col_names) {
  n_cols <- length(x)
  x |>
    unlist() |>
    matrix(ncol = n_cols) |>
    t() |>
    as.data.frame() |>
    setNames(col_names)
}

To just get a vector of means, we can use the above function as follows:

vec_of_means <- (purrr::accumulate(
  numbers,
  cumulative_mean,
  .init = c(0, 0)
) |>
  list_vec_to_dataframe(c("index", "mean")))$mean

vec_of_means
[1] 0 1 2 3 4 5 6 7

We could also write a function to perform a cumulative mean using a loop, and that might look something like the following:

cummean_loop <- function(numbers) {
  cum_mean <- vector()
  for (i in seq_along(numbers)) {
    running_mean <- mean(numbers[1:i])
    cum_mean <- c(cum_mean, running_mean)
  }
  c(0, cum_mean)
}

But remember, a mean relies on a sum, and a sum uses a loop. So the function above contains a loop within a loop. Not only that, but we’ll loop over some values many times. To see how this impacts performance, let’s benchmark it on a small set of numbers.

version elapsed relative
cummean_accumulate 0.646 6.212
cummean_loop 0.104 1.000
cummean_reduce 0.110 1.058

These benchmarks might seem a little surprising. We’re meant to be improving the performance of {BFDA} but that doesn’t seem to be the case. Instead, we seem to have made things worse! Why has this happened? With purrr::accumulate() there is some overhead when we call the function. This is a fixed or nearly fixed amount of time that will be added to the computation, so for a relatively small set of numbers this has a relatively large impact on compute time. But as we increase the size of the set you’ll notice something starts to happen. Below, I’ve plotted some benchmarks for different set sizes. I’ve picked set sizes from 10 to just over 300, because, in the examples on the {BFDA} webpage at least, Nmax values of around 300 are typical.

From this plot we can see that as set sizes increases the loop version blows up in compute time. For the reduce/accumulate versions compute time does not increase as rapidly as the loop version. Why? Because the loop version is looping over some numbers hundreds of times. In the reduce/accumulate version we’re only touching each number once. Thinking of numbers as streams instead of sets allowed us to come up with a clever solution of what to do “when we see a new number”. This is the power of functional-style programming. Yes, we get a performance benefit, but we get that performance benefit from changing how we think about the data we’re dealing with. As a stream, rather than a set.

Clever algorithms for streams

As I said earlier, problems that deal with streams of data are ubiquitous in computing an electronics. Anything to do with signal processing is dealing with a stream. Similarly, when computers had far less memory than they do today, programmers needed to treat data as streams, because computers couldn’t store all the data in memory as a set. One of these clever algorithms that we’ll make use of to speed up {BFDA} is known as Welford’s algorithm(see also Welford, 1962). Welford’s algorithm allows us to calculate the mean, and the standard deviation on a stream of data. From the mean and standard deviation, we can then compute many other things. We could, for example, calculate the standard error, a t statistic, and even a Bayes factor value—all on a stream of data touching each number only once.

The code chunk below show’s an example of Welford’s algorithm. Note that I’ve added a few additional lines of code so that it also works out the standard deviation, the standard error, the Cohen’s d value, and the t statistic. The are all statistic’s that we’ll need for replicating the output of the {BFDA} package. Note that the 3 values that we need to propigate from one step to the next are the count, and the mean (just like the running mean example above), but also squared distances. These are the minimum that we need for Welford’s algorithm to work. However, we also output the other values, because, why not!

welfords <- function(x, new_value) {
  count <- x[1]
  mean <- x[2]
  squared_distances <- x[3]
  count <- count + 1
  delta <- new_value - mean
  new_mean <- mean + (delta / count)
  delta2 <- new_value - new_mean
  squared_distances <- squared_distances + (delta * delta2)
  sd <- sqrt(squared_distances / (count - 1))
  se <- sd / sqrt(count)
  t <- new_mean / se
  d <- new_mean / sd
  c(count, new_mean, squared_distances, sd, se, d, t)
}

And this is how we’d use it on a set of numbers.

welfords_accumulate <- function(numbers) {
purrr::accumulate(numbers, welfords, .init = c(0, 0, 0, 0, 0, 0, 0)) |>
  list_vec_to_dataframe(c("i", "mean", "sq_dist", "sd", "se", "d", "t")) |>
  select(-i, -sq_dist)
}

Or

welfords_reduce <- function(numbers) {
Reduce(welfords, numbers, init = c(0, 0, 0, 0, 0, 0, 0), accumulate = TRUE) |>
  list_vec_to_dataframe(c("i", "mean", "sq_dist", "sd", "se", "d", "t")) |>
  select(-i, -sq_dist)
}

And now let’s take a look what the looped version would look like—that is, the version that matches more-or-less what {BFDA} is doing:

welfords_loop <- function(numbers) {
  cumulative_t <- vector()
  cumulative_mean <- vector()
  cumulative_d <- vector()
  cumulative_sd <- vector()
  cumulative_se <- vector()
  for (i in seq_along(numbers)) {
    if (i > 1) {
      data <- numbers[1:i]
      m <- mean(data)
      s <- sd(data)
      n <- length(data)
      se <- s / sqrt(n)
      d <- m / s
      t <- unname(t.test(data)$stat) # better to replace this with m / s

      cumulative_t <- c(cumulative_t, t)
      cumulative_mean <- c(cumulative_mean, m)
      cumulative_sd <- c(cumulative_sd, s)
      cumulative_se <- c(cumulative_se, se)
      cumulative_d <- c(cumulative_d, d)
    } else {
      cumulative_t <- c(cumulative_t, NA)
      cumulative_mean <- c(cumulative_mean, NA)
      cumulative_sd <- c(cumulative_sd, NA)
      cumulative_se <- c(cumulative_se, NA)
      cumulative_d <- c(cumulative_d, NA)
    }
  }
  tibble::tibble(
    mean = cumulative_mean,
    sd = cumulative_sd,
    se = cumulative_se,
    d = cumulative_d,
    t = cumulative_t
  )
}

We’ve got a lot more going on here, so let’s take a look at some benchmarks. Again, we’ll benchmark it for different set sizes.

Looking at the plot above we can see that we’ve got a staggering performance boost. This is the first part of our reworking of {BFDA} complete. But notice that we’ve only dealt with one set of numbers. What {BFDA} does is perform the process above many times (maybe 1000 times or more) on different sets of numbers. {BFDA} does it with a loop, but we’re going to do it a different way. But this post is already far to long, so we’ll leave that for next time when we’ll tackle map. In the next post I’ll show you how map works, and importantly how map makes it easier to work with chunks of data in parallel. In the end, we’ll write some code consisting of a single pipeline that will perform all the work done my the {BFDA} simulation code that consists of many, sometimes nested, loops.

References

Colling, L. J., & Szűcs, D. (2021). Statistical inference and the replication crisis. Review of Philosophy and Psychology, 12, 121–147. https://doi.org/10.1007/s13164-018-0421-4
Welford, B. P. (1962). Note on a method for calculating corrected sums of squares and products. Technometrics, 4(3), 419–420. https://doi.org/10.1080/00401706.1962.10490022

Footnotes

  1. It is possible in certain circumstances to modify the planned sample size for a study after data collection has already started. However, doing so requires special statistical procedures, the these procedures are not common in psychological science.↩︎

  2. It’s outside the scope of the current post to explain why optional stopping is permissible when using Bayes factors; however, this might be the subject of a future post.↩︎

  3. This scale use arbitrary units, and it not in terms of raw midi-chlorians concentrations. For example, it is known that the midi-chlorians concentrations in Anakin Skywalker exceeded 20,000.↩︎

  4. A researcher might also be interested in knowing how many participants to test so that there is a good chance of finding strong evidence against this hypothesis if it is instead false.↩︎

Reuse

Citation

BibTeX citation:
@online{jcolling2022,
  author = {Lincoln J Colling},
  editor = {},
  title = {Functional Style Code Part 1: {Reduce} and Accumulate},
  date = {2022-05-24},
  langid = {en},
  abstract = {Speeding up the BFDA R package using functional style
    code.}
}
For attribution, please cite this work as:
Lincoln J Colling. (2022, May 24). Functional style code part 1: Reduce and accumulate.