Show the code
<- function(numbers) {
welfords_accumulate ::accumulate(numbers, welfords, .init = c(0, 0, 0, 0, 0, 0, 0)) |>
purrrlist_vec_to_dataframe(c("i", "mean", "sq_dist", "sd", "se", "d", "t")) |>
::select(-sq_dist)
dplyr }
Part 2: maps and parallel processing
July 7, 2022
In Part 1 of this series, I showed how Functional-style code allows you to separate the what and the how of your code, and how doing so opened up new ways of solving programming problems. Part 1 covered the reduce / accumulate family of higher-order functions, and in the current post I cover the apply / map family of higher-order functions. Finally, I show how changing from loops to the apply / map family of higher-order functions parallel programming becomes a trivial matter of slightly modifying a function name.
If you haven’t read Part 1 of this series then I suggest that you go back and read it. In that post I cover the basics of thinking in functional style code, and I cover the reduce/accumulate family of higher-order functions.1 This family of higher-order functions is useful for when you want to “loop” over some data to produce some final result either with (accumulate) or without (reduce) keeping the intermediate results. Both reduce/accumulate allow us to use the intermediate results at step n - 1 for the computation at step n, and whenever we want to do this then using the reduce/accumulate family makes sense. But sometimes we are faced with a different scenario where we want to “loop” over some data and do something to each element but what we do to one element has no bearing on what we do to any other element. That is, there is nothing like the computation on element n being dependent on the output of some computation on element n - 1. In this scenario, each element can be treated totally independently. For this second scenario, we’ll be using the map/apply family of higher-order functions.
{BFDA}
packageIn Part 1 I explained that the motivation of this series of posts was to apply functional style programming to the problem of speeding up the {BFDA}
package. As a refresher on what the {BFDA}
package is doing, check out the relevant section of Part 1.
In Part 1 we solved the core of the problem in the {BFDA}
package. The core of the problem meant performing the following steps:
For a sample of Nmax participants, take a sub-sample of Ni participants.
Work out the mean of the sub-sample
Work out the standard deviation of the sub-sample
Compute the effect size, and other statistics, of the sub-sample
Compute the Bayes factor of the sub-sample
Once this is done, the sampling process will repeat, but a sub-sample of Ni + step size participants will now be taken from the sample of Nmax participants.
To do this, we used Welford’s algorithm to work out the mean, standard deviation, effect size, and t statistic on a stream of random numbers (our sample). We then passed the function for Welford’s algorithm and the stream of numbers to the higher-order function purrr::accumulate
or Reduce
. The two solutions are shown below.
Or
Where welfords()
is defined as follows:
welfords <- function(x, new_value) {
count <- x[1]
mean <- x[2]
squared_distances <- x[3]
count <- count + 1
delta <- new_value - mean
new_mean <- mean + (delta / count)
delta2 <- new_value - new_mean
squared_distances <- squared_distances + (delta * delta2)
sd <- sqrt(squared_distances / (count - 1))
se <- sd / sqrt(count)
t <- new_mean / se
d <- new_mean / sd
c(count, new_mean, squared_distances, sd, se, d, t)
}
and list_vec_to_dataframe()
is defined as below:
The code above performs the calculations we need on just one list of random numbers. But with data simulations, we can’t just simulate one sample of numbers. Rather, we need to repeat the whole process many times with a different set of random numbers each time. Each set of simulated numbers can be processed independently, because the outcome of one simulation has no bearing on the outcome of another simulation. With this kind of scenario it makes sense to use the map/apply family of higher-order functions. And these higher-order functions are the topic of this post.
Before we start solving the problem with map/apply let’s think about how we might go about solving it by using a loop. We know that the functions above, welfords_accumulate()
and welfords_reduce()
, will perform all the calculations that we need. To perform the full set of simulations we just need to run those functions on many different sets of random numbers. For now, we’ll just do it 10 times on 10 different sets on numbers. That might look something like this:
outcomes <- list()
for (i in 1: 10) { # 10 sets of random numbers
numbers <- rnorm(15, 0.5, 1) # simulate 15 random numbers
outcome <- welfords_reduce(numbers) # process the set of numbers
outcome$group <- i # label the set
# store the set in a list by concatenating it with previous results
outcomes[[i]] <- outcome
}
If we wanted the output in a single dataframe we might write something like the example below instead.
outcomes <- data.frame()
for(i in 1: 10) {
numbers <- rnorm(15, 0.5, 1) # simulate 15 random numbers
outcome <- welfords_reduce(numbers) # process the set of numbers
outcome$group <- i # label the set
# store the set in a list by concatenating it with previous results
outcomes <- rbind(outcomes, outcome)
}
This will work fine without any issues. And many programmers are happy with these constructs. But the issue that we might have with these constructs is that we’re mixing parts of the code that might be better understood if they were kept separate. We’re constructing a list outside of the loop (to hold our result), we’re then looping, and doing two things inside the loop, but then by wanting to store all the results in a single object we need to direct the output of a function called inside the loop to an object created outside the loop. In this example it’s fairly easy to keep things straight, but as things get more complex keeping track of stuff inside the loop and outside the loop might get more tricky, and this is particularly true when we start thinking about running the code inside the loop in parallel. What this construction demonstrates is that we seem to be mixing the what and the how parts of our solution.
As we covered in Part 1, functional style code allows us to separate the what (running the welfords_reduce()
function on a set of numbers) and the how (by looping over 10 sets of numbers). Following this separation of the what and the how we might break down our task as follows:
Generate 10 sets of random numbers
Feed each set of numbers into the function welfords_reduce()
or welfords_accumulate()
For step 1 we just need to generate 10 sets of random numbers. The easy way to do this is to generate one big set of random numbers and then split it into 10. There’s a couple of ways to do this in R
. The tidyverse version might look something like the following:
The output of this would be a list of 10 tibbles where each tibble has 15 rows. The group column would just have the number 1 for the first tibble, 2 for the second, 3 for the third, and so on.
The base R version might instead look something like this:
Now that we have the list of 10 sets of numbers we just need to feed this list to the welfords_accumulate()
or welfords_reduce()
functions. Before we give these functions the entire list we’ll just try giving them the first element of the list. But before we can do this, we’ll need to make a slight change to our two functions. Both of the functions expect a vector of numbers, but we’re going to give them a tibble / dataframe. So we’ll modify the functions so that they expect tibbles / dataframes and just extract the appropriate column from the tibble / dataframe.
First, the tidyverse version:
And now the base R version:
welfords_reduce <- function(df) {
output <- Reduce(welfords,
df$value,
init = c(0, 0, 0, 0, 0, 0, 0), accumulate = TRUE
) |>
list_vec_to_dataframe(c("i", "mean", "sq_dist", "sd", "se", "d", "t")) |>
subset(x = _, select = c("i", "mean", "sd", "se", "d", "t"))
output$group <- unique(df$group) # add back group label
output
}
Now let’s try them out. We’ll just feed the first element of the list into each function. First the tidyverse version with the list of tibbles:
i mean sd se d t group
1 0 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 1
2 1 0.3680832 NaN NaN NaN NaN 1
3 2 0.5832444 0.3042839 0.2151612 1.9167772 2.7107323 1
4 3 0.3158131 0.5107375 0.2948745 0.6183471 1.0710086 1
5 4 0.1284216 0.5606819 0.2803410 0.2290453 0.4580907 1
6 5 -0.1894330 0.8607733 0.3849495 -0.2200730 -0.4920982 1
7 6 0.1495126 1.1322762 0.4622498 0.1320461 0.3234455 1
8 7 0.2860859 1.0949613 0.4138565 0.2612749 0.6912684 1
9 8 0.2033086 1.0404229 0.3678450 0.1954096 0.5527018 1
10 9 0.4358236 1.1973884 0.3991295 0.3639785 1.0919355 1
11 10 0.3988977 1.1349317 0.3588969 0.3514729 1.1114548 1
12 11 0.4620389 1.0968673 0.3307179 0.4212350 1.3970784 1
13 12 0.3584237 1.1057020 0.3191887 0.3241594 1.1229211 1
14 13 0.4756519 1.1398892 0.3161484 0.4172791 1.5045211 1
15 14 0.4469950 1.1004066 0.2940960 0.4062090 1.5198948 1
16 15 0.5399715 1.1198534 0.2891449 0.4821806 1.8674773 1
And now the base R version:
i mean sd se d t group
1 0 0.0000000 0.0000000 0.0000000 0.0000000 0.000000 1
2 1 0.6757752 NaN NaN NaN NaN 1
3 2 1.1368009 0.6519888 0.4610257 1.7435897 2.465808 1
4 3 1.0238565 0.5008134 0.2891447 2.0443872 3.540982 1
5 4 0.7004197 0.7652809 0.3826404 0.9152452 1.830490 1
6 5 0.7645629 0.6780950 0.3032533 1.1275158 2.521202 1
7 6 0.7908571 0.6099169 0.2489975 1.2966637 3.176164 1
8 7 0.7780616 0.5578037 0.2108300 1.3948665 3.690470 1
9 8 0.7762601 0.5164510 0.1825930 1.5030664 4.251314 1
10 9 0.5178709 0.9133818 0.3044606 0.5669818 1.700946 1
11 10 0.5160371 0.8611641 0.2723240 0.5992319 1.894938 1
12 11 0.5417464 0.8214097 0.2476644 0.6595324 2.187422 1
13 12 0.4975652 0.7979975 0.2303620 0.6235173 2.159927 1
14 13 0.4561626 0.7784714 0.2159091 0.5859722 2.112753 1
15 14 0.4845721 0.7554471 0.2019017 0.6414375 2.400039 1
16 15 0.4502547 0.7400009 0.1910674 0.6084516 2.356523 1
Both give us the output we expect. Before we continue, let’s just think about what we’ve done here. We’ve *applied the function to the first element of our list. What we want to do, is apply the function to every element of the list. To do this, we’ll need a higher-order function that accepts a function (welfords_reduce()
or welfords_accumulate()
in our case) and a list (tib
or df
in our case). In base R
this is the higher-order function called lapply
and in tidyverse** land it’s the function purrr::map()
2.
Let’s take a look at them in action. First the tidyverse version:
And now the base R version:
We can put it all together in a single pipeline if we want. And for both versions, we’ll add a second step where we transform the list of dataframes / tibbles into one big tibble or dataframe by adding a reduction.
In tidyverse:
Or in base R:
In the code above, each element in our list of dataframes/tibbles can be processed independently. The output of one call to welfords_reduce / welfords_accumulate has no bearing on another call to the same function. Currently, however, we’re processing each element in sequence. But there’s no need to do this sequentially. Instead, we could process the elements in parallel. There are a handful of languages that make parallel processing easy. In a subsequent blog post we’ll see that Rust is one example of such a language. However, R is also such an example, with the built in {parallel}
package giving us access to parallel versions of the map/apply functions. In tidyverse land there’s the {furrr}
package which gives you drop in replacements for the map functions in {purrr}
.
Let’s try see how to turn our sequential function calls into parallel function calls. Before we can go parallel, we just need a quick digression into parallel processing. For our purposes, there are two ways to perform operations in parallel. The first, is by forking. This means that each of the parallel operations is run in a separate process that has been forked from the main process. In general, Unix-like (that is, Linux and MacOS) operating systems will support forking. Windows, however, does not support forking. On windows, the alternative is to run separate R sessions instead, and have these sessions communicate with each other. In general, forking will give better performance than the multisession approach, because of lower overheads. However, the multisession approach has the advantage of greater compatibility. With that out of the way, we’ll get to the code.
First, we’ll do the tidyverse version because it’s the most straight forward. The parallel version looks like this:
# uncomment line to choose multisession or forking (aka mulicore)
# future::plan("multisession", workers = parallel::detectCores())
future::plan("multicore", workers = parallel::detectCores())
set.seed(123)
tib <- tibble::tibble(
group = sort(rep(seq(1:10), 15)),
value = rnorm(15 * 10, mean = 0.5, sd = 1),
) |>
dplyr::group_by(group) |>
dplyr::group_split() |>
furrr::future_map(welfords_accumulate) |> # replace furrr with purrr
purrr::reduce(\(x, y) rbind(x, y)) # syntax for lambda in R v4.0+
For the base R version I’ll also use the forking approach. To do this, we’ll just replace lapply()
with parallel::mclapply()
. We just need to include one additional argument where we specify the number of cores we want to use. The code looks like the following:
Up until now, we haven’t actually computed any Bayes factors. The whole point of the {BFDA}
package is that we compute a Bayes factor for each sub-sample. Up until now we’ve only calculated the mean, sd, se, d, and t. The reason we’ve left calculating the Bayes factors until the very end is because this is probably the most computationally intensive step. This means that we only want to compute as many as we need and not a single one more. To decide which sub-samples to calculate a Bayes factor on we need to decide on our minimum sample size and the step size. For example, if our minimum sample size was 10 and the step size was 5 then we’d calculate Bayes factors for the sub-samples of size 10, 15, 20, 25 and so on. To do this, the first step would be to filter our tibbles / dataframes that we created in the previous step so that we only keep the rows we want. The column i
tells us the size of that sub-sample, so we can use that for filtering.
To do this in the tidyverse way is very simple. We could just use dplyr::filter()
to do it. It would look something like this.
i mean sd se d t group
1 10 0.5746256 0.9537841 0.3016130 0.6024693 1.9051753 1
2 15 0.6523843 0.8453394 0.2182657 0.7717425 2.9889459 1
3 10 0.1880977 1.0652856 0.3368729 0.1765702 0.5583641 2
4 15 0.2534081 1.0925713 0.2821007 0.2319374 0.8982897 2
5 10 0.8220446 0.5273024 0.1667477 1.5589623 4.9298716 3
6 15 0.7952895 0.8669419 0.2238434 0.9173504 3.5528828 3
7 10 0.5028680 0.6869852 0.2172438 0.7319925 2.3147634 4
8 15 0.5613871 0.8147373 0.2103643 0.6890407 2.6686430 4
9 10 0.6230837 0.9363923 0.2961132 0.6654088 2.1042075 5
10 15 0.3692777 1.0470315 0.2703424 0.3526902 1.3659632 5
11 10 0.5006767 0.6168828 0.1950755 0.8116237 2.5665796 6
12 15 0.6795631 0.6475100 0.1671864 1.0495020 4.0647037 6
13 10 0.9370942 1.0688850 0.3380111 0.8767025 2.7723767 7
14 15 0.6581044 0.9811103 0.2533216 0.6707752 2.5979010 7
15 10 0.1919476 0.8838902 0.2795106 0.2171622 0.6867273 8
16 15 0.1541178 0.7777717 0.2008198 0.1981529 0.7674430 8
17 10 0.3895617 0.8105153 0.2563075 0.4806346 1.5199001 9
18 15 0.3904774 0.9464620 0.2443754 0.4125653 1.5978586 9
19 10 0.1627094 1.3517393 0.4274575 0.1203704 0.3806446 10
20 15 0.2423618 1.3508068 0.3487768 0.1794200 0.6948908 10
The base R version is a little bit more complex because the base Filter()
function expects a list and a function that returns a boolean. This means that Filter()
is another example of a higher-order function .
As mentioned above, Filter()
expects a list, but we have a dataframe. Luckily, I’ve written a helper function that can turn a dataframe into a list where every element of the list is a row of the dataframe.
Now we just need a function that will perform the filtering. This function will take a vector (a row of the dataframe) and it must return a boolean. The first element of the vector is what was previously in the column labelled i
, so this is the value that we’ll look at in deciding to return TRUE
or FALSE
. To do this, we might write something like the following:
Putting it all together, it’ll look like this:
cols <- names(df)
df |>
dataframe_to_list_vec() |>
Filter(select_items, x = _) |>
list_vec_to_dataframe(col_names = cols)
i mean sd se d t group
1 10 0.5746256 0.9537841 0.3016130 0.6024693 1.9051753 1
2 15 0.6523843 0.8453394 0.2182657 0.7717425 2.9889459 1
3 10 0.1880977 1.0652856 0.3368729 0.1765702 0.5583641 2
4 15 0.2534081 1.0925713 0.2821007 0.2319374 0.8982897 2
5 10 0.8220446 0.5273024 0.1667477 1.5589623 4.9298716 3
6 15 0.7952895 0.8669419 0.2238434 0.9173504 3.5528828 3
7 10 0.5028680 0.6869852 0.2172438 0.7319925 2.3147634 4
8 15 0.5613871 0.8147373 0.2103643 0.6890407 2.6686430 4
9 10 0.6230837 0.9363923 0.2961132 0.6654088 2.1042075 5
10 15 0.3692777 1.0470315 0.2703424 0.3526902 1.3659632 5
11 10 0.5006767 0.6168828 0.1950755 0.8116237 2.5665796 6
12 15 0.6795631 0.6475100 0.1671864 1.0495020 4.0647037 6
13 10 0.9370942 1.0688850 0.3380111 0.8767025 2.7723767 7
14 15 0.6581044 0.9811103 0.2533216 0.6707752 2.5979010 7
15 10 0.1919476 0.8838902 0.2795106 0.2171622 0.6867273 8
16 15 0.1541178 0.7777717 0.2008198 0.1981529 0.7674430 8
17 10 0.3895617 0.8105153 0.2563075 0.4806346 1.5199001 9
18 15 0.3904774 0.9464620 0.2443754 0.4125653 1.5978586 9
19 10 0.1627094 1.3517393 0.4274575 0.1203704 0.3806446 10
20 15 0.2423618 1.3508068 0.3487768 0.1794200 0.6948908 10
Now that we have our nicely filtered dataframes / tibbles, we can go ahead and add the Bayes factor values. To do this, we’ll use the map / apply family of higher-order functions again.
There are actually many ways to write this solution, particularly for the tidyverse version. For this tidyverse version we’ll use the pmap_df()
rather than the standard map()
. Using purrr::pmap_df()
means that we’ll be able to pass each row of our tibble to a function that will calculate the Bayes factor and we’ll be able to use the column named as names arguments to this function (This function will just be a lambda expression). The output of this will also be a tibble, which is exactly what we want. Putting it all together, it might look something like this (note that we could easily switch out purrr::pmap_df()
for furrr::future_pmap_dfr()
if we wanted to do things in parallel, and this is what I’ll do for the final solution below):
tib |>
dplyr::filter(i >= 10 & (i %% 5) == 0) |>
purrr::pmap_df(function(i, mean, sd, se, d, t, group) {
tibble::tibble(
i = i,
mean = mean,
sd = sd,
se = se,
d = d,
t = t,
group = group,
logBF = BayesFactor::ttest.tstat(
t = t,
n1 = i,
rscale = 0.707,
simple = TRUE
) |> log()
)
})
# A tibble: 20 × 8
i mean sd se d t group logBF
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl>
1 10 0.575 0.954 0.302 0.602 1.91 1 0.137
2 15 0.652 0.845 0.218 0.772 2.99 1 1.74
3 10 0.188 1.07 0.337 0.177 0.558 2 -1.04
4 15 0.253 1.09 0.282 0.232 0.898 2 -0.990
5 10 0.822 0.527 0.167 1.56 4.93 3 3.88
6 15 0.795 0.867 0.224 0.917 3.55 3 2.67
7 10 0.503 0.687 0.217 0.732 2.31 4 0.632
8 15 0.561 0.815 0.210 0.689 2.67 4 1.22
9 10 0.623 0.936 0.296 0.665 2.10 5 0.374
10 15 0.369 1.05 0.270 0.353 1.37 5 -0.560
11 10 0.501 0.617 0.195 0.812 2.57 6 0.949
12 15 0.680 0.648 0.167 1.05 4.06 6 3.53
13 10 0.937 1.07 0.338 0.877 2.77 7 1.21
14 15 0.658 0.981 0.253 0.671 2.60 7 1.11
15 10 0.192 0.884 0.280 0.217 0.687 8 -0.977
16 15 0.154 0.778 0.201 0.198 0.767 8 -1.08
17 10 0.390 0.811 0.256 0.481 1.52 9 -0.289
18 15 0.390 0.946 0.244 0.413 1.60 9 -0.296
19 10 0.163 1.35 0.427 0.120 0.381 10 -1.11
20 15 0.242 1.35 0.349 0.179 0.695 10 -1.13
Now for the base R version. For the base R version, we’ll first write a separate function that will perform the relevant calculations for us. This function will take a vector of numbers (a row of our dataframe) and output another vector of numbers.
Putting it all together, it would look something like this:
cols <- names(df)
df |>
dataframe_to_list_vec() |>
Filter(select_items, x = _) |>
lapply(bf_func) |>
list_vec_to_dataframe(col_names = c(cols, "logBF"))
i mean sd se d t group logBF
1 10 0.5746256 0.9537841 0.3016130 0.6024693 1.9051753 1 0.1371568
2 15 0.6523843 0.8453394 0.2182657 0.7717425 2.9889459 1 1.7370969
3 10 0.1880977 1.0652856 0.3368729 0.1765702 0.5583641 2 -1.0426135
4 15 0.2534081 1.0925713 0.2821007 0.2319374 0.8982897 2 -0.9895525
5 10 0.8220446 0.5273024 0.1667477 1.5589623 4.9298716 3 3.8824949
6 15 0.7952895 0.8669419 0.2238434 0.9173504 3.5528828 3 2.6710526
7 10 0.5028680 0.6869852 0.2172438 0.7319925 2.3147634 4 0.6320565
8 15 0.5613871 0.8147373 0.2103643 0.6890407 2.6686430 4 1.2228219
9 10 0.6230837 0.9363923 0.2961132 0.6654088 2.1042075 5 0.3736238
10 15 0.3692777 1.0470315 0.2703424 0.3526902 1.3659632 5 -0.5599658
11 10 0.5006767 0.6168828 0.1950755 0.8116237 2.5665796 6 0.9487774
12 15 0.6795631 0.6475100 0.1671864 1.0495020 4.0647037 6 3.5292408
13 10 0.9370942 1.0688850 0.3380111 0.8767025 2.7723767 7 1.2113699
14 15 0.6581044 0.9811103 0.2533216 0.6707752 2.5979010 7 1.1117285
15 10 0.1919476 0.8838902 0.2795106 0.2171622 0.6867273 8 -0.9765229
16 15 0.1541178 0.7777717 0.2008198 0.1981529 0.7674430 8 -1.0817560
17 10 0.3895617 0.8105153 0.2563075 0.4806346 1.5199001 9 -0.2892276
18 15 0.3904774 0.9464620 0.2443754 0.4125653 1.5978586 9 -0.2961170
19 10 0.1627094 1.3517393 0.4274575 0.1203704 0.3806446 10 -1.1128791
20 15 0.2423618 1.3508068 0.3487768 0.1794200 0.6948908 10 -1.1271358
We’ve now completed all the hard parts. The final step is just to combine everything together in a nice wrapper function so that it works the same way as the BFDA::BDFA.sim()
function.
The tidyverse version would look something like this (note that everything that can run in parallel now runs in parallel):
BFDA_sim_tidy <- function(expected.ES, type, design,
prior, n.min, n.max,
alternative, boundary, B,
verbose, cores, stepsize, seed) {
set.seed(seed)
def_alternative <- function(alternative) {
if (alternative == "greater") {
return(c(0, Inf))
} else if (alternative == "less") {
return(c(-Inf, 0))
} else {
return(c(-Inf, Inf))
}
}
future::plan("multicore", workers = cores)
sim <- tibble::tibble(
group = sort(rep(seq(1:B), n.max)),
value = rnorm(n.max * B, mean = expected.ES, sd = 1)
) |>
dplyr::group_by(group) |>
dplyr::group_split() |>
furrr::future_map(welfords_accumulate) |>
purrr::reduce(\(x, y) rbind(x, y)) |>
dplyr::filter(i >= n.min & (i %% stepsize) == 0) |>
furrr::future_pmap_dfr(function(i, mean, sd, se, d, t, group) {
tibble::tibble(
id = group,
true.ES = expected.ES,
n = i,
logBF = BayesFactor::ttest.tstat(
t = t,
n1 = i,
rscale = prior[[2]]$prior.scale,
simple = TRUE,
nullInterval = def_alternative(alternative)
) |> log(),
emp.ES = d,
statistic = t,
p.value = ifelse(alternative == "two.sides",
2 * pt(q = abs(t), df = i - 1, lower.tail = F),
pt(q = t, df = i - 1, lower.tail = TRUE)
)
)
}, .options = furrr::furrr_options(seed = NULL))
settings <- list(
n.min = n.min,
n.max = n.max,
design = "sequential",
prior = prior,
boundary = c(0, Inf),
alternative = alternative,
type = type,
options.sample = list(),
extra = list(),
packageVersion = "0.5.0"
)
output <- list(sim = sim, settings = settings)
class(output) <- "BFDA"
return(output)
}
And now the base R version. This version is more verbose, because it requires more helper functions, but it would look something like this:
BFDA_sim_base <- function(expected.ES, type, design,
prior, n.min, n.max,
alternative, boundary, B,
verbose, cores, stepsize, seed) {
select_items <- function(x) {
if (x[1] >= n.min & (x[1] %% stepsize) == 0) {
return(TRUE)
}
FALSE
}
def_alternative <- function(alternative) {
if (alternative == "greater") {
return(c(0, Inf))
} else if (alternative == "less") {
return(c(-Inf, 0))
} else {
return(c(-Inf, Inf))
}
}
bf_func <- function(df_row) {
i <- df_row[1]
mean <- df_row[2]
sd <- df_row[3]
se <- df_row[4]
d <- df_row[5]
t <- df_row[6]
group <- df_row[7]
logBf <- BayesFactor::ttest.tstat(
t = t,
n1 = i,
rscale = prior[[2]]$prior.scale,
simple = TRUE,
nullInterval = def_alternative(alternative)
) |>
log()
p_value <- ifelse(alternative == "two.sides",
2 * pt(q = abs(t), df = i - 1, lower.tail = F),
pt(q = t, df = i - 1, lower.tail = TRUE)
)
c(i, mean, sd, se, d, t, group, logBf, p_value)
}
set.seed(seed)
df <- data.frame(
group = sort(rep(seq(1:B), n.max)),
value = rnorm(n.max * B, mean = expected.ES, sd = 1)
) |>
split(~group) |>
parallel::mclapply(welfords_reduce, mc.cores = cores) |>
Reduce(function(x, y) rbind(x, y), x = _) |>
dataframe_to_list_vec() |>
Filter(select_items, x = _) |>
parallel::mclapply(bf_func, mc.cores = cores)
list_vec_to_dataframe(col_names = c("i", "mean", "sd", "se", "d", "t", "group", "logBF", "p.value"))
sim <- data.frame(
id = df$group,
true.ES = expected.ES,
n = df$i,
logBF = df$logBF,
emp.ES = df$d,
statistic = df$t,
p.value = df$p.value
)
settings <- list(
n.min = n.min,
n.max = n.max,
design = "sequential",
prior = prior,
boundary = c(0, Inf),
alternative = alternative,
type = type,
options.sample = list(),
extra = list(),
packageVersion = "0.5.0"
)
output <- list(sim = sim, settings = settings)
class(output) <- "BFDA"
return(output)
}
Note that both of these functions only cover a subset of the models that {BFDA}
is capable of simulating. My aim here is not to rewrite {BFDA}
in its entirety, but rather just to show how it could be re-written.
Now that we’ve written the new functions, we can try them out. First, we’ll run the BFDA.sim()
function from the {BFDA}
pacakge.
sim.H1 <- BFDA.sim(
expected.ES = 0.4,
prior = list(
"Cauchy",
list(prior.location = 0, prior.scale = sqrt(2) / 2)
),
list(prior$famly, list(prior.location = prior$params[1], prior.scale = prior.params[2])
prior = list(
"Cauchy",
list(prior.location = 0, prior.scale = sqrt(2) / 2)
),
n.min = 50,
stepsize = 5,
n.max = 300,
type = "t.paired",
design = "sequential",
alternative = "two.sided",
B = 1000,
cores = 12,
verbose = FALSE,
seed = 123
)
Now we’ll run the tidyverse version:
And the base R version:
And now let’s take a look at the results:
outcome percentage
1 Studies terminating at n.max (n=300) 0%
2 Studies terminating at a boundary 100%
3 --> Terminating at H1 boundary 97.7%
4 --> Terminating at H0 boundary 2.3%
Average sample number (ASN) at stopping point (both boundary hits and n.max): n = 66
Sample number quantiles (50/80/90/95%) at stopping point:
50% 80% 90% 95%
50 80 95 115
outcome percentage
1 Studies terminating at n.max (n=300) 0%
2 Studies terminating at a boundary 100%
3 --> Terminating at H1 boundary 98.2%
4 --> Terminating at H0 boundary 1.8%
Average sample number (ASN) at stopping point (both boundary hits and n.max): n = 66
Sample number quantiles (50/80/90/95%) at stopping point:
50% 80% 90% 95%
55 80 100 111
outcome percentage
1 Studies terminating at n.max (n=300) 0%
2 Studies terminating at a boundary 100%
3 --> Terminating at H1 boundary 98.2%
4 --> Terminating at H0 boundary 1.8%
Average sample number (ASN) at stopping point (both boundary hits and n.max): n = 66
Sample number quantiles (50/80/90/95%) at stopping point:
50% 80% 90% 95%
55 80 100 111
We can see from the output that the two rewrites provide very similar outputs to the original {BFDA}
function. The outputs aren’t identical, because there is a difference in how the two sets of functions generate their random numbers. However, the general conclusions are the same.
The primary purpose here was, however, to improve the running time of the simulations, so that’s where we turn our attention to next. Table 1 shows the running times for the various versions. From this we can see that the original {BFDA}
version is the slowest at over 6 minutes. The two rewrites are significantly faster with the base R version coming in at under one half minutes, and tidyverse version coming in at just over 1.5 minutes. Our aim of speeding up the {BFDA}
package has been successful!
version | running time |
---|---|
BFDA | 397.119s (~6.62 minutes) |
tidyverse-based rewrite | 95.661s (~1.59 minutes) |
base R-based rewrite | 77.866s (~1.3 minutes) |
The aim of the current post and Part 1 was to demonstrate how functional-style thinking can help lead to code where it is easier for the programmer to separate out the what (the functions transforming the data) and the how (how the functions are applied to data sets). This can lead to code that is easier to understand and easier to debug. We also saw that this style of coding makes it clear when parts of the dataset can be processed independently and when they can’t. And we saw that when parts of the dataset can be treated independently that this allows the code to be parallelised in fairly trivial ways by simply changing the name of the relevant higher-order function. Although the primary aim of functional style code isn’t to write faster code, we also saw how an upshot of better code design (which came with functional style thinking) was code that was significantly faster.
One thing I didn’t talk about in this post, although it was a feature I made significant use of, is chaining together operations with the pipe operator (|>
). The result of making extensive use of the pipe operator was that the core of the {BFDA}
simulation routines could be performed with a single long chain of functional transformations. It is my personal preference to write code like this, because I find it fits well with functional style thinking. To my mind, chaining together operations feels like building a production line on a conveyor belt with each step performing some kind of transformation on the incoming data stream. But I understand that your mileage may vary, and therefore you might want to break up the long pipelines into mini sub-pipelines. The choice here is yours.
The final upshot of this style of code that I haven’t talked about is that it will make it easier to translate the R
code I’ve written here into other languages. Although it is certainly possible to translate the non-functional style version to other languages, in my experience, functional style code tends to be easier to translate particularly in lower level languages where more care needs to be taking in matters like allocating memory. In a subsequent blog post I’ll show how the rewrite I did here can be easily translated into the Rust programming language. But that is for another day. Hopefully, however, I’ve given you enough here to want to try out functional style code in your own projects.
By higher-order functions I just mean functions that take functions as one of their arguments.↩︎
The {purrr}
package actually contains many different versions of the map function that make working with different kinds of inputs and outputs easier. For example, if we had a tibble as our input rather than the list of tibbles we might use the pmap()
function instead. And if we wanted our output to be a tibble rather than the list of tibbles then we might use map_df()
or pmap_df()
instead. For now I’ll just use map()
because it works most similarly to lapply()
↩︎
@online{jcolling2022,
author = {Lincoln J Colling},
editor = {},
title = {Functional Style Code Part 2: {Maps} and Parallel Maps},
date = {2022-07-07},
langid = {en},
abstract = {Speeding up the BFDA R package using functional style
code.}
}