JottR on R

Setting Future Plans in R Functions — and Why You Probably Shouldn't

Wed, 25 Jun 2025 00:00:00 +0000

The future package celebrates ten years on CRAN as of June 19, 2025. This is the second in a series of blog posts highlighting recent improvements to the futureverse ecosystem.

TL;DR

You can now use

my_fcn <- function(...) {
  with(plan(multisession), local = TRUE)
  ...
}

to temporarily set a future backend for use in your function. This guarantees that any changes are undone when the function exits, even if there is an error or an interrupt.

But, I really recommend not doing any of that, as I’ll try to explain below.

Decoupling of intent to parallelize and how to execute it

The core design philosophy of futureverse is:

“The developer decides what to parallelize, the user decides where and how.”

This decoupling of intent (what to parallelize) and execution (how to do it) makes code written using futureverse flexible, portable, and easy to maintain.

Specifically, the developer controls what to parallelize by using future() or higher-level abstractions like future_lapply() and future_map() to mark code regions that may run concurrently. The code makes no assumptions about the compute environment and is therefore agnostic to which future backend being used, e.g.

y <- future_lapply(X, slow_fcn)

and

y <- future_map(X, slow_fcn)

Note how there is nothing in those two function calls that specifies how they are parallelized, if at all. Instead, the end user (e.g., data analyst, HPC user, or script runner) controls the execution strategy by setting the future backend via plan(), e.g., built-in sequential, built-in multisession, future.callr, and future.mirai backends. This allows the user to scale the same code from a notebook to an HPC cluster or cloud environment without changing the original code.

We can find this design of decoupling intent and execution also in traditional R parallelization frameworks. In the parallel package we have setDefaultCluster(), which the user can set to control the default cluster type when none is explicitly specified. For that to be used, the developer needs to make sure to use the default cl = NULL, either explicitly as in:

y <- parLapply(cl = NULL, X, slow_fcn)

or implicitly¹, by making sure all arguments are named, as in:

y <- parLapply(X = X, FUN = slow_fcn)

Unfortunately, this is rarely used - instead parLapply(cl, X, FUN) is by far the most common way of using the parallel package, resulting in little to no control for the end user.

The foreach package had greater success with this design philosophy. There the developer writes:

y <- foreach(x = X) %dopar% { slow_fcn(x) }

with no option in that call to specify which parallel backend to use. Instead, the user typically controls the parallel backend via the so called “dopar” foreach adapter, e.g. doParallel::registerDoParallel(), doMC::registerDoMC(), and doFuture::registerDoFuture(). Unfortunately, there are ways for the developer to write foreach() with %dopar% statements such that the code works only with a specific parallel backend². Regardless, it is clear from their designs, that both of these packages shared the same fundamental design philosophy of decoupling intent and execution as is used in the futureverse. You can read more about this in the introduction of my H. Bengtsson (2021) article.

When writing scripts or Rmarkdown documents, I recommend putting code that controls the execution (e.g. plan(), registerDoNnn(), and setDefaultCluster()) at the very top, immediately after any library() statements. This is also where I, like many others, prefer to put global settings such as options() statements. This makes it easier for anyone to identify which settings are available and used by the script. It also avoids cluttering up the rest of the code with such details.

Straying away from the core design philosophy

One practical advantage of the above decoupling design is that there is only one place where parallelization is controlled, instead of it being scattered throughout the code, e.g. as special parallel arguments to different function calls. This makes it easier for the end user, but also for the package developer who does not have to worry about what their APIs should look like and what arguments they should take.

That said, some package developers prefer to expose control of parallelization via special function arguments. If we search CRAN packages, we find arguments like parallel = FALSE, ncores = 1, and cluster = NULL that then are used internally to set up the parallel backend. If you write functions that take this approach, it is critical that you remember to set the backend only temporarily, which can be done via on.exit(), e.g.

my_fcn <- function(xs, ncores = 1) {
  if (ncores > 1) {
    cl <- parallel::makeCluster(ncores)
    on.exit(parallel::stopCluster(cl))
    y <- parLapply(cl = cl, xs, slow_fcn)
  } else {
    y <- lapply(xs, slow_fcn)
  }
  y
}

If you use futureverse, you can use:

my_fcn <- function(xs, ncores = 1) {
  old_plan <- plan(multisession, workers = ncores)
  on.exit(plan(old_plan))
  y <- future_lapply(xs, slow_fcn)
  y
}

And, since future 1.40.0 (2025-04-10), you can achieve the same with a single line of code³:

my_fcn <- function(xs, ncores = 1) {
  with(plan(multisession, workers = ncores), local = TRUE)
  y <- future_lapply(xs, slow_fcn)
  y
}

I hope that this addition lowers the risk of forgetting to undo any changes done by plan() inside functions. If you forget, then you may override what the user intends to use elsewhere. For instance, they might have set plan(batchtools_slurm) to run their R code across a Slurm high-performance-compute (HPC) cluster, but if you change the plan() inside your package function without undoing your changes, then the user is up for a surprise and maybe also hours of troubleshooting.

But, please avoid switching future backends if you can

I still want to plead with package developers to avoid setting the future backend, even temporarily, inside their functions. There are other reasons for not doing this. For instance, if you provide users with an ncores arguments for controlling the amount of parallelization, you risk locking in the user into a specific parallel backend. A common pattern is to use plan(multisession, workers = ncores) as in the above examples. However, this prevents the user from taking advantage of other closely related parallel backends, e.g. plan(callr, workers = ncores) and plan(mirai_multisession, workers = ncores). The future.callr backend runs each parallel task in a fresh R session that is shut down immediately afterward, which is beneficial when memory is the limiting factor. The future.mirai backend is optimized to have a low latency, meaning it can parallelize also shorter-term tasks, which might otherwise not be worth parallelizing. Also, contrary to multisession, these alternative backends can make use of all CPU cores available on modern hardware, e.g. 192- and 256-core machines. The multisession backend, which builds upon parallel PSOCK clusters, is limited to a maximum of 125 parallel workers, because each parallel worker consumes one R connection, and R can only have 125 connections open at any time. There are ways to increase this limit, but it still requires work. See parallelly::availableConnections() for more details on this problem and how to increase the maximum number of connections.

You can of course add another “parallel” argument to allow your users to control also which future backend to use, e.g. backend = multisession and ncores = 1. But, this might not be sufficient - there are backends that take additional arguments, which you then also need to support in each of your functions. Finally, new backends will be implemented by others in the future (pun intended and not), and we can’t predict what they will require.

Related to this, I am working on ways for (i) futureverse to choose among a set of parallel backends - not just one, (ii) based on resource specifications (e.g. memory needs and maximum run times) for specific future statements. This will give back some control to the developer over how and where execution happens and more options for the end user to scale out to different type of compute resources. For instance, a future_map() call with a 192-GiB memory requirement may only be sent to “large-memory” backends and, if not available, throw an instant error. Another example is a future_map() call with a 256-MiB memory and 5-minute runtime requirement - that is small enough to be sent to an AWS Lambda or GCS Cloud Functions backend, if the user has specified such a backend.

In summary, I argue that it’s better to let the user be in full control of the future backend, by letting them set it via plan(), preferably at the top of their scripts. If not possible, please make sure to use with(plan(...), local = TRUE).

May the future be with you!

Henrik

Reference

H. Bengtsson, A Unifying Framework for Parallel and Distributed Processing in R using Futures, The R Journal (2021) 13:2, pages 208-227 [abstract, PDF]

If the argument cl = NULL of parLapply() had been the last argument instead of the first, then parLapply(X, slow_fcn), which resembles lapply(X, slow_fcn), would have also resulted in the default cluster being used.
^[return]
foreach() takes backend-specific options (e.g. .options.multicore, .options.parallel, .options.mpi, and .options.future). The developer can use these to adjust the default behavior of a given foreach adapter. Unfortunately, when used - or rather, when needed - the code is no longer agnostic to the backend - what will happen if a foreach adapter is used that the developer did not anticipate?
^[return]
The withr package has with_nnn() and local_nnn() functions for evaluating code with various settings temporarily changed. Following this lead, I was very close to adding with_plan() and local_plan() to future 1.40.0, but then I noticed that mirai supports with(daemons(ncores), { ... }). This works because with() is an S3 generic function. I like this approach, especially since it avoids adding more functions to the API. I added similar support for with(plan(multisession, workers = ncores), { ... }). More importantly, this allowed me to also add the with(..., local = TRUE) variant to be used inside functions, which makes it very easy to safely switch to a temporary future backend inside a function.
^[return]

Future Got Better at Finding Global Variables

Mon, 23 Jun 2025 00:00:00 +0000

The future package celebrates ten years on CRAN as of June 19, 2025. This is the first in a series of blog posts highlighting recent improvements to the futureverse ecosystem.

The globals package is part of the futureverse and has had two recent releases on 2025-04-15 and 2025-05-08. These updates address a few corner cases that would otherwise lead to unexpected errors. They also resulted in several long, outstanding issues reported on the future, future.apply, furrr, and doFuture package issue trackers, and elsewhere, could be closed.

The significant update is that findGlobals() gained argument method = "dfs", which finds globals in R expressions by walking its abstract syntax tree (AST) using a depth-first-search algorithm. This new approach does a better job of emulating how the R engine identifies global variables, which results in an even smoother ride for anyone using futureverse for parallel and distributed processing. Previously, a tweaked search algorithm adopted from codetools::findGlobals() was used. The codetools search algorithm is mainly designed for R CMD check to detect undefined variables being used in package code. To limit the number of false positives reported by R CMD check, such algorithms tend to be “conservative” by nature, so that we can trust what is reported. This strategy is not always sufficient for automatically detecting globals needed in parallel processing. As an example, in

fcn <- function() { 
  a <- b
  b <- 1 
}

variable b is a global variable, but if we ask codetools, it does not pick up b as a global;

codetools::findGlobals(fun)
#> [1] "{"  "<-"

This false negative is alright for R CMD check, but, in contrast, for parallel processing, we need to use a “liberal” search algorithm. In parallel processing it is okay to pick up and export too many variables to the parallel worker. If a variable is not used, little harm is done, but if we fail to export a needed variable, we’ll end up with an object-not-found error. Futureverse has since the early days (December 2015) used a modified version of the codetools algorithm that is liberal, but not too liberal. It detects b as a global variable;

globals::findGlobals(fun)
#> [1] "{"  "<-" "b"

This liberal search strategy turns out to work surprisingly well for detecting globals needed in parallel processing, but there were corner cases where it failed. For example, futureverse struggled to identify global variables in cases such as:

library(future)
plan(multisession, workers = 2)

x <- 2

f <- future(local({
  h <- function(x) -x
  h(x)
}))
value(f)

which resulted in

Error in eval(quote({ : object 'x' not found

This is because there are several different variables named x, and the one in the calling environment is “masked” by argument x, which results in x never be picked up and exported to the parallel worker.

It might look as if this type of code was carefully curated to fail, but would rarely, if at all, be spotted in real code. As a matter of fact, this is a distilled version of a large real-world scenario reported by at least one person. It’s thanks to such feedback that we together can make improvements to the futureverse ecosystem 🙏 I cannot know for sure, but I’d suspect this has impacted several R developers already - the future package is after all among the 0.6% most downloaded packages and there are 1,300 packages that “need” it as of May 2025. The above problem was fixed in globals 0.18.0 (2025-05-08) and future 1.49.0 (2025-05-09), which now make use of the new findGlobals(..., method = "dfs") search strategy internally. After updating these packages, the above code snippet gives us

value(f)
#> [1] -2

as we’d expect.

Another corner-case bug fix, is where

library(future)
library(magrittr)
x <- list()
f <- future ({ x %>% `$<-`("a", 42) })

would result in the rather obscure error

Error in e[[4]] : subscript out of bounds

This is due to a bug in the codetools package, which globals (>= 0.17.0) [2025-04-15] works around. After updating, things work as expected;

f <- future ({ x %>% `$<-`("a", 42) })
value(f)
#> $a
#> [1] 42

Yet another fix in globals (>= 0.17.0) is that previous versions would throw an error if it ran into an S7 object. The S7 object class was introduced in 2023.

May the future be with you!

Henrik

PS. Did you know that the codetools package is written using literate programming following the vision of Donald Knuth? Neat, eh? And, it’s almost like it was vibe coded, but with the large-language model (LLM) part being replaced by human knowledge and expertise 🤓

Futureverse – Ten-Year Anniversary

Thu, 19 Jun 2025 00:00:00 +0000

The future package turns ten on CRAN today – June 19, 2025. (Image credits: Dan LaBar for the future logo; Hadley Wickham and Greg Swinehart for the ggplot2 logo and balloon wall; The future balloon wall was inspired by ggplot2’s recent real-world version and generated with ChatGPT.)

The future package turns ten years old today. I released version 0.6.0 to CRAN on June 19, 2015, just days before I presented the package and sharing my visions at useR! 2016. I had no idea adoption would snowball the way it has. It’s been an exciting, fun journey, and the best part has been you - the users and developers who shaped the futureverse through questions, discussions, bug reports, and feature requests. Thank you!

To celebrate, I’m kicking off a series of posts over the next few weeks covering the latest improvements that make it easier than ever to scale existing code up or out on a parallel or distributed backend of your choice - and eventually in ways that are neater than what our trusty workhorses future.apply and furrr offer.

These gains come from a slow, steady, multi-year process of remodelling: internal redesigns, working with package maintainers to retire use of deprecated functions, releasing, fixing regressions, and repeating - all while end-users and most developers not noticing, except for a few. The first CRAN release where this work could be noticed was future 1.40.0 (April 10), followed by regression fixes and additional features in 1.49.0 (May 9), and lately 1.57.0 (June 5, 2025). More polishing and features are coming before we hit future 2.0.0 – in the near future (pun firmly intended). Thanks for helping make future a cornerstone of scalable R programming.

Posts in this series thus far:

2025-06-23: Future Got Better at Finding Global Variables
2025-06-25: Setting Future Plans in R Functions — and Why You Probably Shouldn’t

Stay tuned and may the future be with you!

Henrik

parallelly: Querying, Killing and Cloning Parallel Workers Running Locally or Remotely

Sat, 01 Jul 2023 18:00:00 +0200

parallelly 1.36.0 is on CRAN since May 2023. The parallelly package is part of the Futureverse and enhances the parallel package of base R, e.g. it adds several features you’d otherwise expect to see in parallel. The parallelly package is one of the internal work horses for the future package, but it can also be used outside of the future ecosystem.

In this most recent release, parallelly gained several new skills in how cluster nodes (a.k.a. parallel workers) can be managed. Most notably,

the isNodeAlive() function can now also query parallel workers running on remote machines. Previously, this was only possible to workers running on the same machine.
the killNode() function gained the power to terminate parallel workers running also on remotes machines.
the new function cloneNode() can be used to “restart” a cluster node, e.g. if a node was determined to no longer be alive by isNodeAlive(), then cloneNode() can be called to launch an new parallel worker on the same machine as the previous worker.
The print() functions for PSOCK clusters and PSOCK nodes reports on the status of the parallel workers.

Examples

Assume we’re running a PSOCK cluster of two parallel workers - one running on the local machine and the other on a remote machine that we connect to over SSH. Here is how we can set up such a cluster using parallelly:

library(parallelly)

cl <- makeClusterPSOCK(c("localhost", "server.remote.org"))
print(cl)
# Socket cluster with 2 nodes where 1 node is on host 'server.remote.org' (R
# version 4.3.1 (2023-06-16), platform x86_64-pc-linux-gnu), 1 node is on host
# 'localhost' (R version 4.3.1 (2023-06-16), platform x86_64-pc-linux-gnu)

We can check if these two parallel workers are running. We can check this even if they are busy processing parallel tasks. The way isNodeAlive() works is that it checks of the process is running on worker’s machine, which is something that can be done even when the worker is busy. For example, let’s check the first worker process that run on the current machine:

print(cl[[1]])
## RichSOCKnode of a socket cluster on local host 'localhost' with pid 2457339
## (R version 4.3.1 (2023-06-16), x86_64-pc-linux-gnu) using socket connection
## #3 ('<-localhost:11436')

isNodeAlive(cl[[1]])
## [1] TRUE

In parallelly (>= 1.36.0), we can now also query the remote machine:

print(cl[[2]])
## RichSOCKnode of a socket cluster on remote host 'server.remove.org' with
## pid 7731 (R version 4.3.1 (2023-06-16), x86_64-pc-linux-gnu) using socket
## connection #4 ('<-localhost:11436')

isNodeAlive(cl[[2]])
## [1] TRUE

We can also query all parallel workers of the cluster at once, e.g.

isNodeAlive(cl)
## [1] TRUE TRUE

Now, imagine if, say, the remote parallel process terminates for some unknown reasons. For example, the code running in parallel called some code that causes the parallel R process to crash and terminate. Although this “should not” happen, we all experience it once in a while. Another example is that the machine is running out of memory, for instance due to other misbehaving processes on the same machine. When that happens, the operating system might start killing processes in order not to completely crash the machine.

When one of our parallel workers has crashed, it will obviously not respond to requests for processing our R tasks. Instead, we will get obscure errors like:

y <- parallel::parLapply(cl, X = X, fun = slow_fcn)
## Error in summary.connection(connection) : invalid connection

We can see that the second parallel worker in our cluster is no longer alive by:

isNodeAlive(cl)
## [1] TRUE FALSE

We can also see that there is something wrong with the one of our workers if we call print() on our RichSOCKcluster and RichSOCKnode objects, e.g.

print(cl)
## Socket cluster with 2 nodes where 1 node is on host 'server.remote.org'
## (R version 4.3.1 (2023-06-16), platform x86_64-pc-linux-gnu), 1 node is
## on host 'localhost' (R version 4.3.1 (2023-06-16), platform
## x86_64-pc-linux-gnu). 1 node (#2) has a broken connection (ERROR:
## invalid connection)

and

print(cl[[1]])
## RichSOCKnode of a socket cluster on local host 'localhost' with pid
## 2457339 (R version 4.3.1 (2023-06-16), x86_64-pc-linux-gnu) using
## socket connection #3 ('<-localhost:11436')

print(cl[[2]])
## RichSOCKnode of a socket cluster on remote host 'server.remote.org'
## with pid 7731 (R version 4.3.1 (2023-06-16), x86_64-pc-linux-gnu)
## using socket connection #4 ('ERROR: invalid connection')

If we end up with a broken parallel worker like this, we can since parallelly 1.36.0 use cloneNode() to re-create the original worker. In our example, we can do:

cl[[2]] <- cloneNode(cl[[2]])
print(cl[[2]])
## RichSOCKnode of a socket cluster on remote host 'server.remote.org'
## with pid 19808 (R version 4.3.1 (2023-06-16), x86_64-pc-linux-gnu)
## using socket connection #4 ('<-localhost:11436')

to get a working parallel cluster, e.g.

isNodeAlive(cl)
## [1] TRUE TRUE

and

y <- parallel::parLapply(cl, X = X, fun = slow_fcn)
str(y)
## List of 8
##  $ : num 1
##  $ : num 1.41
##  $ : num 1.73

We can also use cloneNode() to launch additional workers of the same kind. For example, say we want to launch two more local workers and one more remote worker, and append them to the current cluster. One way to achieve that is:

cl <- c(cl, cloneNode(cl[c(1,1,2)]))
print(cl)
## Socket cluster with 5 nodes where 3 nodes are on host 'localhost'
## (R version 4.3.1 (2023-06-16), platform x86_64-pc-linux-gnu), 2
## nodes are on host 'server.remote.org' (R version 4.3.1 (2023-06-16),
## platform x86_64-pc-linux-gnu)

Now, consider we launching many heavy parallel tasks, where some of them run on remote machines. However, after some time, we realize that we have launched tasks that will take much longer to resolve than we first anticipated. If we don’t want to wait for this to resolve by itself, we can choose to terminate some or all of the workers using killNode(). For example,

killNode(cl)
## [1] TRUE TRUE TRUE TRUE TRUE

will kill all parallel workers in our cluster, even if they are busy running tasks. We can confirm that these worker processes are no longer alive by calling:

isNodeAlive(cl)
## [1] FALSE FALSE FALSE FALSE FALSE

If we would attempt to use the cluster, we’d get the “Error in unserialize(node$con) : error reading from connection” as we saw previously. After having killed our cluster, we can re-launch it using cloneNode(), e.g.

cl <- cloneNode(cl)
isNodeAlive(cl)
## [1] TRUE TRUE TRUE TRUE TRUE

The new cluster managing skills enhances the future ecosystem

When we use the cluster and multisession parallel backends of the future package, we rely on the parallelly package internally. Thanks to these new abilities, the Futureverse can now give more informative error message whenever we fail to launch a future or when we fail to retrieve the results of one. For example, if a parallel worker has terminated, we might get:

f <- future(slow_fcn(42))
## Error: ClusterFuture (<none>) failed to call grmall() on cluster
## RichSOCKnode #1 (PID 29701 on 'server.remote.org'). The reason reported
## was 'error reading from connection'. Post-mortem diagnostic: No process
## exists with this PID on the remote host, i.e. the remote worker is no
## longer alive

That post-mortem diagnostic is often enough to realize something quite exceptional has happened. It also gives us enough information to troubleshooting the problem further, e.g. if we keep seeing the same problem occurring over and over for a particular machine, it might suggest that there is an issue on that machine and we want to exclude it from further processing.

We could imagine that the future package would not only give us information on why things went wrong, but it could theoretically also try to fix the problem automatically. For instance, it could automatically re-create the crashed worker using cloneNode(), and re-launch the future. It is on the roadmap to add such robustness to the future ecosystem later on. However, there are several things to consider when doing so. For instance, what should happen if it was not a glitch, but that there is one parallel task that keeps crashing the parallel workers over and over? Most certainly, we want to only retry a fixed number of times, before giving up, otherwise we might get stuck in a never ending procedure. But even so, what if the problematic parallel code brings down the machine where it runs? If we have automatic restart of workers and parallel tasks, we might end up bringing down multiple machines before we notice the problem. So, although it appears fairly straightforward to handle crashed workers automatically, we need to come up with a robust, well-behaving strategy for doing so before we can implement it.

I hope you find this useful. If you have questions or comments on parallelly, or the Futureverse in general, please use the Futureverse Discussion forum.

Henrik

%dofuture% - a Better foreach() Parallelization Operator than %dopar%

Mon, 26 Jun 2023 19:00:00 +0200

doFuture 1.0.0 is on CRAN since March 2023. It introduces a new foreach operator %dofuture%, which makes it even easier to use foreach() to parallelize via the future ecosystem. This new operator is designed to be an alternative to the existing %dopar% operator for foreach() - an alternative that works in similar ways but better. If you already use foreach() together with futures, or plan on doing so, I recommend using %dofuture% instead of %dopar%. I’ll explain why I think so below.

Introduction

The traditional way to parallelize with foreach() is to use the %dopar% infix operator together with a registered foreach adaptor. The popular doParallel package provides %dopar% backends for parallelizing on the local machine. Here is an example that uses four local workers:

library(foreach)
workers <- parallel::makeCluster(4)
doParallel::registerDoParallel(cl = workers)

xs <- rnorm(1000)
y <- foreach(x = xs, .export = "slow_fcn") %dopar% {
  slow_fcn(x)
}

I highly suggest Futureverse for parallelization due to its advantages, such as relaying standard output, messages, warnings, and errors that were generated on the parallel workers in the main R process, support for near-live progress updates, and more descriptive backend error messages. Almost from the very beginning of the Futureverse, you have been able to use futures with foreach() and %dopar% via the doFuture package. For instance, we can rewrite the above example to use futures as:

library(foreach)
doFuture::registerDoFuture()
future::plan(multisession, workers = 4)

xs <- rnorm(1000)
y <- foreach(x = xs, .export = "slow_fcn") %dopar% {
  slow_fcn(x)
}

In this blog post, I am proposing to move to

library(foreach)
future::plan(multisession, workers = 4)

xs <- rnorm(1000)
y <- foreach(x = xs, .export = "slow_fcn") %dofuture% {
  slow_fcn(x)
}

instead. So, why is that better? It is because:

%dofuture% removes the need to register a foreach backend, i.e. no more registerDoMC(), registerDoParallel(), registerDoFuture(), etc.
%dofuture% is unaffected by any foreach backends that the end-user has registered.
%dofuture% uses a consistent foreach() “options” argument, regardless of parallel backend used, and not different ones for different backends, e.g. .options.multicore, .options.snow, and .options.mpi.
%dofuture% is guaranteed to always parallelizes via the Futureverse, using whatever plan() the end-user has specified. It also means that you, as a developer, have full control of the parallelization code.
%dofuture% can generate proper parallel random number generation (RNG). There is no longer a need to use %dorng% of the doRNG package.
%dofuture% automatically identifies global variables and packages that are needed by the parallel workers.
%dofuture% relays errors generated in parallel as-is such that they can be handled using standard R methods, e.g. tryCatch().
%dofuture% truly outputs standard output and messages, warnings, and other types of conditions generated in parallel as-is such that they can be handled using standard R methods, e.g. capture.output() and withCallingHandlers().
%dofuture% supports near-live progress updates via the progressr package.
%dofuture% gives more informative error messages, which helps troubleshooting, if a parallel worker crashes.

Below are the details.

Problems of `%dopar%` that `%dofuture%` addresses

Let me discuss a few of the unfortunate drawbacks that comes with %dopar%. Most of these stem from a slightly too lax design. Although convenient, the flexible design prevents us from having full control and writing code that can parallelize on any parallel backend.

Problem 1. `%dopar%` requires registering a foreach adaptor

If we write code that others will use, say, an R package, then we can never know what compute resources the user has, or will have in the future. Traditionally, this means that one user might want to use doParallel for parallelization, another doMC, and yet another, maybe, doRedis. Because of this, we must not have any calls to one of the many registerDoNnn() functions in our code. If we do, we lock users into a specific parallel backend. We could of course support a few different backends, but we are still locking users into a small set of parallel backends. If someone develops a new backend in the future, our code has to be updated before users can take advantage the new backends.

One can argue that doFuture::registerDoFuture() somewhat addresses this problem. On one hand, when used, it does lock the user into the future framework. On the other hand, the user has many parallel backends to choose from in the Futureverse, including backends that will be developed in the future. In this sense, the lock-in is less severe, especially since we do not have to update our code for new backends to be supported. Also, to avoid destructive side effects, registerDoFuture() allows you to change the foreach backend used inside your functions temporarily, e.g.

## Temporarily use futures
oldDoPar <- registerDoFuture()
on.exit(with(oldDoPar, foreach::setDoPar(fun=fun, data=data, info=info)), add = TRUE)

This avoids changing the foreach backend that the user might already have set elsewhere.

That said, I never wanted to say that people should use registerDoFuture() whenever using %dopar%, because I think that would be against the philosophy behind the foreach framework. The foreach ecosystem is designed to separate the foreach() + %dopar% code, describing what to parallelize, from the registerDoNnn() call, describing how and where to parallelize.

Using %dofuture%, instead of %dopar% with user-controlled foreach backend, avoids this dilemma. With %dofuture% the developer is in full control of the parallelization code.

Problem 2. Chunking and load-balancing differ among foreach backends

When using parallel map-reduce functions such as mclapply(), parLapply() of the parallel package, or foreach() with %dopar%, the tasks are partitioned into subsets and distributed to the parallel workers for processing. This partitioning is often referred to as “chunking”, because we chunk up the elements into smaller chunks, and then each chunk is processed by one parallel worker. There are different strategies to chunk up the elements. One approach is to use uniformly sized chunks and have each worker process one chunk. Another approach is to use chunks with a single element, and have each worker process one or more chunks.

The chunks may be pre-assigned (“prescheduled”) to the parallel workers up-front, which is referred to as static load balancing. An alternative is to assign chunks to workers on-the-fly as the workers become available, which is referred to as dynamic load balancing.

If the processing time differ a lot between elements, it is beneficial to use dynamic load balancing together with small chunk sizes.

However, if we dig into the documentation and source code of the different foreach backends, we will find that they use different chunking and load-balancing strategies. For example, assume we are running on a Linux machine, which supports forked processing. Then, if we use

library(foreach)
doParallel::registerDoParallel(ncores = 8)

y <- foreach(x = X, .export = "slow_fcn") %dopar% {
  slow_fcn(x)
}

the data will be processed by eight fork-based parallel workers using dynamic load balancing with single-element chunks. However, if we use PSOCK clusters:

library(foreach)
cl <- parallel::makeCluster(8)
doParallel::registerDoParallel(cl = cl)

y <- foreach(x = X, .export = "slow_fcn") %dopar% {
  slow_fcn(x)
}

the data will be processed by eight PSOCK-based parallel workers using static load balancing with uniformly sized chunks.

Which of these two chunking and load-balancing strategies is the most efficient one depends on how much the processing time of slow_fcn(x) varies with different values of x. For example, and without going into details, if the processing times differ a lot, dynamic load balancing often makes better use of the parallel workers and results in a shorter overall processing time.

Regardless of which is faster, the problem with different foreach backends using different strategies is that, as a developer with little control over the registered foreach backend, you have equally poor control over the chunking and load-balancing strategies.

Using %dofuture%, avoids this problem. If you use %dofuture%, then dynamic load balancing will always be used for processing the data, regardless of which parallel future backend is in place, with the option to control the chunk size. As a side note, %dopar% with registerDoFuture() will also do this.

Problem 3. Different foreach backends use different `foreach()` options

In the previous section, I did not mention that for some foreach backends it is indeed possible to control whether static or dynamic load balancing should be used, and what the chunk sizes should be. This can be controlled by special .options.* arguments for foreach(). However, each foreach backend has their own .options.* argument, e.g. you might find that some use .options.multicore, others .options.snow, or something else. Because they are different, we cannot write code that works with any type of foreach backend.

To give two examples, when using doParallel and registerDoParallel(cores = 8), we can replace the default dynamic load balancing with static load balancing as:

library(foreach)
doParallel::registerDoParallel(ncores = 8)

y <- foreach(x = X, .export = "slow_fcn",
             .options.multicore = list(preschedule = TRUE)) %dopar% {
  slow_fcn(x)
}

This change will also switch from chunks with a single element to (eight) chunks with similar size.

If we instead would use registerDoParallel(cl), which gives us the vice versa situation, we can switch out the static load balancing with dynamic load balancing by using:

library(foreach)
cl <- parallel::makeCluster(8)
doParallel::registerDoParallel(cl = cl)

y <- foreach(x = X, .export = "slow_fcn",
             .options.snow = list(preschedule = FALSE)) %dopar% {
  slow_fcn(x)
}

This will also switch from uniformly sized chunks to single-element chunks.

As we can see, the fact that we have to use different foreach() “options” arguments (here .options.multicore and .options.snow) for different foreach backends prevents us from writing code that works with any foreach backend.

Of course, we could specify “options” arguments for known foreach backends and hope we haven’t missed any and that no new ones are showing up later, e.g.

library(foreach)
doParallel::registerDoParallel(cores = 8)

y <- foreach(x = X,
             .export = "slow_fcn",
             .options.multicore = list(preschedule = TRUE),
             .options.snow      = list(preschedule = TRUE),
             .options.future    = list(preschedule = TRUE),
             .options.mpi       = list(chunkSize = 1)      ) %dopar% {
  slow_fcn(x)
}

Regardlessly, this still limits the end-user to a set of commonly used foreach backends, and our code can never be agile to foreach backends that are developed at a later time.

Using %dofuture% avoids these problems. It supports argument .options.future in a consistent way across all future backends, which means that your code will be the same regardless of parallel backend. By the core design of the Futureverse, any new future backends developed later one will automatically work with your foreach code if you use %dofuture%.

Problem 4. Global variables are not always identified by `foreach()`

When parallelizing code, the parallel workers must have access to all functions and variables required to evaluate the parallel code. As we have seen the above examples, you can use the .export argument to help foreach() to export the necessary objects to each of the parallel workers.

However, a developer who uses doMC::registerDoMC(), or equivalently doParallel::registerDoParallel(cores), might forget to specify the .export argument. This can happen because the mechanisms of forked processing makes all objects available to the parallel workers. If they test their code using only these foreach backends, they will not notice that .export is not declared. The same may happen if the developer assumes doFuture::registerDoFuture() is used. However, without specifying .export, the code will not work on other types of foreach backends, e.g. doParallel::registerDoParallel(cl) and doMPI::registerDoMPI(). If an R package forgets to specify the .export argument, and is not comprehensively tested, then it will be the end-user, for instance on MS Windows, that runs into the bug.

When using %dofuture%, global variables and required packages are by default automatically identified and exported to the parallel workers by the future framework. This is done the same way regardless of parallel backend.

Problem 5. Easy to forget parallel random number generation

The foreach package and %dopar% do not have built-in support for parallel random number generation (RNG). Statistical sound parallel RNG is critical for many statistical analyses. If not done, then the results can be biases and incorrect conclusions might be drawn. The doRNG package comes to rescue when using %dopar%. It provides the operator %dorng%, which will use %dopar% internally while automatically setting up parallel RNG. Whenever you use %dopar% and find yourself needing parallel RNG, I recommend to simply replace %dopar% with %dorng%. The doRNG package also provides registerDoRNG(), which I do not recommend, because as a developer you do not have full control whether that is registered or not.

Because foreach does not have built-in support for parallel RNG, it is easy to forget that it should be used. A developer who is aware of the importance of using proper parallel RNG will find out about doRNG and how to best use it, but a developer who is not aware of the problem, can easily miss it and publish an R package that produces potentially incorrect results.

However, when using the future framework will detect if we forget to use parallel RNG. When this happens, a warning will alert us to the problem and suggest how to fix it. This is the case if you use doFuture::registerDoFuture(), and it’s also the case when using %dofuture%. For example,

library(doFuture)
plan(multisession, workers = 3)

y <- foreach(ii = 1:4) %dofuture% {
  runif(ii)
}

produces

Warning messages:
1: UNRELIABLE VALUE: Iteration 1 of the foreach() %dofuture% { ... },
part of chunk #1 ('doFuture2-1'), unexpectedly generated random numbers
without declaring so. There is a risk that those random numbers are not
statistically sound and the overall results might be invalid. To fix
this, specify foreach() argument '.options.future = list(seed = TRUE)'.
This ensures that proper, parallel-safe random numbers are produced via
the L'Ecuyer-CMRG method. To disable this check, set option
'doFuture.rng.onMisuse' to "ignore".

To fix this, we can specify foreach() argument .options.future = list(seed = TRUE) to declare that we need to draw random number in parallel, i.e.

library(doFuture)
plan(multisession, workers = 3)

y <- foreach(ii = 1:4, .options.future = list(seed = TRUE)) %dofuture% {
  runif(ii)
}

This makes sure that statistical sound random numbers are generated.

Migrating from %dopar% to %dofuture% is straightforward

If you already have code that uses %dopar% and want to start using %dofuture% instead, then it only takes are few changes, which are all straightforward and quick:

Replace %dopar% with %dofuture%.
Replace %dorng% with %dofuture% and set .options.future = list(seed = TRUE).
Replace .export = <character vector of global variables> with .options.future = list(globals = <character vector of global variables>).
Drop any other registerDoNnn() calls inside your function, if you use them.
Update your documentation to mention that the parallel backend should be set using future::plan() and no longer via different registerDoNnn() calls.

In brief, if you use %dofuture% instead of %dopar%, your life as a developer will be easier and so will the end-user’s be too.

If you have questions or comments on doFuture and %dofuture%, or the Futureverse in general, please use the Futureverse Discussion forum.

Happy futuring!

Henrik

Edmonton R User Group Meetup: Futureverse - A Unifying Parallelization Framework in R for Everyone

Mon, 22 May 2023 18:00:00 -0700

Below are the slides from my presentation at the Edmonton R User Group Meetup (YEGRUG) on May 22, 2023:

Title: Futureverse - A Unifying Parallelization Framework in R for Everyone
Speaker: Henrik Bengtsson
Slides: HTML, PDF (46 slides)
Video: official recording (~60 minutes)

Thank you Péter Sólymos and the YEGRUG for the invitate and the opportunity!

/Henrik

parallelly 1.34.0: Support for CGroups v2, Killing Parallel Workers, and more

Wed, 18 Jan 2023 14:00:00 -0800

With the recent releases of parallelly 1.33.0 (2022-12-13) and 1.34.0 (2023-01-13), availableCores() and availableWorkers() gained better support for Linux CGroups, options for avoiding running out of R connections when setting up parallel-style clusters, and killNode() for forcefully terminating one or more parallel workers. I summarize these updates below. For other updates, please see the NEWS.

Added support for CGroups v2

availableCores() and availableWorkers() gained support for Linux Control Groups v2 (CGroups v2), besides CGroups v1, which has been supported since parallelly 1.31.0 (2022-04-07) and partially since 1.22.0 (2020-12-12). This means that if you use availableCores() and availableWorkers() in your R code, it will better respect the number of CPU cores that the Linux system has made available to you. Not all systems use CGroups, but it is becoming more popular, so if the Linux system you run on does not use it right now, it is likely it will at some point.

Avoid running out of R connections

If you run parallel code on a machine with a many CPU cores, there’s a risk that you run out of available R connections, which are needed when setting up parallel cluster nodes. This is because R has a limit of a maximum 125 connections being used at the same time(*) and each cluster node consumes one R connection. If you try to set up more parallel workers than this, you will get an error. The parallelly package already has built-in protection against this, e.g.

> cl <- parallelly::makeClusterPSOCK(192)
Error: Cannot create 192 parallel PSOCK nodes. Each node needs
one connection, but there are only 124 connections left out of
the maximum 128 available on this R installation

This error is instant and with no parallel workers being launched. In contrast, if you use parallel, you will only get an error after R has launched the first 124 cluster nodes and fails to launch the 125:th one, e.g.

> cl <- parallel::makePSOCKcluster(192)
Error in socketAccept(socket = socket, blocking = TRUE, open = "a+b",  : 
  all connections are in use

Now, assume you use:

> library(parallelly)
> nworkers <- availableCores()
> cl <- makeClusterPSOCK(ncores)

to set up a maximum-sized cluster on the current machine. This works as long as availableCores() returns something less than 125. However, if you are on machine with, say, 192 CPU cores, you will get the above error. You could do something like:

> nworkers <- availableCores()
> nworkers <- max(nworkers, 125L)

to work around this problem. Or, if you want to be more agile to what R supports, you could do:

> nworkers <- availableCores()
> nworkers <- max(nworkers, freeConnections())

With the latest versions of parallelly, you can simplify this to:

> nworkers <- availableCores(constraints = "connections")

The availableWorkers() function also supports constraints = "connections".

(*) The only way to increase this limit is to change the R source code and build R from source, cf. freeConnections().

Forcefully terminate PSOCK cluster nodes

The parallel::stopCluster() should be used for stopping a parallel cluster. This works by asking the clusters node to shut themselves down. However, a parallel worker will only shut down this way when it receives the message, which can only happen when the worker is done processing any parallel tasks. So, if a worker runs a very long-running task, which can take minutes, hours, or even days, it will not shut down until after that completes.

This far, we had to turn to special operating-system tools to kill the R process for that cluster worker. With parallelly 1.33.0, you can now use killNode() to kill any parallel worker that runs on the local machine and that was launched by makeClusterPSOCK(). For example,

> library(parallelly)
> cl <- makeClusterPSOCK(10)
> cl
Socket cluster with 10 nodes where 10 nodes are on host 'localhost'
(R version 4.2.2 (2022-10-31), platform x86_64-pc-linux-gnu)
> which(isNodeAlive(cl))
 [1]  1  2  3  4  5  6  7  8  9 10
 
> success <- killNode(cl[1:3])
> success
[1] TRUE TRUE TRUE
> which(isNodeAlive(cl))
[1]  4  5  6  7  8  9 10
> cl <- cl[isNodeAlive(cl)]
Socket cluster with 7 nodes where 7 nodes are on host 'localhost'
(R version 4.2.2 (2022-10-31), platform x86_64-pc-linux-gnu)

Over and out,

Henrik

progressr 0.13.0: cli + progressr = ♥

Tue, 10 Jan 2023 19:00:00 -0800

progressr 0.13.0 is on CRAN. In the recent releases, progressr gained support for using cli to generate progress bars. Vice versa, cli can now report on progress via the progressr framework. Here are the details. For other updates to progressr, see NEWS.

The progressr package, part of the futureverse, provides a minimal API for reporting progress updates in R. The design is to separate the representation of progress updates from how they are presented. What type of progress to signal is controlled by the developer. How these progress updates are rendered is controlled by the end user. For instance, some users may prefer visual feedback, such as a horizontal progress bar in the terminal, whereas others may prefer auditory feedback. The progressr package works also when processing R in parallel or distributed using the future framework.

Use ‘cli’ progress bars for ‘progressr’ reporting

In progressr (>= 0.12.0) [2022-12-13], you can report on progress using cli progress bar. To do this, just set:

progressr::handlers(global = TRUE)  ## automatically report on progress
progressr::handlers("cli")          ## ... using a 'cli' progress bar

With these globals settings (e.g. in your ~/.Rprofile file; see below), R reports progress as:

library(progressr)
y <- slow_sum(1:10)

Just like regular cli progress bars, you can customize these in the same way. For instance, if you use the following from one of the cli examples:

options(cli.progress_bar_style = list(
  complete = cli::col_yellow("\u2605"),
  incomplete = cli::col_grey("\u00b7")
))

you’ll get:

Configure ‘cli’ to Report Progress via ‘progressr’

You might have heard that purrr recently gained support for reporting on progress. If you didn’t, you can read about it in the tidyverse blog post ‘purrr 1.0.0’ on 2022-12-20. The gist is to pass .progress = TRUE to the purrr function of interest, and it’ll show a progress bar while it runs. For example, assume we the following slow function for calculating the square root:

slow_sqrt <- function(x) { Sys.sleep(0.1); sqrt(x) }

If we call

y <- purrr::map(1:30, slow_sqrt, .progress = TRUE)

we’ll see a progress bar appearing after about two seconds:

This progress bar is produced by the cli package. Now, the neat thing with the cli package is that you can tell it to pass on the progress reporting to another progress framework, including that of the progressr package. To do this, set the R option:

options(cli.progress_handlers = "progressr")

This causes all cli progress updates to be reported via progressr, so if you, for instance, already have set:

progressr::handlers(global = TRUE)
red_heart <- cli::col_red(cli::symbol$heart)
handlers(handler_txtprogressbar(char = red_heart))

the above purrr::map() call will report on progress in the terminal using a classical R progress bar tweaked to use red hearts to fill the bar;

As another example, if you set:

progressr::handlers(global = TRUE)
progressr::handlers(c("beepr", "cli", "rstudio"))

R will report progress concurrently via audio using different beepr sounds, via the terminal as a cli progress bar, and the RStudio’s built-in progress bar - whenever progress is reported via the progressr framework or the cli framework.

Customize progress reporting when R starts

To safely configure the above for all your interactive R sessions, I recommend adding something like the following to your ~/.Rprofile file (or in a standalone file using the startup package):

if (interactive() && requireNamespace("progressr", quietly = TRUE)) {
  ## progressr reporting without need for with_progress()
  progressr::handlers(global = TRUE)

  ## Use 'cli', if installed ...
  if (requireNamespace("cli", quietly = TRUE)) {
    progressr::handlers("cli")
    ## Hand over all 'cli' progress reporting to 'progressr'
    options(cli.progress_handlers = "progressr")
  } else {
    ## ... otherwise use the one that comes with R
    progressr::handlers("txtprogressbar")
  }
  
  ## Use 'beepr', if installed ...
  if (requireNamespace("beepr", quietly = TRUE)) {
    progressr::handlers("beepr", append = TRUE)
  }
  
  ## Reporting via RStudio, if running in the RStudio Console,
  ## but not the terminal
  if ((Sys.getenv("RSTUDIO") == "1") && 
      !nzchar(Sys.getenv("RSTUDIO_TERM"))) {
    progressr::handlers("rstudio", append = TRUE)
  }
}

See the progressr website for other, additional ways of reporting on progress.

Now, go make some progress!

Please Avoid detectCores() in your R Packages

Mon, 05 Dec 2022 21:00:00 -0800

The detectCores() function of the parallel package is probably one of the most used functions when it comes to setting the number of parallel workers to use in R. In this blog post, I’ll try to explain why using it is not always a good idea. Already now, I am going to make a bold request and ask you to:

Please avoid using parallel::detectCores() in your package!

By reading this blog post, I hope you become more aware of the different problems that arise from using detectCores() and how they might affect you and the users of your code.

Figure 1: Using detectCores() risks overloading the machine where R runs, even more so if there are other things already running. The machine seen at the left is heavily loaded, because too many parallel processes compete for the 24 CPU cores available, which results in an extensive amount of kernel context switching (red), which wastes precious CPU cycles. The machine to the right is near-perfectly loaded at 100%, where none of the processes use more than they may use (mostly green).

TL;DR

If you don’t have time to read everything, but will take my word that we should avoid detectCores(), then the quick summary is that you basically have two choices for the number of parallel workers to use by default;

Have your code run with a single core by default (i.e. sequentially), or
replace all parallel::detectCores() with parallelly::availableCores().

I’m in the conservative camp and recommend the first alternative. Using sequential processing by default, where the user has to make an explicit choice to run in parallel, significantly lowers the risk for clogging up the CPUs (left panel in Figure 1), especially when there are other things running on the same machine.

The second alternative is useful if you’re not ready to make the move to run sequentially by default. The availableCores() function of the parallelly package is fully backward compatible with detectCores(), while it avoids the most common problems that comes with detectCores(), plus it is agile to a lot more CPU-related settings, including settings that the end-user, the systems administrator, job schedulers and Linux containers control. It is designed to take care of common overuse issues so that you do not have to spend time worry about them.

Background

There are several problems with using detectCores() from the parallel package for deciding how many parallel workers to use. But before we get there, I want you to know that we find this function commonly used in R script and R packages, and frequently suggested in tutorials. So, do not feel ashamed if you use it.

If we scan the code of the R packages on CRAN (e.g. by searching GitHub¹), or on Bioconductor (e.g. by searching Bioc::CodeSearch) we find many cases where detectCores() is used. Here are some variants we see in the wild:

cl <- makeCluster(detectCores())
cl <- makeCluster(detectCores() - 1)
y <- mclapply(..., mc.cores = detectCores())
registerDoParallel(detectCores())

We also find functions that let the user choose the number of workers via some argument, which defaults to detectCores(). Sometimes the default is explicit, as in:

fast_fcn <- function(x, ncores = parallel::detectCores()) {
  if (ncores > 1) {
    cl <- makeCluster(ncores)
    ...
  }
}

and sometimes it’s implicit, as in:

fast_fcn <- function(x, ncores = NULL) {
  if (is.null(ncores)) 
    ncores <- parallel::detectCores() - 1
  if (ncores > 1) {
    cl <- makeCluster(ncores)
    ...
  }
}

As we will see next, all the above examples are potentially buggy and might result in run-time errors.

Common mistakes when using detectCores()

Issue 1: detectCores() may return a missing value

A small, but important detail about detectCores() that is often missed is the following section in help("detectCores", package = "parallel"):

Value

An integer, NA if the answer is unknown.

Because of this, we cannot rely on:

ncores <- detectCores()

to always work, i.e. we might end up with errors like:

ncores <- detectCores()
workers <- parallel::makeCluster(ncores)
Error in makePSOCKcluster(names = spec, ...) : 
  numeric 'names' must be >= 1

We need to account for this, especially as package developers. One way to handle it is simply by using:

ncores <- detectCores()
if (is.na(ncores)) ncores <- 1L

or, by using the following shorter, but also harder to understand, one-liner:

ncores <- max(1L, detectCores(), na.rm = TRUE)

This construct is guaranteed to always return at least one core.

Shameless advertisement for the parallelly package: In contrast to detectCores(), parallelly::availableCores() handles the above case automatically, and it guarantees to always return at least one core.

Issue 2: detectCores() may return one

Although it’s rare to run into hardware with single-core CPUs these days, you might run into a virtual machine (VM) configured to have a single core. Because of this, you cannot reliably use:

ncores <- detectCores() - 1L

ncores <- detectCores() - 2L

in your code. If you use these constructs, a user of your code might end up with zero or a negative number of cores here, which another way we can end up with an error downstream. A real-world example of this problem can be found in continous integration (CI) services, e.g. detectCores() returns 2 in GitHub Actions jobs. So, we need to account also for this case, which we can do by using the above max() solution, e.g.

ncores <- max(1L, detectCores() - 2L, na.rm = TRUE)

This is guaranteed to always return at least one.

Shameless advertisement for the parallelly package: In contrast, parallelly::availableCores() handles this case via argument omit, which makes it easier to understand the code, e.g.

ncores <- availableCores(omit = 2)

This construct is guaranteed to return at least one core, e.g. if there are one, two, or three CPU cores on this machine, ncores will be one in all three cases.

Issue 3: detectCores() may return too many cores

When we use PSOCK, SOCK, or MPI clusters as defined by the parallel package, the communication between the main R session and the parallel workers is done via R socket connection. Low-level functions parallel::makeCluster(), parallelly::makeClusterPSOCK(), and legacy snow::makeCluster() create these types of clusters. In turn, there are higher-level functions that rely on these low-level functions, e.g. doParallel::registerDoParallel() uses parallel::makeCluster() if you are on MS Windows, BiocParallel::SnowParam() uses snow::makeCluster(), and plan(multisession) and plan(cluster) of the future package uses parallelly::makeClusterPSOCK().

R has a limit in the number of connections it can have open at any time. As of R 4.2.2, the limit is 125 open connections. Because of this, we can use at most 125 parallel PSOCK, SOCK, or MPI workers. In practice, this limit is lower, because some connections may already be in use elsewhere. To find the current number of free connections, we can use parallelly::freeConnections(). If we try to launch a cluster with too many workers, there will not be enough connections available for the communication and the setup of the cluster will fail. For example, a user running on a 192-core machine will get errors such as:

> cl <- parallel::makeCluster(detectCores())
Error in socketAccept(socket = socket, blocking = TRUE, open = "a+b",  : 
  all connections are in use

and

> cl <- parallelly::makeClusterPSOCK(detectCores())
Error: Cannot create 192 parallel PSOCK nodes. Each node needs
one connection, but there are only 124 connections left out of
the maximum 128 available on this R installation

Thus, if we use detectCores(), our R code will not work on larger, modern machines. This is a problem that will become more and more common as more users get access to more powerful computers. Hopefully, R will increase this connection limit in a future release, but until then, you as the developer are responsible to handle also this case. To make your code agile to this limit, also if R increases it, you can use:

ncores <- max(1L, detectCores(), na.rm = TRUE)
ncores <- min(parallelly::freeConnections(), ncores)

This is guaranteed to return at least zero (sic!) and never more than what is required to create a PSOCK, SOCK, and MPI cluster with than many parallel workers.

Shameless advertisement for the parallelly package: In the upcoming parallelly 1.33.0 version, you can use parallelly::availableCores(constraints = "connections") to limit the result to the current number of available R connections. In addition, you can control the maximum number of cores that availableCores() returns by setting R option parallelly.availableCores.system, or environment variable R_PARALLELLY_AVAILABLECORES_SYSTEM, e.g. R_PARALLELLY_AVAILABLECORES_SYSTEM=120.

Issue 4: detectCores() does not give the number of “allowed” cores

There’s a note in help("detectCores", package = "parallel") that touches on the above problems, but also on other important limitations that we should know of:

Note

This [= detectCores()] is not suitable for use directly for the mc.cores argument of mclapply nor specifying the number of cores in makeCluster. First because it may return NA, second because it does not give the number of allowed cores, and third because on Sparc Solaris and some Windows boxes it is not reasonable to try to use all the logical CPUs at once.

When is this relevant? The answer is: Always! This is because as package developers, we cannot really know when this occurs, because we never know on what type of hardware and system our code will run. So, we have to account for these unknowns too.

Let’s look at some real-world case where using detectCores() can become a real issue.

4a. A personal computer

A user might want to run other software tools at the same time while running the R analysis. A very common pattern we find in R code is to save one core for other purposes, say, browsing the web, e.g.

ncores <- detectCores() - 1L

This is a good start. It is the first step toward your software tool acknowledging that there might be other things running on the same machine. However, contrary to end-users, we as package developers cannot know how many cores the user needs, or wishes, to set aside. Because of this, it is better to let the user make this decision.

A related scenario is when the user wants to run two concurrent R sessions on the same machine, both using your code. If your code assumes it can use all cores on the machine (i.e. detectCores() cores), the user will end up running the machine at 200% of its capacity. Whenever we use over 100% of the available CPU resources, we get penalized and waste our computational cycles on overhead from context switching, sub-optimal memory access, and more. This is where we end up with the situation illustrated in the left part of Figure 1.

Note also that users might not know that they use an R function that runs on all cores by default. They might not even be aware that this is a problem. Now, imagine if the user runs three or four such R sessions, resulting in a 300-400% CPU load. This is when things start to run slowly. The computer will be sluggish, maybe unresponsive, and mostly likely going to get very hot (“we’re frying the computer”). By the time the four concurrent R processes complete, the user might have been able to finish six to eight similar processes if they would not have been fighting each other for the limited CPU resources.

4b. A shared computer

In the academia and the industry, it is common that several users share the same compute server or set of compute nodes. It might be as simple as they SSH into a shared machine with many cores and large amounts of memory to run their analysis there. On such setups, load balancing between users is often based on an honor system, where each user checks how many resources are available before launching an analysis. This helps to make sure they don’t end up using too many cores, or too much memory, slowing down the computer for everyone else.

Figure 2: Overusing the CPU cores brings everything to a halt.

Now, imagine they run a software tool that uses all CPU cores by default. In that case, there is a significant risk they will step on the other users’ processes, slowing everything down for everyone, especially if there is already a big load on the machine. From my experience in academia, this happens frequently. The user causing the problem is often not aware, because they just launch the problematic software with the default settings, leave it running, with a plan to coming back to it a few hours or a few days later. In the meantime, other users might wonder why their command-line prompts become sluggish or even non-responsive, and their analyses suddenly take forever to complete. Eventually, someone or something alerts the systems administrators to the problem, who end up having to drop everything else and start troubleshooting. This often results in them terminating the wild-running processes and reaching out to the user who runs the problematic software, which leads to a large amount of time and resources being wasted among users and administrators. All this is only because we designed our R package to use all cores by default. This is not a made-up toy story; it is a very likely scenario that happens on shared servers if you make detectCores() the default in your R code.

Shameless advertisement for the parallelly package: In contrast to detectCores(), if you use parallelly::availableCores() the user, or the systems administrator, can limit the default number of CPU cores returned by setting environment variable R_PARALLELLY_AVAILABLECORES_FALLBACK. For instance, by setting it to R_PARALLELLY_AVAILABLECORES_FALLBACK=2 centrally, availableCores() will, unless there are other settings that allow the process to use more, return two cores regardless how many CPU cores the machine has. This will lower the damage any single process can inflict on the system. It will take many such processes running at the same time in order for them to have an overall a negative impact. The risk for that to happen by mistake is much lower than when using detectCores() by default.

4c. A shared compute cluster with many machines

Other, larger compute systems, often referred to as high-performance compute (HPC) cluster, have a job scheduler for running scripts in batches distributed across multiple machines. When users submit their scripts to the scheduler’s job queue, they request how many cores and how much memory each job requires. For example, a user on a Slurm cluster can request that their run_my_rscript.sh script gets to run with 48 CPU cores and 256 GiB of RAM by submitting it to the scheduler as:

sbatch --cpus-per-task=48 --mem=256G run_my_rscript.sh

The scheduler keeps track of all running and queued jobs, and when enough compute slots are freed up, it will launch the next job in the queue, giving it the compute resources it requested. This is a very convenient and efficient way to batch process a large amount of analyses coming from many users.

However, just like with a shared server, it is important that the software tools running this way respect the compute resources that the job scheduler allotted to the job. The detectCores() function does not know about job schedulers - all it does is return the number of CPU cores on the current machine regardless of how many cores the job has been allotted by the scheduler. So, if your R package uses detectCores() cores by default, then it will overuse the CPUs and slow things down for everyone running on the same compute node. Again, when this happens, it often slows everything done and triggers lots of wasted user and admin efforts spent on troubleshooting and communication back and forth.

Shameless advertisement for the parallelly package: In contrast, parallelly::availableCores() respects the number of CPU slots that the job scheduler has given to the job. It recognizes environment variables set by our most common HPC schedulers, including Fujitsu Technical Computing Suite (PJM), Grid Engine (SGE), Load Sharing Facility (LSF), PBS/Torque, and Simple Linux Utility for Resource Management (Slurm).

4d. Running R via CGroups on in a Linux container

This far, we have been concerned about the overuse of the CPU cores affecting other processes and other users running on the same machine. Some systems are configured to protect against misbehaving software from affecting other users. In Linux, this can be done with so-called control groups (“cgroups”), where a process gets allotted a certain amount of CPU cores. If the process uses too many parallel workers, they cannot break out from the sandbox set up by cgroups. From the outside, it will look like the process uses its maximum amount of allocated CPU cores. Some HPC job schedulers have this feature enabled, but not all of them. You find the same feature for Linux containers, e.g. we can limit the number of CPU cores, or throttle the CPU load, using command-line options when you launch a Docker container, e.g. docker run --cpuset-cpus=0-2,8 … or docker run --cpu=3.4 ….

So, if you are a user on a system where compute resources are compartmentalized this way, you run a much lower risk for wreaking havoc on a shared system. That is good news, but if you run too many parallel workers, that is, try to use more cores than available to you, then you will clog up your own analysis. The behavior would be the same as if you request 96 parallel workers on your local eight-core notebook (the scenario in the left panel of Figure 1), with the exception that you will not overheat the computer.

The problem with detectCores() is that it returns the number of CPU cores on the hardware, regardless of the cgroups settings. So, if your R process is limited to eight cores by cgroups, and you use ncores = detectCores() on a 96-core machine, you will end up running 96 parallel workers fighting for the resources on eight cores. A real-world example of this happens for those of you who have a free account on RStudio Cloud. In that case, you are given only a single CPU core to run your R code on, but the underlying machine typically has 16 cores. If you use detectCores() there, you will end up creating 16 parallel workers, running on the same CPU core, which is a very ineffecient way to run the code.

Shameless advertisement for the parallelly package: In contrast to detectCores(), parallelly::availableCores() respects cgroups, and will return eight cores instead of 96 in the above example, and a single core on a free RStudio Cloud account.

My opinionated recommendation

Figure 3: If we avoid overusing the CPU cores, then everything will run much smoother and much faster.

As developers, I think we should at least be aware of these problems, and acknowledge that they exist and they are indeed real problem that people run into “out there”. We should also accept that we cannot predict on what type of compute environment our R code will run on. Unfortunately, I don’t have a magic solution that addresses all the problems reported here. That said, I think the best we can do is to be conservative and don’t make hard-coded decisions on parallelization in our R packages and R scripts.

Because of this, I argue that the safest is to design your R package to run sequentially by default (e.g. ncores = 1L), and leave it to the user to decide on the number of parallel workers to use.

The second-best alternative that I can come up with, is to replace detectCores() with availableCores(), e.g. ncores = parallelly::availableCores(). It is designed to respect common system and R settings that control the number of allowed CPU cores. It also respects R options and environment variables commonly used to limit CPU usage, including those set by our most common HPC job schedulers. In addition, it is possible to control the fallback behavior so that it uses only a few cores when nothing else being set. For example, if the environment variable R_PARALLELLY_AVAILABLECORES_FALLBACK is set to 2, then availableCores() returns two cores by default, unless other settings allowing more are available. A conservative systems administrator may want to set export R_PARALLELLY_AVAILABLECORES_FALLBACK=1 in /etc/profile.d/single-core-by-default.sh. To see other benefits from using availableCores(), see https://parallelly.futureverse.org.

Believe it or not, there’s actually more to be said on this topic, but I think this is already more than a mouthful, so I will save that for another blog post. If you made it this far, I applaud you and I thank you for your interest. If you agree, or disagree, or have additional thoughts around this, please feel free to reach out on the Future Discussions Forum.

Over and out,

Henrik

¹ Searching code on GitHub, requires you to log in to GitHub.

UPDATE 2022-12-06: Alex Chubaty pointed out another problem, where detectCores() can be too large on modern machines, e.g. machines with 128 or 192 CPU cores. I’ve added Section ‘Issue 3: detectCores() may return too many cores’ explaining and addressing this problem.

UPDATE 2022-12-11: Mention upcoming parallelly::availableCores(constraints = "connections").

useR! 2022: My 'Futureverse: Profile Parallel Code' Slides

Thu, 23 Jun 2022 17:00:00 -0700

Figure 1: A time chart of logged events for two futures resolved by two parallel workers. This is a screenshot of Slide #18 in my talk.

Below are the slides for my Futureverse: Profile Parallel Code talk that I presented at the useR! 2022 conference online and hosted by the Department of Biostatistics at Vanderbilt University Medical Center.

Title: Futureverse: Profile Parallel Code
Speaker: Henrik Bengtsson
Session: #21: Parallel Computing, chaired by Ilias Moutsopoulos
Slides: HTML, PDF (24 slides)
Video: official recording (27m30s long starting at 42m10s)

Abstract:

“In this presentation, I share recent enhancements that allow developers and end-users to profile R code running in parallel via the future framework. With these new, frequently requested features, we can study how and where our computational resources are used. With the help of visualization (e.g., ggplot2 and Shiny), we can identify bottlenecks in our code and parallel setup. For example, if we find that some parallel workers are more idle than expected, we can tweak settings to improve the overall CPU utilization and thereby increase the total throughput and decrease the turnaround time (latency). These new benchmarking tools work out of the box on existing code and packages that build on the future package, including future.apply, furrr, and doFuture.

The future framework, available on CRAN since 2016, has been used by hundreds of R packages and is among the top 1% of most downloaded packages. It is designed to unify and leverage common parallelization frameworks in R and to make new and existing R code faster with minimal efforts of the developer. The futureverse allows you, the developer, to stay with your favorite programming style, and end-users are free to choose the parallel backend to use (e.g., on a local machine, across multiple machines, in the cloud, or on a high-performance computing (HPC) cluster).”

I want to send out a big thank you to useR! organizers, staff, and volunteers, and everyone else who contributed to this event.

/Henrik

parallelly: Support for Fujitsu Technical Computing Suite High-Performance Compute (HPC) Environments

Thu, 09 Jun 2022 13:00:00 -0700

parallelly 1.32.0 is now on CRAN. One of the major updates is that availableCores() and availableWorkers(), and therefore also the future framework, gained support for the ‘Fujitsu Technical Computing Suite’ job scheduler. For other updates, please see NEWS.

The parallelly package enhances the parallel package - our built-in R package for parallel processing - by improving on existing features and by adding new ones. Somewhat simplified, parallelly provides the things that you would otherwise expect to find in the parallel package. The future package relies on the parallelly package internally for local and remote parallelization.

Support for the Fujitsu Technical Computing Suite

Functions availableCores() and availableWorkers() now support the Fujitsu Technical Computing Suite. Fujitsu Technical Computing Suite is a high-performance compute (HPC) job scheduler, which is popular in Japan among other places, e.g. at RIKEN and Kyushu University.

Specifically, these functions now recognize environment variables PJM_VNODE_CORE, PJM_PROC_BY_NODE, and PJM_O_NODEINF set by the Fujitsu Technical Computing Suite scheduler. For example, if we submit a job script with:

$ pjsub -L vnode=4 -L vnode-core=10 script.sh

the scheduler will allocate four slots with ten cores each on one or more compute nodes. For example, we might get:

parallelly::availableCores()
#> [1] 10

parallelly::availableWorkers()
#>  [1] "node032" "node032" "node032" "node032" "node032"
#>  [6] "node032" "node032" "node032" "node032" "node032"
#> [11] "node032" "node032" "node032" "node032" "node032"
#> [16] "node032" "node032" "node032" "node032" "node032"
#> [21] "node032" "node032" "node032" "node032" "node032"
#> [26] "node032" "node032" "node032" "node032" "node032"
#> [31] "node109" "node109" "node109" "node109" "node109"
#> [36] "node109" "node109" "node109" "node109" "node109"

In this example, the scheduler allocated three 10-core slots on compute node node032 and one 10-core slot on compute node node109, totalling 40 CPU cores, as requested. Because of this, users on these systems can now use makeClusterPSOCK() to set up a parallel PSOCK cluster as:

library(parallelly)
cl <- makeClusterPSOCK(availableWorkers(), rshcmd = "pjrsh")

As shown above, this code picks up whatever vnode and vnode-core configuration were requested via the pjsub submission, and launch 40 parallel R workers via the pjrsh tool part of the Fujitsu Technical Computing Suite.

This also means that we can use:

library(future)
plan(cluster, rshcmd = "pjrsh")

when using the future framework, which uses makeClusterPSOCK() and availableWorkers() internally.

Avoid having to specify rshcmd = “pjrsh”

To avoid having to manually specify argument rshcmd = "pjrsh" manually, we can set it via environment variable R_PARALLELLY_MAKENODEPSOCK_RSHCMD (sic!) before launching R, e.g.

export R_PARALLELLY_MAKENODEPSOCK_RSHCMD=pjrsh

To make this persistent, the user can add this line to their ~/.bashrc shell startup script. Alternatively, the system administrator can add it to a /etc/profile.d/*.sh file of their choice.

With this environment variable set, it’s sufficient to do:

library(parallelly)
cl <- makeClusterPSOCK(availableWorkers())

and

library(future)
plan(cluster)

In addition to not having to remember using rshcmd = "pjrsh", a major advantage of this approach is that the same R script works also on other systems, including the user’s local machine and HPC environments such as Slurm and SGE.

Over and out, and welcome to all Fujitsu Technical Computing Suite users!

parallelly 1.32.0: makeClusterPSOCK() Didn't Work with Chinese and Korean Locales

Wed, 08 Jun 2022 14:00:00 -0700

parallelly 1.32.0 is on CRAN. This release fixes an important bug that affected users running with the Simplified Chinese, Traditional Chinese (Taiwan), or Korean locale. The bug caused makeClusterPSOCK(), and therefore also future::plan("multisession"), to fail with an error. For other updates, please see NEWS.

Important bug fix for Chinese and Korean users

It turns out that makeClusterPSOCK() has never^[1] worked for users that have their computers set to use a Korean (LANGUAGE=ko), a Simplified Chinese (LANGUAGE=zh_CN), or a Traditional Chinese (Taiwan) (LANGUAGE=zh_TW) locale. For example,

Sys.setLanguage("zh_CN")
library(parallelly)
cl <- parallelly::makeClusterPSOCK(2)
#> 错误: ‘node$session_info$process$pid == pid’ is not TRUE
#> 此外: Warning message:
#> In add_cluster_session_info(cl[ii]) : 强制改变过程中产生了NA

The workaround was to pass validate = FALSE, e.g.

cl <- parallelly::makeClusterPSOCK(2, validate = FALSE)

This bug was because of an internal assertion that made incorrect assumptions about what print() for SOCK0node and SOCKnode object would output. It worked with most locales, but not with the above three. I have fixed this in the most recent release of parallelly.

Since the ‘multisession’ strategy of the future framework relies on makeClusterPSOCK(), this bug affected also the future package, e.g.

Sys.setLanguage("ko")
library(future)
plan(multisession)
#> 에러: 'node$session_info$process$pid == pid' is not TRUE
#> 추가정보: 경고메시지(들): 
#> add_cluster_session_info(cl[ii])에서: 강제형변환에 의해 생성된 NA 입니다

So, if you run into these errors, upgrade to the latest version of parallelly, e.g. update.packages(), restart R, and it will work as you would expect.

To prevent this from happening again, I am now making sure to always check the package with also these locales, in addition to English. CRAN already checks packages with different English and German locales.

I am sorry, 对不起, 미안해요, about this. Hopefully, it’ll work smoother from now on.

Happy parallelization!

progressr 0.10.1: Plyr Now Supports Progress Updates also in Parallel

Fri, 03 Jun 2022 13:00:00 -0700

progressr 0.10.1 is on CRAN. I dedicate this release to all plyr users and developers out there.

The progressr package provides a minimal API for reporting progress updates in R. The design is to separate the representation of progress updates from how they are presented. What type of progress to signal is controlled by the developer. How these progress updates are rendered is controlled by the end user. For instance, some users may prefer visual feedback, such as a horizontal progress bar in the terminal, whereas others may prefer auditory feedback. The progressr package works also when processing R in parallel or distributed using the future framework.

plyr + future + progressr ⇒ parallel progress reporting

The major update in this release, is that plyr (>= 1.8.7) now has built-in support for the progressr package when running in parallel. For example,

library(plyr)

## Parallelize on the local machine
future::plan("multisession")
doFuture::registerDoFuture()

library(progressr)
handlers(global = TRUE)

y <- llply(1:100, function(x) {
  Sys.sleep(1)
  sqrt(x)
}, .progress = "progressr", .parallel = TRUE)

#>   |============                                  |  28%

Previously, plyr only had built-in support for progress reporting when running sequentially. Note that the progressr is the only package that supports progress reporting when using .parallel = TRUE in plyr.

Also, whenever using progressr, the user has plenty of options for where and how progress is reported. For example, handlers("rstudio") uses the progress bar in the RStudio job interface, handlers("progress") uses terminal progress bars of the progress package, and handlers("beep") reports on progress using sounds. It’s also possible to report progress in the Shiny. See my blog post ‘progressr 0.8.0 - RStudio’s Progress Bar, Shiny Progress Updates, and Absolute Progress’ for more information.

There’s actually a better way

I actually recommend another way for reporting on progress with plyr map-reduce functions, which is more in line with the design philosophy of progressr:

The developer is responsible for providing progress updates, but it’s only the end user who decides if, when, and how progress should be presented. No exceptions will be allowed.

Please see Section ‘plyr::llply(…, .parallel = TRUE) with doFuture’ in the ‘progressr: An Introduction’ vignette for this alternative approach, which has worked for long time already. But, of course, adding .progress = "progressr" to your already existing plyr .parallel = TRUE code is as simple as it gets.

Now, make some progress!

parallelly 1.31.1: Better at Inferring Number of CPU Cores with Cgroups and Linux Containers

Fri, 22 Apr 2022 11:00:00 -0700

parallelly 1.31.1 is on CRAN. The parallelly package enhances the parallel package - our built-in R package for parallel processing - by improving on existing features and by adding new ones. Somewhat simplified, parallelly provides the things that you would otherwise expect to find in the parallel package. The future package relies on the parallelly package internally for local and remote parallelization.

Since my previous post on parallelly in November 2021, I’ve fixed a few bugs and added some new features to the package;

availableCores() detects more cgroups settings, e.g. it now detects the number of CPUs available to your RStudio Cloud session
makeClusterPSOCK() gained argument default_packages to control which packages to attach at startup on the R workers
makeClusterPSOCK() gained rscript_sh to explicitly control what type of shell quotes to use on the R workers

Below is a detailed description of these new features. Some of them, and some of the bug fixes, were added to version 1.30.0, while others to versions 1.31.0 and 1.31.1.

availableCores() detects more cgroups settings

Cgroups, short for control groups, is a low-level feature in Linux to control which and how much resources a process may use. This prevents individual processes from taking up all resources. For example, an R process can be limited to use at most four CPU cores, even if the underlying hardware has 48 CPU cores. Imagine we parallelize with parallel::detectCores() background workers, e.g.

library(future)
plan(multisession, workers = parallel::detectCores())

This will spawn 48 background R processes. Without cgroups, these 48 parallel R workers will run across all 48 CPU cores on the machine, competing with all other software and all other users running on the same machine. With cgroups limiting us to, say, four CPU cores, there will still be 48 parallel R workers running, but they will now run isolated on only four CPU cores, leaving the other 44 CPU cores alone.

Of course, running 48 parallel workers on four CPU cores is not very efficient. There will be a lot of wasted CPU cycles due to context switching. The problem is that we use parallel::detectCores() here, which is what gives us 48 workers. If we instead use availableCores() of parallelly;

library(future)
plan(multisession, workers = parallelly::availableCores())

we get four parallel workers, which reflects the four CPU cores that cgroups gives us. Basic support for this was introduced in parallelly 1.22.0 (2020-12-12), by querying nproc. This required that nproc was installed on the system, and although it worked in many cases, it did not work for all cgroups configurations. Specifically, it would not work when cgroups was throttling the CPU usage rather than limiting the process to a specific set of CPU cores. To illustrate this, assume we run R via Docker using Rocker:

$ docker run --cpuset-cpus=0-2,8 rocker/r-base

then cgroups will isolate the Linux container to run on CPU cores 0, 1, 2, and 8 of the host. In this case nproc, e.g. system("nproc") from within R, returns four (4), and therefore also parallelly::availableCores(). Starting with parallelly 1.31.0, parallelly::availableCores() detects this also when nproc is not installed on the system. An alternative to limit the CPU resources, is to throttle the average CPU load. Using Docker, this can be done as:

$ docker run --cpus=3.5 rocker/r-base

In this case, cgroups will throttle our R process to consume at most 350% worth of CPU on the host, where 100% corresponds to a single CPU. Here, nproc is of no use and simply gives the number of CPUs on the host (e.g. 48). Starting with parallelly 1.31.0, parallelly::availableCores() can detect that cgroups throttles R to an average load of 3.5 CPUs. Since we cannot run 3.5 parallel workers, parallelly::availableCores() rounds down to the nearest integer and return three (3). The RStudio Cloud is one example where CPU throttling is used, so if you work in RStudio Cloud, use parallelly::availableCores() and you will be good.

While talking about RStudio Cloud, if you use a free account, you have access to only a single CPU core (“nCPUs = 1”). In this case, plan(multisession, workers = parallelly::availableCores()), or equivalently, plan(multisession), will fall back to sequential processing, because there is no point in running in parallel on a single core. If you still want to prototype parallel processing in a single-core environment, say with two cores, you can set option parallelly.availableCores.min = 2. This makes availableCores() return two (2).

makeClusterPSOCK() gained more skills

Since parallelly 1.29.0, makeClusterPSOCK() has gained arguments default_packages and rscript_sh.

New argument `default_packages`

Argument default_packages controls which R packages are attached on each worker during startup. Previously, it was only possible, via logical argument methods to control whether or not the methods package should be attached - an argument that stems from parallel::makePSOCKcluster(). With the new default_packages argument, we have full control of which packages are attached. For instance, if we want to go minimal, we can do:

cl <- parallelly::makeClusterPSOCK(1, default_packages = "base")

This will result in one R worker with only the base package attached;

> parallel::clusterEvalQ(cl, { search() })
[[1]]
[1] ".GlobalEnv"   "Autoloads"    "package:base"

Having said that, note that more packages are loaded;

> parallel::clusterEvalQ(cl, { loadedNamespaces() })
[[1]]
[1] "compiler" "parallel" "utils"    "base"

Like base, compiler is a package that R always loads. The parallel package is loaded because it provides the code for running the background R workers. The utils package is loaded because makeClusterPSOCK() validates that the workers are functional by collecting extra information from the R workers that later may be useful when reporting on errors. To skip this, pass argument validate = FALSE.

New argument `rscript_sh`

The new argument rscript_sh can be used in the rare case where one launches remote R workers on non-Unix machines from a Unix-like machine. For example, if we, from a Linux machine launch remote MS Windows workers, we need to use rscript_sh = "cmd".

That covers the most important additions to parallelly. For bug fixes and minor updates, please see NEWS.

Over and out!

future 1.24.0: Forwarding RNG State also for Stand-Alone Futures

Tue, 22 Feb 2022 13:00:00 -0800

future 1.24.0 is on CRAN. It comes with one significant update related to random number generation, further deprecation of legacy future strategies, a slight improvement to plan() and tweaks(), and some bug fixes. Below are the most important changes.

One of many possible random number generators. This one was carefully designed by XKCD [CC BY-NC 2.5].

future(…, seed = TRUE) updates RNG state

In future (< 1.24.0), using future(..., seed = TRUE) would not forward the state of the random number generator (RNG). For example, if we generated random numbers in individual futures this way, they would become identical, e.g.

f <- future(rnorm(n = 1L), seed = TRUE)
value(f)
#> [1] -1.424997

f <- future(rnorm(n = 1L), seed = TRUE)
value(f)
#> [1] -1.424997

This was a deliberate, conservative design, because it is not obvious exactly how the RNG state should be forwarded in this case, especially if we consider random numbers may be generated also in the main R session. The more I dug into the problem, the further down I ended up in a rabbit hole. Because of this, I have held back on addressing this problem and leaving it to the developer to solve it, i.e. they had to roll their own RNG streams designed for parallel processing, and populate each future with a unique seed from those RNG streams, i.e. future(..., seed = <seed>). This is how future.apply and furrr already do it internally.

However, I understand that design was confusing, and if not understood, it could silently lead to RNG mistakes and correlated, and even identical random numbers. I also sometimes got confused about this when I needed to do something quickly with individual futures and random numbers. I even considered making seed = TRUE an error until resolved, and, looking back, maybe I should have done so.

Anyway, because it is rather tedious to roll your own L’Ecuyer-CMRG RNG streams, I decided to update future(..., seed = TRUE) to provide a good-enough solution internally, where it forwards the RNG state and then provides the future with an RNG substream based on the updated RNG state. In future (>= 1.24.0), we now get:

f <- future(rnorm(n = 1L), seed = TRUE)
v <- value(f)
print(v)
#> [1] -1.424997

f <- future(rnorm(n = 1L), seed = TRUE)
v <- value(f)
print(v)
#> [1] -1.985136

This update only affects code that currently uses future(..., seed = TRUE). It does not affect code that relies on future.apply or furrr, which already worked correctly. That is, you can keep using y <- future_lapply(..., future.seed = TRUE) and y <- future_map(..., .options = furrr_options(seed = TRUE)).

Deprecating future strategies ‘transparent’ and ‘remote’

It’s on the roadmap to provide mechanisms for the developer to declare what resources a particular future needs and for the end-user to specify multiple parallel-backend alternatives, so that the future can be processed on a worker that best can meet its resource requirements. In order to support this, we need to restrict the future backend API further, which has been in the works over the last couple of years in collaboration with existing package developers.

In this release, I am formally deprecating future strategies transparent and remote. When used, they now produce an informative warning. The transparent strategy is deprecated in favor of using sequential with argument split = TRUE set. If you still use remote, please migrate to cluster, which since a long time can achieve everything that remote can do.

On a related note, if you are still using multiprocess, which is deprecated in future (>= 1.20.0) since 2020-11-03, please migrate to multisession so you won’t get surprised when multiprocess becomes defunct.

For the other updates, please see the NEWS.

Happy futuring!

Henrik

Future Improvements During 2021

Fri, 07 Jan 2022 14:00:00 -0800

Happy New Year! I made some updates to the future framework during 2021 that involve overall improvements and essential preparations to go forward with some exciting new features that I’m keen to work on during 2022.

The future framework makes it easy to parallelize existing R code - often with only a minor change of code. The goal is to lower the barriers so that anyone can quickly and safely speed up their existing R code in a worry-free manner.

future 1.22.1 was released in August 2021, followed by future 1.23.0 at the end of October 2021. Below, I summarize the updates that came with those two releases:

New features
Performance improvements
Cleanups to make room for new features
Significant changes preparing for the future
Roadmap ahead

There were also several updates to the related parallelly and progressr packages, which you can read about in earlier blog posts under the #parallelly and #progressr blog tags.

New features

futureSessionInfo() for troubleshooting and issue reporting

Function futureSessionInfo() was added to future 1.22.0. It outputs information useful for troubleshooting problems related to the future framework. It also runs some basic tests to validate that the current future backend works as expected. If you have problems getting futures to work on your machine, please run this function before reporting issues at Future Discussions. Here’s an example:

> library(future)
> plan(multisession, workers = 2)
> futureSessionInfo()
*** Package versions
future 1.23.0, parallelly 1.30.0, parallel 4.1.2, globals 0.14.0, listenv 0.8.0

*** Allocations
availableCores():
system  nproc 
     8      8 
availableWorkers():
$system
[1] "localhost" "localhost" "localhost"
[4] "localhost" "localhost" "localhost"
[7] "localhost" "localhost"

*** Settings
- future.plan=<not set>
- future.fork.multithreading.enable=<not set>
- future.globals.maxSize=<not set>
- future.globals.onReference=<not set>
- future.resolve.recursive=<not set>
- future.rng.onMisuse='warning'
- future.wait.timeout=<not set>
- future.wait.interval=<not set>
- future.wait.alpha=<not set>
- future.startup.script=<not set>

*** Backends
Number of workers: 2
List of future strategies:
1. multisession:
   - args: function (..., workers = 2, envir = parent.frame())
   - tweaked: TRUE
   - call: plan(multisession, workers = 2)
   
*** Basic tests
  worker   pid     r sysname          release
1      1 19291 4.1.2   Linux 5.4.0-91-generic
2      2 19290 4.1.2   Linux 5.4.0-91-generic
                                               version
1 #102~18.04.1-Ubuntu SMP Thu Nov 11 14:46:36 UTC 2021
2 #102~18.04.1-Ubuntu SMP Thu Nov 11 14:46:36 UTC 2021
   nodename machine login  user effective_user
1 my-laptop  x86_64 alice alice          alice
2 my-laptop  x86_64 alice alice          alice
Number of unique PIDs: 2 (as expected)

Working around UTF-8 escaping on MS Windows

Because of limitations in R itself, UTF-8 symbols outputted on MS Windows parallel workers would be relayed as escaped symbols when using futures. Now, the future framework, and, more specifically, value(), attempts to recover such MS Windows output to UTF-8 before outputting it.

For example, in future (< 1.23.0) you would get the following:

f <- future({ cat("\u2713 Everything is OK") ; 42 })
v <- value(f)
#> <U+2713> Everything is OK

when, and only when, those futures are resolved on a MS Windows machine. In future (>= 1.23.0), we work around this problem by looking for <U+NNNN> like patterns in the output and decode them as UTF-8 symbols;

f <- future({ cat("\u2713 Everything is OK") ; 42 })
v <- value(f)
#> ✓ Everything is OK

Comment: From R 4.2.0, R will have native support for UTF-8 also on MS Windows. More testing and validation is needed to confirm this will work out of the box in R (>= 4.2.0) when running R in the terminal, in the R GUI, in the RStudio Console, and so on. If so, future will be updated to only apply this workaround for R (< 4.2.0).

Harmonization of future(), futureAssign(), and futureCall()

Prior to future 1.22.0, argument seed for futureAssign() and futureCall() defaulted to TRUE, whereas it defaulted to FALSE for future(). This was an oversight. In future (>= 1.22.0), seed = FALSE is the default for all these functions.

Protecting against non-exportable results

Analogously to how globals may be scanned for “non-exportable” objects when option future.globals.onReference is set to "error" or "warning", value() will now check for similar problems in the value returned from parallel workers. For example, in future (< 1.23.0) we would get:

library(future)
plan(multisession, workers = 2)
options(future.globals.onReference = "error")

f <- future(xml2::read_xml("<body></body>"))
v <- value(f)
print(v)
#> Error in doc_type(x) : external pointer is not valid

whereas in future (>= 1.23.0) we get:

library(future)
plan(multisession, workers = 2)
options(future.globals.onReference = "error")

f <- future(xml2::read_xml("<body></body>"))
v <- value(f)
#> Error: Detected a non-exportable reference ('externalptr') in the value
#> (of class 'xml_document') of the resolved future

Finer control of what type of conditions are captured and replayed

Besides specifying which condition classes to be captured and relayed, in future (>= 1.22.0), it is possible to specify also condition classes to be ignored. For example,

f <- future(..., conditions = structure("condition", exclude = "message"))

captures all conditions but message conditions. The default is conditions = "condition", which captures and relays any type of condition.

Performance improvements

I always prioritize correctness over performance in the future framework. So, whenever optimizing for performance, one always has to make sure we are not breaking things somewhere else. Thankfully, there are now over 200 reverse-dependency packages on CRAN and Bioconductor that I can validate against. They provide another comfy cushion against mistakes than what we already get from package unit tests and the future.tests test suite. Below are some recent performance improvements made.

Less latency for multicore, multisession, and cluster futures

In future 1.22.0, the default timeout of resolved() was decreased from 0.20 seconds to 0.01 seconds for multicore, multisession, and cluster futures. This means that less time is now spent on checking for results from these future backends when they are not yet available. After making sure it is safe to do so, we might decrease the default timeout to zero in a later release.

Less overhead when initiating futures

The overhead of initiating futures was significantly reduced in future 1.22.0. For example, the round-trip time for value(future(NULL)) is about twice as fast for sequential, cluster, and multisession futures. For multicore futures the round-trip speedup is about 20%.

The speedup comes from pre-compiling the future’s R expression into an R expression template, which then can quickly re-compiled into the final expression to be evaluated. Specifically, instead of calling expr <- base::bquote(tmpl) for each future, which is computationally expensive, we take a two-step approach where we first call tmpl_cmp <- bquote_compile(tmpl) once per session such that we only have to call the much faster expr <- bquote_apply(tmpl_cmp) for each future.(*) This new pre-compile approach speeds up the construction of the final future expression from the original future expression ~10 times.

(*) These are internal functions of the future package.

Environment variables are only used when package is loaded

All R options specific to the future framework have defaults that fall back to corresponding environment variables. For example, the default for option future.rng.onMisuse can be set by environment variable R_FUTURE_RNG_ONMISUSE.

The purpose of the environment variables is to make it possible to configure the future framework before launching R, e.g. in shell startup scripts, or in shell scripts submitted to job schedulers in high-performance compute (HPC) environments. When R is already running, the best practice is to use the R options to configure the future framework.

In order to avoid the overhead from querying and parsing environment variables at runtime, but also to clarify how and when environment variables should be set, starting with future 1.22.0, R_FUTURE_* environment variables are only used when the future package is loaded. Then, if set, they are used for setting the corresponding future.* option.

Cleanups to make room for new features

The values() function is defunct since future 1.23.0 in favor of value(). All CRAN and Bioconductor packages that depend on future have been updated since a long time. If you get the error:

Error: values() is defunct in future (>= 1.20.0). Use value() instead.

make sure to update your R packages. A few users of furrr have run into this error - updating to furrr (>= 0.2.0) solved the problem.

Continuing, to further harmonize how developers use the Future API, we are moving away from odds-and-ends features, especially the ones that are holding us back from adding new features. The goal is to ensure that more code using futures can truly run anywhere, not just on a particular parallel backend that the developer work with.

In this spirit, we are slowly moving away from “persistent” workers. For example, in future (>= 1.23.0), plan(multisession, persistent = TRUE) is no longer supported and will produce an error if attempted. The same will eventually happen also for plan(cluster, persistent = TRUE), but not until we have support for caching “sticky” globals, which is the main use case for persistent workers.

Another example is transparent futures, which are prepared for deprecation in future (>= 1.23.0). If used, plan(transparent) produces a warning, which soon will be upgraded to a formal deprecation warning. In a later release, it will produce an error. Transparent futures were added during the early days in order to simplify troubleshooting of futures. A better approach these days is to use plan(sequential, split = TRUE), which makes interactive troubleshooting tools such as browser() and debug() to work.

Significant changes preparing for the future

Prior to future 1.22.0, lazy futures were assigned to the currently set future backend immediately when created. For example, if we do:

library(future)
plan(multisession, workers = 2)

f <- future(42, lazy = TRUE)

with future (< 1.22.0), we would get:

class(f)
#> [1] "MultisessionFuture" "ClusterFuture"      "MultiprocessFuture"
#> [4] "Future"             "environment"

Starting with future 1.22.0, lazy futures remain generic futures until they are launched, which means they are not assigned a backend class until they have to. Now, the above example gives:

class(f)
#> [1] "Future"      "environment"

This change opens up the door for storing futures themselves to file and sending them elsewhere. More precisely, this means we can start working towards a queue of futures, which then can be processed on whatever compute resources we have access to at the moment, e.g. some futures might be resolved on the local computer, others on machines on a local cluster, and when those fill up, we can burst out to cloud resources, or maybe process them via a community-driven peer-to-peer cluster.

Roadmap ahead

There are lots of new features on the roadmap related to the above and other things. I hope to make progress on several of them during 2022. If you’re curious about what’s coming up, see the Project Roadmap, stay tuned on this blog (feed), or follow me on Twitter.

Happy futuring!

Henrik

parallelly 1.29.0: New Skills and Less Communication Latency on Linux

Mon, 22 Nov 2021 21:00:00 -0800

parallelly 1.29.0 is on CRAN. The parallelly package enhances the parallel package - our built-in R package for parallel processing - by improving on existing features and by adding new ones. Somewhat simplified, parallelly provides the things that you would otherwise expect to find in the parallel package. The future package rely on the parallelly package internally for local and remote parallelization.

Since my previous post on parallelly five months ago, the parallelly package had some bugs fixed, and it gained a few new features;

new isForkedChild() to test if R runs in a forked process,
new isNodeAlive() to test if one or more cluster-node processes are running,
availableCores() now respects also Bioconductor settings,
makeClusterPSOCK(..., rscript = "*") automatically expands to the proper Rscript executable,
makeClusterPSOCK(…, rscript_envs = c(UNSET_ME = NA_character_)) unsets environment variables on cluster nodes, and
makeClusterPSOCK() sets up clusters with less communication latency on Unix.

Below is a detailed description of these new features.

New function isForkedChild()

If you run R on Unix and macOS, you can parallelize code using so called forked parallel processing. It is a very convenient way of parallelizing code, especially since forking is implemented at the core of the operating system and there is very little extra you have to do at the R level to get it to work. Compared with other parallelization solutions, forked processing has often less overhead, resulting in shorter turnaround times. To date, the most famous method for parallelizing using forks is mclapply() of the parallel package. For example,

library(parallel)
y <- mclapply(X, some_slow_fcn, mc.cores = 4)

works just like lapply(X, some_slow_fcn) but will perform the same tasks in parallel using four (4) CPU cores. MS Windows does not support forked processing; any attempt to use mclapply() there will cause it to silently fall back to a sequential lapply() call.

In the future ecosystem, you get forked parallelization with the multicore backend, e.g.

library(future.apply)
plan(multicore, workers = 4)
y <- future_lapply(X, some_slow_fcn)

Unfortunately, we cannot parallelize all types of code using forks. If done, you might get an error, but in the worst case you crash (segmentation fault) your R process. For example, some graphical user interfaces (GUIs) do not play well with forked processing, e.g. the RStudio Console, but also other GUIs. Multi-threaded parallelization has also been reported to cause problems when run within forked parallelization. We sometime talk about non-fork-safe code, in contrast to fork-safe code, to refer to code that risks crashing the software if run in forked processes.

Here is what R-core developer Simon Urbanek and author of mclapply() wrote in the R-devel thread ‘mclapply returns NULLs on MacOS when running GAM’ on 2020-04-28:

Do NOT use mcparallel() in packages except as a non-default option that user can set for the reasons … explained [above]. Multicore is intended for HPC applications that need to use many cores for computing-heavy jobs, but it does not play well with RStudio and more importantly you don’t know the resource available so only the user can tell you when it is safe to use. Multi-core machines are often shared so using all detected cores is a very bad idea. The user should be able to explicitly enable it, but it should not be enabled by default.

It is not always obvious to know whether a certain function call in R is fork safe, especially not if we haven’t written all the code ourselves. Because of this, it is more of a trial and error so see if works. However, when we know that a certain function call is not fork safe, it is useful to protect against using it in forked parallelization. In parallelly (>= 1.28.0), we can use function isForkedChild() test whether or not R runs in a forked child process. For example, the author of some_slow_fcn() above, could protect against mistakes by:

some_slow_fcn <- function(x) {
  if (parallelly::isForkedChild()) {
    stop("This function must not be used in *forked* parallel processing")
  }
  
  y <- non_fork_safe_code(x)
  ...
}

or, if they have an alternative, less preferred, non-fork-safe implementation, they could run that conditionally on R being executed in a forked child process:

some_slow_fcn <- function(x) {
  if (parallelly::isForkedChild()) {
    y <- fork_safe_code(x)
  } else {
    y <- alternative_code(x)
  }
  ...
}

New function isNodeAlive()

The new function isNodeAlive() checks whether one or more nodes are alive. For instance,

library(parallelly)
cl <- makeClusterPSOCK(3)
isNodeAlive(cl)
#> [1] TRUE TRUE TRUE

Imagine the second parallel worker crashes, which we can emulate with

clusterEvalQ(cl[2], tools::pskill(Sys.getpid()))
#> Error in unserialize(node$con) : error reading from connection

then we get:

isNodeAlive(cl)
#> [1]  TRUE FALSE  TRUE

The isNodeAlive() function works by querying the operating system to see if those processes are still running, based their process IDs (PIDs) recorded by makeClusterPSOCK() when launched. If the workers’ PIDs are unknown, then NA is returned instead. For instance, contrary to parallelly::makeClusterPSOCK(), parallel::makeCluster() does not record the PIDs and we get missing values as the result;

library(parallelly)
cl <- parallel::makeCluster(3)
isNodeAlive(cl)
#> [1] NA NA NA

Similarly, if one of the parallel workers runs on a remote machine, we cannot easily query the remote machine for the PID existing or not. In such cases, NA is returned. Maybe we will be able to query also remote machines in a future version of parallelly, but for now, it is not possible.

availableCores() respects Bioconductor settings

Function availableCores() queries the hardware and the system environment to find out how many CPU cores it may run on. It does this by checking system settings, environment variables, and R options that may be set by the end-user, the system administrator, the parent R process, the operating system, a job scheduler, and so on. When you use availableCores(), you don’t have to worry about using more CPU resources than you were assigned, which helps guarantee that it runs nicely together with everything else on the same machine.

In parallelly (>= 1.29.0), availableCores() is now also agile to Bioconductor-specific settings. For example, BiocParallel 1.27.2 introduced environment variable BIOCPARALLEL_WORKER_NUMBER, which sets the default number of parallel workers when using BiocParallel for parallelization. Similarly, on Bioconductor check servers, they set environment variable BBS_HOME, which BiocParallel uses to limit the number of cores to four (4). Now availableCores() reflects also those settings, which, in turn, means that future settings like plan(multisession) will also automatically respect the Bioconductor settings.

Function availableWorkers(), which relies on availableCores() as a fallback, is therefore also agile to these Bioconductor environment variables.

makeClusterPSOCK(…, rscript = “*“)

Argument rscript of makeClusterPSOCK() can be used to control exactly which Rscript executable is used to launch the parallel workers, and also how that executable is launched. The default settings is often sufficient, but if you want to launch a worker, say, within a Linux container you can do so by adjusting rscript. The help page for makeClusterPSOCK() has several examples of this. It may also be used for other setups. For example, to launch two parallel workers on a remote Linux machine, such that their CPU priority is less than other processing running on that machine, we can use (*):

workers <- rep("remote.example.org", times = 2)
cl <- makeClusterPSOCK(workers, rscript = c("nice", "Rscript"))

This causes the two R workers to be launched using nice Rscript .... The Unix command nice is what makes Rscript to run with a lower CPU priority. By running at a lower priority, we decrease the risk for our parallel tasks to have a negative impact on other software running on that machine, e.g. someone might use that machine for interactive work without us knowing. We can do the same thing on our local machine via:

cl <- makeClusterPSOCK(2L,
        rscript = c("nice", file.path(R.home("bin"), "Rscript")))

Here we specified the absolute path to Rscript to make sure we run the same version of R as the main R session, and not another Rscript that may be on the system PATH.

Starting with parallelly 1.29.0, we can replace the Rscript specification in the above two examples with "*", as in:

workers <- rep("remote-machine.example.org, times = 2L)
cl <- makeClusterPSOCK(workers, rscript = c("nice", "*"))

and

cl <- makeClusterPSOCK(2L, rscript = c("nice", "*"))

When used, makeClusterPSOCK() will expand "*" to the proper Rscript specification depending on running remotely or not. To further emphasize the convenience of this, consider:

workers <- c("localhost", "remote-machine.example.org")
cl <- makeClusterPSOCK(workers, rscript = c("nice", "*"))

which launches two parallel workers - one running on local machine and one running on the remote machine.

Note that, when using future, we can pass rscript to plan(multisession) and plan(cluster) to achieve the same thing, as in

plan(cluster, workers = workers, rscript = c("nice", "*"))

and

plan(multisession, workers = 2L, rscript = c("nice", "*"))

(*) Here we use nice as an example, because it is a simple way to illustrate how rscript can be used. As a matter of fact, there is already an argument renice, which we can use to achieve the same without using the rscript argument.

makeClusterPSOCK(…, rscript_envs = c(UNSET_ME = NA_character_))

Argument rscript_envs of makeClusterPSOCK() can be used to set environment variables on cluster nodes, or copy existing ones from the main R session to the cluster nodes. For example,

cl <- makeClusterPSOCK(2, rscript_envs = c(PI = "3.14", "MY_EMAIL"))

will, during startup, set environment variable PI on each of the two cluster nodes to have value 3.14. It will also set MY_EMAIL on them to the value of Sys.getenv("MY_EMAIL") in the current R session.

Starting with parallelly 1.29.0, we can now also unset environment variables, in case they are set on the cluster nodes. Any named element with a missing value causes the corresponding environment variable to be unset, e.g.

cl <- makeClusterPSOCK(2, rscript_envs = c(_R_CHECK_LENGTH_1_CONDITION_ = NA_character_))

This results in passing -e 'Sys.unsetenv("_R_CHECK_LENGTH_1_CONDITION_")' to Rscript when launching each worker.

makeClusterPSOCK() sets up clusters with less communication latency on Unix

It turns out that, in R on Unix, there is a significant latency in the communication between the parallel workers and the main R session (**). Starting in R (>= 4.1.0), it is possible to decrease this latency by setting a dedicated R option on each of the workers, e.g.

rscript_args <- c("-e", shQuote("options(socketOptions = 'no-delay')")
cl <- parallel::makeCluster(workers, rscript_args = rscript_args))

This is quite verbose, so I’ve made this the new default in parallelly (>= 1.29.0), i.e. you can keep using:

cl <- parallelly::makeClusterPSOCK(workers)

to benefit from the above. See help for makeClusterPSOCK() for options on how to change this new default.

Here is an example that illustrates the difference in latency with and without the new settings;

cl_parallel   <- parallel::makeCluster(1)
cl_parallelly <- parallelly::makeClusterPSOCK(1)

res <- bench::mark(iterations = 1000L,
    parallel = parallel::clusterEvalQ(cl_parallel, iris),
  parallelly = parallel::clusterEvalQ(cl_parallelly, iris)
)

res[, c(1:4,9)]
#> # A tibble: 2 × 5
#>   expression      min   median `itr/sec` total_time
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl>   <bch:tm>
#> 1 parallel      277µs     44ms      22.5      44.4s
#> 2 parallelly    380µs    582µs    1670.     598.3ms

From this, we see that the total latency overhead for 1,000 parallel tasks went from 44 seconds down to 0.60 seconds, which is ~75 times less on average. Does this mean your parallel code will run faster? No, it is just the communication latency that has decreased. But, why waste time on waiting on your results when you don’t have to? This is why I changed the defaults in parallelly. It will also bring the experience on Unix on par with MS Windows and macOS.

Note that the relatively high latency affects only Unix. MS Windows and macOS do not suffer from this extra latency. For example, on MS Windows 10 that runs in a virtual machine on the same Linux computer as above, I get:

#> # A tibble: 2 × 5
#>   expression      min   median `itr/sec` total_time
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl>   <bch:tm>
#> 1 parallel      191us    314us     2993.      333ms
#> 2 parallelly    164us    311us     3227.      310ms

If you’re using future with plan(multisession) or plan(cluster), you’re already benefitting from the performance gain, because those rely on parallelly::makeClusterPSOCK() internally.

(**) Technical details: Options socketOptions sets the default value of argument options of base::socketConnection(). The default is NULL, but if we set it to "no-delay", the created TCP socket connections are configured to use the TCP_NODELAY flag. When using TCP_NODELAY, a TCP connection will no longer use the so called Nagle’s algorithm, which otherwise is used to reduces the number of TCP packets needed to be sent over the network by making sure TCP fills up each packet before sending it off. When using the new "no-delay", this buffering is disabled and packets are sent as soon as data come in. Credits for this improvement should go to Jeff Keller, who identified and reported the problem to R-devel, to Iñaki Úcar who pitched in, and to Simon Urbanek, who implemented support for socketConnection(..., options = "no-delay") for R 4.1.0.

Bug fixes

Finally, the most important bug fixes since parallelly 1.26.0 are:

availableCores() would produce an error on Linux systems without nproc installed.
makeClusterPSOCK() failed with “Error in freePort(port) : Unknown value on argument ‘port’: ‘auto’” if environment variable R_PARALLEL_PORT was set to a port number.
In R environments not supporting setup_strategy = "parallel", makeClusterPSOCK() failed to fall back to setup_strategy = "sequential".

For all other bug fixes and updates, please see NEWS.

Over and out!

matrixStats: Consistent Support for Name Attributes via GSoC Project

Mon, 23 Aug 2021 00:10:00 +0200

Author: Angelina Panagopoulou, GSoC student developer, undergraduate in the Department of Informatics & Telecommunications (DIT), University of Athens, Greece

We are glad to announce recent CRAN releases of matrixStats with support for handling and returning name attributes. This feature is added to make matrixStats functions handle names in the same manner as the corresponding base R functions. In particular, the behavior of matrixStats functions is now the same as the apply() function in R, resolving previous lack of, or inconsistent, handling of row and column names. The added support for names and dimnames attributes has already reached a wide, active user base, while at the same time we expect to attract users and developers who lack this feature and therefore could not use matrixStats package for their needs.

The matrixStats package provides high-performing functions operating on rows and columns of matrices. These functions are optimized such that both memory use and processing time are minimized. In order to minimize the overhead of handling name attributes, the naming support is implemented in native (C) code, where possible. In matrixStats (>= 0.60.0), handling of row and column names is optional. This is done to allow for maximum performance where needed. In addition, in order to avoid breaking some scripts and packages that rely on the previous semi-inconsistent behavior of functions, special care has been taken to ensure backward compatibility by default for the time being. We have validated the correctness of these newly implemented features by extending existing package tests to check name attributes, measuring the code coverage with the covr package, and checking all 358 reverse-dependency packages using the revdepcheck package.

Example

useNames is an argument added to each of the matrixStats functions that gained support naming. It takes values TRUE, FALSE, or NA. For backward compatible reasons, the default value of useNames is NA, meaning the default behavior from earlier versions of matrixStats is preserved. If TRUE, names or dimnames attribute of result is set, otherwise, if FALSE, the results do not have name attributes set. For example, consider the following 5-by-3 matrix with row and column names:

> x <- matrix(rnorm(5 * 3), nrow = 5, ncol = 3, dimnames = list(letters[1:5], LETTERS[1:3]))
> x
            A          B          C
a  0.30292612  1.3825644 -0.2125219
b  0.15812229  2.7719647  1.6237263
c -0.09881700 -0.6468119 -0.6481911
d  0.38520941 -0.8466505 -0.4779964
e -0.01599926 -0.8907434  0.6334347

If we use the base R method to calculate row medians, we see that the names attribute of the results reflects the row names of the input matrix:

> library(stats)
> apply(x, MARGIN = 1, FUN = median)
          a           b           c           d           e 
 0.30292612  1.62372626 -0.64681187 -0.47799635 -0.01599926

If we use matrixStats function rowMedians() with argument useNames = TRUE set, we get the same result as above:

> library(matrixStats)
> rowMedians(x, useNames = TRUE)
          a           b           c           d           e 
 0.30292612  1.62372626 -0.64681187 -0.47799635 -0.01599926

If the name attributes are not of interest, we can use useNames = FALSE as in:

> rowMedians(x, useNames = FALSE)
[1]  0.30292612  1.62372626 -0.64681187 -0.47799635 -0.01599926

Doing so will also avoid the overhead, time and memory, that otherwise comes from processing name attributes.

If we don’t specify useNames explicitly, the default is currently useNames = NA, which corresponds to the non-documented behavior that existed in matrixStats (< 0.60.0). For several functions, that corresponded to setting useNames = FALSE, however for other functions it corresponds to setting useNames = TRUE, and for others it might have set, say, row names but not column names. In our example, the default happens to be the same as useNames = FALSE:

> rowMedians(x) # default as in matrixStats (< 0.60.0)
[1]  0.30292612  1.62372626 -0.64681187 -0.47799635 -0.01599926

Future Plan

The future plan is to change the default value of useNames to TRUE or FALSE and eventually deprecate the backward-compatible behavior of useNames = NA. The default value of useNames is a design choice that requires further investigation. On the one hand, useNames = TRUE as the default is more convenient, but creates an additional performance and memory overhead when name attributes are not needed. On the other hand, make FALSE the default is appropriate for users and packages that rely on the maximum performance. Whatever the new default will become, we will make sure to work with package maintainers to minimize the risk for breaking existing code.

Google Summer of Code 2021

The project that introduces the consistent support for name attributes on the matrixStats package is a part of the R Project’s participation in the Google Summer of Code 2021.

Authors

Angelina Panagopoulou - Student Developer: I am an undergraduate in the Department of Informatics & Telecommunications (DIT) in University of Athens.
Jakob Peder Pettersen - Mentor: PhD Student, Department of Biotechnology and Food Science, Norwegian University of Science and Technology (NTNU). Jakob is a part of the Almaas Lab and does research on genome-scale metabolic modeling and behavior of microbial communities.
Henrik Bengtsson - Co-Mentor: Associate Professor, Department of Epidemiology and Biostatistics, University of California San Francisco (UCSF). He is the author and maintainer of a large number of CRAN and Bioconductor packages including matrixStats.

Contributions

Phase I

All functions implements useNames = NA/FALSE/TRUE using R code and tests are written.
Identify reverse dependency packages that rely on useNames = NA/FALSE/TRUE.
New release on CRAN with useNames = NA. This allow useRs and package maintainers to complain if anything breaks.

Phase II

Changed C code structure such that validateIndices() always return R_xlen_t*. Clean up unnecessary macros.
- Outcome: shorter compile times, smaller compiled package/library, fewer exported symbols.
Simplify C API for setNames()/setDimnames().
Implemented useNames = NA/FALSE/TRUE in C code where possible and cleanup work too.

Summary

We have completed all goals that we had initially planned. The release 0.60.0 of matrixStats on CRAN included the contributions of GSoC Phase I (“implementation in R”) and a new release of version 0.60.1 includes the contributions of Phase II (“implementation in C”).

Experience

When I first heard about the Google Summer of Code, I really wanted to participate in it, but I thought that maybe I do not have the prerequisite knowledge yet. And it was true. It was difficult for me to find a project that I had at least half of the mentioned prerequisites. So, I started looking for a project based on what I would be interested in doing during the summer. This project was an opportunity for me to learn a new programming language, the R language, and also to get in touch with advanced R. I am grateful for all the learning opportunities: programming in R, developing an R package, using a variety of tools that make developing R packages easier and more productive, working with GitHub tools, interacting with the open source community. My mentors had an understanding of the lack of experience and really helped me achieve this. Participating in Google Summer of Code 2021 as student developer is definitely worth it and I recommend every student who wants to open source contribute to give it a try.

Acknowledgements

The Google Summer of Code program for bringing more student developers into open source software development.
Jacob Pettersen for being a great project leader and for providing guidance and willingness to impart his knowledge. Henrik Bengtsson whose insight and knowledge into the subject matter steered me through R package development. I am very grateful for the immense amount of useful discussions and valuable feedback.
The members of the R community for building this warming community.

progressr 0.8.0: RStudio's Progress Bar, Shiny Progress Updates, and Absolute Progress

Fri, 11 Jun 2021 19:00:00 -0700

progressr 0.8.0 is on CRAN. It comes with some new features:

A new ‘rstudio’ handler that reports on progress via the RStudio job interface in RStudio
withProgressShiny() now updates the detail part, instead of the message part
In addition to signalling relative amounts of progress, it’s now also possible to signal total amounts

If you’re curious what progressr is about, have a look at my e-Rum 2020 presentation.

Progress updates in RStudio’s job interface

If you’re using RStudio Console, you can now report on progress in the RStudio’s job interface as long as the progress originates from a progressr-signalling function. I’ve shown an example of this in Figure 1.

Figure 1: The RStudio job interface can show progress bars and we can use it with **progressr**. The progress bar title - "Console 05:50:51 PM" - shows at what time the progress began. The '0:03' shows for how long the progress has been running - here 3 seconds.

To try this yourself, run the below in the RStudio Console.

library(progressr)
handlers(global = TRUE)
handlers("rstudio")
y <- slow_sum(1:10)

The progress bar disappears when the calculation completes.

Tweaks to withProgressShiny()

The withProgressShiny() function, which is a progressr-aware version of withProgress(), gained argument inputs. It defaults to inputs = list(message = NULL, detail = "message"), which says that a progress message should update the ‘detail’ part of the Shiny progress panel. For example,

X <- 1:10
withProgressShiny(message = "Calculation in progress",
                  detail = "Starting ...",
                  value = 0, {
  p <- progressor(along = X)
  y <- lapply(X, FUN=function(x) {
    Sys.sleep(0.25)
    p(sprintf("x=%d", x))
  })
})

will start out as in the left panel of Figure 2, and, as soon as the first progress signal is received, the ‘detail’ part is updated with x=1 as shown in the right panel.

Figure 2: A Shiny progress panel that start out with the 'message' part displaying "Calculation in progress" and the 'detail' part displaying "Starting ..." (left), and whose 'detail' part is updated to "x=1" (right) as soon the first progress update comes in.

Prior to this new release, the default behavior was to update the ‘message’ part of the Shiny progress panel. To revert to the old behavior, set argument inputs as in:

X <- 1:10
withProgressShiny(message = "Starting ...",
                  detail = "Calculation in progress",
                  value = 0, {
  p <- progressor(along = X)
  y <- lapply(X, FUN=function(x) {
    Sys.sleep(0.25)
    p(sprintf("x=%d", x))
  })
}, inputs = list(message = "message", detail = NULL))

This results in what you see in Figure 3. I think that the new behavior, as shown in Figure 2, looks better and makes more sense.

Figure 3: A Shiny progress panel that start out with the 'message' part displaying "Starting ..." and the 'detail' part displaying "Calculation in progress" (left), and whose 'message' part is updated to "x=1" (right) as soon the first progress update comes in.

Update to a specific amount of total progress

When using progressr, we start out by creating a progressor function that we then call to signal progress. For example, if we do:

my_slow_fun <- function() {
  p <- progressr::progressor(steps = 10)
  count <- 0
  for (i in 1:10) {
    count <- count + 1
    Sys.sleep(1)
    p(sprintf("count=%d", count))
  }
  count
})

each call to p() corresponds to p(amount = 1), which signals that our function have moved amount = 1 steps closer to the total amount steps = 10. We can take smaller or bigger steps by specifying another amount.

In this new version, I’ve introduced a new beta feature that allows us to signal progress that says where we are in absolute terms. With this, we can do things like:

my_slow_fun <- function() {
  p <- progressr::progressor(steps = 10)
  count <- 0
  for (i in 1:5) {
    count <- count + 1
    Sys.sleep(1)
    if (runif(1) < 0.5) break
    p(sprintf("count=%d", count))
  }
  ## In case we broke out of the loop early,
  ## make sure to update to 5/10 progress
  p(step = 5)
  for (i in 1:5) {
    count <- count + 1
    Sys.sleep(1)
    p(sprintf("count=%d", count))
  }
  count
}

When calling my_slow_fun(), we might see progress being reported as:

- [------------------------------------------------]   0% 
\ [===>-------------------------------------]  10% count=1
| [=======>---------------------------------]  20% count=2
\ [===================>---------------------]  50% count=3
...

Note how it took a leap from 20% to 50% when count == 2. If we run it again, the move to 50% might happen at another iteration.

Wrapping up

There are also a few bug fixes, which you can read about in NEWS. And a usual, all of this work also when you run in parallel using the future framework.

Make progress!

parallelly 1.26.0: Fast, Concurrent Setup of Parallel Workers (Finally)

Thu, 10 Jun 2021 15:00:00 -0700

parallelly 1.26.0 is on CRAN. It comes with one major improvement and one new function:

The setup of parallel workers is now much faster, which comes from using a concurrent, instead of sequential, setup strategy
The new freePort() can be used to find a TCP port that is currently available

Faster setup of local, parallel workers

In R 4.0.0, which was released in May 2020, parallel::makeCluster(n) gained the power of setting up the n local cluster nodes all at the same time, which greatly reduces to total setup time. Previously, because it was setting up the workers one after the other, which involved a lot of waiting for each worker to get ready. You can read about the details in the Socket Connections Update blog post by Tomas Kalibera and Luke Tierney on 2020-03-17.

Figure: The total setup time versus the number of local cluster workers for the “sequential” setup strategy (red) and the new “parallel” strategy (turquoise). Data were collected on a 128-core Linux machine.

With this release of parallelly, parallelly::makeClusterPSOCK(n) gained the same skills. I benchmarked the new, default “parallel” setup strategy against the previous “sequential” strategy on a CentOS 7 Linux machine with 128 CPU cores and 512 GiB RAM while the machine was idle. I ran these benchmarks five times, which are summarized as smooth curves in the above figure. The variance between the replicate runs is tiny and the smooth curves appear almost linear. Assuming a linear relationship between setup time and number of cluster workers, a linear fit of gives a speedup of approximately 50 times on this machine. It took 52 seconds to set up 122 (sic!) workers when using the “sequential” approach, whereas it took only 1.1 seconds with the “parallel” approach. Not surprisingly, rerunning these benchmarks with parallel::makePSOCKcluster() instead gives nearly identical results.

Importantly, the “parallel” setup strategy, which is the new default, can only be used when setting up parallel workers running on the local machine. When setting up workers on external or remote machines, the “sequential” setup strategy will still be used.

If you’re using future and use

plan(multisession)

you’ll immediately benefit from this performance gain, because it relies on parallelly::makeClusterPSOCK() internally.

All credit for this improvement in parallelly and parallelly::makeClusterPSOCK() should go to Tomas Kalibera and Luke Tierney, who implemented support for this in R 4.0.0.

Edit 2021-06-11 and 2021-07-01: There’s a bug in R (>= 4.0.0 && <= 4.1.0) causing the new setup_strategy = "parallel" to fail in the RStudio Console on some systems. If you’re running RStudio Console and get “Error in makeClusterPSOCK(workers, …) : Cluster setup failed. 8 of 8 workers failed to connect.“, update to parallelly 1.26.1 released on 2021-06-30:

install.packages("parallelly")

which will work around this problem. Alternatively, you can manually set:

options(parallelly.makeNodePSOCK.setup_strategy = "sequential")

Comment: Note that I could only test with up to 122 parallel workers, and not 128, which is the number of CPU cores available on the test machine. The reason for this is that each worker consumes one R connection in the main R session, and R has a limit in the number of connection it can have open at any time. The typical R installation can only have 128 connections open, and three are always occupied by the standard input (stdin), the standard output (stdout), and the standard error (stderr). Thus, the absolute maximum number of workers I could use 125. However, because I used the progressr package to report on progress, and a few other things that consumed a few more connections, I could only test up to 122 workers. You can read more about this limit in ?parallelly::freeConnections, which also gives a reference for how to increase this limit by recompling R from source.

Find an available TCP port

I’ve also added freePort(), which will find a random port in [1024,65535] that is currently not occupied by another process on the machine. For example,

> freePort()
[1] 30386
> freePort()
[1] 37882

Using this function to pick a TCP port at random lowers the risk of trying to use a port already occupied as when using just sample(1024:65535, size=1).

Just like parallel::makePSOCKcluster(), parallelly::makeClusterPSOCK() still uses sample(11000:11999, size=1) to find a random port. I want freePort() to get some more mileage and CRAN validation before switching over, but the plan is to use freePort() by default in the next release of parallelly.

Over and out!

parallelly 1.25.0: availableCores(omit=n) and, Finally, Built-in SSH Support for MS Windows 10 Users

Fri, 30 Apr 2021 15:00:00 -0700

A piece of an ice core - more pleasing to look at than yet another illustration of a CPU core
(Image credit: Ludovic Brucker, NASA’s Goddard Space Flight Center)

parallelly 1.25.0 is on CRAN. It comes with two major improvements:

You can now use availableCores(omit = n) to ask for all but n CPU cores
makeClusterPSOCK() can finally use the built-in SSH client on MS Windows 10 to set up remote workers

availableCores(omit = n) is your new friend

When running R code in parallel, many choose to parallelize on as many CPU cores as possible, e.g.

ncores <- parallel::detectCores()

It’s also common to leave out a few cores so that we can still use the computer for other basic tasks, e.g. checking email, editing files, and browsing the web. This is often done by something like:

ncores <- parallel::detectCores() - 1

which will return seven on a machine with eight CPU cores. If you look around, you also find that some leave two cores aside for other tasks;

ncores <- parallel::detectCores() - 2

I’m sorry to be the party killer, but none of the above is guaranteed to work everywhere. It might work on your computer but not on your collaborator’s computer, or in the cloud, or on continuous integration (CI) services, etc. There are two problems with the above approaches. The help page of parallel::detectCores() describes the first problem:

Value
An integer, NA if the answer is unknown.

Yup, detectCores() might return NA. Ouf!

The second problem is that your code might run on a machine that has only one or two CPU cores. That means that parallel::detectCores() - 1 may return zero, and parallel::detectCores() - 2 may even return a negative one. You might think such machines no longer exists, but they do. The most common cases these days are virtual machines (VMs) running in the cloud. Note, if you’re a package developer, GitHub Actions, Travis CI, and AppVeyor CI are all running in VMs with two cores.

So, to make sure your code will run everywhere, you need to do something like:

ncores <- max(parallel::detectCores() - 1, 1, na.rm = TRUE)

With that approach, we know that ncores is at least one and never a missing value. I don’t know about you, but I often do thinkos where I mix up min() and max(), which I’m sure we don’t want. So, let me introduce you to your new friend:

ncores <- parallelly::availableCores(omit = 1)

Just use that and you’ll be fine everywhere - it’ll always give you a value of one or greater. It’s neater and less error prone. Also, in contrast to parallel::detectCores(), parallelly::availableCores() respects various CPU settings and configurations that the system wants your to follow.

makeClusterPSOCK() to remote machines works out-of-the-box also MS Windows 10

If you’re into parallelizing across multiple machines, either on your local network, or remotely, say in the cloud, you can use:

workers <- parallelly::makeClusterPSOCK(c("n1.example.org", "n2.example.org"))

to spawn two R workers running in the background on those two machines. We can use these workers with different R parallel backends, e.g. with bare-bone parallel

y <- parallel::parLapply(workers, X, slow_fcn)

with foreach and the classical doParallel adapter,

library(foreach)
doParallel::registerDoParallel(workers)
y <- foreach(x = X) %dopar% slow_fcn(x)

and, obviously, my favorite, the future framework, which comes with lots of alternatives, e.g.

library(future)
plan(cluster, workers = workers)

y <- future.apply::future_lapply(X, slow_fcn)

y <- furrr::future_map(X, slow_fcn)

library(foreach)
doFuture::registerDoFuture()
y <- foreach(x = X) %dopar% slow_fcn(x)

y <- BiocParallel::bplapply(X, slow_fcn)

Now, in order to set up remote workers out of the box as shown above, you need to make sure you can do the following from the terminal:

{local}$ ssh n1.example.org Rscript --version
R scripting front-end version 4.0.4 (2021-02-15)

If you can get to that point, you can also use those two remote machines to parallel from your local computer, which, at least I think, is pretty cool. To get to that point, you basically need to configure SSH locally and remotely so that you can log in without having to enter a password, which you do by using SSH keys. It does not require admin rights, and it’s not that hard to do when you know how to do it. Search the web for “SSH key authentication” for instructions, but the gist is that you create a public-private key pair locally and you copy the public one to the remote machine. The setup is the same for Linux, macOS, and MS Windows 10.

What’s new in parallelly 1.25.0 is that MS Windows 10 users no longer have to install the PuTTY SSH client - the Unix-compatible ssh client that comes with all MS Windows 10 installations works out of the box.

The reason why we couldn’t use the built-in Windows 10 client before is that it has an bug preventing us from using it for reverse tunneling, which is needed for remote, parallel processing. However, someone found a workaround, so that bug is no longer a blocker. Thus, now makeClusterPSOCK() works as we always wanted it to.

Take-homes

Use parallelly::availableCores()
Remote parallelization from MS Windows 10 is now as easy as from Linux and macOS

For all updates, including what bugs have been fixed, see the NEWS of parallelly.

Over and out!

Links

parallelly package: CRAN, GitHub, pkgdown
future package: CRAN, GitHub, pkgdown
future.apply package: CRAN, GitHub, pkgdown
furrr package: CRAN, GitHub, pkgdown

PS. If you’re interested in learning more about ice cores and how they are used to track changes in our atmosphere and climate, see Core questions: An introduction to ice cores by Jessica Stoller-Conrad, NASA’s Jet Propulsion Laboratory.

Using Kubernetes and the Future Package to Easily Parallelize R in the Cloud

Thu, 08 Apr 2021 19:00:00 -0700

This is a guest post by Chris Paciorek, Department of Statistics, University of California at Berkeley.

In this post, I’ll demonstrate that you can easily use the future package in R on a cluster of machines running in the cloud, specifically on a Kubernetes cluster.

This allows you to easily doing parallel computing in R in the cloud. One advantage of doing this in the cloud is the ability to easily scale the number and type of (virtual) machines across which you run your parallel computation.

Why use Kubernetes to start a cluster in the cloud?

Kubernetes is a platform for managing containers. You can think of the containers as lightweight Linux machines on which you can do your computation. By using the Kubernetes service of a cloud provider such as Google Cloud Platform (GCP) or Amazon Web Services (AWS), you can easily start up a cluster of (virtual) machines.

There have been (and are) approaches to starting up a cluster of machines on AWS easily from the command line on your laptop. Some tools that are no longer actively maintained are StarCluster and CfnCluster. And there is now something called AWS ParallelCluster. But doing it via Kubernetes allows you to build upon an industry standard platform that can be used on various cloud providers. A similar effort (which I heavily borrowed from in developing the setup described here) allows one to run a Python Dask cluster accessed via a Jupyter notebook.

Many of the cloud providers have Kubernetes services (and it’s also possible you’d have access to a Kubernetes service running at your institution or company). In particular, I’ve experimented with Google Kubernetes Engine (GKE) and Amazon’s Elastic Kubernetes Service (EKS). This post will demonstrate setting up your cluster using Google’s GKE, but see my GitHub future-kubernetes repository for details on doing it on Amazon’s EKS. Note that while I’ve gotten things to work on EKS, there have been various headaches that I haven’t encountered on GKE.

I’m not a Kubernetes expert, nor a GCP or AWS expert (that might explain the headaches I just mentioned), but one upside is that hopefully I’ll go through all the details at a level someone who is not an expert can follow along. In fact, part of my goal in setting this up has been to learn more about Kubernetes, which I’ve done, but note that there’s a lot to it.

More details about the setup, including how it was developed and troubleshooting tips can be found in my future-kubernetes repository.

How it works (briefly)

This diagram in Figure 1 outlines the pieces of the setup.

Figure 1. Overview of using future on a Kubernetes cluster

Work on a Kubernetes cluster is divided amongst pods, which carry out the components of your work and can communicate with each other. A pod is basically a Linux container. (Strictly speaking a pod can contain multiple containers and shared resources for those containers, but for our purposes, it’s simplest just to think of a pod as being a Linux container.) The pods run on the nodes in the Kubernetes cluster, where each Kubernetes node runs on a compute instance of the cloud provider. These instances are themselves virtual machines running on the cloud provider’s actual hardware. (I.e., somewhere out there, behind all the layers of abstraction, there are actual real computers running on endless aisles of computer racks in some windowless warehouse!) One of the nice things about Kubernetes is that if a pod dies, Kubernetes will automatically restart it.

The basic steps are:

Start your Kubernetes cluster on the cloud provider’s Kubernetes service
Start the pods using Helm, the Kubernetes package manager
Connect to the RStudio Server session running on the cluster from your browser
Run your future-based computation
Terminate the Kubernetes cluster

We use the Kubernetes package manager, Helm, to run the pods of interest:

one (scheduler) pod for a main process that runs RStudio Server and communicates with the workers
multiple (worker) pods, each with one R worker process to act as the workers managed by the future package

Helm manages the pods and related services. An example of a service is to open a port on the scheduler pod so the R worker processes can connect to that port, allowing the scheduler pod RStudio Server process to communicate with the worker R processes. I have a Helm chart that does this; it borrows heavily from the Dask Helm chart for the Dask package for Python.

Each pod runs a Docker container. I use my own Docker container that layers a bit on top of the Rocker container that contains R and RStudio Server.

Step 1: Start the Kubernetes cluster

Here I assume you have already installed:

the command line interface to Google Cloud,
the kubectl interface for interacting with Kubernetes, and
helm for installing Helm charts (i.e., Kubernetes packages).

Installation details can be found in the future-kubernetes repository.

First we’ll start our cluster (the first part of Step 1 in Figure 1):

gcloud container clusters create \
    --machine-type n1-standard-1 \
    --num-nodes 4 \
    --zone us-west1-a \
    --cluster-version latest \
    my-cluster

I’ve asked for four virtual machines (nodes), using the basic (and cheap) n1-standard-1 instance type (which has a single CPU per virtual machine) from Google Cloud Platform.

You’ll want to specify the total number of cores on the virtual machines to be equal to the number of R workers that you want to start and that you specify in the Helm chart (as discussed below). Here we ask for four one-cpu nodes, and our Helm chart starts four workers, so all is well. See the Modifications section below on how to start up a different number of workers.

Since the RStudio Server process that you interact with wouldn’t generally be doing heavy computation at the same time as the workers, it’s OK that the RStudio scheduler pod and a worker pod would end up using the same virtual machine.

Step 2: Install the Helm chart to set up your pods

Next we need to get our pods going by installing the Helm chart (i.e., package) on the cluster; the installed chart is called a release. As discussed above, the Helm chart tells Kubernetes what pods to start and how they are configured.

First we need to give our account permissions to perform administrative actions:

kubectl create clusterrolebinding cluster-admin-binding \
    --clusterrole=cluster-admin

Now let’s install the release. This code assumes the use of Helm version 3 or greater (for older versions see my full instructions).

git clone https://github.com/paciorek/future-helm-chart   # download the materials
tar -czf future-helm.tgz -C future-helm-chart .           # create a zipped archive (tarball) that `helm install` needs
helm install --wait test ./future-helm.tgz                # install (start the pods)

You’ll need to name your release; I’ve used ‘test’ above.

The --wait flag tells helm to wait until all the pods have started. Once that happens, you’ll see a message about the release and how to connect to the RStudio interface, which we’ll discuss further in the next section.

We can check the pods are running:

kubectl get pods

You should see something like this (the alphanumeric characters at the ends of the names will differ in your case):

NAME                                READY   STATUS    RESTARTS   AGE
future-scheduler-6476fd9c44-mvmz6   1/1     Running   0          116s
future-worker-54db85cb7b-47qsd      1/1     Running   0          115s
future-worker-54db85cb7b-4xf4x      1/1     Running   0          115s
future-worker-54db85cb7b-rj6bj      1/1     Running   0          116s
future-worker-54db85cb7b-wvp4n      1/1     Running   0          115s

As expected, we have one scheduler and four workers.

Step 3: Connect to RStudio Server running in the cluster

Next we’ll connect to the RStudio instance running via RStudio Server on our main (scheduler) pod, using the browser on our laptop (Step 3 in Figure 1).

After installing the Helm chart, you should have seen a printout with some instructions on how to do this. First you need to connect a port on your laptop to the RStudio port on the main pod (running of course in the cloud):

export RSTUDIO_SERVER_IP="127.0.0.1"
export RSTUDIO_SERVER_PORT=8787
kubectl port-forward --namespace default svc/future-scheduler $RSTUDIO_SERVER_PORT:8787 &

You can now connect from your browser to the RStudio Server instance by going to the URL: http://127.0.0.1:8787.

Enter rstudio as the username and future as the password to login to RStudio.

What’s happening is that port 8787 on your laptop is forwarding to the port on the main pod on which RStudio Server is listening (which is also port 8787). So you can just act as if RStudio Server is accessible directly on your laptop.

One nice thing about this is that there is no public IP address for someone to maliciously use to connect to your cluster. Instead the access is handled securely entirely through kubectl running on your laptop. However, it also means that you couldn’t easily share your cluster with a collaborator. For details on configuring things so there is a public IP, please see my repository.

Note that there is nothing magical about running your computation via RStudio. You could connect to the main pod and simply run R in it and then use the future package.

Step 4: Run your future-based parallel R code

Now we’ll start up our future cluster and run our computation (Step 4 in Figure 1):

library(future)
plan(cluster, manual = TRUE, quiet = TRUE)

The key thing is that we set manual = TRUE above. This ensures that the functions from the future package don’t try to start R processes on the workers, as those R processes have already been started by Kubernetes and are waiting to connect to the main (RStudio Server) process.

Note that we don’t need to say how many future workers we want. This is because the Helm chart sets an environment variable in the scheduler pod’s Renviron file based on the number of worker pod replicas. Since that variable is used by the future package (via parallelly::availableCores()) as the default number of future workers, this ensures that there are only as many future workers as you have worker pods. However, if you modify the number of worker pods after installing the Helm chart, you may need to set the workers argument to plan() manually. (And note that if you were to specify more future workers than R worker processes (i.e., pods) you would get an error and if you were to specify fewer, you wouldn’t be using all the resources that you are paying for.)

Now we can use the various tools in the future package as we would if on our own machine or working on a Linux cluster.

Let’s run our parallelized operations. I’m going to do the world’s least interesting calculation of calculating the mean of many (10 million) random numbers forty separate times in parallel. Not interesting, but presumably if you’re reading this you have your own interesting computation in mind and hopefully know how to do it using future’s tools such as future.apply and foreach with doFuture.

library(future.apply)
output <- future_sapply(1:40, function(i) mean(rnorm(1e7)), future.seed = TRUE)

Note that all of this assumes you’re working interactively, but you can always reconnect to the RStudio Server instance after closing the browser, and any long-running code should continue running even if you close the browser.

Figure 2 shows a screenshot of the RStudio interface.

Figure 2. Screenshot of the RStudio interface

Working with files

Note that /home/rstudio will be your default working directory in RStudio and the RStudio Server process will be running as the user rstudio.

You can use /tmp and /home/rstudio for files, both within RStudio and within code running on the workers, but note that files (even in /home/rstudio) are not shared between workers nor between the workers and the RStudio Server pod.

To make data available to your RStudio process or get output data back to your laptop, you can use kubectl cp to copy files between your laptop and the RStudio Server pod. Here’s an example of copying to/from /home/rstudio:

## create a variable with the name of the scheduler pod
export SCHEDULER=$(kubectl get pod --namespace default -o jsonpath='{.items[?(@.metadata.labels.component=="scheduler")].metadata.name}')

## copy a file to the scheduler pod
kubectl cp my_laptop_file ${SCHEDULER}:home/rstudio/

## copy a file from the scheduler pod
kubectl cp ${SCHEDULER}:home/rstudio/my_output_file .

Of course you can also interact with the web from your RStudio process, so you could download data to the RStudio process from the internet.

Step 5: Cleaning up

Make sure to shut down your Kubernetes cluster, so you don’t keep getting charged.

gcloud container clusters delete my-cluster --zone=us-west1-a

Modifications

You can modify the Helm chart in advance, before installing it. For example you might want to install other R packages for use in your parallel code or change the number of workers.

To add additional R packages, go into the future-helm-chart directory (which you created using the directions above in Step 2) and edit the values.yaml file. Simply modify the lines that look like this:

  env:
  #  - name: EXTRA_R_PACKAGES
  #    value: data.table

by removing the “#” comment characters and putting the R packages you want installed in place of data.table, with the names of the packages separated by spaces, e.g.,

  env:
    - name: EXTRA_R_PACKAGES
      value: foreach doFuture

In many cases you may want these packages installed on both the scheduler pod (where RStudio Server runs) and on the workers. If so, make sure to modify the lines above in both the scheduler and worker stanzas.

To modify the number of workers, modify the replicas line in the worker stanza of the values.yaml file.

Then rebuild the Helm chart:

cd future-helm-chart  ## ensure you are in the directory containing `values.yaml`
tar -czf ../future-helm.tgz .

and install as done previously.

Note that doing the above to increase the number of workers would probably only make sense if you also modify the number of virtual machines you start your Kubernetes cluster with such that the total number of cores across the cloud provider compute instances matches the number of worker replicas.

You may also be able to modify a running cluster. For example you could use gcloud container clusters resize. I haven’t experimented with this.

To modify if your Helm chart is already installed (i.e., your release is running), one simple option is to reinstall the Helm chart as discussed below. You may also need to kill the port-forward process discussed in Step 3.

For some changes, you can also also update a running release without uninstalling it by “patching” the running release or scaling resources. I won’t go into details here.

Troubleshooting

Things can definitely go wrong in getting all the pods to start up and communicate with each other. Here are some suggestions for monitoring what is going on and troubleshooting.

First, you can use kubectl to check the pods are running:

kubectl get pods

Connect to a pod

To connect to a pod, which allows you to check on installed software, check on what the pod is doing, and other troubleshooting, you can do the following

export SCHEDULER=$(kubectl get pod --namespace default -o jsonpath='{.items[?(@.metadata.labels.component=="scheduler")].metadata.name}')
export WORKERS=$(kubectl get pod --namespace default -o jsonpath='{.items[?(@.metadata.labels.component=="worker")].metadata.name}')

## access the scheduler pod:
kubectl exec -it ${SCHEDULER}  -- /bin/bash
## access a worker pod:
echo $WORKERS
kubectl exec -it <insert_name_of_a_worker> -- /bin/bash

Alternatively just determine the name of the pod with kubectl get pods and then run the kubectl exec -it ... invocation above.

Note that once you are in a pod, you can install software in the usual fashion of a Linux machine (in this case using apt commands such as apt-get install).

Connect to a virtual machine

Or to connect directly to an underlying VM, you can first determine the name of the VM and then use the gcloud tools to connect to it.

kubectl get nodes
## now, connect to one of the nodes, 'gke-my-cluster-default-pool-8b490768-2q9v' in this case:
gcloud compute ssh gke-my-cluster-default-pool-8b490768-2q9v --zone us-west1-a

Check your running code

To check that your code is actually running in parallel, one can run the following test and see that the result returns the names of distinct worker pods.

library(future.apply)
future_sapply(seq_len(nbrOfWorkers()), function(i) Sys.info()[["nodename"]])

You should see something like this:

[1] future-worker-54db85cb7b-47qsd future-worker-54db85cb7b-4xf4x
[3] future-worker-54db85cb7b-rj6bj future-worker-54db85cb7b-wvp4n

One can also connect to the pods or to the underlying virtual nodes (as discussed above) and run Unix commands such as top and free to understand CPU and memory usage.

Reinstall the Helm release

You can restart your release (i.e., restarting the pods, without restarting the whole Kubernetes cluster):

helm uninstall test
helm install --wait test ./future-helm.tgz

Note that you may need to restart the entire Kubernetes cluster if you’re having difficulties that reinstalling the release doesn’t fix.

How does it work?

I’ve provided many of the details of how it works in my future-kubernetes repository.

The key pieces are:

The Helm chart with the instructions for how to start the pods and any associated services.
The Rocker-based Docker container(s) that the pods run.

That’s all there is to it … plus these instructions.

Briefly:

Based on the Helm chart, Kubernetes starts up the ‘main’ or ‘scheduler’ pod running RStudio Server and multiple worker pods each running an R process. All of the pods are running the Rocker-based Docker container
The RStudio Server main process and the workers use socket connections (via the R function socketConnection()) to communicate:
- the worker processes start R processes that are instructed to regularly make a socket connection using a particular port on the main scheduler pod
- when you run future::plan() (which calls makeClusterPSOCK()) in RStudio, the RStudio Server process attempts to make socket connections to the workers using that same port
Once the socket connections are established, command of the RStudio session returns to you and you can run your future-based parallel R code.

One thing I haven’t had time to work through is how to easily scale the number of workers after the Kubernetes cluster is running and the Helm chart installed, or even how to auto-scale – starting up workers as needed based on the number of workers requested via plan().

Wrap up

If you’re interested in extending or improving this or collaborating in some fashion, please feel free to get in touch with me via the ‘future-kubernetes’ issue tracker or by email.

And if you’re interested in using R with Kubernetes, note that RStudio provides an integration of RStudio Server Pro with Kubernetes that should allow one to run future-based workflows in parallel.

/Chris

future.BatchJobs - End-of-Life Announcement

Fri, 08 Jan 2021 09:00:00 -0800

This is an announcement that future.BatchJobs - A Future API for Parallel and Distributed Processing using BatchJobs has been archived on CRAN. The package has been deprecated for years with a recommendation of using future.batchtools instead. The latter has been on CRAN since June 2017 and builds upon the batchtools package, which itself supersedes the BatchJobs package.

To wrap up the three-and-a-half year long life of future.BatchJobs, the very last version, 0.17.0, reached CRAN on 2021-01-04 and passed on CRAN checks as of 2020-01-08, when the the package was requested to be formally archived. All versions ever existing on CRAN can be found at https://cran.r-project.org/src/contrib/Archive/future.BatchJobs/.

Archiving the future.BatchJobs package will speed up new releases of the future package. In the past, some of the future releases required internal updates to reverse packages dependencies such as future.BatchJobs to be rolled out on CRAN first in order for future to pass the CRAN incoming checks.

Postscript

The https://cran.r-project.org/package=future.BatchJobs page mentions:

Archived on 2021-01-08 at the request of the maintainer.

Consider using package ‘future.batchtools’ instead.

I’m happy to see that we can suggest another package on our archived package pages. All I did to get this was to mention it in my email to CRAN:

Hi,

please archive the ‘future.BatchJobs’ package. It has zero reverse dependencies. The package has been labelled deprecated for a long time now and has been superseded by the ‘future.batchtools’ package.

Thank you,
Henrik

My Keynote 'Future' Presentation at the European Bioconductor Meeting 2020

Sat, 19 Dec 2020 10:00:00 -0800

Luke Zappia's summary of the talk

I presented Future: A Simple, Extendable, Generic Framework for Parallel Processing in R at the European Bioconductor Meeting 2020, which took place online during the week of December 14-18, 2020.

You’ll find my slides (39 slides + Q&A slides; 35 minutes) below:

Title & Abstract
HTML (Google Slides; requires online access)
PDF (flat slides)
Video (YouTube)

I want to thank the organizers for inviting me to this Bioconductor conference. The Bioconductor Project provides a powerful and an important technical and social environment for developing and conducting computational research in bioinformatics and genomics. It has a great, world-wide community and engaging leadership which effortlessly keep delivering great tools (~2,000 R packages as of December 2020) and training year after year. I am honored for the opportunity to give a keynote presentation to this community.

- Henrik

Links

Relevant packages mentioned in this talk:
- future package: CRAN, GitHub
- future.apply package: CRAN, GitHub
- furrr package: CRAN, GitHub
- foreach package: CRAN, GitHub
- doFuture package: CRAN, GitHub
- doParallel package: CRAN, GitHub
- future.batchtools package: CRAN, GitHub
- future.callr package: CRAN, GitHub
- clustermq package: CRAN, GitHub
- BiocParallel package: CRAN, GitHub

NYC R Meetup: Slides on Future

Thu, 12 Nov 2020 19:30:00 -0800

I presented Future: Simple, Friendly Parallel Processing for R (67 minutes; 59 slides + Q&A slides) at New York Open Statistical Programming Meetup, on November 9, 2020:

HTML (incremental Google Slides; requires online access)
PDF (flat slides)
Video (presentation starts at 0h10m30s, Q&A starts at 1h17m40s)

I like to thanks everyone who attented and everyone who asked lots of brilliant questions during the Q&A. I’d also want to express my gratitude to Amada, Jared, and Noam for the invitation and making this event possible. It was great fun.

- Henrik

Links

Relevant packages mentioned in this talk:
- future package: CRAN, GitHub
- future.apply package: CRAN, GitHub
- furrr package: CRAN, GitHub
- foreach package: CRAN, GitHub
- doFuture package: CRAN, GitHub
- doParallel package: CRAN, GitHub
- future.batchtools package: CRAN, GitHub
- future.callr package: CRAN, GitHub
- future.tests package: CRAN, GitHub
- clustermq package: CRAN, GitHub

future 1.20.1 - The Future Just Got a Bit Brighter

Fri, 06 Nov 2020 13:00:00 -0800

future 1.20.1 is on CRAN. It adds some new features, deprecates old and unwanted behaviors, adds a couple of vignettes, and fixes a few bugs.

Interactive debugging

First out among the new features, and a long-running feature request, is the addition of argument split to plan(), which allows us to split, or “tee”, any output produced by futures.

The default is split = FALSE for which standard output and conditions are captured by the future and only relayed after the future has been resolved, i.e. the captured output is displayed and re-signaled on the main R session when value of the future is queried. This emulates what we experience in R when not using futures, e.g. we can add temporary print() and message() statements to our code for quick troubleshooting. You can read more about this in blog post ‘future 1.9.0 - Output from The Future’.

However, if we want to use debug() or browser() for interactive debugging, we quickly realize they’re not very useful because no output is visible, which is because also their output is captured by the future. This is where the new “split” feature comes to rescue. By using split = TRUE, the standard output and all non-error conditions are split (“tee:d”) on the worker’s end, while still being captured by the future to be relayed back to the main R session at a later time. This means that we can debug ‘sequential’ future interactively. Here is an illustration of using browser() for debugging a future:

> library(future)
> plan(sequential, split = TRUE)
> mysqrt <- function(x) { browser(); y <- sqrt(x); y }
> f <- future(mysqrt(1:3))
Called from: mysqrt(1:3)
Browse[1]> str(x)
 int [1:3] 1 2 3
Browse[1]> 
debug at #1: y <- sqrt(x)
Browse[2]> 
debug at #1: y
Browse[2]> str(y)
 num [1:3] 1 1.41 1.73
Browse[2]> y[1] <- 0
Browse[2]> cont

> v <- value(f)
Called from: mysqrt(1:3)
 int [1:3] 1 2 3
debug at #1: y <- sqrt(x)
debug at #1: y
 num [1:3] 1 1.41 1.73

> v
[1] 0.000000 1.414214 1.732051

Comment: Note how the output produced while debugging is relayed also when value() is called. This is a somewhat unfortunate side effect from futures capturing all output produced while they are active.

Preserved logging on workers (e.g. future.batchtools)

The added support for split = TRUE also means that we can now preserve all output in any log files that might be produced on parallel workers. For example, if you use future.batchtools on a Slurm scheduler, you can use plan(future.batchtools::batchtools_slurm, split = TRUE) to make sure standard output, messages, warnings, etc. are ending up in the batchtools log files while still being relayed to the main R session at the end. This way we can inspect cluster jobs while they still run, among other things. Here is a proof-of-concept example using a ‘batchtools_local’ future:

> library(future.batchtools)
> plan(batchtools_local, split = TRUE)
> f <- future({ message("Hello world"); y <- 42; print(y); sqrt(y) })
> v <- value(f)
[1] 42
Hello world
> v
[1] 6.480741
> loggedOutput(f)
 [1] "### [bt]: This is batchtools v0.9.14"                                 
 [2] "### [bt]: Starting calculation of 1 jobs"                             
 [3] "### [bt]: Setting working directory to '/home/alice/repositories/future'"
 [4] "### [bt]: Memory measurement disabled"                                
 [5] "### [bt]: Starting job [batchtools job.id=1]"                         
 [6] "### [bt]: Setting seed to 15794 ..."                                  
 [7] "Hello world"                                                          
 [8] "[1] 42"                                                               
 [9] ""                                                                     
[10] "### [bt]: Job terminated successfully [batchtools job.id=1]"          
[11] "### [bt]: Calculation finished!"

Without split = TRUE, we would not get lines 7 and 8 in the batchtools logs.

Near-live progress updates also from ‘multicore’ futures

Second out among the new features is ‘multicore’ futures, which now join ‘sequential’, ‘multisession’, and (local and remote) ‘cluster’ futures in the ability of relaying progress updates of progressr in a near-live fashion. This means that all of our most common parallelization backends support near-live progress updates. If this is the first time you hear of progressr, here’s an example of how it can be used in parallel processing:

library(future.apply)
plan(multicore)

library(progressr)
handlers("progress")

xs <- 1:5
with_progress({
  p <- progressor(along = xs)
  y <- future_lapply(xs, function(x, ...) {
    Sys.sleep(6.0-x)
    p(sprintf("x=%g", x))
    sqrt(x)
  })
})

# [=================>------------------------------]  40% x=2

Note that the progress updates signaled by p(), updates the progress bar almost instantly, even if the parallel workers run on a remote machine.

Multisession futures agile to changes in R’s library path

Third out is ‘multisession’ futures. It now automatically inherits the package library path from the main R session. For instance, if you use .libPaths() to adjust your library path and then call plan(multisession), the multisession workers will see the same packages as the parent session. This change is based on a feature request related to RStudio Connect. With this update, it no longer matters which type of local futures you use - ‘sequential’, ‘multisession’, or ‘multicore’ - your future code has access to the same set of installed packages.

As a proof of concept, assume that we add tempdir() as a new folder to R’s library path;

> .libPaths(c(tempdir(), .libPaths()))
> .libPaths()
[1] "/tmp/alice/RtmpwLKdrG"
[2] "/home/alice/R/x86_64-pc-linux-gnu-library/4.0-custom"   
[3] "/home/alice/software/R-devel/tags/R-4-0-3/lib/R/library"

If we then launch a ‘multisession’ future, we find that it uses the same library path;

> library(future)
> plan(multisession)
> f <- future(.libPaths())
> value(f)
[1] "/tmp/alice/RtmpwLKdrG"
[2] "/home/alice/R/x86_64-pc-linux-gnu-library/4.0-custom"   
[3] "/home/alice/software/R-devel/tags/R-4-0-3/lib/R/library"

Best practices for package developers

I’ve added a vignette ‘Best Practices for Package Developers’, which hopefully provides some useful guidelines on how to write and validate future code so it will work on as many parallel backends as possible.

Saying goodbye to ‘multiprocess’ - but don’t worry …

Ok, lets discuss what is being removed. Using plan(multiprocess), which was just an alias for “plan(multicore) on Linux and macOS and plan(multisession) on MS Windows”, is now deprecated. If used, you will get a one-time warning:

> plan(multiprocess)
Warning message:
Strategy 'multiprocess' is deprecated in future (>= 1.20.0). Instead, explicitly
specify either 'multisession' or 'multicore'. In the current R session,
'multiprocess' equals 'multicore'.

I recommend that you use plan(multisession) as a replacement for plan(multiprocess). If you are on Linux or macOS, and are 100% sure that your code and all its dependencies is fork-safe, then you can also use plan(multicore).

Although ‘multiprocess’ was neat to use in documentation and examples, it was at the same time ambiguous, and it risked introducing a platform-dependent behavior to those examples. For instance, it could be that the parallel code worked only for users on Linux and macOS because some non-exportable globals were used. If a user or MS Windows tried the same code, they might have gotten run-time errors. Vice versa, it could also be that code works on MS Windows but not on Linux or macOS. Moreover, in future 1.13.0 (2019-05-08), support for ‘multicore’ futures was disabled when running R via RStudio. This was done because forked parallel processing was deemed unstable in RStudio. This meant that a user on macOS who used plan(multiprocess) would end up getting ‘multicore’ futures when running in the terminal while getting ‘multisession’ futures when running in RStudio. These types of platform-specific, environment-specific user experiences were confusing and complicates troubleshooting and communications, which is why it was decided to move away from ‘multiprocess’ in favor of explicitly specifying ‘multisession’ or ‘multicore’.

Saying goodbye to ‘local = FALSE’ - a good thing

In an effort of refining the Future API, the use of future(..., local = FALSE) is now deprecated. The only place where it is still supported, for backward compatible reason, is when using ‘cluster’ futures that are persistent, i.e. plan(cluster, ..., persistent = TRUE). If you use the latter, I recommended that you start thinking about moving away from using local = FALSE also in those cases. Although persistent = TRUE is rarely used, I am aware that some of you got use cases that require objects to remain on the parallel workers also after a future has been resolved. If you have such needs, please see future Issue #433, particularly the parts on “sticky globals”. Feel free to add your comments and suggestions for how we best could move forward on this. The long-term goals is to get rid of both local and persistent in order to harmonize the Future API across all future backends.

For recent bug fixes, please see the package NEWS.

What’s on the horizon?

There are still lots of things on the roadmap. In no specific order, here are the few things in the works:

Sticky globals for caching globals on workers. This will decrease the number of globals that need to be exported when launching futures. It addresses several related feature requests, e.g. future Issues #273, #339, #346, #431, and #437.
Ability to terminate futures (for backends supporting it), which opens up for the possibility of restarting failed futures and more. This is a frequently requested feature, e.g. Issues #93, #188, #205, #213, and #236.
Optional, zero-cost generic hook function. Having them in place opens up for adding a framework for doing time-and-memory profiling/benchmarking futures and their backends. Being able profile futures and their backends will help identify bottlenecks and improve the performance on some of our parallel backends, e.g. Issues #59, #142, #239, and #437.
Add support for global calling handlers in progressr. This is not specific to the future framework but since its closely related, I figured I mention this here too. A global calling handler for progress updates would remove the need for having to use with_progress() when monitoring progress. This would also help resolve the common problem where package developers want to provide progress updates without having to ask the user to use with_progress(), e.g. progressr Issues #78, #83, and #85.

That’s all for now - Happy futuring!

parallelly, future - Cleaning Up Around the House

Wed, 04 Nov 2020 18:00:00 -0800

parallelly adverb
par·al·lel·ly | \ ˈpa-rə-le(l)li \
Definition: in a parallel manner

future noun
fu·ture | \ ˈfyü-chər \
Definition: existing or occurring at a later time

I’ve cleaned up around the house - with the recent release of future 1.20.1, the package gained a dependency on the new parallelly package. Now, if you’re like me and concerned about bloating package dependencies, I’m sure you immediately wondered why I chose to introduce a new dependency. I’ll try to explain this below, but let me be start by clarifying a few things:

The functions in the parallelly package used to be part of the future package
The functions have been removed from the future making that package smaller while its total installation “weight” remains about the same when adding the parallelly
The future package re-exports these functions, i.e. for the time being, everything works as before

Specifically, I’ve moved the following functions from the future package to the parallelly package:

as.cluster() - Coerce an object to a ‘cluster’ object
c(...) - Combine multiple ‘cluster’ objects into a single, large cluster
autoStopCluster() - Automatically stop a ‘cluster’ when garbage collected
availableCores() - Get number of available cores on the current machine; a better, safer alternative to parallel::detectCores()
availableWorkers() - Get set of available workers
makeClusterPSOCK() - Create a PSOCK cluster of R workers for parallel processing; a more powerful alternative to parallel::makePSOCKcluster()
makeClusterMPI() - Create a message passing interface (MPI) cluster of R workers for parallel processing; a tweaked version of parallel::makeMPIcluster()
supportsMulticore() - Check if forked processing (“multicore”) is supported

Because these are re-exported as-is, you can still use them as if they were part of the future package. For example, you may now use availableCores() as

ncores <- parallelly::availableCores()

or keep using it as

ncores <- future::availableCores()

One reason for moving these functions to a separate package is to make them readily available also outside of the future framework. For instance, using parallelly::availableCores() for decided on the number of parallel workers is a much better and safer alternative than using parallel::detectCores() - see help("availableCores", package = "parallelly") for why. Making these functions available in a lightweight package will attract additional users and developers that are not using futures. More users means more real-world validation, more vetting, and more feedback, which will improve these functions further and indirectly also the future framework.

Another reason is that several of the functions in parallelly are bug fixes and improvements to functions in the parallel package. By extracting these functions from the future package and putting them in a standalone package, it should be more clear what these improvements are. At the same time, it should lower the threshold of getting these improvements into the parallel package, where I hope they will end up one day. The parallelly package comes with an open invitation to the R Core to incorporate parallelly’s implementation or ideas into parallel.

For users of the future framework, maybe the most important reason for this migration is speedier implementation of improvements and feature requests for the future package and the future ecosystem. Over the years, many discussions around enhancing future came down to enhancing the functions that are now part of the parallelly package, especially for adding new features to makeClusterPSOCK(), which is the internal work horse for setting up ‘multisession’ parallel workers but also used explicitly by many when setting up other types of ‘cluster’ workers. The roles and responsibility of the parallelly and future packages are well separated, which should make it straightforward to further improve on these functions. For example, if we want to introduce a new argument to makeClusterPSOCK(), or change one of its defaults (e.g. use the faster useXDR = FALSE), we can now discuss and test them quicker and often without having to bring in futures into the discussion. Don’t worry - parallelly will undergo the same, strict validation process as the future package does to avoid introducing breaking changes to the future framework. For example, reverse-dependency checks will be run on first (e.g. future), and second (e.g. future.apply, furrr, doFuture, drake, mlr3, plumber, promises,and Seurat) generation dependencies.

Happy parallelly futuring!

^* I’ll try to make another post in a couple of days covering the new features that comes with future 1.20.1. Stay tuned.

Trust the Future

Wed, 04 Nov 2020 14:00:00 -0800

Each time we use R to analyze data, we rely on the assumption that functions used produce correct results. If we can’t make this assumption, we have to spend a lot of time validating every nitty detail. Luckily, we don’t have to do this. There are many reasons for why we can comfortably use R for our analyses and some of them are unique to R. Here are some I could think of while writing this blog post - I’m sure I forgot something:

R is a functional language with few side effects (“just like mathematical functions”)
R, and its predecessor S, has undergone lots of real-world validation over the last two-three decades
Millions of users and developers use and vet R regularly, which increases the chances for detecting mistakes and bugs
R has one established, agreed-upon framework for validating an R package: R CMD check
The majority of R packages are distributed through a single repository (CRAN)
CRAN requires that all R packages pass checks on past, current, and upcoming R versions, across operating systems (MS Windows, Linux, macOS, and Solaris), and on different compilers
New checks are continuously added to R CMD check causing the quality of new and existing R packages to improve over time
CRAN asserts that package updates do not break reverse package dependencies
R developers spend a substantial amount of time validating their packages
R has users and developers with various backgrounds and areas of expertise
R has a community that actively engages in discussions on best practices, troubleshooting, bug fixes, testing, and language development
There are many third-party contributed tools for developing and testing R packages

I think Jan Vitek summarized it well in the ‘Why R?’ panel discussion on ‘Performance in R’ on 2020-09-26:

R is an ecosystem. It is not a language. The language is the little bit on top. You come for the ecosystem - the books, all of the questions and answers, the snippets of code, the quality of CRAN. … The quality assurance that CRAN brings … we don’t have that in any other language that I know of.

Without the above technical and social ecosystem, I believe the quality of my own R packages would have been substantially lower. Regardless of how many unit tests I would write, I could never achieve the same amount of validation that the full R ecosystem brings to the table.

When you use the future framework for parallel and distributed processing, it is essential that it delivers a corresponding level of correctness and reproducibility to that you get when implementing the same task sequentially. Because of this, validation is a top priority and part of the design and implementation throughout the future ecosystem. Below, I summarize how it is validated:

All the essential core packages part of the future framework, future, globals, listenv, and parallelly, implement a rich set of package tests. These are validated regularly across the wide-range of operating systems (Linux, Solaris, macOS, and MS Windows) and R versions available on CRAN, on continuous integration (CI) services (GitHub Actions, Travis CI, and AppVeyor CI), an on R-hub.
For each new release, these packages undergo full reverse-package dependency checks using revdepcheck. As of October 2020, the future package is tested against more than 140 direct reverse-package dependencies available on CRAN and Bioconductor, including packages future.apply, furrr, doFuture, drake, googleComputeEngineR, mlr3, plumber, promises (used by shiny), and Seurat. These checks are performed on Linux with both the default settings and when forcing tests to use multisession workers (SOCK clusters), which further validates that globals and packages are identified correctly.
A suite of Future API conformance tests available in the future.tests package validates the correctness of all future backends. Any new future backend developed must pass these tests to comply with the Future API. By conforming to this API, the end-user can trust that the backend will produce the same correct and reproducible results as any other backend, including the ones that the developer have tested on. Also, by making it the responsibility of the developer to assert that their new future backend conforms to the Future API, we relieve other developers from having to test that their future-based software works on all backends. It would be a daunting task for a developer to validate the correctness of their software with all existing backends. Even if they would achieve that, there may be additional third-party future backends that they are not aware of, that they do not have the possibility to test with, or that are yet to be developed. The future.tests framework was sponsored by an R Consortium ISC grant.
Since foreach is used by a large number of essential CRAN packages, it provides an excellent opportunity for supplementary validation. Specifically, I dynamically tweak the examples of foreach and popular CRAN packages caret, glmnet, NMF, plyr, and TSP to use the doFuture adaptor. This allows me to run these examples with a variety of future backends to validate that the examples produce no run-time errors, which indirectly validates the backends and the Future API. In the past, these types of tests helped to identify and resolve corner cases where automatic identification of global variables would fail. As a side note, several of these foreach-based examples fail when using a parallel foreach adaptor because they do not properly export globals or declare package dependencies. The exception is when using the sequential doSEQ adaptor (default), fork-based ones such as doMC, or the generic doFuture, which supports any future backend and relies on the future framework for handling globals and packages.
Analogously to above reverse-dependency checks of each new release, CRAN and Bioconductor continuously run checks on all these direct, but also indirect, reverse dependencies, which further increases the validation of the Future API and the future ecosystem at large.

May the future be with you!

future 1.19.1 - Making Sure Proper Random Numbers are Produced in Parallel Processing

Tue, 22 Sep 2020 19:00:00 -0700

Parallel ‘Digital Rain’ by Jahobr

After two-and-a-half months, future 1.19.1 is now on CRAN. As usual, there are some bug fixes and minor improvements here and there (NEWS), including things needed by the next version of furrr. For those of you who use Slurm or LSF/OpenLava as a scheduler on your high-performance compute (HPC) cluster, future::availableCores() will now do a better job respecting the CPU resources that those schedulers allocate for your R jobs.

With all that said, the most significant update is that an informative warning is now given if random numbers were produced unexpectedly. Here “unexpectedly” means that the developer did not declare that their code needs random numbers.

If you are just interested in the updates regarding random numbers and how to make sure your code is compliant, skip down to the section on ‘Random Number Generation in the Future Framework’. If you are curious how R generates random numbers and how that matters when we use parallel processing, keep on reading.

Disclaimer: I should clarify that, although I understand some algorithms and statistical aspects behind random number generation, my knowledge is limited. If you find mistakes below, please let me know so I can correct them. If you have ideas on how to improve this blog post, or parallel random number generation, I am grateful for such suggestions.

Random Number Generation in R

Being able to generate high-quality random numbers is essential in many areas. For example, we use random number generation in cryptography to produce public-private key pairs. If there is a correlation in the random numbers produced, there is a risk that someone can reverse engineer the private key. In statistics, we need random numbers in simulation studies, bootstrap, and permutation tests. The correctness of these methods rely on the assumptions that the random numbers drawn are “as random as possible”. What we mean by “as random as possible” depends on context and there are several ways to measure “amount of randomness”, e.g. amount of autocorrelation in the sequence of numbers produced.

As developers, statisticians, and data scientists, we often have better things to do than validating the quality of random numbers. Instead, we just want to rely on the computer to produce random numbers that are “good enough.” This is often safe to do because most programming languages produce high-quality random numbers out of the box. However, when we run our algorithms in parallel, random number generation becomes more complicated and we have to make efforts to get it right.

In software, a so-called random number generator (RNG) produces all random numbers. Although hardware RNGs exist (e.g. thermal noise), by far the most common way to produce random numbers is through a pseudo RNG. A pseudo RNG uses an algorithm that produces a sequence of numbers that appear to be random but is fully deterministic given its initial state. For example, in R, we can draw one or more (pseudo) random numbers in $[0,1]$ using runif(), e.g.

> runif(n = 5)
[1] 0.9400145 0.9782264 0.1174874 0.4749971 0.5603327

We can control the RNG state via set.seed(), e.g.

> set.seed(42)
> runif(n = 5)
[1] 0.9148060 0.9370754 0.2861395 0.8304476 0.6417455

If we use this technique, we can regenerate the same pseudo random numbers at a later state if we reset to the same initial RNG state, i.e.

> set.seed(42)
> runif(n = 5)
[1] 0.9148060 0.9370754 0.2861395 0.8304476 0.6417455

This works also after restarting R, on other computers, and other operating systems. Being able to set the initial RNG state this way allows us to produce numerically reproducible results even when the methods involved rely on randomness.

There is no need to set the RNG state, which is also referred to as “the random seed”. If not set, R uses a “random” initial RNG state based on various “random” properties such as the current timestamp and the process ID of the current R session. Because of this, we rarely have to set the random seed and things just work.

Random Number Generation for Parallel Processing

R does a superb job of taking care of us when it comes to random number generation - as long as we run our analysis sequentially in a single R process. Formally R uses the Mersenne Twister RNG algorithm [1] by default, which can we can set explicitly using RNGkind("Mersenne-Twister"). However, like many other RNG algorithms, the authors designed this one for generating random number sequentially but not in parallel. If we use it in parallel code, there is a risk that there will a correlation between the random numbers generated in parallel, and, when taken together, they may no longer be “random enough” for our needs.

A not-so-uncommon, ad hoc attempt to overcome this problem is to set a unique random seed for each parallel iteration, e.g.

library(parallel)
cl <- makeCluster(4)
y <- parLapply(cl, 1:10, function(i) {
  set.seed(i)
  runif(n = 5)
})
stopCluster(cl)

The idea is that although i and i+1 are deterministic, set.seed(i) and set.seed(i+1) will set two different RNG states that are “non-deterministic” compared to each other, e.g. if we know one of them, we cannot predict the other. We can also find other variants of this approach. For instance, we can pre-generate a set of “random” random seeds and use them one-by-one in each iteration;

library(parallel)
cl <- makeCluster(4)
set.seed(42)
seeds <- sample.int(n = 10)
y <- parLapply(cl, seeds, function(seed) {
  set.seed(seed)
  runif(n = 5)
})
stopCluster(cl)

However, these approaches do not guarantee high-quality random numbers. Although not parallel-safe by itself, the latter approach resembles the gist of RNG algorithms designed for parallel processing.

The L’Ecuyer Combined Multiple Recursive random number Generators (CMRG) method [2,3] provides an RNG algorithm that works also for parallel processing. R has built-in support for this method via the parallel package. See help("nextRNGStream", package = "parallel") for additional information. One way to use this is:

library(parallel)
cl <- makeCluster(4)
RNGkind("L'Ecuyer-CMRG")
set.seed(42)
seeds <- list(.Random.seed)
for (i in 2:10) seeds[[i]] <- nextRNGStream(seeds[[i - 1]])
y <- parLapply(cl, seeds, function(seed) {
  .Random.seed <- seed
  runif(n = 5)
})
stopCluster(cl)

Note the similarity to the previous attempt above. For convenience, R provides parallel::clusterSetRNGStream(), which allows us to do:

library(parallel)
cl <- makeCluster(4)
clusterSetRNGStream(cl, iseed = 42)
y <- parLapply(cl, 1:10, function(i) {
  runif(n = 5)
})
stopCluster(cl)

Comment: Contrary to the manual approach, clusterSetRNGStream() does not create one RNG seed per iteration (here ten) but one per workers (here four). Because of this, the two examples will not produce the same random numbers despite using the same initial seed (42). When using clusterSetRNGStream(), the sequence of random numbers produced will depend on the number of parallel workers used, meaning the results will not be numerically identical unless we use the same number of parallel workers. Having said this, we are using a parallel-safe RNG algorithm here, so we still get high-quality random numbers without risking to compromising our statistical analysis, if that is what we are running.

Random Number Generation in the Future Framework

The future framework, which provides a unifying approach to parallel processing in R, uses the L’Ecuyer CMRG algorithm to generate all random numbers. There is no need to specify RNGkind("L'Ecuyer-CMRG") - if not already set, the future framework will still use it internally. At the lowest level, the Future API supports specifying the random seed for each individual future. However, most developers and end-users use the higher-level map-reduce APIs provided by the future.apply and furrr package, which provides “seed” arguments for controlling the RNG behavior. Importantly, generating L’Ecuyer-CMRG RNG streams comes with a significant overhead. Because of this, the default is to not generate them. If we intend to produce random numbers, we need to specify that via the “seed” argument, e.g.

library(future.apply)
y <- future_lapply(1:10, function(i) {
  runif(n = 5)
}, future.seed = TRUE)

and

library(furrr)
y <- future_map(1:10, function(i) {
  runif(n = 5)
}, .options = future_options(seed = TRUE))

Contrary to generating RNG streams, checking if a future has used random numbers is quick. All we have to do is to keep track of the RNG state and check if it still the same afterward (after the future has been resolved). Starting with future 1.19.0, the future framework will warn us whenever we use the RNG without declaring it. For instance,

> y <- future_lapply(1:10, function(i) {
+   runif(n = 5)
+ })
Warning message:
UNRELIABLE VALUE: Future ('future_lapply-1') unexpectedly generated random
numbers without specifying argument '[future.]seed'. There is a risk that
those random numbers are not statistically sound and the overall results
might be invalid. To fix this, specify argument '[future.]seed', e.g.
'seed=TRUE'. This ensures that proper, parallel-safe random numbers are
produced via the L'Ecuyer-CMRG method. To disable this check, use
[future].seed=NULL, or set option 'future.rng.onMisuse' to "ignore".

Although technically unnecessary, this warning will also be produced when running sequentially. This is to make sure that all future-based code will produce correct results when switching to a parallel backend.

When using foreach the best practice is to use the doRNG package to produce parallel-safe random numbers. This is true regardless of foreach adaptor and parallel backend used. Specifically, instead of using %dopar% we want to use %dorng%. For example, here is what it looks like if we use the doFuture adaptor;

library(foreach)
library(doRNG)

doFuture::registerDoFuture()
future::plan("multisession")

y <- foreach(i = 1:10) %dorng% {
  runif(n = 5)
}

The benefit of using the doFuture adaptor is that it will also detect when we, or packages that use foreach, forget to declare that the RNG is needed, e.g.

y <- foreach(i = 1:10) %dopar% {
  runif(n = 5)
}
Warning messages:
1: UNRELIABLE VALUE: Future ('doFuture-1') unexpectedly generated random
numbers without specifying argument '[future.]seed'. There is a risk that
those random numbers are not statistically sound and the overall results
might be invalid. To fix this, specify argument '[future.]seed', e.g.
'seed=TRUE'. This ensures that proper, parallel-safe random numbers are
produced via the L'Ecuyer-CMRG method. To disable this check, use
[future].seed=NULL, or set option 'future.rng.onMisuse' to "ignore". 
...

Note that there will be one warning per future, which in the above examples, means one warning per parallel worker.

If you are an end-user of a package that uses futures internally and you get these warnings, please report them to the maintainer of that package. You might have to use options(warn = 2) to upgrade to an error and then traceback() to track down from where the warning originates. It is not unlikely that they have forgotten or are not aware of the problem of using a proper RNG for parallel processing. Regardless, the fix is for them to declare future.seed = TRUE. If these warnings are irrelevant and the maintainer does not believe there is an RNG issue, then they can declare that using future.seed = NULL, e.g.

y <- future_lapply(X, function(x) {
  ...
}, future.seed = NULL)

The default is future.seed = FALSE, which means “no random numbers will be produced, and if there are, then it is a mistake.”

Until the maintainer has corrected this, as an end-user you can silence these warnings by setting:

options(future.rng.onMisuse = "ignore")

which was the default until future 1.19.0. If you want to be conservative, you can even upgrade the warning to a run-time error by setting this option to "error".

If you are a developer and struggle to narrow down exactly which part of your code uses random number generation, see my blog post ‘Detect When the Random Number Generator Was Used’ for an example how you can track the RNG state at the R prompt and get a notification whenever a function call used the RNG internally.

What’s next regarding RNG and futures?

The higher-level map-reduce APIs in the future framework support perfectly reproducible random numbers regardless of future backend and number of parallel workers being used. This is convenient because it allows us to get identical results when we, for instance, move from a notebook to an HPC environment. The downside is that this RNG strategy requires that one RNG stream is created per iteration, which is expensive when there are many elements to iterate over. If one does not need numerically reproducible random numbers, then it would be sufficient and valid to produce one RNG stream per chunk, where we often have one chunk per worker, similar to what parallel::clusterSetRNGStream() does. It has been on the roadmap for a while to add support for per-chunk RNG streams as well. The remaining thing we need to resolve is to decide on exactly how to specify that type of strategy, e.g. future_lapply(..., future.seed = "per-chunk") versus future_lapply(..., future.seed = "per-element"), where the latter is an alternative to today’s future.seed = TRUE. I will probably address this in a new utility package future.mapreduce that can serve future.apply and furrr and likes, so that they do not have to re-implement this locally, which is error prone and how it works at the moment.
L’Ecuyer CMRG is not the only RNG algorithm designed for parallel processing but some developers might want to use another method. There are already many CRAN packages that provide alternatives, e.g. dqrng, qrandom, random, randtoolbox, rlecuyer, rngtools, rngwell19937, rstream, rTRNG, and sitmo. It is on the long-term road map to support other types of parallel RNG methods. It will require a fair bit of work to come up with a unifying API for this and then a substantial amount of testing and validation to make sure it is correct.

Happy random futuring!

References

Matsumoto, M. and Nishimura, T. (1998). Mersenne Twister: A 623-dimensionally equidistributed uniform pseudo-random number generator, ACM Transactions on Modeling and Computer Simulation, 8, 3–30.
L’Ecuyer, P. (1999). Good parameters and implementations for combined multiple recursive random number generators. Operations Research, 47, 159–164. doi: 10.1287/opre.47.1.159.
L’Ecuyer, P., Simard, R., Chen, E. J. and Kelton, W. D. (2002). An object-oriented random-number package with many long streams and substreams. Operations Research, 50, 1073–1075. doi: 10.1287/opre.50.6.1073.358.

Detect When the Random Number Generator Was Used

Mon, 21 Sep 2020 18:45:00 -0700

If you ever need to figure out if a function call in R generated a random number or not, here is a simple trick that you can use in an interactive R session. Add the following to your ~/.Rprofile(*):

if (interactive()) {
  invisible(addTaskCallback(local({
    last <- .GlobalEnv$.Random.seed
    
    function(...) {
      curr <- .GlobalEnv$.Random.seed
      if (!identical(curr, last)) {
        msg <- "TRACKER: .Random.seed changed"
        if (requireNamespace("crayon", quietly=TRUE)) msg <- crayon::blurred(msg)
        message(msg)
        last <<- curr
      }
      TRUE
    }
  }), name = "RNG tracker"))
}

It works by checking whether or not the state of the random number generator (RNG), that is, .Random.seed in the global environment, was changed. If it has, a note is produced. For example,

> sum(1:100)
[1] 5050
> runif(1)
[1] 0.280737
TRACKER: .Random.seed changed
>

It is not always obvious that a function generates random numbers internally. For instance, the rank() function may or may not updated the RNG state depending on argument ties as illustrated in the following example:

> x <- c(1, 4, 3, 2)
> rank(x)
[1] 1.0 2.5 2.5 4.0
> rank(x, ties.method = "random")
[1] 1 3 2 4
TRACKER: .Random.seed changed
>

For some functions, it may even depend on the input data whether or not random numbers are generated, e.g.

> y <- matrixStats::rowRanks(matrix(c(1,2,2), nrow=2, ncol=3), ties.method = "random")
TRACKER: .Random.seed changed
> y <- matrixStats::rowRanks(matrix(c(1,2,3), nrow=2, ncol=3), ties.method = "random")
>

I have this RNG tracker enabled all the time to learn about functions that unexpectedly draw random numbers internally, which can be important to know when you run statistical analysis in parallel.

As a bonus, if you have the crayon package installed, the RNG tracker will output the note with a style that is less intrusive.

(*) If you use the startup package, you can add it to a new file ~/.Rprofile.d/interactive=TRUE/rng_tracker.R. To learn more about the startup package, have a look at the blog posts on startup.

EDIT 2020-09-23: Changed the message prefix from ‘NOTE:’ to ‘TRACKER:‘.

future and future.apply - Some Recent Improvements

Sat, 11 Jul 2020 22:15:00 -0700

There are new versions of future and future.apply - your friends in the parallelization business - on CRAN. These updates are mostly maintenance updates with bug fixes, some improvements, and preparations for upcoming changes. It’s been some time since I blogged about these packages, so here is the summary of the main updates this far since early 2020:

future:
- values() for lists and other containers was renamed to value() to simplify the API [future 1.17.0]
- When future results in an evaluation error, the result() object of the future holds also the session information when the error occurred [future 1.17.0]
- value() can now detect and warn if a future(..., seed=FALSE) call generated random numbers, which then might give unreliable results because non-parallel safe, non-statistically sound random number generation (RNG) was used [future 1.16.0]
- Progress updates by progressr are relayed in a near-live fashion for multisession and cluster futures [future 1.16.0]
- makeClusterPSOCK() gained argument rscript_envs for setting or copying environment variables during the startup of each worker, e.g. rscript_envs=c(FOO="hello world", "BAR") [future 1.17.0]. In addition, on Linux and macOS, it also possible to set environment variables prior to launching the workers, e.g. rscript=c("TMPDIR=/tmp/foo", "FOO='hello world'", "Rscript") [future 1.18.0]
- Error messages of severe cluster future failures are more informative and include details on the affected worker include hostname and R version [future 1.17.0 and 1.18.0]
future.apply:
- future_apply() gained argument simplify, which has been added to base::apply() in R-devel (to become R 4.1.0) [future.apply 1.6.0]
- Added future_.mapply() corresponding to base::.mapply() [future.apply 1.5.0]
- future_lapply() and friends set a label on each future that reflects the name of the function and the index of the chunk, e.g. ‘future_lapply-3’ [future.apply 1.4.0]
- The assertion of the maximum size of globals per chunk is significantly faster for future_apply() [future.apply 1.4.0]

There have also been updates to doFuture and future.batchtools. Please see their NEWS files for the details.

What’s next?

I’m working on cleaning up and harmonization the Future API even further. This is necessary so I can add some powerful features later on. One example of this cleanup is making sure that all types of futures are resolved in a local environment, which means that the local argument can be deprecated and eventually removed. Another example is to deprecate argument persistent for cluster futures, which is an “outlier” and remnant from the past. I’m aware that some of you use plan(cluster, persistent=TRUE), which, as far as I understand, is because you need to keep persistent variables around throughout the lifetime of the workers. I’ve got a prototype of “sticky globals” that solves this problem differently, without the need for persistent=FALSE. I’ll try my best to make sure everyone’s needs are met. If you’ve got questions, feedback, or a special use case, please reach out on https://github.com/HenrikBengtsson/future/issues/382.

I’ve also worked with the maintainers of foreach to harmonize the end-user and developer experience of foreach with that of the future framework. For example, in y <- foreach(...) %dopar% { ... }, the { ... } expression is now always evaluated in a local environment, just like futures. This helps avoid some quite common beginner mistakes that happen when moving from sequential to parallel processing. You can read about this change in the ‘foreach 1.5.0 now available on CRAN’ blog post by Hong Ooi. There is also a discussion on updating how foreach identifies global variables and packages so that it works the same as in the future framework.

Happy futuring!

e-Rum 2020 Slides on Progressr

Sat, 04 Jul 2020 17:30:00 -0700

Source: Wiktionary.org

I presented Progressr: An Inclusive, Unifying API for Progress Updates (15 minutes; 20 slides) at e-Rum 2020, on June 17, 2020:

Abstract
HTML (incremental Google Slides; requires online access)
PDF (flat slides)
Video (starts at 00h49m58s)

I am grateful for everyone involved who made e-Rum 2020 possible. I cannot imagine having to cancel the on-site Milano conference that had planned for more than a year and then start over to re-organize and create a fabulous online experience for ~1,500 participants in such short notice. Your contribution to the R Community in these times is invaluable - thank you soo much.

As a speaker, I found it a bit of a challenge. It was my first presentation at an all online conference, so I wasn’t sure what to expect and how it would go. As others said, it is indeed a bit unusual to present to an audience you know is there but that you cannot see or interact with during the talk. I gave my presentation a bit before seven o’clock in the morning my time, and halfway through, my mind tried to convince me that it would be ok to get up and pour myself another cup of coffee - hehe - I certainly did not expect that one.

Now, let’s make some progress in this world!

- Henrik

rstudio::conf 2020 Slides on Futures

Sat, 01 Feb 2020 19:30:00 -0800

Design: Dan LaBar

I presented Future: Simple Async, Parallel & Distributed Processing in R Why and What’s New? at rstudio::conf 2020 in San Francisco, USA, on January 29, 2020. Below are the slides for my talk (17 slides; ~18+2 minutes):

HTML (incremental Google Slides; requires online access)
PDF (flat slides)
Video with closed captions (official rstudio::conf recording)

First of all, a big thank you goes out to Dan LaBar (@embiggenData) for proposing and contributing the original design of the future hex sticker. All credits to Dan. (You can blame me for the tweaked background.)

This was my first rstudio::conf and it was such a pleasure to be part of it. I’d like to thank RStudio, PBC for the invitation to speak and everyone who contributed to the conference - organizers, staff, speakers, poster presenters, and last but not the least, all the wonderful participants. Each one of you makes our R community what it is today.

Happy futuring!

- Henrik

Links

rstudio::conf 2020:
- Conference site: https://rstudio.com/conference/
Packages essential to the understanding of this talk (in order of appearance):
- future package: CRAN, GitHub
- future.apply package: CRAN, GitHub
- purrr package: CRAN, GitHub
- furrr package: CRAN, GitHub
- foreach package: CRAN, GitHub
- doFuture package: CRAN, GitHub
- future.batchtools package: CRAN, GitHub
- batchtools package: CRAN, GitHub
- shiny package: CRAN, GitHub
- future.tests package: ~~CRAN~~, GitHub
- progressr package: CRAN, GitHub
- progress package: CRAN, GitHub
- beepr package: CRAN, GitHub

future 1.15.0 - Lazy Futures are Now Launched if Queried

Sat, 09 Nov 2019 11:00:00 -0800

No dogs were harmed while making this release

future 1.15.0 is now on CRAN, accompanied by a recent, related update of future.callr 0.5.0. The main update is a change to the Future API:

resolved() will now also launch lazy futures

Although this change does not look much to the world, I’d like to think of this as part of a young person slowly finding themselves. This change in behavior helps us in cases where we create lazy futures upfront;

fs <- lapply(X, future, lazy = TRUE)

Such futures remain dormant until we call value() on them, or, as of this release, when we call resolved() on them. Contrary to value(), resolved() is a non-blocking function that allows us to check in on one or more futures to see if they are resolved or not. So, we can now do:

while (!all(resolved(fs))) {
  do_something_else()
}

to run that loop until all futures are resolved. Any lazy future that is still dormant will be launched when queried the first time. Previously, we would have had to write specialized code for the lazy=TRUE case to trigger lazy futures to launch. If not, the above loop would have run forever. This change means that the above design pattern works the same regardless of whether we use lazy=TRUE or lazy=FALSE (default). There is now one less thing to worry about when working with futures. Less mental friction should be good.

What else?

The Future API now guarantees that value() relays the “visibility” of a future’s value. For example,

> f <- future(invisible(42))
> value(f)
> v <- value(f)
> v
[1] 42

Other than that, I have fixed several non-critical bugs and improved some documentation. See news(package="future") or NEWS for all updates.

What’s next?

I’ll be talking about futures at rstudio::conf 2020 (San Francisco, CA, USA) at the end of January 2020. Please come and say hi - I am keen to hear your R story.
I will wrap up the deliverables for the project Future Minimal API: Specification with Backend Conformance Test Suite sponsored by the R Consortium. This project helps to robustify the future ecosystem and validate that all backends fulfill the Future API specification. It also serves to refine the Future API specifications. For example, the above change to resolved() resulted from this project.
The maintainers of foreach plan to harmonize how foreach() identifies global variables with how the future framework identifies them. The idea is to migrate foreach to use the same approach as future, which relies on the globals package. If you’re curious, you can find out more about this over at the foreach issue tracker. Yeah, the foreach issue tracker is a fairly recent thing - it’s a great addition.
The progressr package (GitHub only) is a proof-of-concept and a working prototype showing how to signal progress updates when doing parallel processing. It works out of the box with the core Future API and higher-level Future APIs such as future.apply, foreach with doFuture, furrr, and plyr - regardless of what parallel backend is being used. It should also work with all known non-parallel map-reduce frameworks, including base lapply() and purrr. For parallel processing, the “granularity” of progress updates varies with the type of parallel worker used. Right now, you will get live updates for sequential processing, whereas for parallel processing the updates will come in chunks along with the value whenever it is collected for a particular future. I’m working on adding support for “live” progress updates also for some parallel backends including when running on local and remote workers.

Happy futuring!

useR! 2019 Slides on Futures

Fri, 12 Jul 2019 16:00:00 +0200

Below are the slides for my Future: Simple Parallel and Distributed Processing in R that I presented at the useR! 2019 conference in Toulouse, France on July 9-12, 2019.

My talk (25 slides; ~15+3 minutes):

Title: Future: Simple Parallel and Distributed Processing in R
HTML (incremental Google Slides; requires online access)
PDF (flat slides)
Video (official recording)

I want to send out a big thank you to everyone making the useR! conference such wonderful experience.

startup - run R startup files once per hour, day, week, ...

Sun, 26 May 2019 21:00:00 -0700

New release: startup 0.12.0 is now on CRAN. This version introduces support for processing some of the R startup files with a certain frequency, e.g. once per day, once per week, or once per month. See below for two examples.

startup::startup() is cross platform.

The startup package makes it easy to split up a long, complicated .Rprofile startup file into multiple, smaller files in a .Rprofile.d/ folder. For instance, setting R option repos in a separate file ~/.Rprofile.d/repos.R makes it easy to find and update the option. Analogously, environment variables can be configured by using multiple .Renviron.d/ files. To make use of this, install the startup package, and then call startup::install() once, which will tweak your ~/.Rprofile file and create ~/.Renviron.d/ and ~/.Rprofile.d/ folders, if missing. For an introduction, see Start Me Up.

Example: Show a fortune once per hour

The fortunes package is a collection of quotes and wisdom related to the R language. By adding

if (interactive()) print(fortunes::fortune())

to our ~/.Rprofile file, a random fortune will be displayed each time we start R, e.g.

$ R --quiet

I think, therefore I R.
   -- William B. King (in his R tutorials)
      http://ww2.coastal.edu/kingw/statistics/R-tutorials/ (July 2010)

>

Now, if we’re launching R frequently, it might be too much to see a new fortune each time R is started. With startup (>= 0.12.0), we can limit how often a certain startup file should be processed via when=<frequency> declarations. Currently supported values are when=once, when=hourly, when=daily, when=weekly, when=fortnighly, and when=monthly. See the package vignette for more details.

For instance, we can limit ourselves to one fortune per hour, by creating a file ~/.Rprofile.d/interactive=TRUE/when=hourly/package=fortunes.R containing:

print(fortunes::fortune())

The interactive=TRUE part declares that the file should only be processed in an interactive session, the when=hourly part that it should be processed at most once per hour, and the package=fortunes part that it should be processed only if the fortunes package is installed. It not all of these declarations are fulfilled, then the file will not be processed.

Example: Check the status of your CRAN packages once per day

If you are a developer with one or more packages on CRAN, the foghorn package provides foghorn::summary_cran_results() which is a neat way to get a summary of the CRAN statuses of your packages. I use the following two files to display the summary of my CRAN packages once per day:

File ~/.Rprofile.d/interactive=TRUE/when=daily/package=foghorn.R:

try(local({
  if (nzchar(email <- Sys.getenv("MY_CRAN_EMAIL"))) {
    foghorn::summary_cran_results(email)
  }
}), silent = TRUE)

File ~/.Renviron.d/private/me:

[email protected]

SatRday LA 2019 Slides on Futures

Thu, 16 May 2019 12:00:00 -0800

A bit late but here are my slides on Future: Friendly Parallel Processing in R for Everyone that I presented at the satRday LA 2019 conference in Los Angeles, CA, USA on April 6, 2019.

My talk (33 slides; ~45 minutes):

Title: : Friendly Parallel and Distributed Processing in R for Everyone
HTML (incremental slides; requires online access)
PDF (flat slides)
Video (44 min; YouTube; sorry, different page numbers)

Thank you all for making this a stellar satRday event. I enjoyed it very much!

SatRday Paris 2019 Slides on Futures

Thu, 07 Mar 2019 12:00:00 -0800

Below are links to my slides from my talk on Future: Friendly Parallel Processing in R for Everyone that I presented last month at the satRday Paris 2019 conference in Paris, France (February 23, 2019).

My talk (32 slides; ~40 minutes):

Title: Future: Friendly Parallel Processing in R for Everyone
HTML (incremental slides; requires online access)
PDF (flat slides)

A big shout out to the organizers, all the volunteers, and everyone else for making it a great satRday.

Parallelize a For-Loop by Rewriting it as an Lapply Call

Fri, 11 Jan 2019 12:00:00 -0800

A commonly asked question in the R community is:

How can I parallelize the following for-loop?

The answer almost always involves rewriting the for (...) { ... } loop into something that looks like a y <- lapply(...) call. If you can achieve that, you can parallelize it via for instance y <- future.apply::future_lapply(...) or y <- foreach::foreach() %dopar% { ... }.

For some for-loops it is straightforward to rewrite the code to make use of lapply() instead, whereas in other cases it can be a bit more complicated, especially if the for-loop updates multiple variables in each iteration. However, as long as the algorithm behind the for-loop is embarrassingly parallel, it can be done. Whether it should be parallelized in the first place, or it’s worth parallelizing it, is a whole other discussion.

Below are a few walk-through examples on how to transform a for-loop into an lapply call.

Run your loops in parallel.

Example 1: A well-behaving for-loop

I will use very simple function calls throughout the examples, e.g. sqrt(x). For these code snippets to make sense, let us pretend that those functions take a long time to finish and by parallelizing multiple such calls we will shorten the overall processing time.

First, consider the following example:

X <- 1:5
y <- list()
for (ii in seq_along(X)) {
  x <- X[[ii]]
  tmp <- sqrt(x)  ## Assume this takes a long time
  y[[ii]] <- tmp
}

When run, this will give us the following result:

> str(y)
List of 5
 $ : num 1
 $ : num 1.41
 $ : num 1.73
 $ : num 2
 $ : num 2.24

Because the result of each iteration in the for-loop is a single value (variable tmp) it is straightforward to turn this for-loop into an lapply call. I’ll first show a version that resembles the original for-loop as far as possible, with one minor but important change. I’ll wrap up the “iteration” code inside local() to make sure it is evaluated in a local environment in order to prevent it from assigning values to the global environment. It is only the “result” of local() call that I will allow updating y. Here we go:

y <- list()
for (ii in seq_along(X)) {
  y[[ii]] <- local({
    x <- X[[ii]]
    tmp <- sqrt(x)
    tmp            ## same as return(tmp)
  })
}

By making these, apparently, small adjustments, we lower the risk for missing some critical side effects that may be used in some for-loops. If those exists and we miss to adjust for them, then the for-loop is likely to give the wrong results.

If this syntax is unfamiliar to you, run it first to convince yourself that it works. How does it work? The code inside local() will be evaluated in a local environment and it is only its last value (here tmp) that will be returned. It is also neat that x, tmp, and any other created variables, will not clutter up the global environment. Instead, they will vanish after each iteration just like local variables used inside functions. Retry the above after rm(x, tmp) to see that this is really the case.

Now we’re in a really good position to turn the for-loop into an lapply call. To share my train of thought, I’ll start by showing how to do it in a way that best resembles the latter for-loop;

y <- lapply(seq_along(X), function(ii) {
  x <- X[[ii]]
  tmp <- sqrt(x)
  tmp
})

Just like the for-loop with local(), it is the last value (here tmp) that is returned, and everything is evaluated in a local environment, e.g. variable tmp will not show up in our global environment.

There is one more update that we can do, namely instead of passing the index ii as an argument and then extract element x <- X[[ii]] inside the function, we can pass that element directly using:

y <- lapply(X, function(x) {
  tmp <- sqrt(x)
  tmp
})

If we get this far and have confirmed that we get the expected results, then we’re home.

From here, there are few ways to parallelize the lapply call. The parallel package provides the commonly known mclapply() and parLapply() functions, which are found in many examples and inside several R packages. As the author of the future package, I claim that your life as a developer will be a bit easier if you instead use the future framework. It will also bring more power and options to the end user. Below are a few options for parallelization.

future.apply::future_lapply()

The parallelization update that takes the least amount of changes is provided by the future.apply package. All we have to do is to replace lapply() with future_lapply():

library(future.apply)
plan(multisession) ## => parallelize on your local computer

X <- 1:5

y <- future_lapply(X, function(x) {
  tmp <- sqrt(x)
  tmp
})

and we’re done.

foreach::foreach() %dopar% { … }

If we wish to use the foreach framework, we can do:

library(doFuture)
registerDoFuture()
plan(multisession)

X <- 1:5

y <- foreach(x = X) %dopar% {
  tmp <- sqrt(x)
  tmp
}

Here I choose the doFuture adaptor because it provides us with access to the future framework and the full range of parallel backends that comes with it (controlled via plan()).

If there is only one thing you should remember from this post, it is the following:

It is a common misconception that foreach() works like a regular for-loop. It is doesn’t! Instead, think of it as a version of lapply() with a few bells and whistles and always make sure to use it as y <- foreach(...) %dopar% { ... }.

To clarify further, the following is not (I repeat: not) a working solution:

X <- 1:5
y <- list()
foreach(x = X) %dopar% {
  tmp <- sqrt(x)
  y[[ii]] <- tmp
}

No, it isn’t.

Additional parallelization options

There are several more options available, which are conceptually very similar to the above lapply-like approaches, e.g. y <- furrr::future_map(X, ...), y <- plyr::llply(X, ..., .parallel = TRUE) or y <- BiocParallel::bplapply(X, ..., BPPARAM = DoparParam()). For also the latter two to parallelize via one of the many future backends, we need to set doFuture::registerDoFuture(). See also my blog post The Many-Faced Future.

Example 2: A slightly complicated for-loop

Now, what do we do if the for-loop writes multiple results in each iteration? For example,

X <- 1:5

y <- list()
z <- list()
for (ii in seq_along(X)) {
  x <- X[[ii]]
  tmp1 <- sqrt(x)
  y[[ii]] <- tmp1
  tmp2 <- x^2
  z[[ii]] <- tmp2
}

The way to turn this into an lapply call, is to rewrite the code by gathering all the results at the very end of the iteration and then put them into a list;

X <- 1:5

yz <- list()
for (ii in seq_along(X)) {
  x <- X[[ii]]
  tmp1 <- sqrt(x)
  tmp2 <- x^2
  yz[[ii]] <- list(y = tmp1, z = tmp2)
}

This one we know how to rewrite;

yz <- lapply(X, function(x) {
  tmp1 <- sqrt(x)
  tmp2 <- x^2
  list(y = tmp1, z = tmp2)
})

which we in turn can parallelize with one of the above approaches.

The only difference from the original for-loop is that the ‘y’ and ‘z’ results are no longer in two separate lists. This makes it a bit harder to get a hold of the two elements. In some cases, then downstream code can work with the new yz format as is but if not, we can always do:

y <- lapply(yz, function(t) t$y)
z <- lapply(yz, function(t) t$z)
rm(yz)

Example 3: A somewhat complicated for-loop

Another, somewhat complicated, for-loop is when, say, one column of a matrix is updated per iteration. For example,

X <- 1:5

Y <- matrix(0, nrow = 2, ncol = length(X))
rownames(Y) <- c("sqrt", "square")
for (ii in seq_along(X)) {
  x <- X[[ii]]
  Y[, ii] <- c(sqrt(x), x^2)   ## assume this takes a long time
}

which gives

> Y
       [,1]     [,2]     [,3] [,4]      [,5]
sqrt      1 1.414214 1.732051    2  2.236068
square    1 4.000000 9.000000   16 25.000000

To turn this into an lapply call, the approach is the same as in Example 2 - we rewrite the for-loop to assign to a list and only afterward we worry about putting those values into a matrix. To keep it simple, this can be done using something like:

X <- 1:5

tmp <- lapply(X, function(x) {
  c(sqrt(x), x^2)  ## assume this takes a long time
})

Y <- matrix(0, nrow = 2, ncol = length(X))
rownames(Y) <- c("sqrt", "square")
for (ii in seq_along(tmp)) {
  Y[, ii] <- tmp[[ii]]
}
rm(tmp)

To parallelize this, all we have to do is to rewrite the lapply call as:

tmp <- future_lapply(X, function(x) {
  c(sqrt(x), x^2)
})

Example 4: A non-embarrassingly parallel for-loop

Now, if our for-loop is such that one iteration depends on the previous iterations, things becomes much more complicated. For example,

X <- 1:5
y <- list()
y[[1]] <- 1
for (ii in 2:length(X)) {
  x <- X[[ii]]
  tmp <- sqrt(x)
  y[[ii]] <- y[[ii - 1]] + tmp
}

does not use an embarrassingly parallel for-loop. This code cannot be rewritten as an lapply call and therefore it cannot be parallelized.

Summary

To parallelize a for-loop:

Rewrite your for-loop such that each iteration is done inside a local() call (most of the work is done here)
Rewrite this new for-loop as an lapply call (straightforward)
Replace the lapply call with a parallel implementation of your choice (straightforward)

Happy futuring!

Appendix

A regular for-loop with future::future()

In order to lower the risk for mistakes, and because I think the for-loop-to-lapply approach is the one that the works out of the box in the most cases, I decided to not mention the following approach in the main text above, but if you’re interested, here it is. With the core building blocks of the Future API, we can actually do parallel processing using a regular for-loop. Have a look at that second code snippet in Example 1 where we use a for-loop together with local(). All we need to do is to replace local() with future() and make sure to “collect” the values after the for-loop;

library(future)
plan(multisession)

X <- 1:5

y <- list()
for (ii in seq_along(X)) {
  y[[ii]] <- future({
    x <- X[[ii]]
    tmp <- sqrt(x)
    tmp
  })
}
y <- values(y)  ## collect values

Note that this approach does not perform load balancing*. That is, contrary to the above mentioned lapply-like options, it will not chunk up the elements in X into equally-sized portions for each parallel worker to process. Instead, it will call each worker multiple times, which can bring some significant overhead, especially if there are many elements to iterate over.

However, one neat feature of this bare-bones approach is that we have full control of the iteration. For instance, we can initiate each iteration using a bit of sequential code before we use parallel code. This can be particularly useful for subsetting large objects to avoid passing them to each worker, which otherwise can be costly. For example, we can rewrite the above as:

library(future)
plan(multisession)

X <- 1:5

y <- list()
for (ii in seq_along(X)) {
  x <- X[[ii]]
  y[[ii]] <- future({
    tmp <- sqrt(x)
    tmp
  })
}
y <- values(y)

This is just one example. I’ve run into several other use cases in my large-scale genomics research, where I found it extremely useful to be able to perform the beginning of an iteration sequentially in the main processes before passing on the remaining part to be processed in parallel by the workers.

(*) I do have some ideas on how to get the above code snippet to do automatic workload balancing “under the hood”, but that is quite far into the future of the future framework.

UPDATE 2022-12-11: Update examples that used the deprecated multiprocess future backend alias to use the multisession backend.

Maintenance Updates of Future Backends and doFuture

Mon, 07 Jan 2019 00:00:00 +0000

New versions of the following future backends are available on CRAN:

future.callr - parallelization via callr, i.e. on the local machine
future.batchtools - parallelization via batchtools, i.e. on a compute cluster with job schedulers (SLURM, SGE, Torque/PBS, etc.) but also on the local machine
future.BatchJobs - (maintained for legacy reasons) parallelization via BatchJobs, which is the predecessor of batchtools

These releases fix a few small bugs and inconsistencies that were identified with help of the future.tests framework that is being developed with support from the R Consortium.

I also released a new version of:

doFuture - use any future backend for foreach() parallelization

which comes with a few improvements and bug fixes.

The future is now.

The future is … what?

If you never heard of the future framework before, here is a simple example. Assume that you want to run

y <- lapply(X, FUN = my_slow_function)

in parallel on your local computer. The most straightforward way to achieve this is to use:

library(future.apply)
plan(multisession)
y <- future_lapply(X, FUN = my_slow_function)

If you have SSH access to a few machines here and there with R installed, you can use:

library(future.apply)
plan(cluster, workers = c("localhost", "gandalf.remote.edu", "server.cloud.org"))
y <- future_lapply(X, FUN = my_slow_function)

Even better, if you have access to compute cluster with an SGE job scheduler, you could use:

library(future.apply)
plan(future.batchtools::batchtools_sge)
y <- future_lapply(X, FUN = my_slow_function)

The future is … why?

The future package provides a simple, cross-platform, and lightweight API for parallel processing in R. At its core, there are three core building blocks for doing parallel processing - future(), resolved() and value()- which are used for creating the asynchronous evaluation of an R expression, querying whether it’s done or not, and collecting the results. With these fundamental building blocks, a large variety of parallel tasks can be performed, either by using these functions directly or indirectly via more feature rich higher-level parallelization APIs such as future.apply, foreach, BiocParallel or plyr with doFuture, and furrr. In all cases, how and where future R expressions are evaluated, that is, how and where the parallelization is performed, depends solely on which future backend is currently used, which is controlled by the plan() function.

One advantage of the Future API, whether it is used directly as is or via one of the higher-level APIs, is that it encapsulates the details on how and where the code is parallelized allowing the developer to instead focus on what to parallelize. Another advantage is that the end user will have control over which future backend to use. For instance, one user may choose to run an analysis in parallel on their notebook or in the cloud, whereas another may want to run it via a job scheduler in a high-performance compute (HPC) environment.

What’s next?

I’ve spent a fair bit of time working on future.tests, which is a single framework for testing future backends. It will allow developers of future backends to validate that they fully conform to the Future API. This will lower the barrier for creating a new backend (e.g. future.clustermq on top of clustermq or one on top Redis) and it will add trust for existing ones such that end users can reliably switch between backends without having to worry about the results being different or even corrupted. So, backed by future.tests, I feel more comfortable attacking some of the feature requests - and there are quite a few of them. Indeed, I’ve already implemented one of them. More news coming soon …

Happy futuring!

UPDATE 2022-12-11: Update examples that used the deprecated multiprocess future backend alias to use the multisession backend.

future 1.9.0 - Output from The Future

Mon, 23 Jul 2018 00:00:00 +0000

future 1.9.0 - Unified Parallel and Distributed Processing in R for Everyone - is on CRAN. This is a milestone release:

Standard output is now relayed from futures back to the master R session - regardless of where the futures are processed!

Disclaimer: A future’s output is relayed only after it is resolved and when its value is retrieved by the master R process. In other words, the output is not streamed back in a “live” fashion as it is produced. Also, it is only the standard output that is relayed. See below, for why the standard error cannot be relayed.

Relaying standard output from far away

Examples

Assume we have access to three machines with R installed on our local network. We can distribute our R processing to these machines using futures by:

> library(future)
> plan(cluster, workers = c("n1", "n2", "n3"))
> nbrOfWorkers()
[1] 3

With the above, future expressions will now be processed across those three machines. To see which machine a future ends up being resolved by, we can output the hostname, e.g.

> printf <- function(...) cat(sprintf(...))

> f <- future({
+   printf("Hostname: %s\n", Sys.info()[["nodename"]])
+   42
+ })
> v <- value(f)
Hostname: n1
> v
[1] 42

We see that this particular future was resolved on the n1 machine. Note how the output is relayed when we call value(). This means that if we call value() multiple times, the output will also be relayed multiple times, e.g.

> v <- value(f)
Hostname: n1
> value(f)
Hostname: n1
[1] 42

This is intended and by design. In case you are new to futures, note that a future is only evaluated once. In other words, calling value() multiple times will not re-evaluate the future expression.

The output is also relayed when using future assignments (%<-%). For example,

> v %<-% {
+   printf("Hostname: %s\n", Sys.info()[["nodename"]])
+   42
+ }
> v
Hostname: n1
[1] 42
> v
[1] 42

In this case, the output is only relayed the first time we print v. The reason for this is because when first set up, v is a promise (delayed assignment), and as soon as we “touch” (here print) it, it will internally call value() on the underlying future and then be resolved to a regular variable v. This is also intended and by design.

In the spirit of the Future API, any output behaves exactly the same way regardless of future backend used. In the above, we see that output can be relayed from three external machines back to our local R session. We would get the exact same if we run our futures in parallel, or sequentially, on our local machine, e.g.

> plan(sequential)
 v %<-% {
   printf("Hostname: %s\n", Sys.info()[["nodename"]])
   42
 }
> v
Hostname: my-laptop
[1] 42

This also works when we use nested futures wherever the workers are located (local or remote), e.g.

> plan(list(sequential, multisession))
> a %<-% {
+   printf("PID: %d\n", Sys.getpid())
+   b %<-% {
+     printf("PID: %d\n", Sys.getpid())
+     42
+   }
+   b	
+ }
> a
PID: 360547
PID: 484252
[1] 42

Higher-Level Future Frontends

The core Future API, that is, the explicit future()-value() functions and the implicit future-assignment operator %<-% function, provides the foundation for all of the future ecosystem. Because of this, relaying of output will work out of the box wherever futures are used. For example, when using future.apply we get:

> library(future.apply)
> plan(cluster, workers = c("n1", "n2", "n3"))
> printf <- function(...) cat(sprintf(...))

> y <- future_lapply(1:5, FUN = function(x) {
+   printf("Hostname: %s (x = %g)\n", Sys.info()[["nodename"]], x)
+   sqrt(x)
+ })
Hostname: n1 (x = 1)
Hostname: n1 (x = 2)
Hostname: n2 (x = 3)
Hostname: n3 (x = 4)
Hostname: n3 (x = 5)
> unlist(y)
[1] 1.000000 1.414214 1.732051 2.000000 2.236068

and similarly when, for example, using foreach:

> library(doFuture)
> registerDoFuture()
> plan(cluster, workers = c("n1", "n2", "n3"))
> printf <- function(...) cat(sprintf(...))

> y <- foreach(x = 1:5) %dopar% {
+   printf("Hostname: %s (x = %g)\n", Sys.info()[["nodename"]], x)
+   sqrt(x)
+ }
Hostname: n1 (x = 1)
Hostname: n1 (x = 2)
Hostname: n2 (x = 3)
Hostname: n3 (x = 4)
Hostname: n3 (x = 5)
> unlist(y)
[1] 1.000000 1.414214 1.732051 2.000000 2.236068

What about standard error?

Unfortunately, it is not possible to relay output sent to the standard error (stderr), that is, output by message(), cat(..., file = stderr()), and so on, is not taken care of. This is due to a limitation in R, preventing us from capturing stderr in a reliable way. The gist of the problem is that, contrary to stdout (“output”), there can only be a single stderr (“message”) sink active in R at any time. What really is the show stopper is that if we allocate such a message sink, it will be stolen from us the moment other code/functions request the message sink. In other words, message sinks cannot be used reliably in R unless one fully controls the whole software stack. As long as this is the case, it is not possible to collect and relay stderr in a consistent fashion across all future backends (*). But, of course, I’ll keep on trying to find a solution to this problem. If anyone has a suggestion for a workaround or a patch to R, please let me know.

(*) The callr package captures stdout and stderr in a consistent manner, so for the future.callr backend, we could indeed already now relay stderr. We could probably also find a solution for future.batchtools backends, which targets HPC job schedulers by utilizing the batchtools package. However, if code becomes dependent on using specific future backends, it will limit the end users’ options - we want to avoid that as far as ever possible. Having said this, it is possible that we’ll start out supporting stderr by making it an optional feature of the Future API.

Poor Man’s debugging

Because the output is also relayed when there is an error, e.g.

> x <- "42"
> f <- future({
+   str(list(x = x))
+   log(x)
+ })
> value(f)
List of 1
 $ x: chr "42"
Error in log(x) : non-numeric argument to mathematical function

it can be used for simple troubleshooting and narrowing down errors. For example,

> library(doFuture)
> registerDoFuture()
> plan(multisession)
> nbrOfWorkers()
[1] 2
> x <- list(1, "2", 3, 4, 5)
> y <- foreach(x = x) %dopar% {
+   str(list(x = x))
+   log(x)
+ }
List of 1
 $ x: num 1
List of 1
 $ x: chr "2"
List of 1
 $ x: num 3
List of 1
 $ x: num 4
List of 1
 $ x: num 5
Error in { : 
  task 2 failed - "non-numeric argument to mathematical function"
>

From the error message, we get that there was an “non-numeric argument” (element) passed to a function. By adding the str(), we can also see that it is of type character and what its value is. This will help us go back to the data source (x) and continue the troubleshooting there.

What’s next?

Progress bar information is one of several frequently requested features in the future framework. I hope to attack the problem of progress bars and progress messages in higher-level future frontends such as future.apply. Ideally, this can be done in a uniform and generic fashion to meet all needs. A possible implementation that has been discussed, is to provide a set of basic hook functions (e.g. on-start, on-resolved, on-value) that any ProgressBar API (e.g. jobstatus) can build upon. This could help avoid tie-in to a particular progress-bar implementation.

Another feature I’d like to get going is (optional) benchmarking of processing time and memory consumption. This type of information will help optimize parallel and distributed processing by identifying and understand the various sources of overhead involved in parallelizing a particular piece of code in a particular compute environment. This information will also help any efforts trying to automate load balancing. It may even be used for progress bars that try to estimate the remaining processing time (“ETA”).

So, lots of work ahead. Oh well …

Happy futuring!

UPDATE 2022-12-11: Update examples that used the deprecated multiprocess future backend alias to use the multisession backend.

R.devices - Into the Void

Sat, 21 Jul 2018 00:00:00 +0000

R.devices 2.16.0 - Unified Handling of Graphics Devices - is on CRAN. With this release, you can now easily suppress unwanted graphics, e.g. graphics produced by one of those do-everything-in-one-call functions that we all bump into once in a while. To suppress graphics, the R.devices package provides graphics device nulldev(), and function suppressGraphics(), which both send any produced graphics into the void. This works on all operating systems, including Windows.

Guillaume Nery base jumping at Dean’s Blue Hole, filmed on breath hold by Julie Gautier

Examples

library(R.devices)
nulldev()
plot(1:100, main = "Some Ignored Graphics")
dev.off()

R.devices::suppressGraphics({
  plot(1:100, main = "Some Ignored Graphics")
})

Other Features

Some other reasons for using the R.devices package:

No need to call dev.off() - Did you ever forgot to call dev.off(), or did a function call produce an error causing dev.off() not to be reached, leaving a graphics device open? By using one of the toPDF(), toPNG(), … functions, or the more general devEval() function, dev.off() is automatically taken care of.
No need to specify filename extension - Did you ever switch from using png() to, say, pdf(), and forgot to update the filename resulting in a my_plot.png file that is actually a PDF file? By using one of the toPDF(), toPNG(), … functions, or the more general devEval() function, filename extensions are automatically taken care of - just specify the part without the extension.
Specify the aspect ratio - rather than having to manually calculate device-specific arguments width or height, e.g. toPNG("my_plot", { plot(1:10) }, aspectRatio = 2/3). This is particularly useful when switching between device types, or when outputting to multiple ones at the same time.
Unified API for graphics options - conveniently set (most) graphics options including those that can otherwise only be controlled via arguments, e.g. devOptions("png", width = 1024).
Control where figure files are saved - the default is folder figures/ but can be set per device type or globally, e.g. devOptions("*", path = "figures/col/").
Easily produce EPS and favicons - toEPS() and toFavicon() are friendly wrappers for producing EPS and favicon graphics.
Capture and replay graphics - for instance, use future::plan(remote, workers = "remote.server.org"); p %<-% capturePlot({ plot(1:10) }) to produce graphics on a remote machine, and then display it locally by printing p.

Some more examples

R.devices::toPDF("my_plot", {
  plot(1:100, main = "Amazing Graphics")
})
### [1] "figures/my_plot.pdf"

R.devices::toPNG("my_plot", {
  plot(1:100, main = "Amazing Graphics")
})
### [1] "figures/my_plot.png"

R.devices::toEPS("my_plot", {
  plot(1:100, main = "Amazing Graphics")
})
### [1] "figures/my_plot.eps"

R.devices::devEval(c("png", "pdf", "eps"), name = "my_plot", {
  plot(1:100, main = "Amazing Graphics")
}, aspectRatio = 1.3)
### $png
### [1] "figures/my_plot.png"
### 
### $pdf
### [1] "figures/my_plot.pdf"
### 
### $eps
### [1] "figures/my_plot.eps"

future.apply - Parallelize Any Base R Apply Function

Sat, 23 Jun 2018 00:00:00 +0000

Got compute?

future.apply 1.0.0 - Apply Function to Elements in Parallel using Futures - is on CRAN. With this milestone release, all^* base R apply functions now have corresponding futurized implementations. This makes it easier than ever before to parallelize your existing apply(), lapply(), mapply(), … code - just prepend future_ to an apply call that takes a long time to complete. That’s it! The default is sequential processing but by using plan(multisession) it’ll run in parallel.

Table: All future_nnn() functions in the future.apply package. Each function takes the same arguments as the corresponding base function does.

Function	Description
`future_apply()`	Apply Functions Over Array Margins
`future_lapply()`	Apply a Function over a List or Vector
`future_sapply()`	- “ -
`future_vapply()`	- “ -
`future_replicate()`	- “ -
`future_mapply()`	Apply a Function to Multiple List or Vector Arguments
`future_Map()`	- “ -
`future_eapply()`	Apply a Function Over Values in an Environment
`future_tapply()`	Apply a Function Over a Ragged Array

^* future_rapply() - Recursively Apply a Function to a List - is yet to be implemented.

A Motivating Example

In the parallel package there is an example - in ?clusterApply - showing how to perform bootstrap simulations in parallel. After some small modifications to clarify the steps, it looks like the following:

library(parallel)
library(boot)

run1 <- function(...) {
   library(boot)
   cd4.rg <- function(data, mle) MASS::mvrnorm(nrow(data), mle$m, mle$v)
   cd4.mle <- list(m = colMeans(cd4), v = var(cd4))
   boot(cd4, corr, R = 500, sim = "parametric",
        ran.gen = cd4.rg, mle = cd4.mle)
}

cl <- makeCluster(4) ## Parallelize using four cores
clusterSetRNGStream(cl, 123)
cd4.boot <- do.call(c, parLapply(cl, 1:4, run1))
boot.ci(cd4.boot, type = c("norm", "basic", "perc"),
                  conf = 0.9, h = atanh, hinv = tanh)
stopCluster(cl)

The script defines a function run1() that produces 500 bootstrap samples, and then it calls this function four times, combines the four replicated samples into one cd4.boot, and at the end it uses boot.ci() to summarize the results.

The corresponding sequential implementation would look something like:

library(boot)

run1 <- function(...) {
   cd4.rg <- function(data, mle) MASS::mvrnorm(nrow(data), mle$m, mle$v)
   cd4.mle <- list(m = colMeans(cd4), v = var(cd4))
   boot(cd4, corr, R = 500, sim = "parametric",
        ran.gen = cd4.rg, mle = cd4.mle)
}

set.seed(123)
cd4.boot <- do.call(c, lapply(1:4, run1))
boot.ci(cd4.boot, type = c("norm", "basic", "perc"),
                  conf = 0.9, h = atanh, hinv = tanh)

We notice a few things about these two code snippets. First of all, in the parallel code, there are two library(boot) calls; one in the main code and one inside the run1() function. The reason for this is to make sure that the boot package is also attached in the parallel, background R session when run1() is called there. The boot package defines the boot.ci() function, as well as the boot() function and the cd4 data.frame - both used inside run1(). If boot is not attached inside the function, we would get an error on "object 'cd4' not found" when running the parallel code. In contrast, we do not need to do this in the sequential code. Also, if we later would turn our parallel script into a package, then R CMD check would complain if we kept the library(boot) call inside the run1() function.

Second, the example uses MASS::mvrnorm() in run1(). The reason for this is related to the above - if we use only mvrnorm(), we need to attach the MASS package using library(MASS) and also do so inside run1(). Since there is only one MASS function called, it’s easier and neater to use the form MASS::mvrnorm().

Third, the random-seed setup differs between the sequential and the parallel approach.

In summary, in order to turn the sequential script into a script that parallelizes using the parallel package, we would have to not only rewrite parts of the code but also be aware of important differences in order to avoid getting run-time errors due to missing packages or global variables.

One of the objectives of the future.apply package, and the future ecosystem in general, is to make transitions from writing sequential code to writing parallel code as simple and frictionless as possible.

Here is the same example parallelized using the future.apply package:

library(future.apply)
plan(multisession, workers = 4) ## Parallelize using four cores
library(boot)

run1 <- function(...) {
   cd4.rg <- function(data, mle) MASS::mvrnorm(nrow(data), mle$m, mle$v)
   cd4.mle <- list(m = colMeans(cd4), v = var(cd4))
   boot(cd4, corr, R = 500, sim = "parametric",
        ran.gen = cd4.rg, mle = cd4.mle)
}

set.seed(123)
cd4.boot <- do.call(c, future_lapply(1:4, run1, future.seed = TRUE))
boot.ci(cd4.boot, type = c("norm", "basic", "perc"),
                  conf = 0.9, h = atanh, hinv = tanh)

The difference between the sequential base-R implementation and the future.apply implementation is minimal. The future.apply package is attached, the parallel plan of four workers is set up, and the apply() function is replaced by future_apply(), where we specify future.seed = TRUE to get statistical sound and numerically reproducible parallel random number generation (RNG). More importantly, notice how there is no need to worry about which packages need to be attached on the workers and which global variables need to be exported. That is all taken care of automatically by the future framework.

Q&A

Q. What are my options for parallelization?
A. Everything in future.apply is processed through the future framework. This means that all parallelization backends supported by the parallel package are supported out of the box, e.g. on your local machine, and on local or remote ad-hoc compute clusters (also in the cloud). Additional parallelization and distribution schemas are provided by backends such as future.callr (parallelization on your local machine) and future.batchtools (large-scale parallelization via HPC job schedulers). For other alternatives, see the CRAN Page for the future package and the High-Performance and Parallel Computing with R CRAN Task View.

Q. Righty-oh, so how do I specify which parallelization backend to use?
A. A fundamental design pattern of the future framework is that the end user decides how and where to parallelize while the developer decides what to parallelize. This means that you do not specify the backend via some argument to the future_nnn() functions. Instead, the backend is specified by the plan() function - you can almost think of it as a global option that the end user controls. For example, plan(multisession) will parallelize on the local machine, so will plan(future.callr::callr), whereas plan(cluster, workers = c("n1", "n2", "remote.server.org")) will parallelize on two local machines and one remote machine. Using plan(future.batchtools::batchtools_sge) will distribute the processing on your SGE-supported compute cluster. BTW, you can also have nested parallelization strategies, e.g. plan(list(tweak(cluster, workers = nodes), multisession)) where nodes = c("n1", "n2", "remote.server.org").

Q. What about load balancing?
A. The default behavior of all functions is to distribute equally-sized chunks of elements to each available background worker - such that each worker process exactly one chunk (= one future). If the processing times vary significantly across chunks, you can increase the average number of chunks processed by each worker, e.g. to have them process two chunks on average, specify future.scheduling = 2.0. Alternatively, you can specify the number of elements processed per chunk, e.g. future.chunk.size = 10L (an analog to the chunk.size argument added to the parallel package in R 3.5.0).

Q. What about random number generation (RNG)? I’ve heard it’s tricky to get right when running in parallel.
A. Just add future.seed = TRUE and you’re good. This will use parallel safe and statistical sound L’Ecuyer-CMRG RNG, which is a well-established parallel RNG algorithm and used by the parallel package. The future.apply functions use this in a way that is also invariant to the future backend and the amount of “chunking” used. To produce numerically reproducible results, set set.seed(123) before (as in the above example), or simply use future.seed = 123.

Q. What about global variables? Whenever I’ve tried to parallelize code before, I often ran into errors on “this or that variable is not found”.
A. This is very rarely a problem when using the future framework - things work out of the box. Global variables and packages needed are automatically identified from static code inspection and passed on to the workers - even when the workers run on remote computers or in the cloud.

Happy futuring!

UPDATE 2022-12-11: Update examples that used the deprecated multiprocess future backend alias to use the multisession backend.

Delayed Future(Slides from eRum 2018)

Mon, 18 Jun 2018 00:00:00 +0000

As promised - though a bit delayed - below are links to my slides and the video of my talk on Future: Parallel & Distributed Processing in R for Everyone that I presented last month at the eRum 2018 conference in Budapest, Hungary (May 14-16, 2018).

The conference was very well organized (thank you everyone involved) with a great lineup of several brilliant workshop sessions, talks, and poster presentations (thanks all). It was such a pleasure to attend this conference and to connect and reconnect with so many of the lovely people that we are fortunate to have in the R Community. I’m looking forward to meeting you all again.

My talk (22 slides plus several appendix slides):

Title: Future: Parallel & Distributed Processing in R for Everyone
HTML (incremental slides; requires online access)
PDF (flat slides)
Video (22 mins)

May the future be with you!

future 1.8.0: Preparing for a Shiny Future

Thu, 12 Apr 2018 00:00:00 +0000

future 1.8.0 is available on CRAN.

This release lays the foundation for being able to capture outputs from futures, perform automated timing and memory benchmarking (profiling) on futures, and more. These features are not yet available out of the box, but thanks to this release we will be able to make some headway on many of the feature requests related to this - hopefully already by the next release.

For shiny users following Joe Cheng’s efforts on extending Shiny with asynchronous processing using futures, future 1.8.0 comes with some important updates/bug fixes that allow for consistent error handling regardless whether Shiny runs with or without futures and regardless of the future backend used. With previous versions of the future package, you would receive errors of different classes depending on which future backend was used.

The future_lapply() function was moved to the future.apply package back in January 2018. Please use that one instead, especially since the one in the future package is now formally deprecated (and produces a warning if used). In future.apply there is also a future_sapply() function and hopefully, in a not too far future, we’ll see additional futurized versions of other base R apply functions, e.g. future_vapply() and future_apply().

Finally, with this release, there was an bug fix related to nested futures (where you call future() within a future() - or use %<-% within another %<-%). When using non-standard evaluation (NSE) such as dplyr expressions in a nested future, you could get a false error that complained about not being able to identify a global variable when it actually was a column in a data.frame.

What’s next?

I’m giving a presentation on futures at the eRum 2018 conference taking place on May 14-16, 2018 in Budapest. I’m excited about this opportunity and to meet more folks in the European R community.
I’m happy to announce that The Infrastructure Steering Committee of The R Consortium is funding the project Future Minimal API: Specification with Backend Conformance Test Suite. I’m grateful for their support. The aim is to formalize the Future API further and to provide a standardized test suite that packages implementing future backends can validate their implementations against. This will benefit the quality of higher-level parallel frameworks that utilize futures internally, e.g. future.apply and foreach with doFuture. It will also help moving forward on several of the feature requests received from the community.

Help shape the future

If you find futures useful in your R-related work, please consider sharing your stories, e.g. by blogging, on Twitter, or on GitHub. It always exciting to hear about how people are using them or how they’d like to use. I know there are so many great ideas out there!

Happy futuring!

Performance: Avoid Coercing Indices To Doubles

Mon, 02 Apr 2018 00:00:00 +0000

x[idxs + 1] or x[idxs + 1L]? That is the question.

Assume that we have a vector $x$ of $n = 100,000$ random values, e.g.

> n <- 100000
> x <- rnorm(n)

and that we wish to calculate the $n-1$ first-order differences $y=(y_1, y_2, …, y_{n-1})$ where $y_i=x_{i+1} - x_i$. In R, we can calculate this using the following vectorized form:

> idxs <- seq_len(n - 1)
> y <- x[idxs + 1] - x[idxs]

We can certainly do better if we turn to native code, but is there a more efficient way to implement this using plain R code? It turns out there is (*). The following calculation is ~15-20% faster:

> y <- x[idxs + 1L] - x[idxs]

The reason for this is because the index calculation:

idxs + 1

is inefficient due to a coercion of integers to doubles. We have that idxs is an integer vector but idxs + 1 becomes a double vector because 1 is a double:

> typeof(idxs)
[1] "integer"
> typeof(idxs + 1)
[1] "double"
> typeof(1)
[1] "double"

Note also that doubles (aka “numerics” in R) take up twice the amount of memory:

> object.size(idxs)
400040 bytes
> object.size(idxs + 1)
800032 bytes

which is because integers are stored as 4 bytes and doubles as 8 bytes.

By using 1L instead, we can avoid this coercion from integers to doubles:

> typeof(idxs)
[1] "integer"
> typeof(idxs + 1L)
[1] "integer"
> typeof(1L)
[1] "integer"

and we save some, otherwise wasted, memory;

> object.size(idxs + 1L)
400040 bytes

Does it really matter for the overall performance? It should because less memory is allocated which always comes with some overhead. Possibly more importantly, by using objects that are smaller in memory, the more likely it is that elements can be found in the memory cache rather than in the RAM itself, i.e. the chance for cache hits increases. Accessing data in the cache is orders of magnitute faster than in RAM. Furthermore, we also avoid coercion/casting of doubles to integers when R adds one to each element, which may add some extra CPU overhead.

The performance gain is confirmed by running microbenchmark on the two alternatives:

> microbenchmark::microbenchmark(
+   y <- x[idxs + 1 ] - x[idxs],
+   y <- x[idxs + 1L] - x[idxs]
+ )
Unit: milliseconds
                        expr  min   lq mean median   uq  max neval cld
  y <- x[idxs + 1] - x[idxs] 1.27 1.58 3.71   2.27 2.62 80.6   100   a
 y <- x[idxs + 1L] - x[idxs] 1.04 1.25 2.38   1.34 2.20 76.5   100   a

From the median (which is the most informative here), we see that using idxs + 1L is ~15-20% faster than idxs + 1 in this case (it depends on $n$ and the overall calculation performed).

Is it worth it? Although it is “only” an absolute difference of ~1 ms, it adds up if we do these calculations a large number times, e.g. in a bootstrap algorithm. And if there are many places in the code that result in coercions from index calculations like these, that also adds up. Some may argue it’s not worth it, but at least now you know it does indeed improve the performance a bit if you specify index constants as integers, i.e. by appending an L.

To wrap it up, here is look at the cost of subsetting all of the $1,000,000$ elements in a vector using various types of integer and double index vectors:

> n <- 1000000
> x <- rnorm(n)
> idxs <- seq_len(n)          ## integer indices
> idxs_dbl <- as.double(idxs) ## double indices

> microbenchmark::microbenchmark(unit = "ms",
+   x[],
+   x[idxs],
+   x[idxs + 0L],
+   x[idxs_dbl],
+   x[idxs_dbl + 0],
+   x[idxs_dbl + 0L],
+   x[idxs + 0]
+ )
Unit: milliseconds
             expr    min     lq   mean median     uq    max neval  cld
              x[] 0.7056 0.7481 1.6563 0.7632 0.8351 74.682   100 a   
          x[idxs] 3.9647 4.0638 5.1735 4.2020 4.7311 78.038   100  b  
     x[idxs + 0L] 5.7553 5.8724 6.2694 6.0810 6.6447  7.845   100  bc 
      x[idxs_dbl] 6.6355 6.7799 7.9916 7.1305 7.6349 77.696   100   cd
  x[idxs_dbl + 0] 7.7081 7.9441 8.6044 8.3321 8.9432 12.171   100    d
 x[idxs_dbl + 0L] 8.0770 8.3050 8.8973 8.7669 9.1682 12.578   100    d
      x[idxs + 0] 7.9980 8.2586 8.8544 8.8924 9.2197 12.345   100    d

(I ordered the entries by their ‘median’ processing times.)

In all cases, we are extracting the complete vector of x. We see that

subsetting using an integer vector is faster than using a double vector,
x[idxs + 0L] is faster than x[idxs + 0] (as seen previously),
x[idxs + 0L] is still faster than x[idxs_dbl] despite also involving an addition, and
x[] is whoppingly fast (probably because it does not have to iterate over an index vector) and serves as a lower-bound reference for the best we can hope for.

(*): There already exists a highly efficient implementation for calculating the first-order differences, namely y <- diff(x). But for the sake of the take-home message of this blog post, let’s ignore that.

Bonus: Did you know that sd(y) / sqrt(2) is an estimator of the standard deviation of the above x:s (von Neumann et al., 1941)? It’s actually not too hard to derive this - give it a try by deriving the variance when x is independent, identically distributed Gaussian random variables. This property is useful in cases where we are interested in the noise level of x and x has a piecewise constant mean level which changes at a small number of locations, e.g. a DNA copy-number profile of a tumor. In such cases we cannot use sd(x), because the estimate would be biased due to the different mean levels. Instead, by taking the first-order differences y, changes in mean levels of x become sporadic outliers in y. If we could trim off these outliers, sd(y) / sqrt(2) would be a good estimate of the standard deviation of x after subtracting the mean levels. Even better, by using a robust estimator, such as the median absolute deviation (MAD) - mad(y) / sqrt(2) - we do not have to worry about have to identify the outliers. Efficient implementations of sd(diff(x)) / sqrt(2)) and mad(diff(x)) / sqrt(2)) are sdDiff(x) and madDiff(x) of the matrixStats package.

References

J. von Neumann et al., The mean square successive difference. Annals of Mathematical Statistics, 1941, 12, 153-162.

Session information

> sessionInfo()
R version 3.4.4 (2018-03-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.4 LTS

Matrix products: default
BLAS: /usr/lib/atlas-base/atlas/libblas.so.3.0
LAPACK: /usr/lib/atlas-base/atlas/liblapack.so.3.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
[1] compiler_3.4.4

Startup with Secrets - A Poor Man's Approach

Fri, 30 Mar 2018 00:00:00 +0000

New release: startup 0.10.0 is now on CRAN.

If your R startup files (.Renviron and .Rprofile) get long and windy, or if you want to make parts of them public and other parts private, then you can use the startup package to split them up in separate files and directories under .Renviron.d/ and .Rprofile.d/. For instance, the .Rprofile.d/repos.R file can be solely dedicated to setting in the repos option, which specifies from which web servers R packages are installed from. This makes it easy to find and easy to share with others (e.g. on GitHub). To make use of startup, install the package and then call startup::install() once. For an introduction, see Start Me Up.

startup::startup() is cross platform.

Several R packages provide APIs for easier access to online services such as GitHub, GitLab, Twitter, Amazon AWS, Google GCE, etc. These packages often rely on R options or environment variables to hold your secret credentials or tokens in order to provide more or less automatic, batch-friendly access to those services. For convenience, it is common to set these secret options in ~/.Rprofile or secret environment variables in ~/.Renviron - or if you use the startup package, in separate files. For instance, by adding a file ~/.Renviron.d/private/github containing:

## GitHub token used by devtools
GITHUB_PAT=db80a925a60ee5b57f323c7b3719bbaaf9f96b26

then, when you start R, environment variable GITHUB_PAT will be accessible from within R as:

> Sys.getenv("GITHUB_PAT")
[1] "db80a925a60ee5b57f323c7b3719bbaaf9f96b26"

which means that also devtools can make use of it.

IMPORTANT: If you’re on a shared file system or a computer with multiple users, you want to make sure no one else can access your files holding “secrets”. If you’re on Linux or macOS, this can be done by:

$ chmod -R go-rwx ~/.Renviron.d/private/

Also, keeping “secrets” in options or environment variables is not super secure. For instance, if your script or a third-party package dumps Sys.getenv() to a log file, that log file will contain your “secrets” too. Depending on your default settings on the machine / file system, that log file might be readable by others in your group or even by anyone on the file system. And if you’re not careful, you might even end up sharing that file with the public, e.g. on GitHub.

Having said this, with the above setup we at least know that the secret token is only loaded when we run R and only when we run R as ourselves. Starting with startup 0.10.0 (*), we can customize the startup further such that secrets are only loaded conditionally on a certain environment variable. For instance, if we instead of putting our secret files in a folder named:

~/.Renviron.d/private/SECRET=develop/

because then (i) that folder will not be visible to anyone else because we already restricted access to ~/.Renviron.d/private/ and (ii) the secrets defined by files of that folder will only be loaded during the R startup if and only if environment variable SECRET has value develop. For example,

$ SECRET=develop Rscript -e "Sys.getenv('GITHUB_PAT')"
[1] "db80a925a60ee5b57f323c7b3719bbaaf9f96b26"

will load the secrets, but none of:

$ Rscript -e "Sys.getenv('GITHUB_PAT')"
[1] ""

$ SECRET=runtime Rscript -e "Sys.getenv('GITHUB_PAT')"
[1] ""

In other words, with the above approach, you can avoid loading secrets by default and only load them when you really need them. This lowers the risk of exposing them by mistake in log files or to R code you’re not in control of. Furthermore, if you only need GITHUB_PAT in interactive devtools sessions, name the folder:

~/.Renviron.d/private/interactive=TRUE,SECRET=develop/

and it will only be loaded in an interactive session, e.g.

$ SECRET=develop Rscript -e "Sys.getenv('GITHUB_PAT')"
[1] ""

and

$ SECRET=develop R --quiet

> Sys.getenv('GITHUB_PAT')
[1] "db80a925a60ee5b57f323c7b3719bbaaf9f96b26"

To repeat what already been said above, storing secrets in environment variables or R variables provides only very limited security. The above approach is meant to provide you with a bit more control if you are already storing credentials in ~/.Renviron or ~/.Rprofile. For a more secure approach to store secrets, see the keyring package, which makes it easy to “access the system credential store from R” in a cross-platform fashion, provides a better alternative.

What’s new in startup 0.10.0?

Renviron and Rprofile startup files that use <key>=<value> filters with non-declared keys are now(*) skipped (which makes the above possible).
startup(debug = TRUE) report on more details.
A startup script can use startup::is_debug_on() to output message during the startup process conditionally on whether the user chooses to display debug message or not.
Added sysinfo() flags microsoftr, pqr, rstudioterm, and rtichoke, which can be used in directory and file names to process them depending on in which environment R is running.
restart() works also in the RStudio Terminal.

Links

startup package:
- CRAN page: https://cran.r-project.org/package=startup (NEWS, vignette)
- GitHub page: https://github.com/HenrikBengtsson/startup
Blog post Start Me Up on 2016-12-22.

(*) In startup (< 0.10.0), ~/.Renviron.d/private/SECRET=develop/ would be processed not only when SECRET had value develop but also when it was undefined. In startup (>= 0.10.0), files with such <key>=<value> tags will now be skipped when that key variable is undefined.

The Many-Faced Future

Mon, 05 Jun 2017 00:00:00 +0000

The future package defines the Future API, which is a unified, generic, friendly API for parallel processing. The Future API follows the principle of write code once and run anywhere - the developer chooses what to parallelize and the user how and where.

The nature of a future is such that it lends itself to be used with several of the existing map-reduce frameworks already available in R. In this post, I’ll give an example of how to apply a function over a set of elements concurrently using plain sequential R, the parallel package, the future package alone, as well as future in combination of the foreach, the plyr, and the purrr packages.

You can choose your own future and what you want to do with it.

Example: Multiple Julia sets

The Julia package provides the JuliaImage() function for generating a Julia set for a given set of start parameters (centre, L, C) where centre specify the center point in the complex plane, L specify the width and height of the square region around this location, and C is a complex coefficient controlling the “shape” of the generated Julia set. For example, to generate one of the above Julia set images (1000-by-1000 pixels), you can use:

library("Julia")
set <- JuliaImage(1000, centre = 0 + 0i, L = 3.5, C = -0.4 + 0.6i)
plot_julia(set)

with

plot_julia <- function(img, col = topo.colors(16)) {
  par(mar = c(0, 0, 0, 0))
  image(img, col = col, axes = FALSE)
}

For the purpose of illustrating how to calculate different Julia sets in parallel, I will use the same (centre, L) = (0 + 0i, 3.5) region as above with the following ten complex coefficients (from Julia set):

Cs <- c(
  a = -0.618,
  b = -0.4     + 0.6i,
  c =  0.285   + 0i,
  d =  0.285   + 0.01i,
  e =  0.45    + 0.1428i,
  f = -0.70176 - 0.3842i,
  g =  0.835   - 0.2321i,
  h = -0.8     + 0.156i,
  i = -0.7269  + 0.1889i,
  j =          - 0.8i
)

Now we’re ready to see how we can use futures in combination of different map-reduce implementations in R for generating these ten sets in parallel. Note that all approaches will generate the exact same ten Julia sets. So, feel free to pick your favorite approach.

Sequential

To process the above ten regions sequentially, we can use the lapply() function:

library("Julia")
sets <- lapply(Cs, function(C) {
  JuliaImage(1000, centre = 0 + 0i, L = 3.5, C = C)
})

Parallel

library("parallel")
ncores <- future::availableCores() ## a friendly version of detectCores()
cl <- makeCluster(ncores)

clusterEvalQ(cl, library("Julia"))
sets <- parLapply(cl, Cs, function(C) {
  JuliaImage(1000, centre = 0 + 0i, L = 3.5, C = C)
})

Futures (in parallel)

library("future")
plan(multisession)  ## defaults to availableCores() workers

library("Julia")
sets <- future_lapply(Cs, function(C) {
  JuliaImage(1000, centre = 0 + 0i, L = 3.5, C = C)
})

We could also have used the more explicit setup plan(cluster, workers = makeCluster(availableCores())), which is identical to plan(multisession).

Futures with foreach

library("doFuture")
registerDoFuture()  ## tells foreach futures should be used
plan(multisession)  ## specifies what type of futures

sets <- foreach(C = Cs) %dopar% {
  JuliaImage(1000, centre = 0 + 0i, L = 3.5, C = C)
}

Note that I didn’t pass .packages = "Julia" to foreach() because the doFuture backend will do that automatically for us - that’s one of the treats of using futures. If we would have used doParallel::registerDoParallel(cl) or similar, we would have had to worry about that.

Futures with plyr

The plyr package will utilize foreach internally if we pass .parallel = TRUE. Because of this, we can use plyr::llply() to parallelize via futures as follows:

library("plyr")
library("doFuture")
registerDoFuture()  ## tells foreach futures should be used
plan(multisession)  ## specifies what type of futures

library("Julia")
sets <- llply(Cs, function(C) {
  JuliaImage(1000, centre = 0 + 0i, L = 3.5, C = C)
}, .parallel = TRUE)

For the same reason as above, we also here don’t have to worry about global variables and making sure needed packages are attached; that’s all handles by the future packages.

Futures with purrr (= furrr)

As a final example, here is how you can use futures to parallelize your purrr::map() calls:

library("purrr")
library("future")
plan(multisession)

library("Julia")
sets <- Cs %>%
        map(~ future(JuliaImage(1000, centre = 0 + 0i, L = 3.5, C = .x))) %>%
        values

Comment: This latter approach will not perform load balancing (“scheduling”) across backend workers; that’s a feature that ideally would be taken care of by purrr itself. However, I have some ideas for future versions of future (pun…) that may achieve this without having to modify the purrr package.

Got compute?

If you have access to one or more machines with R installed (e.g. a local or remote cluster, or a Google Compute Engine cluster), and you’ve got direct SSH access to those machines, you can have those machines to calculate the above Julia sets; just change future plan, e.g.

plan(cluster, workers = c("machine1", "machine2", "machine3.remote.org"))

If you have access to a high-performance compute (HPC) cluster with a HPC scheduler (e.g. Slurm, TORQUE / PBS, LSF, and SGE), then you can harness its power by switching to:

library("future.batchtools")
plan(batchtools_sge)

For more details, see the vignettes of the future.batchtools and batchtools packages.

Happy futuring!

The R-help Community was Started on This Day 20 Years Ago

Sat, 01 Apr 2017 00:00:00 +0000

Today, its been 20 years since Martin Mächler started the R-help community list. The first post was written by Ross Ihaka on 1997-04-01:

Screenshot of the very first post to the R-help mailing list.

This is a post about R’s memory model. We’re talking R v0.50 beta. I think that the paragraph at the end provides a nice anecdote on the importance not to be overwhelmed by problems ahead:

”(The consumption of one cell per string is perhaps the major memory problem in R - we didn’t design it with large problems in mind. It is probably fixable, but it will mean a lot of work).”

We all know the story; an endless number of hours has been put in by many contributors throughout the years, making The R Project and its community the great experience it is today.

Thank you!

PS. This is a blog version of my R-help post with the same content.

doFuture: A Universal Foreach Adaptor Ready to be Used by 1,000+ Packages

Sat, 18 Mar 2017 00:00:00 +0000

doFuture 0.4.0 is available on CRAN. The doFuture package provides a universal foreach adaptor enabling any future backend to be used with the foreach() %dopar% { ... } construct. As shown below, this will allow foreach() to parallelize on not only multiple cores, multiple background R sessions, and ad-hoc clusters, but also cloud-based clusters and high performance compute (HPC) environments.

1,300+ R packages on CRAN and Bioconductor depend, directly or indirectly, on foreach for their parallel processing. By using doFuture, a user has the option to parallelize those computations on more compute environments than previously supported, especially HPC clusters. Notably, all plyr code with .parallel = TRUE will be able to take advantage of this without need for modifications - this is possible because internally plyr makes use of foreach for its parallelization.

With doFuture, foreach can process your code in more places than ever before. Alright, it may not be able to process this programmer’s 62,500 punched cards.

What is new in doFuture 0.4.0?

Load balancing: The doFuture %dopar% backend will now partition all iterations (elements) and distribute them uniformly such that the each backend worker will receive exactly one partition equally sized to those sent to the other workers. This approach speeds up the processing significantly when iterating over a large set of elements that each has a relatively small processing time.
Globals: Global variables and packages needed in order for external R workers to evaluate the foreach expression are now identified by the same algorithm as used for regular future constructs and future::future_lapply().

For full details on updates, please see the NEWS file. The doFuture package installs out-of-the-box on all operating systems.

A quick example

Here is a bootstrap example using foreach adapted from help("clusterApply", package = "parallel"). I use this example to illustrate how to perform foreach() iterations in parallel on a variety of backends.

library("boot")

run <- function(...) {
  cd4.rg <- function(data, mle) MASS::mvrnorm(nrow(data), mle$m, mle$v)
  cd4.mle <- list(m = colMeans(cd4), v = var(cd4))
  boot(cd4, corr, R = 10000, sim = "parametric", ran.gen = cd4.rg, mle = cd4.mle)
}

## Attach doFuture (and foreach), and tell foreach to use futures
library("doFuture")
registerDoFuture()

## Sequentially on the local machine
plan(sequential)
system.time(boot <- foreach(i = 1:100, .packages = "boot") %dopar% { run() })
##    user  system elapsed 
## 298.728   0.601 304.242

# In parallel on local machine (with 8 cores)
plan(multisession)
system.time(boot <- foreach(i = 1:100, .packages = "boot") %dopar% { run() })
##    user  system elapsed 
## 452.241   1.635  68.740

# In parallel on the ad-hoc cluster machine (5 machines with 4 workers each)
nodes <- rep(c("n1", "n2", "n3", "n4", "n5"), each = 4L)
plan(cluster, workers = nodes)
system.time(boot <- foreach(i = 1:100, .packages = "boot") %dopar% { run() })
##    user  system elapsed
##   2.046   0.188  22.227

# In parallel on Google Compute Engine (10 r-base Docker containers)
vms <- lapply(paste0("node", 1:10), FUN = googleComputeEngineR::gce_vm, template = "r-base")
vms <- lapply(vms, FUN = gce_ssh_setup)
vms <- as.cluster(vms, docker_image = "henrikbengtsson/r-base-future")
plan(cluster, workers = vms)
system.time(boot <- foreach(i = 1:100, .packages = "boot") %dopar% { run() })
##    user  system elapsed
##   0.952   0.040  26.269

# In parallel on a HPC cluster with a TORQUE / PBS scheduler
# (Note, the below timing includes waiting time on job queue)
plan(future.BatchJobs::batchjobs_torque, workers = 10)
system.time(boot <- foreach(i = 1:100, .packages = "boot") %dopar% { run() })
##    user  system elapsed
##  15.568   6.778  52.024

About `.export` and `.packages`

When using doFuture::registerDoFuture(), there is no need to manually specify which global variables (argument .export) to export. By default, the doFuture backend automatically identifies and exports all globals needed. This is done using recursive static-code inspection. The same is true for packages that need to be attached; those will also be handled automatically and there is no need to specify them manually via argument .packages. This is in line with how it works for regular future constructs, e.g. y %<-% { a * sum(x) }.

Having said this, you may still want to specify arguments .export and .packages because of the risk that your foreach() statement may not work with other foreach adaptors, e.g. doParallel and doSNOW. Exactly when and where a failure may occur depends on the nestedness of your code and the location of your global variables. Specifying .export and .packages manually skips such automatic identification.

Finally, I recommend that you as a developer always try to write your code in such way the users can choose their own futures: The developer decides what should be parallelized - the user chooses how.

Happy futuring!

UPDATE 2022-12-11: Update examples that used the deprecated multiprocess future backend alias to use the multisession backend.

future 1.3.0: Reproducible RNGs, future_lapply() and More

Sun, 19 Feb 2017 00:00:00 +0000

future 1.3.0 is available on CRAN. With futures, it is easy to write R code once, which the user can choose to evaluate in parallel using whatever resources s/he has available, e.g. a local machine, a set of local machines, a set of remote machines, a high-end compute cluster (via future.BatchJobs and soon also future.batchtools), or in the cloud (e.g. via googleComputeEngineR).

Futures makes it easy to harness any resources at hand.

Thanks to great feedback from the community, this new version provides:

A convenient lapply() function
- Added future_lapply() that works like lapply() and gives identical results with the difference that futures are used internally. Depending on user’s choice of plan(), these calculations may be processed sequential, in parallel, or distributed on multiple machines.
- Load balancing can be controlled by argument future.scheduling, which is a scalar adjusting how many futures each worker should process.
- Perfect reproducible random number generation (RNG) is guaranteed given the same initial seed, regardless of the type of futures used and choice of load balancing. Argument future.seed = TRUE (default) will use a random initial seed, which may also be specified as future.seed = <integer>. L’Ecuyer-CMRG RNG streams are used internally.
Clarifies distinction between developer and end user
- The end user controls what future strategy to use by default, e.g. plan(multisession) or plan(cluster, workers = c("machine1", "machine2", "remote.server.org")).
- The developer controls whether futures should be resolved eagerly (default) or lazily, e.g. f <- future(..., lazy = TRUE). Because of this, plan(lazy) is now deprecated.
Is even more friendly to multi-tenant compute environments
- availableCores() returns the number of cores available to the current R process. On a regular machine, this typically corresponds to the number of cores on the machine (parallel::detectCores()). If option mc.cores or environment variable MC_CORES is set, then that will be returned. However, on compute clusters using schedulers such as SGE, Slurm, and TORQUE / PBS, the function detects the number of cores allotted to the job by the scheduler and returns that instead. This way developers don’t have to adjust their code to match a certain compute environment; the default works everywhere.
- With the new version, it is possible to override the fallback value used when nothing else is specified to not be the number of cores on the machine but to option future.availableCores.fallback or environment variable R_FUTURE_AVAILABLE_FALLBACK. For instance, by using R_FUTURE_AVAILABLE_FALLBACK=1 system-wide in HPC environments, any user running outside of the scheduler will automatically use single-core processing unless explicitly requesting more cores. This lowers the risk of overloading the CPU by mistake.
- Analogously to how availableCores() returns the number of cores, the new function availableWorkers() returns the host names available to the R process. The default is rep("localhost", times = availableCores()), but when using HPC schedulers it may be the host names of other compute notes allocated to the job.

For full details on updates, please see the NEWS file. The future package installs out-of-the-box on all operating systems.

A quick example

The bootstrap example of help("clusterApply", package = "parallel") adapted to make use of futures.

library("future")
library("boot")

run <- function(...) {
  cd4.rg <- function(data, mle) MASS::mvrnorm(nrow(data), mle$m, mle$v)
  cd4.mle <- list(m = colMeans(cd4), v = var(cd4))
  boot(cd4, corr, R = 5000, sim = "parametric", ran.gen = cd4.rg, mle = cd4.mle)
}

# base::lapply()
system.time(boot <- lapply(1:100, FUN = run))
###    user  system elapsed 
### 133.637   0.000 133.744
   
# Sequentially on the local machine
plan(sequential)
system.time(boot0 <- future_lapply(1:100, FUN = run, future.seed = 0xBEEF))
###    user  system elapsed 
### 134.916   0.003 135.039 

# In parallel on the local machine (with 8 cores)
plan(multisession)
system.time(boot1 <- future_lapply(1:100, FUN = run, future.seed = 0xBEEF))
###    user  system elapsed
###   0.960   0.041  29.527 
stopifnot(all.equal(boot1, boot0))

What’s next?

The future.BatchJobs package, which builds on top of BatchJobs, provides future strategies for various HPC schedulers, e.g. SGE, Slurm, and TORQUE / PBS. For example, by using plan(batchjobs_torque) instead of plan(multisession) your futures will be resolved distributed on a compute cluster instead of parallel on your local machine. That’s it! However, since last year, the BatchJobs package has been decommissioned and the authors recommend everyone to use their new batchtools package instead. Just like BatchJobs, it is a very well written package, but at the same time it is more robust against cluster problems and it also supports more types of HPC schedulers. Because of this, I’ve been working on future.batchtools which I hope to be able to release soon.

Finally, I’m really keen on looking into how futures can be used with Shaun Jackman’s lambdar, which is a proof-of-concept that allows you to execute R code on Amazon’s “serverless” AWS Lambda framework. My hope is that, in a not too far future (pun not intended*), we’ll be able to resolve our futures on AWS Lambda using plan(aws_lambda).

Happy futuring!

(*) Alright, I admit, it was intended.

UPDATE 2022-12-11: Update examples that used the deprecated multiprocess future backend alias to use the multisession backend.

Start Me Up

Thu, 22 Dec 2016 00:00:00 +0000

The startup package makes it easy to control your R startup processes and to share part of your startup settings with others (e.g. as a public Git repository) while keeping secret parts to yourself. Instead of having long and windy .Renviron and .Rprofile startup files, you can split them up into short specific files under corresponding .Renviron.d/ and .Rprofile.d/ directories. For example,

# Environment variables
# (one name=value per line)
.Renviron.d/
 +- lang                   # language settings
 +- libs                   # library settings
 +- r_cmd_check            # R CMD check settings
 +- secrets                # secret access keys (don't share!)
 
# Configuration scripts
# (regular R scripts)
.Rprofile.d/ 
 +- interactive=TRUE/      # Used in interactive-mode only:
 |  +- help.start.R        # - launch the help server on fixed port
 |  +- misc.R              # - TAB completions and more
 |  +- package=fortunes.R  # - show a random fortune (iff installed)
 +- package=devtools.R     # devtools-specific options
 +- os=windows.R           # Windows-specific settings
 +- repos.R                # set up the CRAN repository

All you need to for this to work is to have a line:

startup::startup()

in your ~/.Rprofile file (you may use it in any of the other locations that R supports). As an alternative to manually edit this file, just call startup::install() and this line will be appended if missing and if the file is missing that will also be created. Don’t worry, your old file will be backed up with a timestamp.

The startup package is extremely lightweight, has no external dependencies and depends only on the ‘base’ R package. It can be installed from CRAN using install.packages("startup"). Note, startup 0.4.0 was released on CRAN on 2016-12-22 - until macOS and Windows binaries are available you can install it via install.packages("startup", type = "source").

For more information on what’s possible to do with the startup package, see the README file of the package.

High-Performance Compute in R Using Futures

Sat, 22 Oct 2016 00:00:00 +0000

A new version of the future.BatchJobs package has been released and is available on CRAN. With a single change of settings, it allows you to switch from running an analysis sequentially on a local machine to running it in parallel on a compute cluster.

Our different futures can easily be resolved on high-performance compute clusters.

Requirements

The future.BatchJobs package implements the Future API, as defined by the future package, on top of the API provided by the BatchJobs package. These packages and their dependencies install out-of-the-box on all operating systems.

Installing the package is all that is needed in order to give it a test ride. If you have access to a compute cluster that uses one of the common job schedulers, such as TORQUE (PBS), Slurm, Sun/Oracle Grid Engine (SGE), Load Sharing Facility (LSF) or OpenLava, then you’re ready to take it for a serious ride. If your cluster uses another type of scheduler, it is possible to configure it to work also there. If you don’t have access to a compute cluster right now, you can still try future.BatchJobs by simply using plan(batchjobs_local) in the below example - all futures (“jobs”) will then be processed sequentially on your local machine (*).

(*) For those of you who are already familiar with the future package - yes, if you’re only going to run locally, then you can equally well use plan(sequential) or plan(multisession), but for the sake of demonstrating future.BatchJobs per se, I suggest using plan(batchjobs_local) because it will use the BatchJobs machinery underneath.

Example: Extracting text and generating images from PDFs

Imagine we have a large set of PDF documents from which we would like to extract the text and also generate PNG images for each of the pages. Below, I will show how this can be easily done in R thanks to the pdftools package written by Jeroen Ooms. I will also show how we can speed up the processing by using futures that are resolved in parallel either on the local machine or, as shown here, distributed on a compute cluster.

library("pdftools")
library("future.BatchJobs")
library("listenv")

## Process all PDFs on local TORQUE cluster
plan(batchjobs_torque)

## PDF documents to process
pdfs <- dir(path = rev(.libPaths())[1], recursive = TRUE,
            pattern = "[.]pdf$", full.names = TRUE)
pdfs <- pdfs[basename(dirname(pdfs)) == "doc"]
print(pdfs)

## For each PDF ...
docs <- listenv()
for (ii in seq_along(pdfs)) {
  pdf <- pdfs[ii]
  message(sprintf("%d. Processing %s", ii, pdf))
  name <- tools::file_path_sans_ext(basename(pdf))

  docs[[name]] %<-% {
    path <- file.path("output", name)
    dir.create(path, recursive = TRUE, showWarnings = FALSE)
    
    ## (a) Extract the text and write to file
    content <- pdf_text(pdf)
    txt <- file.path(path, sprintf("%s.txt", name))
    cat(content, file = txt)
  
    ## (b) Create a PNG file per page
    pngs <- listenv()
    for (jj in seq_along(content)) {
      pngs[[jj]] %<-% {
        img <- pdf_render_page(pdf, page = jj)
        png <- file.path(path, sprintf("%s_p%03d.png", name, jj))
        png::writePNG(img, png)
        png
      }
    }

    list(pdf = pdf, txt = txt, pngs = unlist(pngs))
  }
}

## Resolve everything if not already done
docs <- as.list(docs)

str(docs)

As true for all code using the Future API, as a user you always have full control on how futures should be resolved. For instance, you can choose to run the above on your local machine, still via the BatchJobs framework, by using plan(batchjobs_local). You could even skip the future.BatchJobs package and use what is available in the future package alone, e.g. library("future") and plan(multisession).

As emphasized in for instance the Remote Processing Using Futures blog post and in the vignettes of the future package, there is no need to manually identify and manually export variables and functions that need to be available to the external R processes resolving the futures. Such global variables are automatically identified by the future package and exported when necessary.

Futures may be nested

Note how we used nested futures in the above example, where we create one future per PDF and for each PDF we, in turn, create one future per PNG. The design of the Future API is such that the user should have full control on how each level of futures is resolved. In other words, it is the user and not the developer who should decide what is specified in plan().

For futures, if nothing is specified, then sequential processing is always used for resolving futures. In the above example, we specified plan(batchjobs_torque), which means that the outer loop of futures is processed as individual jobs on the cluster. Each of these futures will be resolved in a separate R process. Next, since we didn’t specify how the inner loop of futures should be processed, these will be resolved sequentially as part of these individual R processes.

However, we could also choose to have the futures in the inner loop be resolved as individual jobs on the scheduler, which can be done as:

plan(list(batchjobs_torque, batchjobs_torque))

This would cause each PDF to be submitted as an individual job, which when launched on a compute node by scheduler will start by extract the plain text of the document and write it to file. When this is done, the job continues by generating a PDF image file for each page, which is done via individual jobs on the scheduler.

Exactly what strategies to use for resolving the different levels of futures depends on how long they take to process. If the amount of processing needed for a future is really long, then it makes sense to submit it the scheduler whereas if it is really quick it probably makes more sense to process it on the current machine either using parallel futures or no futures at all. For instance, in our example, we could also have chosen to generate the PNGs in parallel on the same compute node that extracted the text. Such a configuration could look like:

plan(list(
  tweak(batchjobs_torque, resources = "nodes=1:ppn=12"),
  multisession
))

This setup tells the scheduler that each job should be allocated 12 cores that the individual R processes then may use in parallel. The future package and the multisession configuration will automatically detect how many cores it was allocated by the scheduler.

There are numerous other ways to control how and where futures are resolved. See the vignettes of the future and the future.BatchJobs packages for more details. Also, if you read the above and thought that this may result in an explosion of futures created recursively that will bring down your computer or your cluster, don’t worry. It’s built into the core of future package to prevent this from happening.

What’s next?

The future.BatchJobs package simply implements the Future API (as defined by the future package) on top of the API provided by the awesome BatchJobs package. The creators of that package are working on the next generation of their tool - the batchtools package. I’ve already started on the corresponding future.batchtools package so that you and your users can switch over to using plan(batchtools_torque) - it’ll be as simple as that.

Happy futuring!

UPDATE 2022-12-11: Update examples that used the deprecated multiprocess future backend alias to use the multisession backend.

Remote Processing Using Futures

Tue, 11 Oct 2016 00:00:00 +0000

A new version of the future package has been released and is available on CRAN. With futures, it is easy to write R code once, which later the user can choose to parallelize using whatever resources s/he has available, e.g. a local machine, a set of local notebooks, a set of remote machines, or a high-end compute cluster.

The future provides comfortable and friendly long-distance interactions.

The new version, future 1.1.1, provides:

Much easier usage of remote computers / clusters
- If you can SSH to the machine, then you can also use it to resolve R expressions remotely.
- Firewall configuration and port forwarding are no longer needed.
Improved identification of global variables
- Corner cases where the package previously failed to identify and export global variables are now also handled. For instance, variable x is now properly identified as a global variable in expressions such as x$a <- 3 and x[1, 2, 4] <- 3 as well as in formulas such as y ~ x | z.
- Global variables are by default identified automatically, but can now also be specified manually, either by their names (as a character vector) or by their names and values (as a named list).

For full details on updates, please see the NEWS file. The future package installs out-of-the-box on all operating systems.

Example: Remote graphics rendered locally

To illustrate how simple and powerful remote futures can be, I will show how to (i) set up locally stored data, (ii) generate plotly-enhanced ggplot2 graphics based on these data using a remote machine, and then (iii) render these plotly graphics in the local web browser for interactive exploration of data.

Before starting, all we need to do is to verify that we have SSH access to the remote machine, let’s call it remote.server.org, and that it has R installed:

{local}: ssh remote.server.org
{remote}: Rscript --version
R scripting front-end version 3.3.1 (2016-06-21)
{remote}: exit
{local}: exit

Note, it is highly recommended to use SSH-key pair authentication so that login credentials do not have to be entered manually.

After having made sure that the above works, we are ready for our remote future demo. The following code is based on an online plotly example where only a few minor modifications have been done:

library("plotly")
library("future")

## %<-% assignments will be resolved remotely
plan(remote, workers = "remote.server.org")

## Set up data (locally)
set.seed(100)
d <- diamonds[sample(nrow(diamonds), 1000), ]

## Generate ggplot2 graphics and plotly-fy (remotely)
gg %<-% {
  p <- ggplot(data = d, aes(x = carat, y = price)) +
         geom_point(aes(text = paste("Clarity:", clarity)), size = 4) +
         geom_smooth(aes(colour = cut, fill = cut)) + facet_wrap(~ cut)
  ggplotly(p)
}

## Display graphics in browser (locally)
gg

The above renders the plotly-compiled ggplot2 graphics in our local browser. See below screenshot for an example.

This might sound like magic, but all that is going behind the scenes is a carefully engineered utilization of the globals and the parallel packages, which is then encapsulated in the unified API provided by the future package. First, a future assignment (%<-%) is used for gg, instead of a regular assignment (<-). That tells R to use a future to evaluate the expression on the right-hand side (everything within { ... }). Second, since we specified that we want to use the remote machine remote.server.org to resolve our futures, that is where the future expression is evaluated. Third, necessary data is automatically communicated between our local and remote machines. That is, any global variables (d) and functions are automatically identified and exported to the remote machine and required packages (ggplot2 and plotly) are loaded remotely. When resolved, the value of the expression is automatically transferred back to our local machine afterward and is available as the value of future variable gg, which was formally set up as a promise.

An example of remote futures: This ggplot2 + plotly figure was generated on a remote machine and then rendered in the local web browser where it is can be interacted with dynamically.

What’s next? Over the summer, I have received tremendous feedback from several people, such as (in no particular order) Kirill Müller, Guillaume Devailly, Clark Fitzgerald, Michael Bradley, Thomas Lin Pedersen, Alex Vorobiev, Bob Rudis, RebelionTheGrey, Drew Schmidt and Gábor Csárdi (sorry if I missed anyone, please let me know). This feedback contributed to some of the new features found in future 1.1.1. However, there’re many great suggestions and wishes that didn’t make it in for this release - I hope to be able to work on those next. Thank you all.

Happy futuring!

A Future for R: Slides from useR 2016

Sat, 02 Jul 2016 00:00:00 +0000

Unless you count DSC 2003 in Vienna, last week’s useR conference at Stanford was my very first time at useR. It was a great event, it was awesome to meet our lovely and vibrant R community in real life, which we otherwise only get know from online interactions, and of course it was very nice to meet old friends and make new ones.

The future is promising.

At the end of the second day, I presented A Future for R (18 min talk; slides below) on how you can use the future package for asynchronous (parallel and distributed) processing using a single unified API regardless of what backend you have available, e.g. multicore, multisession, ad hoc cluster, and job schedulers. I ended with a teaser on how futures can be used for much more than speeding up your code, e.g. generating graphics remotely and displaying it locally.

Here’s an example using two futures that process data in parallel:

> library("future")
> plan(multisession)       ## Parallel processing
> a %<-% slow_sum(1:50)    ## These two assignments are
> b %<-% slow_sum(51:100)  ## non-blocking and in parallel
> y <- a + b               ## Waits for a and b to be resolved
> y
[1] 5050

Below are different formats of my talk (18 slides + 9 appendix slides) on 2016-06-28:

HTML (incremental slides; requires online access)
HTML (non-incremental slides; requires online access)
PDF (incremental slides)
PDF (non-incremental slides)
Markdown (screen reader friendly)
YouTube (video recording)

May the future be with you!

UPDATE 2022-12-11: Update examples that used the deprecated multiprocess future backend alias to use the multisession backend.

matrixStats: Optimized Subsetted Matrix Calculations

Wed, 16 Dec 2015 00:00:00 +0000

The matrixStats package provides highly optimized functions for computing common summaries over rows and columns of matrices. In a previous blog post, I showed that, instead of using apply(X, MARGIN = 2, FUN = median), we can speed up calculations dramatically by using colMedians(X). In the most recent release (version 0.50.0), matrixStats has been extended to perform optimized calculations also on a subset of rows and/or columns specified via new arguments rows and cols, e.g. colMedians(X, cols = 1:50).

For instance, assume we wish to find the median value of the first 50 columns of matrix X with 1,000,000 rows and 100 columns. For simplicity, assume

> X <- matrix(rnorm(1e6 * 100), nrow = 1e6, ncol = 100)

To get the median values without matrixStats, we would do

> y <- apply(X[, 1:50], MARGIN = 2, FUN = median)
> str(y)
 num [1:50] -0.001059 0.00059 0.001316 0.00103 0.000814 ...

As in the past, we could use matrixStats to do

> y <- colMedians(X[, 1:50])

which is much faster than apply() with median().

However, both approaches require that X is subsetted before the actual calculations can be performed, i.e. the temporary object X[, 1:50] is created. In this example, the size of the original matrix is ~760 MiB and the subsetted one is ~380 MiB;

> object.size(X)
800000200 bytes
> object.size(X[, 1:50])
400000100 bytes

This temporary object is created by (i) R first allocating the size for it and then (ii) copying all its values over from X. After the medians have been calculated this temporary object is automatically discarded and eventually (iii) R’s garbage collector will deallocate its memory. This introduces overhead in form of extra memory usage as well as processing time.

Starting with matrixStats 0.50.0, we can avoid this overhead by instead using

> y <- colMedians(X, cols = 1:50)

This uses less memory, because no internal copy of X[, 1:50] has to be created. Instead all calculations are performed directly on the source object X. Because of this, the latter approach of subsetting is also faster.

Bootstrapping example

Subsetted calculations occur naturally in bootstrap analysis. Assume we want to calculate the median for each column of a 100-by-10,000 matrix X where the rows are resampled with replacement 1,000 times. Without matrixStats, this can be done as

B <- 1000
Y <- matrix(NA_real_, nrow = B, ncol = ncol(X))
for (b in seq_len(B)) {
  rows <- sample(seq_len(nrow(X)), replace = TRUE)
  Y[b,] <- apply(X[rows, ], MARGIN = 2, FUN = median)
}

However, powered with the new matrixStats we can do

B <- 1000
Y <- matrix(NA_real_, nrow = B, ncol = ncol(X))
for (b in seq_len(B)) {
  rows <- sample(seq_len(nrow(X)), replace = TRUE)
  Y[b, ] <- colMedians(X, rows = rows)
}

In the first approach, with explicit subsetting (X[rows, ]), we are creating a large number of temporary objects - each of size object.size(X[rows, ]) == object.size(X) - that all need to be allocated, copied and deallocated. Thus, if X is a 100-by-10,000 double matrix of size 8,000,200 bytes = 7.6 MiB we are allocating and deallocating a total of 7.5 GiB worth of RAM when using 1,000 bootstrap samples. With a million bootstrap samples, we’re consuming a total of 7.3 TiB RAM. In other words, we are wasting lots of compute resources on memory allocation, copying, deallocation and garbage collection. Instead, by using the optimized subsetted calculations available in matrixStats (>= 0.50.0), which is used in the second approach, we spare the computer all that overhead.

Not only does the peak memory requirement go down by roughly a half, but the overall speedup is also substantial; using a regular notebook the above 1,000 bootstrap samples took 660 seconds (= 11 minutes) to complete using apply(X[rows, ]), 85 seconds (8x speedup) using colMedians(X[rows, ]) and 45 seconds (15x speedup) using colMedians(X, rows = rows).

Availability

The matrixStats package can be installed on all common operating systems as

> install.packages("matrixStats")

The source code is available on GitHub.

Credits

Support for optimized calculations on subsets was implemented by Dongcan Jiang. Dongcan is a Master’s student in Computer Science at Peking University and worked on this project from April to August 2015 through support by the Google Summer of Code 2015 program. This GSoC project was mentored jointly by me and Hector Corrada Bravo at University of Maryland. We would like to thank Dongcan again for this valuable addition to the package and the community. We would also like to thank Google and the R Project in GSoC for making this possible.

Any type of feedback, including bug reports, is always appreciated!

Milestone: 7000 Packages on CRAN

Wed, 12 Aug 2015 00:00:00 +0000

Another 1,000 packages were added to CRAN, which took less than 9 months. Today (August 12, 2015), the Comprehensive R Archive Network (CRAN) package page reports:

“Currently, the CRAN package repository features 7002 available packages.”

While the previous 1,000 packages took 355 days, going from 6,000 to 7,000 packages took 286 days - which means that now a new CRAN package is born on average every 6.9 hours (or 3.5 packages per day). Since the start of CRAN 18.3 years ago on April 23, 1997, there has been on average one new package appearing on CRAN every 22.9 hours. It is actually more frequent than that because dropped/archived packages are not accounted for. The 7,000 packages on CRAN are maintained by ~4,130 people.

Thanks to the CRAN team and to all package developers. You can give back by carefully reporting bugs to the maintainers and properly citing any packages you use in your publications (see citation("pkg name")).

Milestones:

2015-08-12: 7000 packages
2014-10-29: 6000 packages
2013-11-08: 5000 packages
2012-08-23: 4000 packages
2011-05-12: 3000 packages
2009-10-04: 2000 packages
2007-04-12: 1000 packages
2004-10-01: 500 packages
2003-04-01: 250 packages

These data are for CRAN only. There are many more packages elsewhere, e.g. Bioconductor, R-Forge (sic!), RForge (sic!), Github etc.

Performance: Calling R_CheckUserInterrupt() Every 256 Iteration is Actually Faster than Every 1,000,000 Iteration

Fri, 05 Jun 2015 00:00:00 +0000

If your native code takes more than a few seconds to finish, it is a nice courtesy to the user to check for user interrupts (Ctrl-C) once in a while, say, every 1,000 or 1,000,000 iteration. The C-level API of R provides R_CheckUserInterrupt() for this (see ‘Writing R Extensions’ for more information on this function). Here’s what the code would typically look like:

for (int ii = 0; ii < n; ii++) {
  /* Some computational expensive code */
  if (ii % 1000 == 0) R_CheckUserInterrupt()
}

This uses the modulo operator % and tests when it is zero, which happens every 1,000 iteration. When this occurs, it calls R_CheckUserInterrupt(), which will interrupt the processing and “return to R” whenever an interrupt is detected.

Interestingly, it turns out that, it is significantly faster to do this check every $k=2^m$ iteration, e.g. instead of doing it every 1,000 iteration, it is faster to do it every 1,024 iteration. Similarly, instead of, say, doing it every 1,000,000 iteration, do it every 1,048,576 - not one less (1,048,575) or one more (1,048,577). The difference is so large that it is even 2-3 times faster to call R_CheckUserInterrupt() every 256 iteration rather than, say, every 1,000,000 iteration, which at least to me was a bit counter intuitive the first time I observed it.

Below are some benchmark statistics supporting the claim that testing / calculating ii % k == 0 is faster for $k=2^m$ (blue) than for other choices of $k$ (red).

Note that the times are on the log scale (the results are also tabulated at the end of this post). Now, will it make a big difference to the overall performance of you code if you choose, say, 1,048,576 instead of 1,000,000? Probably not, but on the other hand, it does not hurt to pick an interval that is a $2^m$ integer. This observation may also be useful in algorithms that make lots of use of the modulo operator.

So why is ii % k == 0 a faster test when $k=2^m$? I can only speculate. For instance, the integer $2^m$ is a binary number with all bits but one set to zero. It might be that this is faster to test for than other bit patterns, but I don’t know if this is because of how the native code is optimized by the compiler and/or if it goes down to the hardware/CPU level. I’d be interested in feedback and hear your thoughts on this.

UPDATE 2015-06-15: Thomas Lumley kindly replied and pointed me to fact that “the modulo of powers of 2 can alternatively be expressed as a bitwise AND operation”, which in C terms means that ii % 2^m is identical to ii & (2^m - 1) (at least for positive integers), and this is an optimization that the GCC compiler does by default. The bitwise AND operator is extremely fast, because the CPU can take the AND on all bits at the same time (think 64 electronic AND gates for a 64-bit integer). After this, comparing to zero is also very fast. The optimization cannot be done for integers that are not powers of two. So, in our case, when the compiler sees ii % 256 == 0 it optimizes it to become ii & 255 == 0, which is much faster to calculate than the non-optimized ii % 256 == 0 (or ii % 257 == 0, or ii % 1000000 == 0, and so on).

Details on how the benchmarking was done

I used the inline package to generate a set of C-level functions with varying interrupt intervals ($k$). I’m not passing $k$ as a parameter to these functions. Instead, I use it as a constant value so that the compiler can optimize as far as possible, but also in order to imitate how most code is written. This is why I generate multiple C functions. I benchmarked across a wide range of interval choices using the microbenchmark package. The C functions (with corresponding R functions calling them) and the corresponding benchmark expressions to be called were generated as follows:

## The interrupt intervals to benchmark
## (a) Classical values
ks <- c(1, 10, 100, 1000, 10e3, 100e3, 1e6)
## (b) 2^k values and the ones before and after
ms <- c(2, 5, 8, 10, 16, 20)
as <- c(-1, 0, +1) + rep(2^ms, each = 3)

## List of unevaluated expressions to benchmark
mbexpr <- list()

for (k in sort(c(ks, as))) {
  name <- sprintf("every_%d", k)

  ## The C function
  assign(name, inline::cfunction(c(length = "integer"), body = sprintf("
    int i, n = asInteger(length);
    for (i=0; i < n; i++) {
      if (i %% %d == 0) R_CheckUserInterrupt();
    }
    return ScalarInteger(n);
  ", k)))

  ## The corresponding expression to benchmark
  mbexpr <- c(mbexpr, substitute(every(n), list(every = as.symbol(name))))
}

The actual benchmarking of the 25 cases was then done by calling:

n <- 10e6  ## Number of iterations
stats <- microbenchmark::microbenchmark(list = mbexpr)

expr	min	lq	mean	median	uq	max
every_1(n)	479.19	485.08	511.45	492.91	521.50	839.50
every_3(n)	184.08	185.74	197.86	189.10	197.31	321.69
every_4(n)	148.99	150.80	160.92	152.73	158.55	245.72
every_5(n)	127.42	129.25	134.18	131.26	134.69	190.88
every_10(n)	91.96	93.12	99.75	94.48	98.10	194.98
every_31(n)	65.78	67.15	71.18	68.33	70.52	113.55
every_32(n)	49.12	49.49	51.72	50.24	51.38	91.28
every_33(n)	63.29	64.01	67.96	64.76	68.79	112.26
every_100(n)	50.85	51.46	54.81	52.37	55.01	89.83
every_255(n)	56.05	56.48	59.81	57.21	59.25	119.47
every_256(n)	19.46	19.62	21.03	19.88	20.71	41.98
every_257(n)	53.32	53.70	57.16	54.54	56.34	96.61
every_1000(n)	44.76	46.68	50.40	47.50	50.19	121.97
every_1023(n)	53.68	54.89	57.64	55.57	57.71	111.59
every_1024(n)	17.41	17.55	18.86	17.80	18.78	43.54
every_1025(n)	51.19	51.72	54.09	52.28	53.29	101.97
every_10000(n)	42.82	45.65	48.09	46.20	47.83	82.92
every_65535(n)	51.51	53.45	55.68	54.00	55.04	87.36
every_65536(n)	16.74	16.84	17.91	16.99	17.37	47.82
every_65537(n)	60.62	61.44	65.16	62.56	64.93	104.71
every_100000(n)	43.68	44.48	46.81	44.98	46.51	83.33
every_1000000(n)	41.61	44.21	46.99	44.86	47.11	87.90
every_1048575(n)	50.98	52.80	54.92	53.55	55.36	72.44
every_1048576(n)	16.73	16.83	17.92	17.05	17.89	35.52
every_1048577(n)	60.28	62.58	65.43	63.92	65.91	87.58

I get similar results across various operating systems (Windows, OS X and Linux) all using GNU Compiler Collection (GCC).

Feedback and comments are apprecated!

To reproduce these results, do:

> path <- 'https://raw.githubusercontent.com/HenrikBengtsson/jottr.org/master/blog/20150604%2CR_CheckUserInterrupt'
> html <- R.rsp::rfile('R_CheckUserInterrupt.md.rsp', path = path)
> !html  ## Open in browser

To Students: matrixStats for Google Summer of Code

Thu, 12 Mar 2015 00:00:00 +0000

We are pleased to announce our proposal ‘Subsetted and parallel computations in matrixStats’ for Google Summer of Code. The project is aimed for a student with experience in R and C, it runs for three months, and the student gets paid 5500 USD by Google. Students from (almost) all over the world can apply. Application deadline is March 27, 2015. I, Henrik Bengtsson, and Héctor Corrada Bravo will be joint mentors. Communication and mentoring will occur online. We’re looking forward to your application.

How to: Package Vignettes in Plain LaTeX

Sat, 21 Feb 2015 00:00:00 +0000

Ever wanted to include a plain-LaTeX vignette in your package and have it compiled into a PDF? The R.rsp package provides a four-line solution for this.

But, first, what’s R.rsp? R.rsp is an R package that implements a compiler for the RSP markup language. RSP can be used to embed dynamic R code in any text-based source document to be compiled into a final document, e.g. RSP-embedded LaTeX into PDF, RSP-embedded Markdown into HTML, RSP-embedded HTML into HTML and so on. The package provides a set of vignette engines making it straightforward to use RSP in vignettes and there are also other vignette engines to, for instance, include static PDF vignettes. Starting with R.rsp v0.20.0 (on CRAN), a vignette engine for including plain LaTeX-based vignettes is also available. The R.rsp package installs out-of-the-box on all common operating systems, including Linux, OS X and Windows. Its source code is available on GitHub.

Steps to include a LaTeX vignettes in your package

Place your LaTeX file in the vignettes/ directory of your package. If it needs other files such as image files, place those under this directory too.
Rename the file to have filename extension *.ltx, e.g. vignettes/UsingYadayada.ltx(*)
Add the following meta directives at the top of the LaTeX file:
%\VignetteIndexEntry{Using Yadayada}
%\VignetteEngine{R.rsp::tex}
Add the following to your DESCRIPTION file:
Suggests: R.rsp
VignetteBuilder: R.rsp

That’s all!

When you run R CMD build, the R.rsp::tex vignette engine will compile your LaTeX vignette into a PDF and make it part of your package’s *.tar.gz file. As for any vignette engine, the PDF will be placed in the inst/doc/ directory of the *.tar.gz file, ready to be installed together with your package. Users installing your package will not have to install R.rsp.

If this is your first package vignette ever, you should know that you are now only baby steps away from writing your first “dynamic” vignette using Sweave, knitr or RSP. For RSP-embedded LaTeX vignettes, change the engine to R.rsp::rsp, rename the file to *.ltx.rsp (or *.tex.rsp) and start embedding R code in the LaTeX file, e.g. ‘The p-value is <%= signif(p, 2) %>`.

Footnote: (*) If one uses filename extension *.tex, then R CMD check will give a false NOTE about the file “should probably not be installed”. Using extension *.ltx, which is an official LaTeX extension, avoids this issue.

Why not use Sweave?

It has always been possible to “hijack” the Sweave vignette engine to achieve the same thing by renaming the filename extension to *.Rnw and including the proper \VignetteIndexEntry markup. This would trick R to compile it as an Sweave vignette (without Sweave markup) resulting in a PDF, which in practice would work as a plain LaTeX-to-PDF compiler. The R.rsp::tex engine achieves the same without the “hack” and without the Sweave machinery.

Static PDFs?

If you want to use a “static” pre-generated PDF as a package vignette that can also be achieved in a few step using the R.rsp::asis vignette engine. There is an R.rsp vignette explaining how to do this, but please consider alternatives that compile from source before doing this. Also, vignettes without full source may not be accepted by CRAN. A LaTeX vignette does not have this problem.

Package: matrixStats 0.13.1 - Methods that Apply to Rows and Columns of a Matrix (and Vectors)

Sun, 25 Jan 2015 00:00:00 +0000

A new release 0.13.1 of matrixStats is now on CRAN. The source code is available on GitHub.

What does it do?

The matrixStats package provides highly optimized functions for computing common summaries over rows and columns of matrices, e.g. rowQuantiles(). There are also functions that operate on vectors, e.g. logSumExp(). Their implementations strive to minimize both memory usage and processing time. They are often remarkably faster compared to good old apply() solutions. The calculations are mostly implemented in C, which allow us to optimize(*) beyond what is possible to do in plain R. The package installs out-of-the-box on all common operating systems, including Linux, OS X and Windows.

The following example computes the median of the columns in a 20-by-500 matrix

> library("matrixStats")
> X <- matrix(rnorm(20 * 500), nrow = 20, ncol = 500)
> stats <- microbenchmark::microbenchmark(colMedians = colMedians(X), 
+     `apply+median` = apply(X, MARGIN = 2, FUN = median), unit = "ms")
> stats
Unit: milliseconds
         expr   min    lq  mean median   uq    max neval cld
   colMedians  0.41  0.45  0.49   0.47  0.5   0.75   100  a 
 apply+median 21.50 22.77 25.59  23.86 26.2 107.12   100   b

It shows that colMedians() is ~51 times faster than apply(..., MARGIN = 2, FUN = median) in this particular case. The relative gain varies with matrix shape, so you should benchmark with your configurations. You can also play around with the benchmark reports that are under development, e.g. html <- matrixStats:::benchmark("colRowMedians"); !html.

What is new?

With this release, all the functions run faster than ever before and at the same time use less memory than ever before, which in turn means that now even larger data matrices can be processed without having to upgrade the RAM. A few small bugs have also been fixed and some “missing” functions have been added to the R API. This update is part of a long-term tune-up that started back in June 2014. Most of the major groundwork has already been done, but there is still room for improvements. If you’re using matrixStats functions in your package already now, you should see some notable speedups for those function calls, especially compared to what was available back in June. For instance, rowMins() is now 5-20 times faster than functions such as base::pmin.int() whereas in the past they performed roughly the same.

I’ve also added a large number of new package tests; the R and C source code coverage has recently gone up from 59% to 96% (… and counting). Some of the bugs were discovered as part of this effort. Here a special thank should go out to Jim Hester for his great work on covr, which provides me with on-the-fly coverage reports via Coveralls. (You can run covr locally or via GitHub + Travis CI, which is very easy if you’re already up and running there. Try it!) I would also like to thank the R core team and the CRAN team for their continuous efforts on improving the package tests that we get via R CMD check but also via the CRAN farm (which occasionally catches code issues that I’m not always seeing on my end).

Footnote: (*) One strategy for keeping the memory footprint at a minimum is to optimize the implementations for the integer and the numeric (double) data types separately. Because of this, a great number of data-type coercions are avoided, coercions that otherwise would consume precious memory due to temporarily allocated copies, but also precious processing time because the garbage collector later would have to spend time cleaning up the mess. The new weightedMean() function, which is many times faster than stats::weighted.mean(), is one of several cases where this strategy is particular helpful.

Milestone: 6000 Packages on CRAN

Wed, 29 Oct 2014 00:00:00 +0000

Another 1,000 packages were added to CRAN and this time in less than 12 months. Today (2014-10-29) on The Comprehensive R Archive Network (CRAN) package page:

“Currently, the CRAN package repository features 6000 available packages.”

Going from 5,000 to 6,000 packages took 355 days - which means that it on average was only ~8.5 hours between each new packages added. It is actually even more frequent since dropped packages are not accounted for. The 6,000 packages on CRAN are maintained by 3,444 people. Thanks to all package developers and to the CRAN Team for handling all this!

You can give back by carefully reporting bugs to the maintainers and properly citing any packages you use in your publications, cf. citation("pkg name").

Milestones:

2014-10-29: 6000 packages
2013-11-08: 5000 packages
2012-08-23: 4000 packages
2011-05-12: 3000 packages
2009-10-04: 2000 packages
2007-04-12: 1000 packages
2004-10-01: 500 packages
2003-04-01: 250 packages

These data are for CRAN only. There are many more packages elsewhere, e.g. Bioconductor, R-Forge (sic!), RForge (sic!), Github etc.

Pitfall: Did You Really Mean to Use matrix(nrow, ncol)?

Tue, 17 Jun 2014 00:00:00 +0000

Are you a good R citizen and preallocates your matrices? If you are allocating a numeric matrix in one of the following two ways, then you are doing it the wrong way!

x <- matrix(nrow = 500, ncol = 100)

x <- matrix(NA, nrow = 500, ncol = 100)

Why? Because it is counter productive. And why is that? In the above, x becomes a logical matrix, and not a numeric matrix as intended. This is because the default value of the data argument of matrix() is NA, which is a logical value, i.e.

> x <- matrix(nrow = 500, ncol = 100)
> mode(x)
[1] "logical"
> str(x)
 logi [1:500, 1:100] NA NA NA NA NA NA ...

Why is that bad? Because, as soon as you assign a numeric value to any of the cells in x, the matrix will first have to be coerced to numeric when the new value is assigned. The originally allocated logical matrix was allocated in vain and just adds an unnecessary memory footprint and extra work for the garbage collector.

Instead allocate it using NA_real_ (or NA_integer_ for integers):

x <- matrix(NA_real_, nrow = 500, ncol = 100)

Of course, if you wish to allocate a matrix with all zeros, use 0 instead of NA_real_ (or 0L for integers).

The exact same thing happens with array() and also because the default value is NA, e.g.

> x <- array(dim = c(500, 100))
> mode(x)
[1] "logical"

Similarly, be careful when you setup vectors using rep(), e.g. compare

x <- rep(NA, times = 500)

x <- rep(NA_real_, times = 500)

Note, if all you want is an empty vector with all zeros, you may as well use

x <- double(500)

for doubles and

x <- integer(500)

for integers.

Details

In the ‘base’ package there is a neat little function called tracemem() that can be used to trace the internal copying of objects. We can use it to show how the two cases differ. Lets start by doing it the wrong way:

> x <- matrix(nrow = 500, ncol = 100)
> tracemem(x)
[1] "<0x00000000100a0040>"
> x[1,1] <- 3.14
tracemem[0x00000000100a0040 -> 0x000007ffffba0010]:
> x[1,2] <- 2.71
>

That ‘tracemem’ output message basically tells us that x is copied, or more precisely that a new internal object (0x000007ffffba0010) is allocated and that x now refers to that instead of the original one (0x00000000100a0040). This happens because x needs to be coerced from logical to numerical before assigning cell (1,1) the (numerical) value 3.14. Note that there is no need for R to create a copy in the second assignment to x, because at this point it is already of a numeric type.

To avoid the above, lets make sure to allocate a numeric matrix from the start and there will be no extra copies created:

> x <- matrix(NA_real_, nrow = 500, ncol = 100)
> tracemem(x)
[1] "<0x000007ffffd70010>"
> x[1,1] <- 3.14
> x[1,2] <- 2.71
>

Appendix

Session information

R version 3.1.0 Patched (2014-06-11 r65921)
Platform: x86_64-w64-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=English_United States.1252 
[2] LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] R.utils_1.32.5    R.oo_1.18.2       R.methodsS3_1.6.2

loaded via a namespace (and not attached):
[1] R.cache_0.10.0 R.rsp_0.19.0   tools_3.1.0

Reproducibility

This report was generated from an RSP-embedded Markdown document using R.rsp v0.19.0.

Performance: captureOutput() is Much Faster than capture.output()

Mon, 26 May 2014 00:00:00 +0000

The R function capture.output() can be used to “collect” the output of functions such as cat() and print() to strings. For example,

> s <- capture.output({
+     cat("Hello\nworld!\n")
+     print(pi)
+ })
> s
[1] "Hello"        "world!"       "[1] 3.141593"

More precisely, it captures all output sent to the standard output and returns a character vector where each element correspond to a line of output. By the way, it does not capture the output sent to the standard error, e.g. cat("Hello\nworld!\n", file = stderr()) and message("Hello\nworld!\n").

However, as currently implemented (R 3.1.0), this function is very slow in capturing a large number of lines. Its processing time is approximately quadratic (= $O(n^2)$), ~~exponential (= O(e^n))~~ in the number of lines capture, e.g. on my notebook 10,000 lines take 0.7 seconds to capture, whereas 50,000 take 12 seconds, and 100,000 take 42 seconds. The culprit is textConnection() which capture.output() utilizes. Without going in to the details, it turns out that textConnection() copies lines one by one internally, which is extremely inefficient.

The captureOutput() function of R.utils does not have this problem. Its processing time is linear in the number of lines and characters, because it relies on rawConnection() instead of textConnection(). For instance, 100,000 lines take 0.2 seconds and 1,000,000 lines take 2.5 seconds to captures when the lines are 100 characters long. For 100,000 lines with 1,000 characters it takes 2.4 seconds.

Benchmarking

The above benchmark results were obtained as following. We first create a function that generates a string with a large number of lines:

> lineBuffer <- function(n, len) {
+     line <- paste(c(rep(letters, length.out = len), "\n"), collapse = "")
+     line <- charToRaw(line)
+     lines <- rep(line, times = n)
+     rawToChar(lines, multiple = FALSE)
+ }

For example,

> cat(lineBuffer(n = 2, len = 10))
abcdefghij
abcdefghij

For very long character vectors paste() becomes very slow, which is why rawToChar() is used above.

Next, lets create a function that measures the processing time for a capture function to capture the output of a given number of lines:

> benchmark <- function(fcn, n, len) {
+     x <- lineBuffer(n, len)
+     system.time({
+         fcn(cat(x))
+     }, gcFirst = TRUE)[[3]]
+ }

Note that the measured processing time neither includes the creation of the line buffer string nor the garbage collection.

The functions to be benchmarked are:

> fcns <- list(capture.output = capture.output, captureOutput = captureOutput)

and we choose to benchmark for outputs with a variety number of lines:

> ns <- c(1, 10, 100, 1000, 10000, 25000, 50000, 75000, 1e+05)

Finally, lets benchmark all of the above with lines of length 100 and 1,000 characters:

> benchmarkAll <- function(ns, len) {
+     stats <- lapply(ns, FUN = function(n) {
+         message(sprintf("n=%d", n))
+         t <- sapply(fcns, FUN = benchmark, n = n, len = len)
+         data.frame(name = names(t), n = n, time = unname(t))
+     })
+     Reduce(rbind, stats)
+ }
> stats_100 <- benchmarkAll(ns, len = 100L)
> stats_1000 <- benchmarkAll(ns, len = 1000L)

The results are:

n	capture.output(100)	captureOutput(100)	capture.output(1000)	captureOutput(1000)
1	0.00	0.00	0.00	0.00
10	0.00	0.00	0.00	0.00
100	0.00	0.00	0.01	0.00
1000	0.00	0.02	0.02	0.01
10000	0.69	0.02	0.80	0.21
25000	3.18	0.05	2.99	0.57
50000	11.88	0.15	10.33	1.17
75000	25.01	0.19	25.43	1.80
100000	41.73	0.24	46.34	2.41

Table: Benchmarking of captureOutput() and capture.output() for n lines of length 100 and 1,000 characters. All times are in seconds.

Figure: captureOutput() captures standard output much faster than capture.output(). The processing time for the latter grows exponentially in the number of lines captured whereas for the former it only grows linearly.

These results will vary a little bit from run to run, particularly since we only benchmark once per setting. This also explains why for some settings the processing time for lines with 1,000 characters appears faster than the corresponding setting with 100 characters. Averaging over multiple runs would remove this artifact.

UPDATE:
2015-02-06: Thanks to Kevin Van Horn for pointing out that the growth of the capture.output() is probably not as extreme as exponential and suggests quadratic growth.

Appendix

Session information

R version 3.1.0 Patched (2014-05-21 r65711)
Platform: x86_64-w64-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=English_United States.1252 
[2] LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] markdown_0.7      plyr_1.8.1        R.cache_0.9.5     knitr_1.5.26     
[5] ggplot2_1.0.0     R.devices_2.9.2   R.utils_1.32.5    R.oo_1.18.2      
[9] R.methodsS3_1.6.2

loaded via a namespace (and not attached):
 [1] base64enc_0.1-1  colorspace_1.2-4 digest_0.6.4     evaluate_0.5.5  
 [5] formatR_0.10     grid_3.1.0       gtable_0.1.2     labeling_0.2    
 [9] MASS_7.3-33      mime_0.1.1       munsell_0.4.2    proto_0.3-10    
[13] R.rsp_0.18.2     Rcpp_0.11.1      reshape2_1.4     scales_0.2.4    
[17] stringr_0.6.2    tools_3.1.0

Tables were generated using plyr and knitr, and graphics using ggplot2.

Reproducibility

This report was generated from an RSP-embedded Markdown document using R.rsp v0.18.2.

Speed Trick: Assigning Large Object NULL is Much Faster than using rm()!

Sat, 25 May 2013 00:00:00 +0000

When processing large data sets in R you often also end up creating large temporary objects. In order to keep the memory footprint small, it is always good to remove those temporary objects as soon as possible. When done, removed objects will be deallocated from memory (RAM) the next time the garbage collection runs.

Better: Use `rm(list = "x")` instead of `rm(x)`, if using `rm()`

To remove an object in R, one can use the rm() function (with alias remove()). However, it turns out that that function has quite a bit of internal overhead (look at its R code), particularly if you call it as rm(x) rather than rm(list = "x"). The former takes about three times longer to complete. Example:

> t1 <- system.time(for (k in 1:1e5) { a <- 1; rm(a) })
> t2 <- system.time(for (k in 1:1e5) { a <- 1; rm(list = "a") })
> t1
   user  system elapsed
  10.45    0.00   10.50
> t2
   user  system elapsed
   2.93    0.00    2.94
> t1/t2
    user   system  elapsed
3.566553      NaN 3.571429

Note: In order to minimize the impact of the memory allocation on the benchmark, I use a <- 1 to represent the “large” object.

Best: Use x <- NULL instead of rm()

Instead of using rm(list = "x"), which still has a fair amount of overhead, one can remove a large active object by assigning the corresponding variable a new value (a small object), e.g. x <- NULL. Whenever doing this, the previously assigned value (the large object) will become available for garbage collection. Example:

> t3 <- system.time(for (k in 1:1e5) { a <- 1; a <- NULL })
> t3
   user  system elapsed
   0.05    0.00    0.05
> t1/t3
   user  system elapsed
    209     NaN     210

That’s a 200 times speedup!

Background

I “accidentally” discovered this when profiling readMat() in my R.matlab package. In particular, there was one rm(x) call inside a local function that was called thousands of times when parsing modestly large MAT files. Together with some additional optimizations, R.matlab v2.0.0 (to be appear) is now 10-20 times faster. Now I’m going to review all my other packages for expensive rm() calls.

This Day in History (1997-04-01)

Mon, 01 Apr 2013 00:00:00 +0000

Today it’s 16 years ago and 367,496 messages later since Martin Mächler started the R-help (321,119 msgs), R-devel (45,830 msgs) and R-announce (547 msgs) mailing lists [1] - a great benefit to all of us. Special thanks to Martin and also thanks to everyone else contributing to these forums.

[1] https://stat.ethz.ch/pipermail/r-help/1997-April/001490.html

Speed Trick: unlist(..., use.names=FALSE) is Heaps Faster!

Mon, 07 Jan 2013 00:00:00 +0000

Sometimes a minor change to your R code can make a big difference in processing time. Here is an example showing that if you’re don’t care about the names attribute when unlist():ing a list, specifying argument use.names = FALSE can speed up the processing lots!

> x <- split(sample(1000, size = 1e6, rep = TRUE), rep(1:1e5, times = 10))
> t1 <- system.time(y1 <- unlist(x))
> t2 <- system.time(y2 <- unlist(x, use.names = FALSE))
> stopifnot(identical(y2, unname(y1)))
> t1/t2
user  system elapsed
 103     NaN     104

That’s more than a 100 times speedup.

So, check your code to see to which unlist() statements you can add an use.names = FALSE.

Force R Help HTML Server to Always Use the Same URL Port

Mon, 22 Oct 2012 00:00:00 +0000

The below code shows how to configure the help.ports option in R such that the built-in R help server always uses the same URL port. Just add it to the .Rprofile file in your home directory (iff missing, create it). For more details, see help("Startup").

# Force the URL of the help to http://127.0.0.1:21510
options(help.ports = 21510)

A slighter fancier version is to use a environment variable to set the port(s):

local({
  ports <- Sys.getenv("R_HELP_PORTS", 21510)
  ports <- as.integer(unlist(strsplit(ports, ",")))
  options(help.ports = ports)
})

However, if you launch multiple R sessions in parallel, this means that they will all try to use the same port, but it’s only the first one that will success and all other will fail. An alternative is then to provide R with a set of ports to choose from (see help("startDynamicHelp", package = "tools")). To set the ports to 21510-21519 if you run R v2.15.1, to 21520-21529 if you run R v2.15.2, to 21600-21609 if you run R v2.16.0 (“devel”) and so on, do:

local(
  port <- sum(c(1e4, 100) * as.double(R.version[c("major", "minor")]))
  options(help.ports = port + 0:9)
})

With this it will be easy from the URL to identify for which version of R the displayed help is for. Finally, if you wish the R help server to start automatically in the background when you start R, add:

# Try to start HTML help server
if (interactive()) {
  try(tools::startDynamicHelp())
}

Set Package Repositories at Startup

Thu, 27 Sep 2012 00:00:00 +0000

The below code shows how to configure the repos option in R such that install.packages() etc. will locate the packages without having to explicitly specify the repository. Just add it to the .Rprofile file in your home directory (iff missing, create it). For more details, see help("Startup").

local({
  repos <- getOption("repos")

  # http://cran.r-project.org/
  # For a list of CRAN mirrors, see getCRANmirrors().
  repos["CRAN"] <- "http://cran.stat.ucla.edu"

  # http://www.stats.ox.ac.uk/pub/RWin/ReadMe
  if (.Platform$OS.type == "windows") {
    repos["CRANextra"] <- "http://www.stats.ox.ac.uk/pub/RWin"
  }

  # http://r-forge.r-project.org/
  repos["R-Forge"] <- "http://R-Forge.R-project.org"

  # http://www.omegahat.org/
  repos["Omegahat"] <- "http://www.omegahat.org/R"

  options(repos = repos)
})

JottR on R

Setting Future Plans in R Functions — and Why You Probably Shouldn't

TL;DR

Decoupling of intent to parallelize and how to execute it

Straying away from the core design philosophy

But, please avoid switching future backends if you can

Reference

Future Got Better at Finding Global Variables

Futureverse – Ten-Year Anniversary

parallelly: Querying, Killing and Cloning Parallel Workers Running Locally or Remotely

Examples

The new cluster managing skills enhances the future ecosystem

Links

%dofuture% - a Better foreach() Parallelization Operator than %dopar%

Introduction

Problems of %dopar% that %dofuture% addresses

Problem 1. %dopar% requires registering a foreach adaptor

Problem 2. Chunking and load-balancing differ among foreach backends

Problem 3. Different foreach backends use different foreach() options

Problem 4. Global variables are not always identified by foreach()

Problem 5. Easy to forget parallel random number generation

Migrating from %dopar% to %dofuture% is straightforward

Links

Edmonton R User Group Meetup: Futureverse - A Unifying Parallelization Framework in R for Everyone

Links

parallelly 1.34.0: Support for CGroups v2, Killing Parallel Workers, and more

Added support for CGroups v2

Avoid running out of R connections

Forcefully terminate PSOCK cluster nodes

Links

progressr 0.13.0: cli + progressr = ♥

Use ‘cli’ progress bars for ‘progressr’ reporting

Configure ‘cli’ to Report Progress via ‘progressr’

Customize progress reporting when R starts

Other posts on progressr reporting

Links

Please Avoid detectCores() in your R Packages

TL;DR

Background

Common mistakes when using detectCores()

Issue 1: detectCores() may return a missing value

Issue 2: detectCores() may return one

Issue 3: detectCores() may return too many cores

Issue 4: detectCores() does not give the number of “allowed” cores

4a. A personal computer

4b. A shared computer

4c. A shared compute cluster with many machines

4d. Running R via CGroups on in a Linux container

My opinionated recommendation

useR! 2022: My 'Futureverse: Profile Parallel Code' Slides

Links

parallelly: Support for Fujitsu Technical Computing Suite High-Performance Compute (HPC) Environments

Support for the Fujitsu Technical Computing Suite

Avoid having to specify rshcmd = “pjrsh”

Links

parallelly 1.32.0: makeClusterPSOCK() Didn't Work with Chinese and Korean Locales

Important bug fix for Chinese and Korean users

Links

progressr 0.10.1: Plyr Now Supports Progress Updates also in Parallel

plyr + future + progressr ⇒ parallel progress reporting

There’s actually a better way

Other posts on progress reporting

Links

parallelly 1.31.1: Better at Inferring Number of CPU Cores with Cgroups and Linux Containers

availableCores() detects more cgroups settings

makeClusterPSOCK() gained more skills

New argument default_packages

New argument rscript_sh

Links

future 1.24.0: Forwarding RNG State also for Stand-Alone Futures

future(…, seed = TRUE) updates RNG state

Deprecating future strategies ‘transparent’ and ‘remote’

Other posts on random numbers in parallel processing

Links

Future Improvements During 2021

New features

futureSessionInfo() for troubleshooting and issue reporting

Working around UTF-8 escaping on MS Windows

Harmonization of future(), futureAssign(), and futureCall()

Protecting against non-exportable results

Problems of `%dopar%` that `%dofuture%` addresses

Problem 1. `%dopar%` requires registering a foreach adaptor

Problem 3. Different foreach backends use different `foreach()` options

Problem 4. Global variables are not always identified by `foreach()`

New argument `default_packages`

New argument `rscript_sh`