# Using Kubernetes and the Future Package to Easily Parallelize R in the Cloud

This is a guest post by Chris Paciorek, Department of Statistics, University of California at Berkeley. In this post, I’ll demonstrate that you can easily use the future package in R on a cluster of machines running in the cloud, specifically on a Kubernetes cluster. This allows you to easily doing parallel computing in R in the cloud. One advantage of doing this in the cloud is the ability to easily scale the number and type of (virtual) machines across which you run your parallel computation.

# future.BatchJobs - End-of-Life Announcement

This is an announcement that future.BatchJobs - A Future API for Parallel and Distributed Processing using BatchJobs has been archived on CRAN. The package has been deprecated for years with a recommendation of using future.batchtools instead. The latter has been on CRAN since June 2017 and builds upon the batchtools package, which itself supersedes the BatchJobs package. To wrap up the three-and-a-half year long life of future.

# My Keynote 'Future' Presentation at the European Bioconductor Meeting 2020

Luke Zappia's summary of the talk I presented Future: A Simple, Extendable, Generic Framework for Parallel Processing in R at the European Bioconductor Meeting 2020, which took place online during the week of December 14-18, 2020. You’ll find my slides (39 slides + Q&A slides; 35 minutes) below: Title & Abstract HTML (Google Slides; requires online access) PDF (flat slides) Video (to be uploaded by the organizers) I want to thank the organizers for inviting me to this Bioconductor conference.

# NYC R Meetup: Slides on Future

I presented Future: Simple, Friendly Parallel Processing for R (67 minutes; 59 slides + Q&A slides) at New York Open Statistical Programming Meetup, on November 9, 2020: HTML (incremental Google Slides; requires online access) PDF (flat slides) Video (presentation starts at 0h10m30s, Q&A starts at 1h17m40s) I like to thanks everyone who attented and everyone who asked lots of brilliant questions during the Q&A. I’d also want to express my gratitude to Amada, Jared, and Noam for the invitation and making this event possible.

# future 1.20.1 - The Future Just Got a Bit Brighter

future 1.20.1 is on CRAN. It adds some new features, deprecates old and unwanted behaviors, adds a couple of vignettes, and fixes a few bugs. Interactive debugging First out among the new features, and a long-running feature request, is the addition of argument split to plan(), which allows us to split, or “tee”, any output produced by futures. The default is split = FALSE for which standard output and conditions are captured by the future and only relayed after the future has been resolved, i.

# parallelly, future - Cleaning Up Around the House

parallelly adverb par·​al·​lel·​ly | \ ˈpa-rə-le(l)li \ Definition: in a parallel manner future noun fu·​ture | \ ˈfyü-chər \ Definition: existing or occurring at a later time I’ve cleaned up around the house - with the recent release of future 1.20.1, the package gained a dependency on the new parallelly package. Now, if you’re like me and concerned about bloating package dependencies, I’m sure you immediately wondered why I chose to introduce a new dependency.

# Trust the Future

Each time we use R to analyze data, we rely on the assumption that functions used produce correct results. If we can’t make this assumption, we have to spend a lot of time validating every nitty detail. Luckily, we don’t have to do this. There are many reasons for why we can comfortably use R for our analyses and some of them are unique to R. Here are some I could think of while writing this blog post - I’m sure I forgot something:

# future 1.19.1 - Making Sure Proper Random Numbers are Produced in Parallel Processing

Parallel ‘Digital Rain’ by Jahobr After two-and-a-half months, future 1.19.1 is now on CRAN. As usual, there are some bug fixes and minor improvements here and there (NEWS), including things needed by the next version of furrr. For those of you who use Slurm or LSF/OpenLava as a scheduler on your high-performance compute (HPC) cluster, future::availableCores() will now do a better job respecting the CPU resources that those schedulers allocate for your R jobs.

# Detect When the Random Number Generator Was Used

If you ever need to figure out if a function call in R generated a random number or not, here is a simple trick that you can use in an interactive R session. Add the following to your ~/.Rprofile(*): if (interactive()) { invisible(addTaskCallback(local({ last <- .GlobalEnv$.Random.seed function(...) { curr <- .GlobalEnv$.Random.seed if (!identical(curr, last)) { msg <- "TRACKER: .Random.seed changed" if (requireNamespace("crayon", quietly=TRUE)) msg <- crayon::blurred(msg) message(msg) last <<- curr } TRUE } }), name = "RNG tracker")) } It works by checking whether or not the state of the random number generator (RNG), that is, .

# future and future.apply - Some Recent Improvements

There are new versions of future and future.apply - your friends in the parallelization business - on CRAN. These updates are mostly maintenance updates with bug fixes, some improvements, and preparations for upcoming changes. It’s been some time since I blogged about these packages, so here is the summary of the main updates this far since early 2020: future: values() for lists and other containers was renamed to value() to simplify the API [future 1.

# e-Rum 2020 Slides on Progressr

Source: Wiktionary.org I presented Progressr: An Inclusive, Unifying API for Progress Updates (15 minutes; 20 slides) at e-Rum 2020, on June 17, 2020: Abstract HTML (incremental Google Slides; requires online access) PDF (flat slides) Video (starts at 00h49m58s) I am grateful for everyone involved who made e-Rum 2020 possible. I cannot imagine having to cancel the on-site Milano conference that had planned for more than a year and then start over to re-organize and create a fabulous online experience for ~1,500 participants in such short notice.

# rstudio::conf 2020 Slides on Futures

Design: Dan LaBar I presented Future: Simple Async, Parallel & Distributed Processing in R Why and What’s New? at rstudio::conf 2020 in San Francisco, USA, on January 29, 2020. Below are the slides for my talk (17 slides; ~18+2 minutes): HTML (incremental Google Slides; requires online access) PDF (flat slides) Video with closed captions (official rstudio::conf recording) First of all, a big thank you goes out to Dan LaBar (@embiggenData) for proposing and contributing the original design of the future hex sticker.

# future 1.15.0 - Lazy Futures are Now Launched if Queried

No dogs were harmed while making this release future 1.15.0 is now on CRAN, accompanied by a recent, related update of future.callr 0.5.0. The main update is a change to the Future API: resolved() will now also launch lazy futures Although this change does not look much to the world, I’d like to think of this as part of a young person slowly finding themselves. This change in behavior helps us in cases where we create lazy futures upfront;

# useR! 2019 Slides on Futures

Below are the slides for my Future: Simple Parallel and Distributed Processing in R that I presented at the useR! 2019 conference in Toulouse, France on July 9-12, 2019. My talk (25 slides; ~15+3 minutes): Title: Future: Simple Parallel and Distributed Processing in R HTML (incremental Google Slides; requires online access) PDF (flat slides) Video (official recording) I want to send out a big thank you to everyone making the useR!

# startup - run R startup files once per hour, day, week, ...

New release: startup 0.12.0 is now on CRAN. This version introduces support for processing some of the R startup files with a certain frequency, e.g. once per day, once per week, or once per month. See below for two examples. startup::startup() is cross platform. The startup package makes it easy to split up a long, complicated .Rprofile startup file into multiple, smaller files in a .Rprofile.d/ folder. For instance, setting R option repos in a separate file ~/.

# SatRday LA 2019 Slides on Futures

A bit late but here are my slides on Future: Friendly Parallel Processing in R for Everyone that I presented at the satRday LA 2019 conference in Los Angeles, CA, USA on April 6, 2019. My talk (33 slides; ~45 minutes): Title: : Friendly Parallel and Distributed Processing in R for Everyone HTML (incremental slides; requires online access) PDF (flat slides) Video (44 min; YouTube; sorry, different page numbers) Thank you all for making this a stellar satRday event.

# SatRday Paris 2019 Slides on Futures

Below are links to my slides from my talk on Future: Friendly Parallel Processing in R for Everyone that I presented last month at the satRday Paris 2019 conference in Paris, France (February 23, 2019). My talk (32 slides; ~40 minutes): Title: Future: Friendly Parallel Processing in R for Everyone HTML (incremental slides; requires online access) PDF (flat slides) A big shout out to the organizers, all the volunteers, and everyone else for making it a great satRday.

# Parallelize a For-Loop by Rewriting it as an Lapply Call

A commonly asked question in the R community is: How can I parallelize the following for-loop? The answer almost always involves rewriting the for (...) { ... } loop into something that looks like a y <- lapply(...) call. If you can achieve that, you can parallelize it via for instance y <- future.apply::future_lapply(...) or y <- foreach::foreach() %dopar% { ... }. For some for-loops it is straightforward to rewrite the code to make use of lapply() instead, whereas in other cases it can be a bit more complicated, especially if the for-loop updates multiple variables in each iteration.

# Maintenance Updates of Future Backends and doFuture

New versions of the following future backends are available on CRAN: future.callr - parallelization via callr, i.e. on the local machine future.batchtools - parallelization via batchtools, i.e. on a compute cluster with job schedulers (SLURM, SGE, Torque/PBS, etc.) but also on the local machine future.BatchJobs - (maintained for legacy reasons) parallelization via BatchJobs, which is the predecessor of batchtools These releases fix a few small bugs and inconsistencies that were identified with help of the future.

# future 1.9.0 - Output from The Future

future 1.9.0 - Unified Parallel and Distributed Processing in R for Everyone - is on CRAN. This is a milestone release: Standard output is now relayed from futures back to the master R session - regardless of where the futures are processed! Disclaimer: A future’s output is relayed only after it is resolved and when its value is retrieved by the master R process. In other words, the output is not streamed back in a “live” fashion as it is produced.
• page 1 of 3

#### Henrik Bengtsson

MSc CS | PhD Math Stat | Associate Professor | R Foundation | R Consortium

Associate Professor