JottR on R https://www.jottr.org/categories/r/ Recent content in R on JottR Hugo -- gohugo.io en-us Wed, 25 Jun 2025 00:00:00 +0000 Setting Future Plans in R Functions — and Why You Probably Shouldn't https://www.jottr.org/2025/06/25/with-plan/ Wed, 25 Jun 2025 00:00:00 +0000 https://www.jottr.org/2025/06/25/with-plan/ <p><a href="https://www.jottr.org/2025/06/19/futureverse-10-years/"><img src="https://www.jottr.org/post/future-logo-balloons.png" alt="The 'future' hexlogo balloon wall" style="width: 20%; padding-left: 2ex; padding-bottom: 2ex; float: right;"/></a></p> <p>The <strong>future</strong> package <a href="https://www.jottr.org/2025/06/19/futureverse-10-years/">celebrates ten years on CRAN</a> as of June 19, 2025. This is the second in a series of blog posts highlighting recent improvements to the <strong><a href="https://www.futureverse.org">futureverse</a></strong> ecosystem.</p> <h2 id="tl-dr">TL;DR</h2> <p>You can now use</p> <pre><code class="language-r">my_fcn &lt;- function(...) { with(plan(multisession), local = TRUE) ... } </code></pre> <p>to <em>temporarily</em> set a future backend for use in your function. This guarantees that any changes are undone when the function exits, even if there is an error or an interrupt.</p> <p>But, I really recommend <em>not</em> doing any of that, as I&rsquo;ll try to explain below.</p> <h2 id="decoupling-of-intent-to-parallelize-and-how-to-execute-it">Decoupling of intent to parallelize and how to execute it</h2> <p>The core design philosophy of <strong>futureverse</strong> is:</p> <blockquote> <p>&ldquo;The developer decides what to parallelize, the user decides where and how.&rdquo;</p> </blockquote> <p>This decoupling of <em>intent</em> (what to parallelize) and <em>execution</em> (how to do it) makes code written using futureverse flexible, portable, and easy to maintain.</p> <p>Specifically, the developer <em>controls what to parallelize</em> by using <code>future()</code> or higher-level abstractions like <code>future_lapply()</code> and <code>future_map()</code> to mark code regions that may run concurrently. The code makes no assumptions about the compute environment and is therefore agnostic to which future backend being used, e.g.</p> <pre><code class="language-r">y &lt;- future_lapply(X, slow_fcn) </code></pre> <p>and</p> <pre><code class="language-r">y &lt;- future_map(X, slow_fcn) </code></pre> <p>Note how there is nothing in those two function calls that specifies how they are parallelized, if at all. Instead, the end user (e.g., data analyst, HPC user, or script runner) <em>controls the execution strategy</em> by setting the <a href="https://www.futureverse.org/backends.html">future backend</a> via <code>plan()</code>, e.g., built-in sequential, built-in multisession, <strong><a href="https://future.callr.futureverse.org">future.callr</a></strong>, and <strong><a href="https://future.mirai.futureverse.org">future.mirai</a></strong> backends. This allows the user to scale the same code from a notebook to an HPC cluster or cloud environment without changing the original code.</p> <p>We can find this design of <em>decoupling intent and execution</em> also in traditional R parallelization frameworks. In the <strong>parallel</strong> package we have <code>setDefaultCluster()</code>, which the user can set to control the default cluster type when none is explicitly specified. For that to be used, the developer needs to make sure to use the default <code>cl = NULL</code>, either explicitly as in:</p> <pre><code class="language-r">y &lt;- parLapply(cl = NULL, X, slow_fcn) </code></pre> <p>or implicitly<sup class="footnote-ref" id="fnref:1"><a href="#fn:1">1</a></sup>, by making sure all arguments are named, as in:</p> <pre><code class="language-r">y &lt;- parLapply(X = X, FUN = slow_fcn) </code></pre> <p>Unfortunately, this is rarely used - instead <code>parLapply(cl, X, FUN)</code> is by far the most common way of using the <strong>parallel</strong> package, resulting in little to no control for the end user.</p> <p>The <strong>foreach</strong> package had greater success with this design philosophy. There the developer writes:</p> <pre><code class="language-r">y &lt;- foreach(x = X) %dopar% { slow_fcn(x) } </code></pre> <p>with no option in that call to specify which parallel backend to use. Instead, the user typically controls the parallel backend via the so called &ldquo;dopar&rdquo; foreach adapter, e.g. <code>doParallel::registerDoParallel()</code>, <code>doMC::registerDoMC()</code>, and <code>doFuture::registerDoFuture()</code>. Unfortunately, there are ways for the developer to write <code>foreach()</code> with <code>%dopar%</code> statements such that the code works only with a specific parallel backend<sup class="footnote-ref" id="fnref:2"><a href="#fn:2">2</a></sup>. Regardless, it is clear from their designs, that both of these packages shared the same fundamental design philosophy of <em>decoupling intent and execution</em> as is used in the <strong>futureverse</strong>. You can read more about this in the introduction of my <a href="https://journal.r-project.org/archive/2021/RJ-2021-048/index.html">H. Bengtsson (2021)</a> article.</p> <p>When writing scripts or Rmarkdown documents, I recommend putting code that controls the execution (e.g. <code>plan()</code>, <code>registerDoNnn()</code>, and <code>setDefaultCluster()</code>) at the very top, immediately after any <code>library()</code> statements. This is also where I, like many others, prefer to put global settings such as <code>options()</code> statements. This makes it easier for anyone to identify which settings are available and used by the script. It also avoids cluttering up the rest of the code with such details.</p> <h2 id="straying-away-from-the-core-design-philosophy">Straying away from the core design philosophy</h2> <p>One practical advantage of the above decoupling design is that there is only one place where parallelization is controlled, instead of it being scattered throughout the code, e.g. as special parallel arguments to different function calls. This makes it easier for the end user, but also for the package developer who does not have to worry about what their APIs should look like and what arguments they should take.</p> <p>That said, some package developers prefer to expose control of parallelization via special function arguments. If we search CRAN packages, we find arguments like <code>parallel = FALSE</code>, <code>ncores = 1</code>, and <code>cluster = NULL</code> that then are used internally to set up the parallel backend. If you write functions that take this approach, it is <em>critical</em> that you remember to set the backend only temporarily, which can be done via <code>on.exit()</code>, e.g.</p> <pre><code class="language-r">my_fcn &lt;- function(xs, ncores = 1) { if (ncores &gt; 1) { cl &lt;- parallel::makeCluster(ncores) on.exit(parallel::stopCluster(cl)) y &lt;- parLapply(cl = cl, xs, slow_fcn) } else { y &lt;- lapply(xs, slow_fcn) } y } </code></pre> <p>If you use futureverse, you can use:</p> <pre><code class="language-r">my_fcn &lt;- function(xs, ncores = 1) { old_plan &lt;- plan(multisession, workers = ncores) on.exit(plan(old_plan)) y &lt;- future_lapply(xs, slow_fcn) y } </code></pre> <p>And, since <strong>future</strong> 1.40.0 (2025-04-10), you can achieve the same with a single line of code<sup class="footnote-ref" id="fnref:3"><a href="#fn:3">3</a></sup>:</p> <pre><code class="language-r">my_fcn &lt;- function(xs, ncores = 1) { with(plan(multisession, workers = ncores), local = TRUE) y &lt;- future_lapply(xs, slow_fcn) y } </code></pre> <p>I hope that this addition lowers the risk of forgetting to undo any changes done by <code>plan()</code> inside functions. If you forget, then you may override what the user intends to use elsewhere. For instance, they might have set <code>plan(batchtools_slurm)</code> to run their R code across a Slurm high-performance-compute (HPC) cluster, but if you change the <code>plan()</code> inside your package function without undoing your changes, then the user is up for a surprise and maybe also hours of troubleshooting.</p> <h2 id="but-please-avoid-switching-future-backends-if-you-can">But, please avoid switching future backends if you can</h2> <p>I still want to plead with package developers to avoid setting the future backend, even temporarily, inside their functions. There are other reasons for not doing this. For instance, if you provide users with an <code>ncores</code> arguments for controlling the amount of parallelization, you risk locking in the user into a specific parallel backend. A common pattern is to use <code>plan(multisession, workers = ncores)</code> as in the above examples. However, this prevents the user from taking advantage of other closely related parallel backends, e.g. <code>plan(callr, workers = ncores)</code> and <code>plan(mirai_multisession, workers = ncores)</code>. The <strong>future.callr</strong> backend runs each parallel task in a fresh R session that is shut down immediately afterward, which is beneficial when memory is the limiting factor. The <strong>future.mirai</strong> backend is optimized to have a low latency, meaning it can parallelize also shorter-term tasks, which might otherwise not be worth parallelizing. Also, contrary to <code>multisession</code>, these alternative backends can make use of all CPU cores available on modern hardware, e.g. 192- and 256-core machines. The <code>multisession</code> backend, which builds upon <strong>parallel</strong> PSOCK clusters, is limited to a maximum of 125 parallel workers, because each parallel worker consumes one R connection, and R can only have 125 connections open at any time. There are ways to increase this limit, but it still requires work. See <a href="https://parallelly.futureverse.org/reference/availableConnections.html"><code>parallelly::availableConnections()</code></a> for more details on this problem and how to increase the maximum number of connections.</p> <p>You can of course add another &ldquo;parallel&rdquo; argument to allow your users to control also which future backend to use, e.g. <code>backend = multisession</code> and <code>ncores = 1</code>. But, this might not be sufficient - there are backends that take additional arguments, which you then also need to support in each of your functions. Finally, new backends will be implemented by others in the future (pun intended and not), and we can&rsquo;t predict what they will require.</p> <p>Related to this, I am working on ways for (i) futureverse to choose among a set of parallel backends - not just one, (ii) based on resource specifications (e.g. memory needs and maximum run times) for specific future statements. This will give back some control to the developer over how and where execution happens and more options for the end user to scale out to different type of compute resources. For instance, a <code>future_map()</code> call with a 192-GiB memory requirement may only be sent to &ldquo;large-memory&rdquo; backends and, if not available, throw an instant error. Another example is a <code>future_map()</code> call with a 256-MiB memory and 5-minute runtime requirement - that is small enough to be sent to an AWS Lambda or GCS Cloud Functions backend, if the user has specified such a backend.</p> <p>In summary, I argue that it&rsquo;s better to let the user be in full control of the future backend, by letting them set it via <code>plan()</code>, preferably at the top of their scripts. If not possible, please make sure to use <code>with(plan(...), local = TRUE)</code>.</p> <p><em>May the future be with you!</em></p> <p>Henrik</p> <h2 id="reference">Reference</h2> <ul> <li>H. Bengtsson, A Unifying Framework for Parallel and Distributed Processing in R using Futures, The R Journal (2021) 13:2, pages 208-227 [<a href="https://journal.r-project.org/archive/2021/RJ-2021-048/index.html">abstract</a>, <a href="https://journal.r-project.org/archive/2021/RJ-2021-048/RJ-2021-048.pdf">PDF</a>]</li> </ul> <div class="footnotes"> <hr /> <ol> <li id="fn:1"><p>If the argument <code>cl = NULL</code> of <a href="https://rdrr.io/r/parallel/clusterApply.html"><code>parLapply()</code></a> had been the last argument instead of the first, then <code>parLapply(X, slow_fcn)</code>, which resembles <code>lapply(X, slow_fcn)</code>, would have also resulted in the default cluster being used.</p> <a class="footnote-return" href="#fnref:1"><sup>[return]</sup></a></li> <li id="fn:2"><p><code>foreach()</code> takes backend-specific options (e.g. <code>.options.multicore</code>, <code>.options.parallel</code>, <code>.options.mpi</code>, and <code>.options.future</code>). The developer can use these to adjust the default behavior of a given foreach adapter. Unfortunately, when used - or rather, when needed - the code is no longer agnostic to the backend - what will happen if a foreach adapter is used that the developer did not anticipate?</p> <a class="footnote-return" href="#fnref:2"><sup>[return]</sup></a></li> <li id="fn:3"><p>The <strong><a href="https://cran.r-project.org/package=withr">withr</a></strong> package has <code>with_nnn()</code> and <code>local_nnn()</code> functions for evaluating code with various settings temporarily changed. Following this lead, I was very close to adding <code>with_plan()</code> and <code>local_plan()</code> to <strong>future</strong> 1.40.0, but then I noticed that <strong><a href="https://cran.r-project.org/package=mirai">mirai</a></strong> supports <code>with(daemons(ncores), { ... })</code>. This works because <code>with()</code> is an S3 generic function. I like this approach, especially since it avoids adding more functions to the API. I added similar support for <code>with(plan(multisession, workers = ncores), { ... })</code>. More importantly, this allowed me to also add the <code>with(..., local = TRUE)</code> variant to be used inside functions, which makes it very easy to safely switch to a temporary future backend inside a function.</p> <a class="footnote-return" href="#fnref:3"><sup>[return]</sup></a></li> </ol> </div> Future Got Better at Finding Global Variables https://www.jottr.org/2025/06/23/future-got-better-at-finding-global-variables/ Mon, 23 Jun 2025 00:00:00 +0000 https://www.jottr.org/2025/06/23/future-got-better-at-finding-global-variables/ <p><a href="https://www.jottr.org/2025/06/19/futureverse-10-years/"><img src="https://www.jottr.org/post/future-logo-balloons.png" alt="The 'future' hexlogo balloon wall" style="width: 20%; padding-left: 2ex; padding-bottom: 2ex; float: right;"/></a></p> <p>The <strong>future</strong> package <a href="https://www.jottr.org/2025/06/19/futureverse-10-years/">celebrates ten years on CRAN</a> as of June 19, 2025. This is the first in a series of blog posts highlighting recent improvements to the <strong><a href="https://www.futureverse.org">futureverse</a></strong> ecosystem.</p> <p>The <strong><a href="https://globals.futureverse.org">globals</a></strong> package is part of the futureverse and has had two recent releases on 2025-04-15 and 2025-05-08. These updates address a few corner cases that would otherwise lead to unexpected errors. They also resulted in several long, outstanding issues reported on the <strong><a href="https://future.futureverse.org">future</a></strong>, <strong><a href="https://future.apply.futureverse.org">future.apply</a></strong>, <strong><a href="https://furrr.futureverse.org">furrr</a></strong>, and <strong><a href="https://doFuture.futureverse.org">doFuture</a></strong> package issue trackers, and elsewhere, could be closed.</p> <p>The significant update is that <a href="https://globals.futureverse.org/reference/globalsOf.html"><code>findGlobals()</code></a> gained argument <code>method = &quot;dfs&quot;</code>, which finds globals in R expressions by walking its abstract syntax tree (AST) using a <em>depth-first-search</em> algorithm. <strong>This new approach does a better job of emulating how the R engine identifies global variables, which results in an even smoother ride for anyone using futureverse for parallel and distributed processing.</strong> Previously, a tweaked search algorithm adopted from <code>codetools::findGlobals()</code> was used. The <strong><a href="https://cran.r-project.org/package=codetools">codetools</a></strong> search algorithm is mainly designed for <code>R CMD check</code> to detect undefined variables being used in package code. To limit the number of false positives reported by <code>R CMD check</code>, such algorithms tend to be &ldquo;conservative&rdquo; by nature, so that we can trust what is reported. This strategy is not always sufficient for automatically detecting globals needed in parallel processing. As an example, in</p> <pre><code class="language-r">fcn &lt;- function() { a &lt;- b b &lt;- 1 } </code></pre> <p>variable <code>b</code> is a global variable, but if we ask <strong>codetools</strong>, it does not pick up <code>b</code> as a global;</p> <pre><code class="language-r">codetools::findGlobals(fun) #&gt; [1] &quot;{&quot; &quot;&lt;-&quot; </code></pre> <p>This false negative is alright for <code>R CMD check</code>, but, in contrast, for parallel processing, we need to use a &ldquo;liberal&rdquo; search algorithm. In parallel processing it is okay to pick up and export too many variables to the parallel worker. If a variable is not used, little harm is done, but if we fail to export a needed variable, we&rsquo;ll end up with an object-not-found error. Futureverse has since the early days (December 2015) used a modified version of the <strong>codetools</strong> algorithm that is liberal, but not too liberal. It detects <code>b</code> as a global variable;</p> <pre><code class="language-r">globals::findGlobals(fun) #&gt; [1] &quot;{&quot; &quot;&lt;-&quot; &quot;b&quot; </code></pre> <p>This liberal search strategy turns out to work surprisingly well for detecting globals needed in parallel processing, but there were corner cases where it failed. For example, <strong>futureverse</strong> struggled to identify global variables in cases such as:</p> <pre><code class="language-r">library(future) plan(multisession, workers = 2) x &lt;- 2 f &lt;- future(local({ h &lt;- function(x) -x h(x) })) value(f) </code></pre> <p>which resulted in</p> <pre><code>Error in eval(quote({ : object 'x' not found </code></pre> <p>This is because there are several different variables named <code>x</code>, and the one in the calling environment is &ldquo;masked&rdquo; by argument <code>x</code>, which results in <code>x</code> never be picked up and exported to the parallel worker.</p> <p>It might look as if this type of code was carefully curated to fail, but would rarely, if at all, be spotted in real code. As a matter of fact, this is a distilled version of a large real-world scenario reported by at least one person. It&rsquo;s thanks to such feedback that we together can make improvements to the <strong>futureverse</strong> ecosystem 🙏 I cannot know for sure, but I&rsquo;d suspect this has impacted several R developers already - the <strong>future</strong> package is after all among the 0.6% most downloaded packages and there are <a href="https://r-universe.dev/search?q=needs%3Afuture">1,300 packages that &ldquo;need&rdquo; it</a> as of May 2025. The above problem was fixed in <strong>globals</strong> 0.18.0 (2025-05-08) and <strong>future</strong> 1.49.0 (2025-05-09), which now make use of the new <code>findGlobals(..., method = &quot;dfs&quot;)</code> search strategy internally. After updating these packages, the above code snippet gives us</p> <pre><code class="language-r">value(f) #&gt; [1] -2 </code></pre> <p>as we&rsquo;d expect.</p> <p>Another corner-case bug fix, is where</p> <pre><code class="language-r">library(future) library(magrittr) x &lt;- list() f &lt;- future ({ x %&gt;% `$&lt;-`(&quot;a&quot;, 42) }) </code></pre> <p>would result in the rather obscure error</p> <pre><code class="language-r">Error in e[[4]] : subscript out of bounds </code></pre> <p>This is due to <a href="https://gitlab.com/luke-tierney/codetools/-/issues/16">a bug</a> in the <strong>codetools</strong> package, which <strong>globals</strong> (&gt;= 0.17.0) [2025-04-15] works around. After updating, things work as expected;</p> <pre><code class="language-r">f &lt;- future ({ x %&gt;% `$&lt;-`(&quot;a&quot;, 42) }) value(f) #&gt; $a #&gt; [1] 42 </code></pre> <p>Yet another fix in <strong>globals</strong> (&gt;= 0.17.0) is that previous versions would throw an error if it ran into an S7 object. The S7 object class was introduced in 2023.</p> <p><em>May the future be with you!</em></p> <p>Henrik</p> <p>PS. Did you know that the <strong>codetools</strong> package is <a href="https://gitlab.com/luke-tierney/codetools/-/blob/master/noweb/codetools.nw?ref_type=heads">written using literate programming</a> following the vision of Donald Knuth? Neat, eh? And, it&rsquo;s almost like it was vibe coded, but with the large-language model (LLM) part being replaced by human knowledge and expertise 🤓</p> Futureverse – Ten-Year Anniversary https://www.jottr.org/2025/06/19/futureverse-10-years/ Thu, 19 Jun 2025 00:00:00 +0000 https://www.jottr.org/2025/06/19/futureverse-10-years/ <figure style="margin-top: 3ex;"> <div style="padding: 2ex; float: right;"/> <center> <img src="https://www.jottr.org/post/future-logo-balloons.png" alt="The 'future' hexlogo balloon wall" style="width: 80%;"/> </center> </div> <figcaption style="font-style: italic"> The future package turns ten on CRAN today – June 19, 2025. <small>(Image credits: Dan LaBar for the future logo; Hadley Wickham and Greg Swinehart for the ggplot2 logo and balloon wall; The future balloon wall was inspired by ggplot2’s recent real-world version and generated with ChatGPT.)</small> </figcaption> </figure> <p>The <strong><a href="https://future.futureverse.org">future</a></strong> package turns ten years old today. I released version 0.6.0 to CRAN on June 19, 2015, just days before I presented the package and sharing my visions at <a href="https://www.jottr.org/2016/07/02/future-user2016-slides/">useR! 2016</a>. I had no idea adoption would snowball the way it has. It&rsquo;s been an exciting, fun journey, and the best part has been you - the users and developers who shaped the futureverse through questions, discussions, bug reports, and feature requests. Thank you!</p> <p>To celebrate, I’m kicking off a series of posts over the next few weeks covering the latest improvements that make it easier than ever to scale existing code up or out on a parallel or distributed backend of your choice - and eventually in ways that are neater than what our trusty workhorses <strong><a href="https://future.apply.futureverse.org">future.apply</a></strong> and <strong><a href="https://furrr.futureverse.org">furrr</a></strong> offer.</p> <p>These gains come from a slow, steady, multi-year process of remodelling: internal redesigns, working with package maintainers to retire use of deprecated functions, releasing, fixing regressions, and repeating - all while end-users and most developers not noticing, except for a few. The first CRAN release where this work could be noticed was <strong>future</strong> 1.40.0 (April 10), followed by regression fixes and additional features in 1.49.0 (May 9), and lately 1.57.0 (June 5, 2025). More polishing and features are coming before we hit <strong>future</strong> 2.0.0 – in the near future (pun firmly intended). Thanks for helping make future a cornerstone of scalable R programming.</p> <p>Posts in this series thus far:</p> <ul> <li>2025-06-23: <a href="https://www.jottr.org/2025/06/23/future-got-better-at-finding-global-variables/">Future Got Better at Finding Global Variables</a></li> <li>2025-06-25: <a href="https://www.jottr.org/2025/06/25/with-plan/">Setting Future Plans in R Functions — and Why You Probably Shouldn&rsquo;t</a></li> </ul> <p><em>Stay tuned and may the future be with you!</em></p> <p>Henrik</p> parallelly: Querying, Killing and Cloning Parallel Workers Running Locally or Remotely https://www.jottr.org/2023/07/01/parallelly-managing-workers/ Sat, 01 Jul 2023 18:00:00 +0200 https://www.jottr.org/2023/07/01/parallelly-managing-workers/ <div style="padding: 2ex; float: right;"/> <center> <img src="https://www.jottr.org/post/parallelly-logo.png" alt="The 'parallelly' hexlogo"/> </center> </div> <p><strong><a href="https://parallelly.futureverse.org">parallelly</a></strong> 1.36.0 is on CRAN since May 2023. The <strong>parallelly</strong> package is part of the <a href="https://www.futureverse.org">Futureverse</a> and enhances the <strong>parallel</strong> package of base R, e.g. it adds several features you&rsquo;d otherwise expect to see in <strong>parallel</strong>. The <strong>parallelly</strong> package is one of the internal work horses for the <strong><a href="https://future.futureverse.org">future</a></strong> package, but it can also be used outside of the future ecosystem.</p> <p>In this most recent release, <strong>parallelly</strong> gained several new skills in how cluster nodes (a.k.a. parallel workers) can be managed. Most notably,</p> <ul> <li><p>the <a href="https://parallelly.futureverse.org/reference/isNodeAlive.html"><code>isNodeAlive()</code></a> function can now also query parallel workers running on remote machines. Previously, this was only possible to workers running on the same machine.</p></li> <li><p>the <a href="https://parallelly.futureverse.org/reference/killNode.html"><code>killNode()</code></a> function gained the power to terminate parallel workers running also on remotes machines.</p></li> <li><p>the new function <a href="https://parallelly.futureverse.org/reference/cloneNode.html"><code>cloneNode()</code></a> can be used to &ldquo;restart&rdquo; a cluster node, e.g. if a node was determined to no longer be alive by <code>isNodeAlive()</code>, then <code>cloneNode()</code> can be called to launch an new parallel worker on the same machine as the previous worker.</p></li> <li><p>The <code>print()</code> functions for PSOCK clusters and PSOCK nodes reports on the status of the parallel workers.</p></li> </ul> <h2 id="examples">Examples</h2> <p>Assume we&rsquo;re running a PSOCK cluster of two parallel workers - one running on the local machine and the other on a remote machine that we connect to over SSH. Here is how we can set up such a cluster using <strong>parallelly</strong>:</p> <pre><code class="language-r">library(parallelly) cl &lt;- makeClusterPSOCK(c(&quot;localhost&quot;, &quot;server.remote.org&quot;)) print(cl) # Socket cluster with 2 nodes where 1 node is on host 'server.remote.org' (R # version 4.3.1 (2023-06-16), platform x86_64-pc-linux-gnu), 1 node is on host # 'localhost' (R version 4.3.1 (2023-06-16), platform x86_64-pc-linux-gnu) </code></pre> <p>We can check if these two parallel workers are running. We can check this even if they are busy processing parallel tasks. The way <code>isNodeAlive()</code> works is that it checks of the <em>process</em> is running on worker&rsquo;s machine, which is something that can be done even when the worker is busy. For example, let&rsquo;s check the first worker process that run on the current machine:</p> <pre><code class="language-r">print(cl[[1]]) ## RichSOCKnode of a socket cluster on local host 'localhost' with pid 2457339 ## (R version 4.3.1 (2023-06-16), x86_64-pc-linux-gnu) using socket connection ## #3 ('&lt;-localhost:11436') isNodeAlive(cl[[1]]) ## [1] TRUE </code></pre> <p>In <strong>parallelly</strong> (&gt;= 1.36.0), we can now also query the remote machine:</p> <pre><code class="language-r">print(cl[[2]]) ## RichSOCKnode of a socket cluster on remote host 'server.remove.org' with ## pid 7731 (R version 4.3.1 (2023-06-16), x86_64-pc-linux-gnu) using socket ## connection #4 ('&lt;-localhost:11436') isNodeAlive(cl[[2]]) ## [1] TRUE </code></pre> <p>We can also query <em>all</em> parallel workers of the cluster at once, e.g.</p> <pre><code class="language-r">isNodeAlive(cl) ## [1] TRUE TRUE </code></pre> <p>Now, imagine if, say, the remote parallel process terminates for some unknown reasons. For example, the code running in parallel called some code that causes the parallel R process to crash and terminate. Although this &ldquo;should not&rdquo; happen, we all experience it once in a while. Another example is that the machine is running out of memory, for instance due to other misbehaving processes on the same machine. When that happens, the operating system might start killing processes in order not to completely crash the machine.</p> <p>When one of our parallel workers has crashed, it will obviously not respond to requests for processing our R tasks. Instead, we will get obscure errors like:</p> <pre><code class="language-r">y &lt;- parallel::parLapply(cl, X = X, fun = slow_fcn) ## Error in summary.connection(connection) : invalid connection </code></pre> <p>We can see that the second parallel worker in our cluster is no longer alive by:</p> <pre><code class="language-r">isNodeAlive(cl) ## [1] TRUE FALSE </code></pre> <p>We can also see that there is something wrong with the one of our workers if we call <code>print()</code> on our <code>RichSOCKcluster</code> and <code>RichSOCKnode</code> objects, e.g.</p> <pre><code class="language-r">print(cl) ## Socket cluster with 2 nodes where 1 node is on host 'server.remote.org' ## (R version 4.3.1 (2023-06-16), platform x86_64-pc-linux-gnu), 1 node is ## on host 'localhost' (R version 4.3.1 (2023-06-16), platform ## x86_64-pc-linux-gnu). 1 node (#2) has a broken connection (ERROR: ## invalid connection) </code></pre> <p>and</p> <pre><code class="language-r">print(cl[[1]]) ## RichSOCKnode of a socket cluster on local host 'localhost' with pid ## 2457339 (R version 4.3.1 (2023-06-16), x86_64-pc-linux-gnu) using ## socket connection #3 ('&lt;-localhost:11436') print(cl[[2]]) ## RichSOCKnode of a socket cluster on remote host 'server.remote.org' ## with pid 7731 (R version 4.3.1 (2023-06-16), x86_64-pc-linux-gnu) ## using socket connection #4 ('ERROR: invalid connection') </code></pre> <p>If we end up with a broken parallel worker like this, we can since <strong>parallelly</strong> 1.36.0 use <code>cloneNode()</code> to re-create the original worker. In our example, we can do:</p> <pre><code class="language-r">cl[[2]] &lt;- cloneNode(cl[[2]]) print(cl[[2]]) ## RichSOCKnode of a socket cluster on remote host 'server.remote.org' ## with pid 19808 (R version 4.3.1 (2023-06-16), x86_64-pc-linux-gnu) ## using socket connection #4 ('&lt;-localhost:11436') </code></pre> <p>to get a working parallel cluster, e.g.</p> <pre><code class="language-r">isNodeAlive(cl) ## [1] TRUE TRUE </code></pre> <p>and</p> <pre><code class="language-r">y &lt;- parallel::parLapply(cl, X = X, fun = slow_fcn) str(y) ## List of 8 ## $ : num 1 ## $ : num 1.41 ## $ : num 1.73 </code></pre> <p>We can also use <code>cloneNode()</code> to launch <em>additional</em> workers of the same kind. For example, say we want to launch two more local workers and one more remote worker, and append them to the current cluster. One way to achieve that is:</p> <pre><code class="language-r">cl &lt;- c(cl, cloneNode(cl[c(1,1,2)])) print(cl) ## Socket cluster with 5 nodes where 3 nodes are on host 'localhost' ## (R version 4.3.1 (2023-06-16), platform x86_64-pc-linux-gnu), 2 ## nodes are on host 'server.remote.org' (R version 4.3.1 (2023-06-16), ## platform x86_64-pc-linux-gnu) </code></pre> <p>Now, consider we launching many heavy parallel tasks, where some of them run on remote machines. However, after some time, we realize that we have launched tasks that will take much longer to resolve than we first anticipated. If we don&rsquo;t want to wait for this to resolve by itself, we can choose to terminate some or all of the workers using <code>killNode()</code>. For example,</p> <pre><code class="language-r">killNode(cl) ## [1] TRUE TRUE TRUE TRUE TRUE </code></pre> <p>will kill all parallel workers in our cluster, even if they are busy running tasks. We can confirm that these worker processes are no longer alive by calling:</p> <pre><code class="language-r">isNodeAlive(cl) ## [1] FALSE FALSE FALSE FALSE FALSE </code></pre> <p>If we would attempt to use the cluster, we&rsquo;d get the &ldquo;Error in unserialize(node$con) : error reading from connection&rdquo; as we saw previously. After having killed our cluster, we can re-launch it using <code>cloneNode()</code>, e.g.</p> <pre><code class="language-r">cl &lt;- cloneNode(cl) isNodeAlive(cl) ## [1] TRUE TRUE TRUE TRUE TRUE </code></pre> <h2 id="the-new-cluster-managing-skills-enhances-the-future-ecosystem">The new cluster managing skills enhances the future ecosystem</h2> <p>When we use the <a href="https://future.futureverse.org/reference/cluster.html"><code>cluster</code></a> and <a href="https://future.futureverse.org/reference/multisession.html"><code>multisession</code></a> parallel backends of the <strong>future</strong> package, we rely on the <strong>parallelly</strong> package internally. Thanks to these new abilities, the Futureverse can now give more informative error message whenever we fail to launch a future or when we fail to retrieve the results of one. For example, if a parallel worker has terminated, we might get:</p> <pre><code class="language-r">f &lt;- future(slow_fcn(42)) ## Error: ClusterFuture (&lt;none&gt;) failed to call grmall() on cluster ## RichSOCKnode #1 (PID 29701 on 'server.remote.org'). The reason reported ## was 'error reading from connection'. Post-mortem diagnostic: No process ## exists with this PID on the remote host, i.e. the remote worker is no ## longer alive </code></pre> <p>That post-mortem diagnostic is often enough to realize something quite exceptional has happened. It also gives us enough information to troubleshooting the problem further, e.g. if we keep seeing the same problem occurring over and over for a particular machine, it might suggest that there is an issue on that machine and we want to exclude it from further processing.</p> <p>We could imagine that the <strong>future</strong> package would not only give us information on why things went wrong, but it could theoretically also try to fix the problem automatically. For instance, it could automatically re-create the crashed worker using <code>cloneNode()</code>, and re-launch the future. It is on the roadmap to add such robustness to the future ecosystem later on. However, there are several things to consider when doing so. For instance, what should happen if it was not a glitch, but that there is one parallel task that keeps crashing the parallel workers over and over? Most certainly, we want to only retry a fixed number of times, before giving up, otherwise we might get stuck in a never ending procedure. But even so, what if the problematic parallel code brings down the machine where it runs? If we have automatic restart of workers and parallel tasks, we might end up bringing down multiple machines before we notice the problem. So, although it appears fairly straightforward to handle crashed workers automatically, we need to come up with a robust, well-behaving strategy for doing so before we can implement it.</p> <p>I hope you find this useful. If you have questions or comments on <strong>parallelly</strong>, or the Futureverse in general, please use the <a href="https://github.com/HenrikBengtsson/future/discussions/">Futureverse Discussion forum</a>.</p> <p>Henrik</p> <h2 id="links">Links</h2> <ul> <li><strong>parallelly</strong> package: <a href="https://cran.r-project.org/package=parallelly">CRAN</a>, <a href="https://github.com/HenrikBengtsson/parallelly">GitHub</a>, <a href="https://parallelly.futureverse.org">pkgdown</a></li> <li><strong>future</strong> package: <a href="https://cran.r-project.org/package=future">CRAN</a>, <a href="https://github.com/HenrikBengtsson/future">GitHub</a>, <a href="https://future.futureverse.org">pkgdown</a></li> <li><strong>Futureverse</strong>: <a href="https://www.futureverse.org">https://www.futureverse.org</a></li> </ul> %dofuture% - a Better foreach() Parallelization Operator than %dopar% https://www.jottr.org/2023/06/26/dofuture/ Mon, 26 Jun 2023 19:00:00 +0200 https://www.jottr.org/2023/06/26/dofuture/ <div style="margin: 2ex; width: 100%;"/> <center> <img src="https://www.jottr.org/post/dopar-to-dofuture.png" alt="Two lines of code, where the first line shows 'y <- foreach(...) %dopar% { ... }'. The second line 'y <- foreach(...) %dofuture% { ... }'. The %dopar% operator is crossed out and there is a line down to %dofuture% directly below." style="width: 80%; border: 1px solid black;"/> </center> </div> <p><strong><a href="https://doFuture.futureverse.org">doFuture</a></strong> 1.0.0 is on CRAN since March 2023. It introduces a new <strong><a href="https://cran.r-project.org/package=foreach">foreach</a></strong> operator <code>%dofuture%</code>, which makes it even easier to use <code>foreach()</code> to parallelize via the <strong>future</strong> ecosystem. This new operator is designed to be an alternative to the existing <code>%dopar%</code> operator for <code>foreach()</code> - an alternative that works in similar ways but better. If you already use <code>foreach()</code> together with futures, or plan on doing so, I recommend using <code>%dofuture%</code> instead of <code>%dopar%</code>. I&rsquo;ll explain why I think so below.</p> <h2 id="introduction">Introduction</h2> <p>The traditional way to parallelize with <code>foreach()</code> is to use the <code>%dopar%</code> infix operator together with a registered foreach adaptor. The popular <strong><a href="https://cran.r-project.org/package=doParallel">doParallel</a></strong> package provides <code>%dopar%</code> backends for parallelizing on the local machine. Here is an example that uses four local workers:</p> <pre><code class="language-r">library(foreach) workers &lt;- parallel::makeCluster(4) doParallel::registerDoParallel(cl = workers) xs &lt;- rnorm(1000) y &lt;- foreach(x = xs, .export = &quot;slow_fcn&quot;) %dopar% { slow_fcn(x) } </code></pre> <p>I highly suggest Futureverse for parallelization due to its advantages, such as relaying standard output, messages, warnings, and errors that were generated on the parallel workers in the main R process, support for near-live progress updates, and more descriptive backend error messages. Almost from the very beginning of the Futureverse, you have been able to use futures with <code>foreach()</code> and <code>%dopar%</code> via the <strong>doFuture</strong> package. For instance, we can rewrite the above example to use futures as:</p> <pre><code class="language-r">library(foreach) doFuture::registerDoFuture() future::plan(multisession, workers = 4) xs &lt;- rnorm(1000) y &lt;- foreach(x = xs, .export = &quot;slow_fcn&quot;) %dopar% { slow_fcn(x) } </code></pre> <p>In this blog post, I am proposing to move to</p> <pre><code class="language-r">library(foreach) future::plan(multisession, workers = 4) xs &lt;- rnorm(1000) y &lt;- foreach(x = xs, .export = &quot;slow_fcn&quot;) %dofuture% { slow_fcn(x) } </code></pre> <p>instead. So, why is that better? It is because:</p> <ol> <li><p><code>%dofuture%</code> removes the need to register a foreach backend, i.e. no more <code>registerDoMC()</code>, <code>registerDoParallel()</code>, <code>registerDoFuture()</code>, etc.</p></li> <li><p><code>%dofuture%</code> is unaffected by any foreach backends that the end-user has registered.</p></li> <li><p><code>%dofuture%</code> uses a consistent <code>foreach()</code> &ldquo;options&rdquo; argument, regardless of parallel backend used, and <em>not</em> different ones for different backends, e.g. <code>.options.multicore</code>, <code>.options.snow</code>, and <code>.options.mpi</code>.</p></li> <li><p><code>%dofuture%</code> is guaranteed to always parallelizes via the Futureverse, using whatever <code>plan()</code> the end-user has specified. It also means that you, as a developer, have full control of the parallelization code.</p></li> <li><p><code>%dofuture%</code> can generate proper parallel random number generation (RNG). There is no longer a need to use <code>%dorng%</code> of the <strong><a href="https://cran.r-project.org/package=doRNG">doRNG</a></strong> package.</p></li> <li><p><code>%dofuture%</code> automatically identifies global variables and packages that are needed by the parallel workers.</p></li> <li><p><code>%dofuture%</code> relays errors generated in parallel as-is such that they can be handled using standard R methods, e.g. <code>tryCatch()</code>.</p></li> <li><p><code>%dofuture%</code> truly outputs standard output and messages, warnings, and other types of conditions generated in parallel as-is such that they can be handled using standard R methods, e.g. <code>capture.output()</code> and <code>withCallingHandlers()</code>.</p></li> <li><p><code>%dofuture%</code> supports near-live progress updates via the <strong><a href="https://progressr.futureverse.org">progressr</a></strong> package.</p></li> <li><p><code>%dofuture%</code> gives more informative error messages, which helps troubleshooting, if a parallel worker crashes.</p></li> </ol> <p>Below are the details.</p> <h2 id="problems-of-dopar-that-dofuture-addresses">Problems of <code>%dopar%</code> that <code>%dofuture%</code> addresses</h2> <p>Let me discuss a few of the unfortunate drawbacks that comes with <code>%dopar%</code>. Most of these stem from a slightly too lax design. Although convenient, the flexible design prevents us from having full control and writing code that can parallelize on any parallel backend.</p> <h3 id="problem-1-dopar-requires-registering-a-foreach-adaptor">Problem 1. <code>%dopar%</code> requires registering a foreach adaptor</h3> <p>If we write code that others will use, say, an R package, then we can never know what compute resources the user has, or will have in the future. Traditionally, this means that one user might want to use <strong>doParallel</strong> for parallelization, another <strong>doMC</strong>, and yet another, maybe, <strong>doRedis</strong>. Because of this, we must not have any calls to one of the many <code>registerDoNnn()</code> functions in our code. If we do, we lock users into a specific parallel backend. We could of course support a few different backends, but we are still locking users into a small set of parallel backends. If someone develops a new backend in the future, our code has to be updated before users can take advantage the new backends.</p> <p>One can argue that <code>doFuture::registerDoFuture()</code> somewhat addresses this problem. On one hand, when used, it does lock the user into the future framework. On the other hand, the user has many parallel backends to choose from in the Futureverse, including backends that will be developed in the future. In this sense, the lock-in is less severe, especially since we do not have to update our code for new backends to be supported. Also, to avoid destructive side effects, <code>registerDoFuture()</code> allows you to change the foreach backend used inside your functions temporarily, e.g.</p> <pre><code class="language-r">## Temporarily use futures oldDoPar &lt;- registerDoFuture() on.exit(with(oldDoPar, foreach::setDoPar(fun=fun, data=data, info=info)), add = TRUE) </code></pre> <p>This avoids changing the foreach backend that the user might already have set elsewhere.</p> <p>That said, I never wanted to say that people <em>should use</em> <code>registerDoFuture()</code> whenever using <code>%dopar%</code>, because I think that would be against the philosophy behind the <strong>foreach</strong> framework. The <strong>foreach</strong> ecosystem is designed to separate the <code>foreach()</code> + <code>%dopar%</code> code, describing what to parallelize, from the <code>registerDoNnn()</code> call, describing how and where to parallelize.</p> <p>Using <code>%dofuture%</code>, instead of <code>%dopar%</code> with user-controlled foreach backend, avoids this dilemma. With <code>%dofuture%</code> the developer is in full control of the parallelization code.</p> <h3 id="problem-2-chunking-and-load-balancing-differ-among-foreach-backends">Problem 2. Chunking and load-balancing differ among foreach backends</h3> <p>When using parallel map-reduce functions such as <code>mclapply()</code>, <code>parLapply()</code> of the <strong>parallel</strong> package, or <code>foreach()</code> with <code>%dopar%</code>, the tasks are partitioned into subsets and distributed to the parallel workers for processing. This partitioning is often referred to as &ldquo;chunking&rdquo;, because we chunk up the elements into smaller chunks, and then each chunk is processed by one parallel worker. There are different strategies to chunk up the elements. One approach is to use uniformly sized chunks and have each worker process one chunk. Another approach is to use chunks with a single element, and have each worker process one or more chunks.</p> <p>The chunks may be pre-assigned (&ldquo;prescheduled&rdquo;) to the parallel workers up-front, which is referred to as <em>static load balancing</em>. An alternative is to assign chunks to workers on-the-fly as the workers become available, which is referred to as <em>dynamic load balancing</em>.</p> <p>If the processing time differ a lot between elements, it is beneficial to use dynamic load balancing together with small chunk sizes.</p> <p>However, if we dig into the documentation and source code of the different foreach backends, we will find that they use different chunking and load-balancing strategies. For example, assume we are running on a Linux machine, which supports forked processing. Then, if we use</p> <pre><code class="language-r">library(foreach) doParallel::registerDoParallel(ncores = 8) y &lt;- foreach(x = X, .export = &quot;slow_fcn&quot;) %dopar% { slow_fcn(x) } </code></pre> <p>the data will be processed by eight fork-based parallel workers using <em>dynamic load balancing with single-element chunks</em>. However, if we use PSOCK clusters:</p> <pre><code class="language-r">library(foreach) cl &lt;- parallel::makeCluster(8) doParallel::registerDoParallel(cl = cl) y &lt;- foreach(x = X, .export = &quot;slow_fcn&quot;) %dopar% { slow_fcn(x) } </code></pre> <p>the data will be processed by eight PSOCK-based parallel workers using <em>static load balancing with uniformly sized chunks</em>.</p> <p>Which of these two chunking and load-balancing strategies is the most efficient one depends on how much the processing time of <code>slow_fcn(x)</code> varies with different values of <code>x</code>. For example, and without going into details, if the processing times differ a lot, dynamic load balancing often makes better use of the parallel workers and results in a shorter overall processing time.</p> <p>Regardless of which is faster, the problem with different foreach backends using different strategies is that, as a developer with little control over the registered foreach backend, you have equally poor control over the chunking and load-balancing strategies.</p> <p>Using <code>%dofuture%</code>, avoids this problem. If you use <code>%dofuture%</code>, then dynamic load balancing will always be used for processing the data, regardless of which parallel future backend is in place, with the option to control the chunk size. As a side note, <code>%dopar%</code> with <code>registerDoFuture()</code> will also do this.</p> <h3 id="problem-3-different-foreach-backends-use-different-foreach-options">Problem 3. Different foreach backends use different <code>foreach()</code> options</h3> <p>In the previous section, I did not mention that for some foreach backends it is indeed possible to control whether static or dynamic load balancing should be used, and what the chunk sizes should be. This can be controlled by special <code>.options.*</code> arguments for <code>foreach()</code>. However, each foreach backend has their own <code>.options.*</code> argument, e.g. you might find that some use <code>.options.multicore</code>, others <code>.options.snow</code>, or something else. Because they are different, we cannot write code that works with any type of foreach backend.</p> <p>To give two examples, when using <strong>doParallel</strong> and <code>registerDoParallel(cores = 8)</code>, we can replace the default dynamic load balancing with static load balancing as:</p> <pre><code class="language-r">library(foreach) doParallel::registerDoParallel(ncores = 8) y &lt;- foreach(x = X, .export = &quot;slow_fcn&quot;, .options.multicore = list(preschedule = TRUE)) %dopar% { slow_fcn(x) } </code></pre> <p>This change will also switch from chunks with a single element to (eight) chunks with similar size.</p> <p>If we instead would use <code>registerDoParallel(cl)</code>, which gives us the vice versa situation, we can switch out the static load balancing with dynamic load balancing by using:</p> <pre><code class="language-r">library(foreach) cl &lt;- parallel::makeCluster(8) doParallel::registerDoParallel(cl = cl) y &lt;- foreach(x = X, .export = &quot;slow_fcn&quot;, .options.snow = list(preschedule = FALSE)) %dopar% { slow_fcn(x) } </code></pre> <p>This will also switch from uniformly sized chunks to single-element chunks.</p> <p>As we can see, the fact that we have to use different <code>foreach()</code> &ldquo;options&rdquo; arguments (here <code>.options.multicore</code> and <code>.options.snow</code>) for different foreach backends prevents us from writing code that works with any foreach backend.</p> <p>Of course, we could specify &ldquo;options&rdquo; arguments for known foreach backends and hope we haven&rsquo;t missed any and that no new ones are showing up later, e.g.</p> <pre><code class="language-r">library(foreach) doParallel::registerDoParallel(cores = 8) y &lt;- foreach(x = X, .export = &quot;slow_fcn&quot;, .options.multicore = list(preschedule = TRUE), .options.snow = list(preschedule = TRUE), .options.future = list(preschedule = TRUE), .options.mpi = list(chunkSize = 1) ) %dopar% { slow_fcn(x) } </code></pre> <p>Regardlessly, this still limits the end-user to a set of commonly used foreach backends, and our code can never be agile to foreach backends that are developed at a later time.</p> <p>Using <code>%dofuture%</code> avoids these problems. It supports argument <code>.options.future</code> in a consistent way across all future backends, which means that your code will be the same regardless of parallel backend. By the core design of the Futureverse, any new future backends developed later one will automatically work with your <strong>foreach</strong> code if you use <code>%dofuture%</code>.</p> <h3 id="problem-4-global-variables-are-not-always-identified-by-foreach">Problem 4. Global variables are not always identified by <code>foreach()</code></h3> <p>When parallelizing code, the parallel workers must have access to all functions and variables required to evaluate the parallel code. As we have seen the above examples, you can use the <code>.export</code> argument to help <code>foreach()</code> to export the necessary objects to each of the parallel workers.</p> <p>However, a developer who uses <code>doMC::registerDoMC()</code>, or equivalently <code>doParallel::registerDoParallel(cores)</code>, might forget to specify the <code>.export</code> argument. This can happen because the mechanisms of forked processing makes all objects available to the parallel workers. If they test their code using only these foreach backends, they will not notice that <code>.export</code> is not declared. The same may happen if the developer assumes <code>doFuture::registerDoFuture()</code> is used. However, without specifying <code>.export</code>, the code will <em>not</em> work on other types of foreach backends, e.g. <code>doParallel::registerDoParallel(cl)</code> and <code>doMPI::registerDoMPI()</code>. If an R package forgets to specify the <code>.export</code> argument, and is not comprehensively tested, then it will be the end-user, for instance on MS Windows, that runs into the bug.</p> <p>When using <code>%dofuture%</code>, global variables and required packages are by default automatically identified and exported to the parallel workers by the future framework. This is done the same way regardless of parallel backend.</p> <h3 id="problem-5-easy-to-forget-parallel-random-number-generation">Problem 5. Easy to forget parallel random number generation</h3> <p>The <strong>foreach</strong> package and <code>%dopar%</code> do not have built-in support for parallel random number generation (RNG). Statistical sound parallel RNG is critical for many statistical analyses. If not done, then the results can be biases and incorrect conclusions might be drawn. The <strong><a href="https://cran.r-project.org/package=doRNG">doRNG</a></strong> package comes to rescue when using <code>%dopar%</code>. It provides the operator <code>%dorng%</code>, which will use <code>%dopar%</code> internally while automatically setting up parallel RNG. Whenever you use <code>%dopar%</code> and find yourself needing parallel RNG, I recommend to simply replace <code>%dopar%</code> with <code>%dorng%</code>. The <strong>doRNG</strong> package also provides <code>registerDoRNG()</code>, which I do not recommend, because as a developer you do not have full control whether that is registered or not.</p> <p>Because <strong>foreach</strong> does not have built-in support for parallel RNG, it is easy to forget that it should be used. A developer who is aware of the importance of using proper parallel RNG will find out about <strong>doRNG</strong> and how to best use it, but a developer who is not aware of the problem, can easily miss it and publish an R package that produces potentially incorrect results.</p> <p>However, when using the future framework will detect if we forget to use parallel RNG. When this happens, a warning will alert us to the problem and suggest how to fix it. This is the case if you use <code>doFuture::registerDoFuture()</code>, and it&rsquo;s also the case when using <code>%dofuture%</code>. For example,</p> <pre><code class="language-r">library(doFuture) plan(multisession, workers = 3) y &lt;- foreach(ii = 1:4) %dofuture% { runif(ii) } </code></pre> <p>produces</p> <pre><code>Warning messages: 1: UNRELIABLE VALUE: Iteration 1 of the foreach() %dofuture% { ... }, part of chunk #1 ('doFuture2-1'), unexpectedly generated random numbers without declaring so. There is a risk that those random numbers are not statistically sound and the overall results might be invalid. To fix this, specify foreach() argument '.options.future = list(seed = TRUE)'. This ensures that proper, parallel-safe random numbers are produced via the L'Ecuyer-CMRG method. To disable this check, set option 'doFuture.rng.onMisuse' to &quot;ignore&quot;. </code></pre> <p>To fix this, we can specify <code>foreach()</code> argument <code>.options.future = list(seed = TRUE)</code> to declare that we need to draw random number in parallel, i.e.</p> <pre><code class="language-r">library(doFuture) plan(multisession, workers = 3) y &lt;- foreach(ii = 1:4, .options.future = list(seed = TRUE)) %dofuture% { runif(ii) } </code></pre> <p>This makes sure that statistical sound random numbers are generated.</p> <h2 id="migrating-from-dopar-to-dofuture-is-straightforward">Migrating from %dopar% to %dofuture% is straightforward</h2> <p>If you already have code that uses <code>%dopar%</code> and want to start using <code>%dofuture%</code> instead, then it only takes are few changes, which are all straightforward and quick:</p> <ol> <li><p>Replace <code>%dopar%</code> with <code>%dofuture%</code>.</p></li> <li><p>Replace <code>%dorng%</code> with <code>%dofuture%</code> and set <code>.options.future = list(seed = TRUE)</code>.</p></li> <li><p>Replace <code>.export = &lt;character vector of global variables&gt;</code> with <code>.options.future = list(globals = &lt;character vector of global variables&gt;)</code>.</p></li> <li><p>Drop any other <code>registerDoNnn()</code> calls inside your function, if you use them.</p></li> <li><p>Update your documentation to mention that the parallel backend should be set using <code>future::plan()</code> and no longer via different <code>registerDoNnn()</code> calls.</p></li> </ol> <p>In brief, if you use <code>%dofuture%</code> instead of <code>%dopar%</code>, your life as a developer will be easier and so will the end-user&rsquo;s be too.</p> <p>If you have questions or comments on <strong>doFuture</strong> and <code>%dofuture%</code>, or the Futureverse in general, please use the <a href="https://github.com/HenrikBengtsson/future/discussions/">Futureverse Discussion forum</a>.</p> <p>Happy futuring!</p> <p>Henrik</p> <h2 id="links">Links</h2> <ul> <li><strong>doFuture</strong> package: <a href="https://cran.r-project.org/package=doFuture">CRAN</a>, <a href="https://github.com/HenrikBengtsson/doFuture">GitHub</a>, <a href="https://doFuture.futureverse.org">pkgdown</a></li> <li><strong>Futureverse</strong>: <a href="https://www.futureverse.org">https://www.futureverse.org</a></li> </ul> Edmonton R User Group Meetup: Futureverse - A Unifying Parallelization Framework in R for Everyone https://www.jottr.org/2023/05/22/future-yegrug-2023-slides/ Mon, 22 May 2023 18:00:00 -0700 https://www.jottr.org/2023/05/22/future-yegrug-2023-slides/ <div style="margin: 2ex; width: 100%;"/> <center> <img src="https://www.jottr.org/post/YEGRUG_20230522.jpeg" alt="The YEGRUG poster slide for the Futureverse presentation on 2023-05-22" style="width: 80%; border: 1px solid black;"/> </center> </div> <p>Below are the slides from my presentation at the <a href="https://www.meetup.com/edmonton-r-user-group-yegrug/events/fxvdbtyfchbhc/">Edmonton R User Group Meetup (YEGRUG)</a> on May 22, 2023:</p> <p>Title: Futureverse - A Unifying Parallelization Framework in R for Everyone<br /> Speaker: Henrik Bengtsson<br /> Slides: <a href="https://docs.google.com/presentation/d/e/2PACX-1vQfbnVRHZhIkEAd3_pNG14N5JQqE0jqCohSq-m-uWAcA7StF-BuHdOz0IGDhcRI3K681DxoXoqA7pwp/pub?start=true&amp;loop=false&amp;delayms=60000">HTML</a>, <a href="https://www.jottr.org/presentations/yegrug2023/BengtssonH_20230522-Futureverse-YEGRUG.pdf">PDF</a> (46 slides)<br /> Video: <a href="https://www.youtube.com/watch?v=6Dp6zMelrmg">official recording</a> (~60 minutes)</p> <p>Thank you Péter Sólymos and the YEGRUG for the invitate and the opportunity!</p> <p>/Henrik</p> <h2 id="links">Links</h2> <ul> <li>YEGRUG: <a href="https://yegrug.github.io/">https://yegrug.github.io/</a></li> <li><strong>Futureverse</strong> website: <a href="https://www.futureverse.org/">https://www.futureverse.org/</a></li> <li><strong>future</strong> package <a href="https://cran.r-project.org/package=future">CRAN</a>, <a href="https://github.com/HenrikBengtsson/future">GitHub</a>, <a href="https://future.futureverse.org/">pkgdown</a></li> </ul> parallelly 1.34.0: Support for CGroups v2, Killing Parallel Workers, and more https://www.jottr.org/2023/01/18/parallelly-1.34.0-support-for-cgroups-v2-killing-parallel-workers-and-more/ Wed, 18 Jan 2023 14:00:00 -0800 https://www.jottr.org/2023/01/18/parallelly-1.34.0-support-for-cgroups-v2-killing-parallel-workers-and-more/ <div style="padding: 2ex; float: right;"/> <center> <img src="https://www.jottr.org/post/parallelly-logo.png" alt="The 'parallelly' hexlogo"/> </center> </div> <p>With the recent releases of <strong><a href="https://parallelly.futureverse.org">parallelly</a></strong> 1.33.0 (2022-12-13) and 1.34.0 (2023-01-13), <a href="https://parallelly.futureverse.org/reference/availableCores.html"><code>availableCores()</code></a> and <a href="https://parallelly.futureverse.org/reference/availableWorkers.html"><code>availableWorkers()</code></a> gained better support for Linux CGroups, options for avoiding running out of R connections when setting up <strong>parallel</strong>-style clusters, and <code>killNode()</code> for forcefully terminating one or more parallel workers. I summarize these updates below. For other updates, please see the <a href="https://parallelly.futureverse.org/news/index.html">NEWS</a>.</p> <h2 id="added-support-for-cgroups-v2">Added support for CGroups v2</h2> <p><a href="https://parallelly.futureverse.org/reference/availableCores.html"><code>availableCores()</code></a> and <a href="https://parallelly.futureverse.org/reference/availableWorkers.html"><code>availableWorkers()</code></a> gained support for Linux Control Groups v2 (CGroups v2), besides CGroups v1, which has been supported since <strong><a href="https://parallelly.futureverse.org">parallelly</a></strong> 1.31.0 (2022-04-07) and partially since 1.22.0 (2020-12-12). This means that if you use <code>availableCores()</code> and <code>availableWorkers()</code> in your R code, it will better respect the number of CPU cores that the Linux system has made available to you. Not all systems use CGroups, but it is becoming more popular, so if the Linux system you run on does not use it right now, it is likely it will at some point.</p> <h2 id="avoid-running-out-of-r-connections">Avoid running out of R connections</h2> <p>If you run parallel code on a machine with a many CPU cores, there&rsquo;s a risk that you run out of available R connections, which are needed when setting up <strong>parallel</strong> cluster nodes. This is because R has a limit of a maximum 125 connections being used at the same time(*) and each cluster node consumes one R connection. If you try to set up more parallel workers than this, you will get an error. The <strong>parallelly</strong> package already has built-in protection against this, e.g.</p> <pre><code class="language-r">&gt; cl &lt;- parallelly::makeClusterPSOCK(192) Error: Cannot create 192 parallel PSOCK nodes. Each node needs one connection, but there are only 124 connections left out of the maximum 128 available on this R installation </code></pre> <p>This error is <em>instant</em> and with no parallel workers being launched. In contrast, if you use <strong>parallel</strong>, you will only get an error after R has launched the first 124 cluster nodes and fails to launch the 125:th one, e.g.</p> <pre><code class="language-r">&gt; cl &lt;- parallel::makePSOCKcluster(192) Error in socketAccept(socket = socket, blocking = TRUE, open = &quot;a+b&quot;, : all connections are in use </code></pre> <p>Now, assume you use:</p> <pre><code class="language-r">&gt; library(parallelly) &gt; nworkers &lt;- availableCores() &gt; cl &lt;- makeClusterPSOCK(ncores) </code></pre> <p>to set up a maximum-sized cluster on the current machine. This works as long as <code>availableCores()</code> returns something less than 125. However, if you are on machine with, say, 192 CPU cores, you will get the above error. You could do something like:</p> <pre><code class="language-r">&gt; nworkers &lt;- availableCores() &gt; nworkers &lt;- max(nworkers, 125L) </code></pre> <p>to work around this problem. Or, if you want to be more agile to what R supports, you could do:</p> <pre><code class="language-r">&gt; nworkers &lt;- availableCores() &gt; nworkers &lt;- max(nworkers, freeConnections()) </code></pre> <p>With the latest versions of <strong>parallelly</strong>, you can simplify this to:</p> <pre><code class="language-r">&gt; nworkers &lt;- availableCores(constraints = &quot;connections&quot;) </code></pre> <p>The <code>availableWorkers()</code> function also supports <code>constraints = &quot;connections&quot;</code>.</p> <p>(*) The only way to increase this limit is to change the R source code and build R from source, cf. <a href="https://parallelly.futureverse.org/reference/availableConnections.html"><code>freeConnections()</code></a>.</p> <h2 id="forcefully-terminate-psock-cluster-nodes">Forcefully terminate PSOCK cluster nodes</h2> <p>The <code>parallel::stopCluster()</code> should be used for stopping a parallel cluster. This works by asking the clusters node to shut themselves down. However, a parallel worker will only shut down this way when it receives the message, which can only happen when the worker is done processing any parallel tasks. So, if a worker runs a very long-running task, which can take minutes, hours, or even days, it will not shut down until after that completes.</p> <p>This far, we had to turn to special operating-system tools to kill the R process for that cluster worker. With <strong>parallelly</strong> 1.33.0, you can now use <code>killNode()</code> to kill any parallel worker that runs on the local machine and that was launched by <a href="https://parallelly.futureverse.org/reference/makeClusterPSOCK.html"><code>makeClusterPSOCK()</code></a>. For example,</p> <pre><code class="language-r">&gt; library(parallelly) &gt; cl &lt;- makeClusterPSOCK(10) &gt; cl Socket cluster with 10 nodes where 10 nodes are on host 'localhost' (R version 4.2.2 (2022-10-31), platform x86_64-pc-linux-gnu) &gt; which(isNodeAlive(cl)) [1] 1 2 3 4 5 6 7 8 9 10 &gt; success &lt;- killNode(cl[1:3]) &gt; success [1] TRUE TRUE TRUE &gt; which(isNodeAlive(cl)) [1] 4 5 6 7 8 9 10 &gt; cl &lt;- cl[isNodeAlive(cl)] Socket cluster with 7 nodes where 7 nodes are on host 'localhost' (R version 4.2.2 (2022-10-31), platform x86_64-pc-linux-gnu) </code></pre> <p>Over and out,</p> <p>Henrik</p> <h2 id="links">Links</h2> <ul> <li><strong>parallelly</strong> package: <a href="https://cran.r-project.org/package=parallelly">CRAN</a>, <a href="https://github.com/HenrikBengtsson/parallelly">GitHub</a>, <a href="https://parallelly.futureverse.org">pkgdown</a></li> </ul> progressr 0.13.0: cli + progressr = ♥ https://www.jottr.org/2023/01/10/progressr-0.13.0/ Tue, 10 Jan 2023 19:00:00 -0800 https://www.jottr.org/2023/01/10/progressr-0.13.0/ <p><strong><a href="https://progressr.futureverse.org">progressr</a></strong> 0.13.0 is on CRAN. In the recent releases, <strong>progressr</strong> gained support for using <strong><a href="https://cli.r-lib.org/">cli</a></strong> to generate progress bars. Vice versa, <strong>cli</strong> can now report on progress via the <strong>progressr</strong> framework. Here are the details. For other updates to <strong>progressr</strong>, see <a href="https://progressr.futureverse.org/news/index.html">NEWS</a>.</p> <div style="padding: 2ex; float: right;"/> <center> <img src="https://www.jottr.org/post/three_in_chinese.gif" alt="Three strokes writing three in Chinese"/> </center> </div> <p>The <strong>progressr</strong> package, part of the <a href="https://www.futureverse.org">futureverse</a>, provides a minimal API for reporting progress updates in R. The design is to separate the representation of progress updates from how they are presented. What type of progress to signal is controlled by the developer. How these progress updates are rendered is controlled by the end user. For instance, some users may prefer visual feedback, such as a horizontal progress bar in the terminal, whereas others may prefer auditory feedback. The <strong>progressr</strong> package works also when processing R in parallel or distributed using the <strong><a href="https://future.futureverse.org">future</a></strong> framework.</p> <h2 id="use-cli-progress-bars-for-progressr-reporting">Use &lsquo;cli&rsquo; progress bars for &lsquo;progressr&rsquo; reporting</h2> <p>In <strong>progressr</strong> (&gt;= 0.12.0) [2022-12-13], you can report on progress using <strong>cli</strong> progress bar. To do this, just set:</p> <pre><code class="language-r">progressr::handlers(global = TRUE) ## automatically report on progress progressr::handlers(&quot;cli&quot;) ## ... using a 'cli' progress bar </code></pre> <p>With these globals settings (e.g. in your <code>~/.Rprofile</code> file; see below), R reports progress as:</p> <pre><code class="language-r">library(progressr) y &lt;- slow_sum(1:10) </code></pre> <p><img src="https://www.jottr.org/post/handler_cli-default-slow_sum.svg" alt="Animation of a one-line, green-blocks cli progress bar in the terminal growing from 0% to 100% with an ETA estimate at the end" /></p> <p>Just like regular <strong>cli</strong> progress bars, you can customize these in the same way. For instance, if you use the following from one of the <strong>cli</strong> examples:</p> <pre><code class="language-r">options(cli.progress_bar_style = list( complete = cli::col_yellow(&quot;\u2605&quot;), incomplete = cli::col_grey(&quot;\u00b7&quot;) )) </code></pre> <p>you&rsquo;ll get:</p> <p><img src="https://www.jottr.org/post/handler_cli-default-slow_sum-yellow-starts.svg" alt="Animation of a one-line, yellow-starts cli progress bar in the terminal growing from 0% to 100% with an ETA estimate at the end" /></p> <h2 id="configure-cli-to-report-progress-via-progressr">Configure &lsquo;cli&rsquo; to Report Progress via &lsquo;progressr&rsquo;</h2> <p>You might have heard that <strong><a href="https://purrr.tidyverse.org/">purrr</a></strong> recently gained support for reporting on progress. If you didn&rsquo;t, you can read about it in the tidyverse blog post &lsquo;<a href="https://www.tidyverse.org/blog/2022/12/purrr-1-0-0/#progress-bars">purrr 1.0.0</a>&rsquo; on 2022-12-20. The gist is to pass <code>.progress = TRUE</code> to the <strong>purrr</strong> function of interest, and it&rsquo;ll show a progress bar while it runs. For example, assume we the following slow function for calculating the square root:</p> <pre><code class="language-r">slow_sqrt &lt;- function(x) { Sys.sleep(0.1); sqrt(x) } </code></pre> <p>If we call</p> <pre><code class="language-r">y &lt;- purrr::map(1:30, slow_sqrt, .progress = TRUE) </code></pre> <p>we&rsquo;ll see a progress bar appearing after about two seconds:</p> <p><img src="https://www.jottr.org/post/handler_cli-default.svg" alt="Animation of a one-line, green-blocks cli progress bar in the terminal growing from 0% to 100% with an ETA estimate at the end" /></p> <p>This progress bar is produced by the <strong>cli</strong> package. Now, the neat thing with the <strong>cli</strong> package is that you can tell it to pass on the progress reporting to another progress framework, including that of the <strong>progressr</strong> package. To do this, set the R option:</p> <pre><code class="language-r">options(cli.progress_handlers = &quot;progressr&quot;) </code></pre> <p>This causes <em>all</em> <strong>cli</strong> progress updates to be reported via <strong>progressr</strong>, so if you, for instance, already have set:</p> <pre><code class="language-r">progressr::handlers(global = TRUE) red_heart &lt;- cli::col_red(cli::symbol$heart) handlers(handler_txtprogressbar(char = red_heart)) </code></pre> <p>the above <code>purrr::map()</code> call will report on progress in the terminal using a classical R progress bar tweaked to use red hearts to fill the bar;</p> <p><img src="https://www.jottr.org/post/handler_txtprogressbar-custom-hearts.svg" alt="Animation of a one-line, text-based red-hearts progress bar in the terminal growing from 0% to 100%" /></p> <p>As another example, if you set:</p> <pre><code class="language-r">progressr::handlers(global = TRUE) progressr::handlers(c(&quot;beepr&quot;, &quot;cli&quot;, &quot;rstudio&quot;)) </code></pre> <p>R will report progress <em>concurrently</em> via audio using different <strong><a href="https://cran.r-project.org/package=beepr">beepr</a></strong> sounds, via the terminal as a <strong>cli</strong> progress bar, and the RStudio&rsquo;s built-in progress bar - whenever progress is reported via the <strong>progressr</strong> framework <em>or</em> the <strong>cli</strong> framework.</p> <h2 id="customize-progress-reporting-when-r-starts">Customize progress reporting when R starts</h2> <p>To safely configure the above for all your <em>interactive</em> R sessions, I recommend adding something like the following to your <code>~/.Rprofile</code> file (or in a standalone file using the <strong><a href="https://cran.r-project.org/package=startup">startup</a></strong> package):</p> <pre><code class="language-r">if (interactive() &amp;&amp; requireNamespace(&quot;progressr&quot;, quietly = TRUE)) { ## progressr reporting without need for with_progress() progressr::handlers(global = TRUE) ## Use 'cli', if installed ... if (requireNamespace(&quot;cli&quot;, quietly = TRUE)) { progressr::handlers(&quot;cli&quot;) ## Hand over all 'cli' progress reporting to 'progressr' options(cli.progress_handlers = &quot;progressr&quot;) } else { ## ... otherwise use the one that comes with R progressr::handlers(&quot;txtprogressbar&quot;) } ## Use 'beepr', if installed ... if (requireNamespace(&quot;beepr&quot;, quietly = TRUE)) { progressr::handlers(&quot;beepr&quot;, append = TRUE) } ## Reporting via RStudio, if running in the RStudio Console, ## but not the terminal if ((Sys.getenv(&quot;RSTUDIO&quot;) == &quot;1&quot;) &amp;&amp; !nzchar(Sys.getenv(&quot;RSTUDIO_TERM&quot;))) { progressr::handlers(&quot;rstudio&quot;, append = TRUE) } } </code></pre> <p>See the <strong><a href="https://progressr.futureverse.org">progressr</a></strong> website for other, additional ways of reporting on progress.</p> <p>Now, go make some progress!</p> <h2 id="other-posts-on-progressr-reporting">Other posts on progressr reporting</h2> <ul> <li><a href="https://www.jottr.org/2022/06/03/progressr-0.10.1/">progressr 0.10.1: Plyr Now Supports Progress Updates also in Parallel</a>, 2022-06-03</li> <li><a href="https://www.jottr.org/2021/06/11/progressr-0.8.0/">progressr 0.8.0 - RStudio&rsquo;s Progress Bar, Shiny Progress Updates, and Absolute Progress</a>, 2021-06-11</li> <li><a href="https://www.jottr.org/2020/07/04/progressr-erum2020-slides/">e-Rum 2020 Slides on Progressr</a>, 2020-07-04</li> <li>See also <a href="https://www.jottr.org/tags/#progressr-list">&lsquo;progressr&rsquo;</a> tag.</li> </ul> <h2 id="links">Links</h2> <ul> <li><strong>progressr</strong> package: <a href="https://cran.r-project.org/package=progressr">CRAN</a>, <a href="https://github.com/HenrikBengtsson/progressr">GitHub</a>, <a href="https://progressr.futureverse.org">pkgdown</a></li> <li><strong>cli</strong> package: <a href="https://cran.r-project.org/package=cli">CRAN</a>, <a href="https://github.com/r-lib/cli">GitHub</a>, <a href="https://cli.r-lib.org/">pkgdown</a></li> <li><strong>future</strong> package: <a href="https://cran.r-project.org/package=future">CRAN</a>, <a href="https://github.com/HenrikBengtsson/future">GitHub</a>, <a href="https://future.futureverse.org">pkgdown</a></li> </ul> Please Avoid detectCores() in your R Packages https://www.jottr.org/2022/12/05/avoid-detectcores/ Mon, 05 Dec 2022 21:00:00 -0800 https://www.jottr.org/2022/12/05/avoid-detectcores/ <p>The <code>detectCores()</code> function of the <strong>parallel</strong> package is probably one of the most used functions when it comes to setting the number of parallel workers to use in R. In this blog post, I&rsquo;ll try to explain why using it is not always a good idea. Already now, I am going to make a bold request and ask you to:</p> <blockquote> <p>Please <em>avoid</em> using <code>parallel::detectCores()</code> in your package!</p> </blockquote> <p>By reading this blog post, I hope you become more aware of the different problems that arise from using <code>detectCores()</code> and how they might affect you and the users of your code.</p> <figure style="margin-top: 3ex;"> <img src="https://www.jottr.org/post/detectCores_bad_vs_good.png" alt="Screenshots of two terminal-based, colored graphs each showing near 100% load on all 24 CPU cores. The load bars to the left are mostly red, whereas the ones to the right are most green. There is a shrug emoji, with the text \"do you want this?\" pointing to the left and the text "or that?" pointing to the right, located inbetween the two graphs." style="width: 100%; margin: 0; margin-bottom: 2ex;"/> <figcaption style="font-style: italic"> Figure&nbsp;1: Using <code>detectCores()</code> risks overloading the machine where R runs, even more so if there are other things already running. The machine seen at the left is heavily loaded, because too many parallel processes compete for the 24 CPU cores available, which results in an extensive amount of kernel context switching (red), which wastes precious CPU cycles. The machine to the right is near-perfectly loaded at 100%, where none of the processes use more than they may use (mostly green). </figcaption> </figure> <h2 id="tl-dr">TL;DR</h2> <p>If you don&rsquo;t have time to read everything, but will take my word that we should avoid <code>detectCores()</code>, then the quick summary is that you basically have two choices for the number of parallel workers to use by default;</p> <ol> <li><p>Have your code run with a single core by default (i.e. sequentially), or</p></li> <li><p>replace all <code>parallel::detectCores()</code> with <a href="https://parallelly.futureverse.org/reference/availableCores.html"><code>parallelly::availableCores()</code></a>.</p></li> </ol> <p>I&rsquo;m in the conservative camp and recommend the first alternative. Using sequential processing by default, where the user has to make an explicit choice to run in parallel, significantly lowers the risk for clogging up the CPUs (left panel in Figure&nbsp;1), especially when there are other things running on the same machine.</p> <p>The second alternative is useful if you&rsquo;re not ready to make the move to run sequentially by default. The <code>availableCores()</code> function of the <strong><a href="https://parallelly.futureverse.org">parallelly</a></strong> package is fully backward compatible with <code>detectCores()</code>, while it avoids the most common problems that comes with <code>detectCores()</code>, plus it is agile to a lot more CPU-related settings, including settings that the end-user, the systems administrator, job schedulers and Linux containers control. It is designed to take care of common overuse issues so that you do not have to spend time worry about them.</p> <h2 id="background">Background</h2> <p>There are several problems with using <code>detectCores()</code> from the <strong>parallel</strong> package for deciding how many parallel workers to use. But before we get there, I want you to know that we find this function commonly used in R script and R packages, and frequently suggested in tutorials. So, do not feel ashamed if you use it.</p> <p>If we scan the code of the R packages on CRAN (e.g. by <a href="https://github.com/search?q=org%3Acran+language%3Ar+%22detectCores%28%29%22&amp;type=code">searching GitHub</a><sup>1</sup>), or on Bioconductor (e.g. by <a href="https://code.bioconductor.org/search/search?q=detectCores%28%29)">searching Bioc::CodeSearch</a>) we find many cases where <code>detectCores()</code> is used. Here are some variants we see in the wild:</p> <pre><code class="language-r">cl &lt;- makeCluster(detectCores()) cl &lt;- makeCluster(detectCores() - 1) y &lt;- mclapply(..., mc.cores = detectCores()) registerDoParallel(detectCores()) </code></pre> <p>We also find functions that let the user choose the number of workers via some argument, which defaults to <code>detectCores()</code>. Sometimes the default is explicit, as in:</p> <pre><code class="language-r">fast_fcn &lt;- function(x, ncores = parallel::detectCores()) { if (ncores &gt; 1) { cl &lt;- makeCluster(ncores) ... } } </code></pre> <p>and sometimes it&rsquo;s implicit, as in:</p> <pre><code class="language-r">fast_fcn &lt;- function(x, ncores = NULL) { if (is.null(ncores)) ncores &lt;- parallel::detectCores() - 1 if (ncores &gt; 1) { cl &lt;- makeCluster(ncores) ... } } </code></pre> <p>As we will see next, all the above examples are potentially buggy and might result in run-time errors.</p> <h2 id="common-mistakes-when-using-detectcores">Common mistakes when using detectCores()</h2> <h3 id="issue-1-detectcores-may-return-a-missing-value">Issue 1: detectCores() may return a missing value</h3> <p>A small, but important detail about <code>detectCores()</code> that is often missed is the following section in <code>help(&quot;detectCores&quot;, package = &quot;parallel&quot;)</code>:</p> <blockquote> <p><strong>Value</strong></p> <p>An integer, <strong>NA if the answer is unknown</strong>.</p> </blockquote> <p>Because of this, we cannot rely on:</p> <pre><code class="language-r">ncores &lt;- detectCores() </code></pre> <p>to always work, i.e. we might end up with errors like:</p> <pre><code class="language-r">ncores &lt;- detectCores() workers &lt;- parallel::makeCluster(ncores) Error in makePSOCKcluster(names = spec, ...) : numeric 'names' must be &gt;= 1 </code></pre> <p>We need to account for this, especially as package developers. One way to handle it is simply by using:</p> <pre><code class="language-r">ncores &lt;- detectCores() if (is.na(ncores)) ncores &lt;- 1L </code></pre> <p>or, by using the following shorter, but also harder to understand, one-liner:</p> <pre><code class="language-r">ncores &lt;- max(1L, detectCores(), na.rm = TRUE) </code></pre> <p>This construct is guaranteed to always return at least one core.</p> <p><em>Shameless advertisement for the <strong><a href="https://parallelly.futureverse.org">parallelly</a></strong> package</em>: In contrast to <code>detectCores()</code>, <code>parallelly::availableCores()</code> handles the above case automatically, and it guarantees to always return at least one core.</p> <h3 id="issue-2-detectcores-may-return-one">Issue 2: detectCores() may return one</h3> <p>Although it&rsquo;s rare to run into hardware with single-core CPUs these days, you might run into a virtual machine (VM) configured to have a single core. Because of this, you cannot reliably use:</p> <pre><code class="language-r">ncores &lt;- detectCores() - 1L </code></pre> <p>or</p> <pre><code class="language-r">ncores &lt;- detectCores() - 2L </code></pre> <p>in your code. If you use these constructs, a user of your code might end up with zero or a negative number of cores here, which another way we can end up with an error downstream. A real-world example of this problem can be found in continous integration (CI) services, e.g. <code>detectCores()</code> returns 2 in GitHub Actions jobs. So, we need to account also for this case, which we can do by using the above <code>max()</code> solution, e.g.</p> <pre><code class="language-r">ncores &lt;- max(1L, detectCores() - 2L, na.rm = TRUE) </code></pre> <p>This is guaranteed to always return at least one.</p> <p><em>Shameless advertisement for the <strong><a href="https://parallelly.futureverse.org">parallelly</a></strong> package</em>: In contrast, <code>parallelly::availableCores()</code> handles this case via argument <code>omit</code>, which makes it easier to understand the code, e.g.</p> <pre><code class="language-r">ncores &lt;- availableCores(omit = 2) </code></pre> <p>This construct is guaranteed to return at least one core, e.g. if there are one, two, or three CPU cores on this machine, <code>ncores</code> will be one in all three cases.</p> <h3 id="issue-3-detectcores-may-return-too-many-cores">Issue 3: detectCores() may return too many cores</h3> <p>When we use PSOCK, SOCK, or MPI clusters as defined by the <strong>parallel</strong> package, the communication between the main R session and the parallel workers is done via R socket connection. Low-level functions <code>parallel::makeCluster()</code>, <code>parallelly::makeClusterPSOCK()</code>, and legacy <code>snow::makeCluster()</code> create these types of clusters. In turn, there are higher-level functions that rely on these low-level functions, e.g. <code>doParallel::registerDoParallel()</code> uses <code>parallel::makeCluster()</code> if you are on MS Windows, <code>BiocParallel::SnowParam()</code> uses <code>snow::makeCluster()</code>, and <code>plan(multisession)</code> and <code>plan(cluster)</code> of the <strong><a href="https://future.futureverse.org">future</a></strong> package uses <code>parallelly::makeClusterPSOCK()</code>.</p> <p>R has a limit in the number of connections it can have open at any time. As of R 4.2.2, <a href="https://github.com/HenrikBengtsson/Wishlist-for-R/issues/28">the limit is 125 open connections</a>. Because of this, we can use at most 125 parallel PSOCK, SOCK, or MPI workers. In practice, this limit is lower, because some connections may already be in use elsewhere. To find the current number of free connections, we can use <a href="https://parallelly.futureverse.org/reference/availableConnections.html"><code>parallelly::freeConnections()</code></a>. If we try to launch a cluster with too many workers, there will not be enough connections available for the communication and the setup of the cluster will fail. For example, a user running on a 192-core machine will get errors such as:</p> <pre><code class="language-r">&gt; cl &lt;- parallel::makeCluster(detectCores()) Error in socketAccept(socket = socket, blocking = TRUE, open = &quot;a+b&quot;, : all connections are in use </code></pre> <p>and</p> <pre><code class="language-r">&gt; cl &lt;- parallelly::makeClusterPSOCK(detectCores()) Error: Cannot create 192 parallel PSOCK nodes. Each node needs one connection, but there are only 124 connections left out of the maximum 128 available on this R installation </code></pre> <p>Thus, if we use <code>detectCores()</code>, our R code will not work on larger, modern machines. This is a problem that will become more and more common as more users get access to more powerful computers. Hopefully, R will increase this connection limit in a future release, but until then, you as the developer are responsible to handle also this case. To make your code agile to this limit, also if R increases it, you can use:</p> <pre><code class="language-r">ncores &lt;- max(1L, detectCores(), na.rm = TRUE) ncores &lt;- min(parallelly::freeConnections(), ncores) </code></pre> <p>This is guaranteed to return at least zero (sic!) and never more than what is required to create a PSOCK, SOCK, and MPI cluster with than many parallel workers.</p> <p><em>Shameless advertisement for the <strong><a href="https://parallelly.futureverse.org">parallelly</a></strong> package</em>: In the upcoming <strong>parallelly</strong> 1.33.0 version, you can use <code>parallelly::availableCores(constraints = &quot;connections&quot;)</code> to limit the result to the current number of available R connections. In addition, you can control the maximum number of cores that <code>availableCores()</code> returns by setting R option <code>parallelly.availableCores.system</code>, or environment variable <code>R_PARALLELLY_AVAILABLECORES_SYSTEM</code>, e.g. <code>R_PARALLELLY_AVAILABLECORES_SYSTEM=120</code>.</p> <h2 id="issue-4-detectcores-does-not-give-the-number-of-allowed-cores">Issue 4: detectCores() does not give the number of &ldquo;allowed&rdquo; cores</h2> <p>There&rsquo;s a note in <code>help(&quot;detectCores&quot;, package = &quot;parallel&quot;)</code> that touches on the above problems, but also on other important limitations that we should know of:</p> <blockquote> <p><strong>Note</strong></p> <p>This [= <code>detectCores()</code>] is not suitable for use directly for the <code>mc.cores</code> argument of <code>mclapply</code> nor specifying the number of cores in <code>makeCluster</code>. First because it may return <code>NA</code>, second because it does not give the number of <em>allowed</em> cores, and third because on Sparc Solaris and some Windows boxes it is not reasonable to try to use all the logical CPUs at once.</p> </blockquote> <p><strong>When is this relevant? The answer is: Always!</strong> This is because as package developers, we cannot really know when this occurs, because we never know on what type of hardware and system our code will run. So, we have to account for these unknowns too.</p> <p>Let&rsquo;s look at some real-world case where using <code>detectCores()</code> can become a real issue.</p> <h3 id="4a-a-personal-computer">4a. A personal computer</h3> <p>A user might want to run other software tools at the same time while running the R analysis. A very common pattern we find in R code is to save one core for other purposes, say, browsing the web, e.g.</p> <pre><code class="language-r">ncores &lt;- detectCores() - 1L </code></pre> <p>This is a good start. It is the first step toward your software tool acknowledging that there might be other things running on the same machine. However, contrary to end-users, we as package developers cannot know how many cores the user needs, or wishes, to set aside. Because of this, it is better to let the user make this decision.</p> <p>A related scenario is when the user wants to run two concurrent R sessions on the same machine, both using your code. If your code assumes it can use all cores on the machine (i.e. <code>detectCores()</code> cores), the user will end up running the machine at 200% of its capacity. Whenever we use over 100% of the available CPU resources, we get penalized and waste our computational cycles on overhead from context switching, sub-optimal memory access, and more. This is where we end up with the situation illustrated in the left part of Figure&nbsp;1.</p> <p>Note also that users might not know that they use an R function that runs on all cores by default. They might not even be aware that this is a problem. Now, imagine if the user runs three or four such R sessions, resulting in a 300-400% CPU load. This is when things start to run slowly. The computer will be sluggish, maybe unresponsive, and mostly likely going to get very hot (&ldquo;we&rsquo;re frying the computer&rdquo;). By the time the four concurrent R processes complete, the user might have been able to finish six to eight similar processes if they would not have been fighting each other for the limited CPU resources.</p> <!-- If this happens on a shared system, the user might get an email from the systems adminstrator asking you why they are "trying to fry the computer". The user gets blamed for something that is our fault - it is us that decided to run on `detectCores()` CPU cores by default. This leads us to another scenario where a user might run into a case where the CPUs are overwhelmed because a software tool assumes it has exclusive right to all cores. --> <h3 id="4b-a-shared-computer">4b. A shared computer</h3> <p>In the academia and the industry, it is common that several users share the same compute server or set of compute nodes. It might be as simple as they SSH into a shared machine with many cores and large amounts of memory to run their analysis there. On such setups, load balancing between users is often based on an honor system, where each user checks how many resources are available before launching an analysis. This helps to make sure they don’t end up using too many cores, or too much memory, slowing down the computer for everyone else.</p> <div style="width: 38%; float: right;"> <figure style="margin-top: 1ex;"> <img src="https://www.jottr.org/post/detectCores_bad.png" alt="The left-handside graph of Figure 1, which shows mostly red bars at near 100% load for 24 CPU cores." style="width: 100%; margin: 0; margin-bottom: 2ex;"/> <figcaption> Figure 2: Overusing the CPU cores brings everything to a halt. </figcaption> </figure> </div> <p>Now, imagine they run a software tool that uses all CPU cores by default. In that case, there is a significant risk they will step on the other users&rsquo; processes, slowing everything down for everyone, especially if there is already a big load on the machine. From my experience in academia, this happens frequently. The user causing the problem is often not aware, because they just launch the problematic software with the default settings, leave it running, with a plan to coming back to it a few hours or a few days later. In the meantime, other users might wonder why their command-line prompts become sluggish or even non-responsive, and their analyses suddenly take forever to complete. Eventually, someone or something alerts the systems administrators to the problem, who end up having to drop everything else and start troubleshooting. This often results in them terminating the wild-running processes and reaching out to the user who runs the problematic software, which leads to a large amount of time and resources being wasted among users and administrators. All this is only because we designed our R package to use all cores by default. This is not a made-up toy story; it is a very likely scenario that happens on shared servers if you make <code>detectCores()</code> the default in your R code.</p> <p><em>Shameless advertisement for the <strong><a href="https://parallelly.futureverse.org">parallelly</a></strong> package</em>: In contrast to <code>detectCores()</code>, if you use <code>parallelly::availableCores()</code> the user, or the systems administrator, can limit the default number of CPU cores returned by setting environment variable <code>R_PARALLELLY_AVAILABLECORES_FALLBACK</code>. For instance, by setting it to <code>R_PARALLELLY_AVAILABLECORES_FALLBACK=2</code> centrally, <code>availableCores()</code> will, unless there are other settings that allow the process to use more, return two cores regardless how many CPU cores the machine has. This will lower the damage any single process can inflict on the system. It will take many such processes running at the same time in order for them to have an overall a negative impact. The risk for that to happen by mistake is much lower than when using <code>detectCores()</code> by default.</p> <h3 id="4c-a-shared-compute-cluster-with-many-machines">4c. A shared compute cluster with many machines</h3> <p>Other, larger compute systems, often referred to as high-performance compute (HPC) cluster, have a job scheduler for running scripts in batches distributed across multiple machines. When users submit their scripts to the scheduler&rsquo;s job queue, they request how many cores and how much memory each job requires. For example, a user on a Slurm cluster can request that their <code>run_my_rscript.sh</code> script gets to run with 48 CPU cores and 256 GiB of RAM by submitting it to the scheduler as:</p> <pre><code class="language-sh">sbatch --cpus-per-task=48 --mem=256G run_my_rscript.sh </code></pre> <p>The scheduler keeps track of all running and queued jobs, and when enough compute slots are freed up, it will launch the next job in the queue, giving it the compute resources it requested. This is a very convenient and efficient way to batch process a large amount of analyses coming from many users.</p> <p>However, just like with a shared server, it is important that the software tools running this way respect the compute resources that the job scheduler allotted to the job. The <code>detectCores()</code> function does <em>not</em> know about job schedulers - all it does is return the number of CPU cores on the current machine regardless of how many cores the job has been allotted by the scheduler. So, if your R package uses <code>detectCores()</code> cores by default, then it will overuse the CPUs and slow things down for everyone running on the same compute node. Again, when this happens, it often slows everything done and triggers lots of wasted user and admin efforts spent on troubleshooting and communication back and forth.</p> <p><em>Shameless advertisement for the <strong><a href="https://parallelly.futureverse.org">parallelly</a></strong> package</em>: In contrast, <code>parallelly::availableCores()</code> respects the number of CPU slots that the job scheduler has given to the job. It recognizes environment variables set by our most common HPC schedulers, including Fujitsu Technical Computing Suite (PJM), Grid Engine (SGE), Load Sharing Facility (LSF), PBS/Torque, and Simple Linux Utility for Resource Management (Slurm).</p> <h3 id="4d-running-r-via-cgroups-on-in-a-linux-container">4d. Running R via CGroups on in a Linux container</h3> <p>This far, we have been concerned about the overuse of the CPU cores affecting other processes and other users running on the same machine. Some systems are configured to protect against misbehaving software from affecting other users. In Linux, this can be done with so-called control groups (&ldquo;cgroups&rdquo;), where a process gets allotted a certain amount of CPU cores. If the process uses too many parallel workers, they cannot break out from the sandbox set up by cgroups. From the outside, it will look like the process uses its maximum amount of allocated CPU cores. Some HPC job schedulers have this feature enabled, but not all of them. You find the same feature for Linux containers, e.g. we can limit the number of CPU cores, or throttle the CPU load, using command-line options when you launch a Docker container, e.g. <code>docker run --cpuset-cpus=0-2,8 …</code> or <code>docker run --cpu=3.4 …</code>.</p> <p>So, if you are a user on a system where compute resources are compartmentalized this way, you run a much lower risk for wreaking havoc on a shared system. That is good news, but if you run too many parallel workers, that is, try to use more cores than available to you, then you will clog up your own analysis. The behavior would be the same as if you request 96 parallel workers on your local eight-core notebook (the scenario in the left panel of Figure&nbsp;1), with the exception that you will not overheat the computer.</p> <p>The problem with <code>detectCores()</code> is that it returns the number of CPU cores on the hardware, regardless of the cgroups settings. So, if your R process is limited to eight cores by cgroups, and you use <code>ncores = detectCores()</code> on a 96-core machine, you will end up running 96 parallel workers fighting for the resources on eight cores. A real-world example of this happens for those of you who have a free account on RStudio Cloud. In that case, you are given only a single CPU core to run your R code on, but the underlying machine typically has 16 cores. If you use <code>detectCores()</code> there, you will end up creating 16 parallel workers, running on the same CPU core, which is a very ineffecient way to run the code.</p> <p><em>Shameless advertisement for the <strong><a href="https://parallelly.futureverse.org">parallelly</a></strong> package</em>: In contrast to <code>detectCores()</code>, <code>parallelly::availableCores()</code> respects cgroups, and will return eight cores instead of 96 in the above example, and a single core on a free RStudio Cloud account.</p> <h2 id="my-opinionated-recommendation">My opinionated recommendation</h2> <div style="width: 38%; float: right;"> <figure style="margin-top: 1ex;"> <img src="https://www.jottr.org/post/detectCores_good.png" alt="The right-handside graph of Figure 1, which shows mostly green bars at near 100% load for 24 CPU cores." style="width: 100%; margin: 0; margin-bottom: 2ex;"/> <figcaption> Figure 3: If we avoid overusing the CPU cores, then everything will run much smoother and much faster. </figcaption> </figure> </div> <p>As developers, I think we should at least be aware of these problems, and acknowledge that they exist and they are indeed real problem that people run into &ldquo;out there&rdquo;. We should also accept that we cannot predict on what type of compute environment our R code will run on. Unfortunately, I don&rsquo;t have a magic solution that addresses all the problems reported here. That said, I think the best we can do is to be conservative and don&rsquo;t make hard-coded decisions on parallelization in our R packages and R scripts.</p> <p>Because of this, I argue that <strong>the safest is to design your R package to run sequentially by default (e.g. <code>ncores = 1L</code>), and leave it to the user to decide on the number of parallel workers to use.</strong></p> <p>The <strong>second-best alternative</strong> that I can come up with, is to replace <code>detectCores()</code> with <code>availableCores()</code>, e.g. <code>ncores = parallelly::availableCores()</code>. It is designed to respect common system and R settings that control the number of allowed CPU cores. It also respects R options and environment variables commonly used to limit CPU usage, including those set by our most common HPC job schedulers. In addition, it is possible to control the <em>fallback</em> behavior so that it uses only a few cores when nothing else being set. For example, if the environment variable <code>R_PARALLELLY_AVAILABLECORES_FALLBACK</code> is set to <code>2</code>, then <code>availableCores()</code> returns two cores by default, unless other settings allowing more are available. A conservative systems administrator may want to set <code>export R_PARALLELLY_AVAILABLECORES_FALLBACK=1</code> in <code>/etc/profile.d/single-core-by-default.sh</code>. To see other benefits from using <code>availableCores()</code>, see <a href="https://parallelly.futureverse.org">https://parallelly.futureverse.org</a>.</p> <p>Believe it or not, there&rsquo;s actually more to be said on this topic, but I think this is already more than a mouthful, so I will save that for another blog post. If you made it this far, I applaud you and I thank you for your interest. If you agree, or disagree, or have additional thoughts around this, please feel free to reach out on the <a href="https://github.com/HenrikBengtsson/future/discussions/">Future Discussions Forum</a>.</p> <p>Over and out,</p> <p>Henrik</p> <p><small><sup>1</sup> Searching code on GitHub, requires you to log in to GitHub.</small></p> <p>UPDATE 2022-12-06: <a href="https://github.com/HenrikBengtsson/future/discussions/656">Alex Chubaty pointed out another problem</a>, where <code>detectCores()</code> can be too large on modern machines, e.g. machines with 128 or 192 CPU cores. I&rsquo;ve added Section &lsquo;Issue 3: detectCores() may return too many cores&rsquo; explaining and addressing this problem.</p> <p>UPDATE 2022-12-11: Mention upcoming <code>parallelly::availableCores(constraints = &quot;connections&quot;)</code>.</p> useR! 2022: My 'Futureverse: Profile Parallel Code' Slides https://www.jottr.org/2022/06/23/future-user2022-slides/ Thu, 23 Jun 2022 17:00:00 -0700 https://www.jottr.org/2022/06/23/future-user2022-slides/ <figure style="margin-top: 3ex;"> <img src="https://www.jottr.org/post/BengtssonH_20220622-Future-useR2022_slide18.png" alt="Screenshot of Slide #18 in my presentation. A graphical time-chart representation of the events that takes place when calling the following code in R: plan(cluster, workers = 2); fs <- lapply(1:2, function(x) future(slow(x)); vs <- value(fs); There are two futures displayed in the time chart. Each future is represented by a blue, horizontal 'lifespan' bar. The second future starts slightly after the first one. Each future is evaluated in a separate worker, which is represented as pink horizontal 'evaluate' bar. The two 'lifespan' and the two 'evaluation' bars are overlapping indicating they run in parallel." style="width: 100%; margin: 0;"/> <figcaption> Figure 1: A time chart of logged events for two futures resolved by two parallel workers. This is a screenshot of Slide #18 in my talk. </figcaption> </figure> <p><img src="https://www.jottr.org/post/user2022-logo_450x300.webp" alt="The useR 2022 logo" style="width: 30%; float: right; margin: 2ex;"/></p> <p>Below are the slides for my <em>Futureverse: Profile Parallel Code</em> talk that I presented at the <a href="https://user2022.r-project.org/">useR! 2022</a> conference online and hosted by the Department of Biostatistics at Vanderbilt University Medical Center.</p> <p>Title: Futureverse: Profile Parallel Code<br /> Speaker: Henrik Bengtsson<br /> Session: <a href="https://user2022.r-project.org/program/talks/#session-21-parallel-computing">#21: Parallel Computing</a>, chaired by Ilias Moutsopoulos<br /> Slides: <a href="https://docs.google.com/presentation/d/e/2PACX-1vTnpyj7qvyKr-COHaJAYjoGveoOJPYrstTmvC4farFk2vdwWb8O79kA5tn7klTS67_uoJJdKFPgKNql/pub?start=true&amp;loop=false&amp;delayms=60000&amp;slide=id.gf778290f24_0_165">HTML</a>, <a href="https://www.jottr.org/presentations/useR2022/BengtssonH_20220622-Future-useR2022.pdf">PDF</a> (24 slides)<br /> Video: <a href="https://www.youtube.com/watch?v=_lrPgNqT3SM&amp;t=2528s">official recording</a> (27m30s long starting at 42m10s)</p> <p>Abstract:</p> <p>&ldquo;In this presentation, I share recent enhancements that allow developers and end-users to profile R code running in parallel via the future framework. With these new, frequently requested features, we can study how and where our computational resources are used. With the help of visualization (e.g., ggplot2 and Shiny), we can identify bottlenecks in our code and parallel setup. For example, if we find that some parallel workers are more idle than expected, we can tweak settings to improve the overall CPU utilization and thereby increase the total throughput and decrease the turnaround time (latency). These new benchmarking tools work out of the box on existing code and packages that build on the future package, including future.apply, furrr, and doFuture.</p> <p>The future framework, available on CRAN since 2016, has been used by hundreds of R packages and is among the top 1% of most downloaded packages. It is designed to unify and leverage common parallelization frameworks in R and to make new and existing R code faster with minimal efforts of the developer. The futureverse allows you, the developer, to stay with your favorite programming style, and end-users are free to choose the parallel backend to use (e.g., on a local machine, across multiple machines, in the cloud, or on a high-performance computing (HPC) cluster).&rdquo;</p> <hr /> <p>I want to send out a big thank you to useR! organizers, staff, and volunteers, and everyone else who contributed to this event.</p> <p>/Henrik</p> <h2 id="links">Links</h2> <ul> <li>useR! 2022: <a href="https://user2022.r-project.org/">https://user2022.r-project.org/</a></li> <li><strong>futureverse</strong> website: <a href="https://www.futureverse.org/">https://www.futureverse.org/</a></li> <li><strong>future</strong> package <a href="https://cran.r-project.org/package=future">CRAN</a>, <a href="https://github.com/HenrikBengtsson/future">GitHub</a>, <a href="https://future.futureverse.org/">pkgdown</a></li> </ul> parallelly: Support for Fujitsu Technical Computing Suite High-Performance Compute (HPC) Environments https://www.jottr.org/2022/06/09/parallelly-support-for-fujitsu-technical-computing-suite-high-performance-compute-hpc-environments/ Thu, 09 Jun 2022 13:00:00 -0700 https://www.jottr.org/2022/06/09/parallelly-support-for-fujitsu-technical-computing-suite-high-performance-compute-hpc-environments/ <div style="padding: 2ex; float: right;"/> <center> <img src="https://www.jottr.org/post/parallelly-logo.png" alt="The 'parallelly' hexlogo"/> </center> </div> <p><strong><a href="https://parallelly.futureverse.org">parallelly</a></strong> 1.32.0 is now on CRAN. One of the major updates is that <code>availableCores()</code> and <code>availableWorkers()</code>, and therefore also the <strong>future</strong> framework, gained support for the &lsquo;Fujitsu Technical Computing Suite&rsquo; job scheduler. For other updates, please see <a href="https://parallelly.futureverse.org/news/index.html">NEWS</a>.</p> <p>The <strong>parallelly</strong> package enhances the <strong>parallel</strong> package - our built-in R package for parallel processing - by improving on existing features and by adding new ones. Somewhat simplified, <strong>parallelly</strong> provides the things that you would otherwise expect to find in the <strong>parallel</strong> package. The <strong><a href="https://future.futureverse.org">future</a></strong> package relies on the <strong>parallelly</strong> package internally for local and remote parallelization.</p> <h2 id="support-for-the-fujitsu-technical-computing-suite">Support for the Fujitsu Technical Computing Suite</h2> <p>Functions <a href="https://parallelly.futureverse.org/reference/availableCores.html"><code>availableCores()</code></a> and <a href="https://parallelly.futureverse.org/reference/availableWorkers.html"><code>availableWorkers()</code></a> now support the Fujitsu Technical Computing Suite. Fujitsu Technical Computing Suite is a high-performance compute (HPC) job scheduler, which is popular in Japan among other places, e.g. at RIKEN and Kyushu University.</p> <p>Specifically, these functions now recognize environment variables <code>PJM_VNODE_CORE</code>, <code>PJM_PROC_BY_NODE</code>, and <code>PJM_O_NODEINF</code> set by the Fujitsu Technical Computing Suite scheduler. For example, if we submit a job script with:</p> <pre><code class="language-sh">$ pjsub -L vnode=4 -L vnode-core=10 script.sh </code></pre> <p>the scheduler will allocate four slots with ten cores each on one or more compute nodes. For example, we might get:</p> <pre><code class="language-r">parallelly::availableCores() #&gt; [1] 10 parallelly::availableWorkers() #&gt; [1] &quot;node032&quot; &quot;node032&quot; &quot;node032&quot; &quot;node032&quot; &quot;node032&quot; #&gt; [6] &quot;node032&quot; &quot;node032&quot; &quot;node032&quot; &quot;node032&quot; &quot;node032&quot; #&gt; [11] &quot;node032&quot; &quot;node032&quot; &quot;node032&quot; &quot;node032&quot; &quot;node032&quot; #&gt; [16] &quot;node032&quot; &quot;node032&quot; &quot;node032&quot; &quot;node032&quot; &quot;node032&quot; #&gt; [21] &quot;node032&quot; &quot;node032&quot; &quot;node032&quot; &quot;node032&quot; &quot;node032&quot; #&gt; [26] &quot;node032&quot; &quot;node032&quot; &quot;node032&quot; &quot;node032&quot; &quot;node032&quot; #&gt; [31] &quot;node109&quot; &quot;node109&quot; &quot;node109&quot; &quot;node109&quot; &quot;node109&quot; #&gt; [36] &quot;node109&quot; &quot;node109&quot; &quot;node109&quot; &quot;node109&quot; &quot;node109&quot; </code></pre> <p>In this example, the scheduler allocated three 10-core slots on compute node <code>node032</code> and one 10-core slot on compute node <code>node109</code>, totalling 40 CPU cores, as requested. Because of this, users on these systems can now use <a href="https://parallelly.futureverse.org/reference/makeClusterPSOCK.html"><code>makeClusterPSOCK()</code></a> to set up a parallel PSOCK cluster as:</p> <pre><code class="language-r">library(parallelly) cl &lt;- makeClusterPSOCK(availableWorkers(), rshcmd = &quot;pjrsh&quot;) </code></pre> <p>As shown above, this code picks up whatever <code>vnode</code> and <code>vnode-core</code> configuration were requested via the <code>pjsub</code> submission, and launch 40 parallel R workers via the <code>pjrsh</code> tool part of the Fujitsu Technical Computing Suite.</p> <p>This also means that we can use:</p> <pre><code class="language-r">library(future) plan(cluster, rshcmd = &quot;pjrsh&quot;) </code></pre> <p>when using the <strong>future</strong> framework, which uses <code>makeClusterPSOCK()</code> and <code>availableWorkers()</code> internally.</p> <h2 id="avoid-having-to-specify-rshcmd-pjrsh">Avoid having to specify rshcmd = &ldquo;pjrsh&rdquo;</h2> <p>To avoid having to manually specify argument <code>rshcmd = &quot;pjrsh&quot;</code> manually, we can set it via environment variable <a href="https://parallelly.futureverse.org/reference/parallelly.options.html"><code>R_PARALLELLY_MAKENODEPSOCK_RSHCMD</code></a> (sic!) before launching R, e.g.</p> <pre><code class="language-sh">export R_PARALLELLY_MAKENODEPSOCK_RSHCMD=pjrsh </code></pre> <p>To make this persistent, the user can add this line to their <code>~/.bashrc</code> shell startup script. Alternatively, the system administrator can add it to a <code>/etc/profile.d/*.sh</code> file of their choice.</p> <p>With this environment variable set, it&rsquo;s sufficient to do:</p> <pre><code>library(parallelly) cl &lt;- makeClusterPSOCK(availableWorkers()) </code></pre> <p>and</p> <pre><code class="language-r">library(future) plan(cluster) </code></pre> <p>In addition to not having to remember using <code>rshcmd = &quot;pjrsh&quot;</code>, a major advantage of this approach is that the same R script works also on other systems, including the user&rsquo;s local machine and HPC environments such as Slurm and SGE.</p> <p>Over and out, and welcome to all Fujitsu Technical Computing Suite users!</p> <h2 id="links">Links</h2> <ul> <li><strong>parallelly</strong> package: <a href="https://cran.r-project.org/package=parallelly">CRAN</a>, <a href="https://github.com/HenrikBengtsson/parallelly">GitHub</a>, <a href="https://parallelly.futureverse.org">pkgdown</a></li> <li><strong>future</strong> package: <a href="https://cran.r-project.org/package=future">CRAN</a>, <a href="https://github.com/HenrikBengtsson/future">GitHub</a>, <a href="https://future.futureverse.org">pkgdown</a></li> </ul> parallelly 1.32.0: makeClusterPSOCK() Didn't Work with Chinese and Korean Locales https://www.jottr.org/2022/06/08/parallelly-1.32.0-makeclusterpsock-didnt-work-with-chinese-and-korean-locales/ Wed, 08 Jun 2022 14:00:00 -0700 https://www.jottr.org/2022/06/08/parallelly-1.32.0-makeclusterpsock-didnt-work-with-chinese-and-korean-locales/ <div style="padding: 2ex; float: right;"/> <center> <img src="https://www.jottr.org/post/parallelly-logo.png" alt="The 'parallelly' hexlogo"/> </center> </div> <p><strong><a href="https://parallelly.futureverse.org">parallelly</a></strong> 1.32.0 is on CRAN. This release fixes an important bug that affected users running with the Simplified Chinese, Traditional Chinese (Taiwan), or Korean locale. The bug caused <code>makeClusterPSOCK()</code>, and therefore also <code>future::plan(&quot;multisession&quot;)</code>, to fail with an error. For other updates, please see <a href="https://parallelly.futureverse.org/news/index.html">NEWS</a>.</p> <p>The <strong>parallelly</strong> package enhances the <strong>parallel</strong> package - our built-in R package for parallel processing - by improving on existing features and by adding new ones. Somewhat simplified, <strong>parallelly</strong> provides the things that you would otherwise expect to find in the <strong>parallel</strong> package. The <strong><a href="https://future.futureverse.org">future</a></strong> package relies on the <strong>parallelly</strong> package internally for local and remote parallelization.</p> <h2 id="important-bug-fix-for-chinese-and-korean-users">Important bug fix for Chinese and Korean users</h2> <p>It turns out that <a href="https://parallelly.futureverse.org/reference/makeClusterPSOCK.html"><code>makeClusterPSOCK()</code></a> has never<sup>[1]</sup> worked for users that have their computers set to use a Korean (<code>LANGUAGE=ko</code>), a Simplified Chinese (<code>LANGUAGE=zh_CN</code>), or a Traditional Chinese (Taiwan) (<code>LANGUAGE=zh_TW</code>) locale. For example,</p> <pre><code class="language-r">Sys.setLanguage(&quot;zh_CN&quot;) library(parallelly) cl &lt;- parallelly::makeClusterPSOCK(2) #&gt; 错误: ‘node$session_info$process$pid == pid’ is not TRUE #&gt; 此外: Warning message: #&gt; In add_cluster_session_info(cl[ii]) : 强制改变过程中产生了NA </code></pre> <p>The workaround was to pass <code>validate = FALSE</code>, e.g.</p> <pre><code class="language-r">cl &lt;- parallelly::makeClusterPSOCK(2, validate = FALSE) </code></pre> <p>This bug was because of an internal assertion that made incorrect assumptions about what <code>print()</code> for <code>SOCK0node</code> and <code>SOCKnode</code> object would output. It worked with most locales, but not with the above three. I have fixed this in the most recent release of <strong>parallelly</strong>.</p> <p>Since the &lsquo;multisession&rsquo; strategy of the <strong><a href="https://future.futureverse.org">future</a></strong> framework relies on <code>makeClusterPSOCK()</code>, this bug affected also the <strong>future</strong> package, e.g.</p> <pre><code class="language-r">Sys.setLanguage(&quot;ko&quot;) library(future) plan(multisession) #&gt; 에러: 'node$session_info$process$pid == pid' is not TRUE #&gt; 추가정보: 경고메시지(들): #&gt; add_cluster_session_info(cl[ii])에서: 강제형변환에 의해 생성된 NA 입니다 </code></pre> <p>So, if you run into these errors, upgrade to the latest version of <strong>parallelly</strong>, e.g. <code>update.packages()</code>, restart R, and it will work as you would expect.</p> <!-- Source: https://chinesefor.us/lessons/say-sorry-chinese-apologize-duibuqi/ and https://www.wikihow.com/Apologize-in-Korean --> <p>To prevent this from happening again, I am now making sure to always check the package with also these locales, in addition to English. CRAN already checks packages <a href="https://cran.r-project.org/web/checks/check_flavors.html">with different English and German locales</a>.</p> <p>I am sorry, 对不起, 미안해요, about this. Hopefully, it&rsquo;ll work smoother from now on.</p> <p>Happy parallelization!</p> <h2 id="links">Links</h2> <ul> <li><strong>parallelly</strong> package: <a href="https://cran.r-project.org/package=parallelly">CRAN</a>, <a href="https://github.com/HenrikBengtsson/parallelly">GitHub</a>, <a href="https://parallelly.futureverse.org">pkgdown</a></li> <li><strong>future</strong> package: <a href="https://cran.r-project.org/package=future">CRAN</a>, <a href="https://github.com/HenrikBengtsson/future">GitHub</a>, <a href="https://future.futureverse.org">pkgdown</a></li> </ul> <p><sup>[1]</sup> The last time it worked was with <strong>future</strong> 1.4.0 (2017-03-13), when this function was still part of the <strong>future</strong> package.</p> progressr 0.10.1: Plyr Now Supports Progress Updates also in Parallel https://www.jottr.org/2022/06/03/progressr-0.10.1/ Fri, 03 Jun 2022 13:00:00 -0700 https://www.jottr.org/2022/06/03/progressr-0.10.1/ <div style="padding: 2ex; float: right;"/> <center> <img src="https://www.jottr.org/post/three_in_chinese.gif" alt="Three strokes writing three in Chinese"/> </center> </div> <p><strong><a href="https://progressr.futureverse.org">progressr</a></strong> 0.10.1 is on CRAN. I dedicate this release to all <strong><a href="https://cran.r-project.org/package=plyr">plyr</a></strong> users and developers out there.</p> <p>The <strong>progressr</strong> package provides a minimal API for reporting progress updates in R. The design is to separate the representation of progress updates from how they are presented. What type of progress to signal is controlled by the developer. How these progress updates are rendered is controlled by the end user. For instance, some users may prefer visual feedback, such as a horizontal progress bar in the terminal, whereas others may prefer auditory feedback. The <strong>progressr</strong> package works also when processing R in parallel or distributed using the <strong><a href="https://future.futureverse.org">future</a></strong> framework.</p> <h2 id="plyr-future-progressr-parallel-progress-reporting"><strong>plyr</strong> + <strong>future</strong> + <strong>progressr</strong> ⇒ parallel progress reporting</h2> <p>The major update in this release, is that <strong><a href="https://cran.r-project.org/package=plyr">plyr</a></strong> (&gt;= 1.8.7) now has built-in support for the <strong>progressr</strong> package when running in parallel. For example,</p> <pre><code class="language-r">library(plyr) ## Parallelize on the local machine future::plan(&quot;multisession&quot;) doFuture::registerDoFuture() library(progressr) handlers(global = TRUE) y &lt;- llply(1:100, function(x) { Sys.sleep(1) sqrt(x) }, .progress = &quot;progressr&quot;, .parallel = TRUE) #&gt; |============ | 28% </code></pre> <p>Previously, <strong>plyr</strong> only had built-in support for progress reporting when running sequentially. Note that the <strong>progressr</strong> is the only package that supports progress reporting when using <code>.parallel = TRUE</code> in <strong>plyr</strong>.</p> <p>Also, whenever using <strong>progressr</strong>, the user has plenty of options for where and how progress is reported. For example, <code>handlers(&quot;rstudio&quot;)</code> uses the progress bar in the RStudio job interface, <code>handlers(&quot;progress&quot;)</code> uses terminal progress bars of the <strong>progress</strong> package, and <code>handlers(&quot;beep&quot;)</code> reports on progress using sounds. It&rsquo;s also possible to report progress in the Shiny. See my blog post <a href="https://www.jottr.org/2021/06/11/progressr-0.8.0/">&lsquo;progressr 0.8.0 - RStudio’s Progress Bar, Shiny Progress Updates, and Absolute Progress&rsquo;</a> for more information.</p> <h2 id="there-s-actually-a-better-way">There&rsquo;s actually a better way</h2> <p>I actually recommend another way for reporting on progress with <strong>plyr</strong> map-reduce functions, which is more in line with the design philosophy of <strong>progressr</strong>:</p> <blockquote> <p>The developer is responsible for providing progress updates, but it’s only the end user who decides if, when, and how progress should be presented. No exceptions will be allowed.</p> </blockquote> <p>Please see Section &lsquo;plyr::llply(…, .parallel = TRUE) with doFuture&rsquo; in the <a href="https://progressr.futureverse.org/articles/progressr-intro.html">&lsquo;progressr: An Introduction&rsquo;</a> vignette for this alternative approach, which has worked for long time already. But, of course, adding <code>.progress = &quot;progressr&quot;</code> to your already existing <strong>plyr</strong> <code>.parallel = TRUE</code> code is as simple as it gets.</p> <p>Now, make some progress!</p> <h2 id="other-posts-on-progress-reporting">Other posts on progress reporting</h2> <ul> <li><a href="https://www.jottr.org/2021/06/11/progressr-0.8.0/">progressr 0.8.0 - RStudio&rsquo;s Progress Bar, Shiny Progress Updates, and Absolute Progress</a>, 2021-06-11</li> <li><a href="https://www.jottr.org/2020/07/04/progressr-erum2020-slides/">e-Rum 2020 Slides on Progressr</a>, 2020-07-04</li> <li>See also <a href="https://www.jottr.org/tags/#progressr-list">&lsquo;progressr&rsquo;</a> tag.</li> </ul> <h2 id="links">Links</h2> <ul> <li><strong>progressr</strong> package: <a href="https://cran.r-project.org/package=progressr">CRAN</a>, <a href="https://github.com/HenrikBengtsson/progressr">GitHub</a>, <a href="https://progressr.futureverse.org">pkgdown</a></li> <li><strong>plyr</strong> package: <a href="https://cran.r-project.org/package=plyr">CRAN</a>, <a href="https://github.com/hadley/plyr">GitHub</a>, <a href="http://plyr.had.co.nz/">pkgdown-ish</a></li> <li><strong>future</strong> package: <a href="https://cran.r-project.org/package=future">CRAN</a>, <a href="https://github.com/HenrikBengtsson/future">GitHub</a>, <a href="https://future.futureverse.org">pkgdown</a></li> </ul> parallelly 1.31.1: Better at Inferring Number of CPU Cores with Cgroups and Linux Containers https://www.jottr.org/2022/04/22/parallelly-1.31.1/ Fri, 22 Apr 2022 11:00:00 -0700 https://www.jottr.org/2022/04/22/parallelly-1.31.1/ <div style="padding: 2ex; float: right;"/> <center> <img src="https://www.jottr.org/post/parallelly-logo.png" alt="The 'parallelly' hexlogo"/> </center> </div> <p><strong><a href="https://parallelly.futureverse.org">parallelly</a></strong> 1.31.1 is on CRAN. The <strong>parallelly</strong> package enhances the <strong>parallel</strong> package - our built-in R package for parallel processing - by improving on existing features and by adding new ones. Somewhat simplified, <strong>parallelly</strong> provides the things that you would otherwise expect to find in the <strong>parallel</strong> package. The <strong><a href="https://future.futureverse.org">future</a></strong> package relies on the <strong>parallelly</strong> package internally for local and remote parallelization.</p> <p>Since my <a href="https://www.jottr.org/2021/11/22/parallelly-1.29.0/">previous post on <strong>parallelly</strong></a> in November 2021, I&rsquo;ve fixed a few bugs and added some new features to the package;</p> <ul> <li><p><code>availableCores()</code> detects more cgroups settings, e.g. it now detects the number of CPUs available to your RStudio Cloud session</p></li> <li><p><code>makeClusterPSOCK()</code> gained argument <code>default_packages</code> to control which packages to attach at startup on the R workers</p></li> <li><p><code>makeClusterPSOCK()</code> gained <code>rscript_sh</code> to explicitly control what type of shell quotes to use on the R workers</p></li> </ul> <p>Below is a detailed description of these new features. Some of them, and some of the bug fixes, were added to version 1.30.0, while others to versions 1.31.0 and 1.31.1.</p> <h2 id="availablecores-detects-more-cgroups-settings">availableCores() detects more cgroups settings</h2> <p><em><a href="https://www.wikipedia.org/wiki/Cgroups">Cgroups</a></em>, short for control groups, is a low-level feature in Linux to control which and how much resources a process may use. This prevents individual processes from taking up all resources. For example, an R process can be limited to use at most four CPU cores, even if the underlying hardware has 48 CPU cores. Imagine we parallelize with <code>parallel::detectCores()</code> background workers, e.g.</p> <pre><code class="language-r">library(future) plan(multisession, workers = parallel::detectCores()) </code></pre> <p>This will spawn 48 background R processes. Without cgroups, these 48 parallel R workers will run across all 48 CPU cores on the machine, competing with all other software and all other users running on the same machine. With cgroups limiting us to, say, four CPU cores, there will still be 48 parallel R workers running, but they will now run isolated on only four CPU cores, leaving the other 44 CPU cores alone.</p> <p>Of course, running 48 parallel workers on four CPU cores is not very efficient. There will be a lot of wasted CPU cycles due to context switching. The problem is that we use <code>parallel::detectCores()</code> here, which is what gives us 48 workers. If we instead use <a href="https://parallelly.futureverse.org/reference/availableCores.html"><code>availableCores()</code></a> of <strong>parallelly</strong>;</p> <pre><code class="language-r">library(future) plan(multisession, workers = parallelly::availableCores()) </code></pre> <p>we get four parallel workers, which reflects the four CPU cores that cgroups gives us. Basic support for this was introduced in <strong>parallelly</strong> 1.22.0 (2020-12-12), by querying <code>nproc</code>. This required that <code>nproc</code> was installed on the system, and although it worked in many cases, it did not work for all cgroups configurations. Specifically, it would not work when cgroups was <em>throttling</em> the CPU usage rather than limiting the process to a specific set of CPU cores. To illustrate this, assume we run R via Docker using <a href="https://www.rocker-project.org/">Rocker</a>:</p> <pre><code class="language-sh">$ docker run --cpuset-cpus=0-2,8 rocker/r-base </code></pre> <p>then cgroups will isolate the Linux container to run on CPU cores 0, 1, 2, and 8 of the host. In this case <code>nproc</code>, e.g. <code>system(&quot;nproc&quot;)</code> from within R, returns four (4), and therefore also <code>parallelly::availableCores()</code>. Starting with <strong>parallelly</strong> 1.31.0, <code>parallelly::availableCores()</code> detects this also when <code>nproc</code> is not installed on the system. An alternative to limit the CPU resources, is to throttle the average CPU load. Using Docker, this can be done as:</p> <pre><code class="language-sh">$ docker run --cpus=3.5 rocker/r-base </code></pre> <p>In this case, cgroups will throttle our R process to consume at most 350% worth of CPU on the host, where 100% corresponds to a single CPU. Here, <code>nproc</code> is of no use and simply gives the number of CPUs on the host (e.g. 48). Starting with <strong>parallelly</strong> 1.31.0, <code>parallelly::availableCores()</code> can detect that cgroups throttles R to an average load of 3.5 CPUs. Since we cannot run 3.5 parallel workers, <code>parallelly::availableCores()</code> rounds down to the nearest integer and return three (3). The <a href="https://rstudio.cloud/">RStudio Cloud</a> is one example where CPU throttling is used, so if you work in RStudio Cloud, use <code>parallelly::availableCores()</code> and you will be good.</p> <p>While talking about RStudio Cloud, if you use a free account, you have access to only a single CPU core (&ldquo;nCPUs = 1&rdquo;). In this case, <code>plan(multisession, workers = parallelly::availableCores())</code>, or equivalently, <code>plan(multisession)</code>, will fall back to sequential processing, because there is no point in running in parallel on a single core. If you still want to <em>prototype</em> parallel processing in a single-core environment, say with two cores, you can set option <code>parallelly.availableCores.min = 2</code>. This makes <code>availableCores()</code> return two (2).</p> <h2 id="makeclusterpsock-gained-more-skills">makeClusterPSOCK() gained more skills</h2> <p>Since <strong>parallelly</strong> 1.29.0, <a href="https://parallelly.futureverse.org/reference/makeClusterPSOCK.html"><code>makeClusterPSOCK()</code></a> has gained arguments <code>default_packages</code> and <code>rscript_sh</code>.</p> <h3 id="new-argument-default-packages">New argument <code>default_packages</code></h3> <p>Argument <code>default_packages</code> controls which R packages are attached on each worker during startup. Previously, it was only possible, via logical argument <code>methods</code> to control whether or not the <strong>methods</strong> package should be attached - an argument that stems from <code>parallel::makePSOCKcluster()</code>. With the new <code>default_packages</code> argument, we have full control of which packages are attached. For instance, if we want to go minimal, we can do:</p> <pre><code class="language-r">cl &lt;- parallelly::makeClusterPSOCK(1, default_packages = &quot;base&quot;) </code></pre> <p>This will result in one R worker with only the <strong>base</strong> package <em>attached</em>;</p> <pre><code class="language-r">&gt; parallel::clusterEvalQ(cl, { search() }) [[1]] [1] &quot;.GlobalEnv&quot; &quot;Autoloads&quot; &quot;package:base&quot; </code></pre> <p>Having said that, note that more packages are <em>loaded</em>;</p> <pre><code class="language-r">&gt; parallel::clusterEvalQ(cl, { loadedNamespaces() }) [[1]] [1] &quot;compiler&quot; &quot;parallel&quot; &quot;utils&quot; &quot;base&quot; </code></pre> <p>Like <strong>base</strong>, <strong>compiler</strong> is a package that R always loads. The <strong>parallel</strong> package is loaded because it provides the code for running the background R workers. The <strong>utils</strong> package is loaded because <code>makeClusterPSOCK()</code> validates that the workers are functional by collecting extra information from the R workers that later may be useful when reporting on errors. To skip this, pass argument <code>validate = FALSE</code>.</p> <h3 id="new-argument-rscript-sh">New argument <code>rscript_sh</code></h3> <p>The new argument <code>rscript_sh</code> can be used in the rare case where one launches remote R workers on non-Unix machines from a Unix-like machine. For example, if we, from a Linux machine launch remote MS Windows workers, we need to use <code>rscript_sh = &quot;cmd&quot;</code>.</p> <p>That covers the most important additions to <strong>parallelly</strong>. For bug fixes and minor updates, please see <a href="https://parallelly.futureverse.org/news/index.html">NEWS</a>.</p> <p>Over and out!</p> <h2 id="links">Links</h2> <ul> <li><strong>parallelly</strong> package: <a href="https://cran.r-project.org/package=parallelly">CRAN</a>, <a href="https://github.com/HenrikBengtsson/parallelly">GitHub</a>, <a href="https://parallelly.futureverse.org">pkgdown</a></li> <li><strong>future</strong> package: <a href="https://cran.r-project.org/package=future">CRAN</a>, <a href="https://github.com/HenrikBengtsson/future">GitHub</a>, <a href="https://future.futureverse.org">pkgdown</a></li> </ul> future 1.24.0: Forwarding RNG State also for Stand-Alone Futures https://www.jottr.org/2022/02/22/future-1.24.0-forwarding-rng-state-also-for-stand-alone-futures/ Tue, 22 Feb 2022 13:00:00 -0800 https://www.jottr.org/2022/02/22/future-1.24.0-forwarding-rng-state-also-for-stand-alone-futures/ <p><strong><a href="https://future.futureverse.org">future</a></strong> 1.24.0 is on CRAN. It comes with one significant update related to random number generation, further deprecation of legacy future strategies, a slight improvement to <code>plan()</code> and <code>tweaks()</code>, and some bug fixes. Below are the most important changes.</p> <figure style="padding: 2ex; float: right;"/> <center> <img src="https://www.jottr.org/post/xkcd_221-random_number.png" alt="A one-box XKCD comic with the following handwritten code: int getRandomNumber() { return 4; // chosen by fair dice roll. // guaranteed to be random. } "/> </center> <figcaption style="font-size: small; font-style: italic;">One of many possible random number generators. This one was carefully designed by <a href="https://xkcd.com/221/">XKCD</a> [CC BY-NC 2.5]. </figcaption> </figure> <h2 id="future-seed-true-updates-rng-state">future(&hellip;, seed = TRUE) updates RNG state</h2> <p>In <strong>future</strong> (&lt; 1.24.0), using <a href="https://future.futureverse.org/reference/future.html"><code>future(..., seed = TRUE)</code></a> would <em>not</em> forward the state of the random number generator (RNG). For example, if we generated random numbers in individual futures this way, they would become <em>identical</em>, e.g.</p> <pre><code class="language-r">f &lt;- future(rnorm(n = 1L), seed = TRUE) value(f) #&gt; [1] -1.424997 f &lt;- future(rnorm(n = 1L), seed = TRUE) value(f) #&gt; [1] -1.424997 </code></pre> <p>This was a deliberate, conservative design, because it is not obvious exactly how the RNG state should be forwarded in this case, especially if we consider random numbers may be generated also in the main R session. The more I dug into the problem, the further down I ended up in a rabbit hole. Because of this, I have held back on addressing this problem and leaving it to the developer to solve it, i.e. they had to roll their own RNG streams designed for parallel processing, and populate each future with a unique seed from those RNG streams, i.e. <code>future(..., seed = &lt;seed&gt;)</code>. This is how <strong><a href="https://future.apply.futureverse.org">future.apply</a></strong> and <strong><a href="https://furrr.futureverse.org">furrr</a></strong> already do it internally.</p> <p>However, I understand that design was confusing, and if not understood, it could silently lead to RNG mistakes and correlated, and even identical random numbers. I also sometimes got confused about this when I needed to do something quickly with individual futures and random numbers. I even considered making <code>seed = TRUE</code> an error until resolved, and, looking back, maybe I should have done so.</p> <p>Anyway, because it is rather tedious to roll your own L&rsquo;Ecuyer-CMRG RNG streams, I decided to update <code>future(..., seed = TRUE)</code> to provide a good-enough solution internally, where it forwards the RNG state and then provides the future with an RNG substream based on the updated RNG state. In <strong>future</strong> (&gt;= 1.24.0), we now get:</p> <pre><code class="language-r">f &lt;- future(rnorm(n = 1L), seed = TRUE) v &lt;- value(f) print(v) #&gt; [1] -1.424997 f &lt;- future(rnorm(n = 1L), seed = TRUE) v &lt;- value(f) print(v) #&gt; [1] -1.985136 </code></pre> <p>This update only affects code that currently uses <code>future(..., seed = TRUE)</code>. It does <em>not</em> affect code that relies on <strong>future.apply</strong> or <strong>furrr</strong>, which already worked correctly. That is, you can keep using <code>y &lt;- future_lapply(..., future.seed = TRUE)</code> and <code>y &lt;- future_map(..., .options = furrr_options(seed = TRUE))</code>.</p> <h2 id="deprecating-future-strategies-transparent-and-remote">Deprecating future strategies &lsquo;transparent&rsquo; and &lsquo;remote&rsquo;</h2> <p>It&rsquo;s on the <a href="https://futureverse.org/roadmap.html">roadmap</a> to provide mechanisms for the developer to declare what resources a particular future needs and for the end-user to specify multiple parallel-backend alternatives, so that the future can be processed on a worker that best can meet its resource requirements. In order to support this, we need to restrict the future backend API further, which has been in the works over the last couple of years in collaboration with existing package developers.</p> <p>In this release, I am formally deprecating future strategies <code>transparent</code> and <code>remote</code>. When used, they now produce an informative warning. The <code>transparent</code> strategy is deprecated in favor of using <code>sequential</code> with argument <code>split = TRUE</code> set. If you still use <code>remote</code>, please migrate to <code>cluster</code>, which since a long time can achieve everything that <code>remote</code> can do.</p> <p>On a related note, if you are still using <code>multiprocess</code>, which is deprecated in <strong>future</strong> (&gt;= 1.20.0) since 2020-11-03, please migrate to <code>multisession</code> so you won&rsquo;t get surprised when <code>multiprocess</code> becomes defunct.</p> <p>For the other updates, please see the <a href="https://future.futureverse.org/news/index.html">NEWS</a>.</p> <p>Happy futuring!</p> <p>Henrik</p> <h2 id="other-posts-on-random-numbers-in-parallel-processing">Other posts on random numbers in parallel processing</h2> <ul> <li><p><a href="https://www.jottr.org/2020/09/22/push-for-statistical-sound-rng/">future 1.19.1 - Making Sure Proper Random Numbers are Produced in Parallel Processing</a>, 2020-09-22</p></li> <li><p><a href="https://www.jottr.org/2020/09/21/detect-when-the-random-number-generator-was-used/">Detect When the Random Number Generator Was Used</a>, 2020-09-21</p></li> <li><p><a href="https://www.jottr.org/2017/02/19/future-rng/">future 1.3.0: Reproducible RNGs, future_lapply() and More</a>, 2017-02-19</p></li> </ul> <h2 id="links">Links</h2> <ul> <li><strong>future</strong> package: <a href="https://cran.r-project.org/package=future">CRAN</a>, <a href="https://github.com/HenrikBengtsson/future">GitHub</a>, <a href="https://future.futureverse.org">pkgdown</a></li> <li><strong>future.apply</strong> package: <a href="https://cran.r-project.org/package=future.apply">CRAN</a>, <a href="https://github.com/HenrikBengtsson/future.apply">GitHub</a>, <a href="https://future.apply.futureverse.org">pkgdown</a></li> <li><strong>furrr</strong> package: <a href="https://cran.r-project.org/package=furrr">CRAN</a>, <a href="https://github.com/HenrikBengtsson/furrr">GitHub</a>, <a href="https://furrr.futureverse.org">pkgdown</a></li> </ul> Future Improvements During 2021 https://www.jottr.org/2022/01/07/future-during-2021/ Fri, 07 Jan 2022 14:00:00 -0800 https://www.jottr.org/2022/01/07/future-during-2021/ <div style="padding: 2ex; float: right;"/> <center> <img src="https://www.jottr.org/post/paragliding_mount_tamalpais_20220101.jpg" alt="First person view while paragliding during a sunny day with blue skies. The pilot's left hand with a glove can be seen pulling the left break with lines going up to the white, left wing tip above. The pilot is in a left turn high above the mountain side with open patched of grass among the tree. Two other paragliders further down can be seen in the distance. Down below, to the left, there is a long ocean beach slowly curving up towards a point in the horizon. Inside the beach, there is a lagoon. Part of the mountain ridge can be seen to the right."/> </center> </div> <p>Happy New Year! I made some updates to the future framework during 2021 that involve overall improvements and essential preparations to go forward with some exciting new features that I&rsquo;m keen to work on during 2022.</p> <p>The <a href="https://futureverse.org">future framework</a> makes it easy to parallelize existing R code - often with only a minor change of code. The goal is to lower the barriers so that anyone can quickly and safely speed up their existing R code in a worry-free manner.</p> <p><strong><a href="https://future.futureverse.org">future</a></strong> 1.22.1 was released in August 2021, followed by <strong>future</strong> 1.23.0 at the end of October 2021. Below, I summarize the updates that came with those two releases:</p> <ul> <li><a href="#new-features">New features</a></li> <li><a href="#performance-improvements">Performance improvements</a></li> <li><a href="#cleanups-to-make-room-for-new-features">Cleanups to make room for new features</a></li> <li><a href="#significant-changes-preparing-for-the-future">Significant changes preparing for the future</a></li> <li><a href="#roadmap-ahead">Roadmap ahead</a></li> </ul> <p>There were also several updates to the related <strong><a href="https://parallelly.futureverse.org">parallelly</a></strong> and <strong><a href="https://progressr.futureverse.org">progressr</a></strong> packages, which you can read about in earlier blog posts under the <a href="https://www.jottr.org/tags/#parallelly-list">#parallelly</a> and <a href="https://www.jottr.org/tags/#progressr-list">#progressr</a> blog tags.</p> <h2 id="new-features">New features</h2> <h3 id="futuresessioninfo-for-troubleshooting-and-issue-reporting">futureSessionInfo() for troubleshooting and issue reporting</h3> <p>Function <a href="https://future.futureverse.org/reference/futureSessionInfo.html"><code>futureSessionInfo()</code></a> was added to <strong>future</strong> 1.22.0. It outputs information useful for troubleshooting problems related to the future framework. It also runs some basic tests to validate that the current future backend works as expected. If you have problems getting futures to work on your machine, please run this function before reporting issues at <a href="https://github.com/HenrikBengtsson/future/discussions">Future Discussions</a>. Here&rsquo;s an example:</p> <pre><code class="language-r">&gt; library(future) &gt; plan(multisession, workers = 2) &gt; futureSessionInfo() *** Package versions future 1.23.0, parallelly 1.30.0, parallel 4.1.2, globals 0.14.0, listenv 0.8.0 *** Allocations availableCores(): system nproc 8 8 availableWorkers(): $system [1] &quot;localhost&quot; &quot;localhost&quot; &quot;localhost&quot; [4] &quot;localhost&quot; &quot;localhost&quot; &quot;localhost&quot; [7] &quot;localhost&quot; &quot;localhost&quot; *** Settings - future.plan=&lt;not set&gt; - future.fork.multithreading.enable=&lt;not set&gt; - future.globals.maxSize=&lt;not set&gt; - future.globals.onReference=&lt;not set&gt; - future.resolve.recursive=&lt;not set&gt; - future.rng.onMisuse='warning' - future.wait.timeout=&lt;not set&gt; - future.wait.interval=&lt;not set&gt; - future.wait.alpha=&lt;not set&gt; - future.startup.script=&lt;not set&gt; *** Backends Number of workers: 2 List of future strategies: 1. multisession: - args: function (..., workers = 2, envir = parent.frame()) - tweaked: TRUE - call: plan(multisession, workers = 2) *** Basic tests worker pid r sysname release 1 1 19291 4.1.2 Linux 5.4.0-91-generic 2 2 19290 4.1.2 Linux 5.4.0-91-generic version 1 #102~18.04.1-Ubuntu SMP Thu Nov 11 14:46:36 UTC 2021 2 #102~18.04.1-Ubuntu SMP Thu Nov 11 14:46:36 UTC 2021 nodename machine login user effective_user 1 my-laptop x86_64 alice alice alice 2 my-laptop x86_64 alice alice alice Number of unique PIDs: 2 (as expected) </code></pre> <h3 id="working-around-utf-8-escaping-on-ms-windows">Working around UTF-8 escaping on MS Windows</h3> <p>Because of limitations in R itself, UTF-8 symbols outputted on MS Windows parallel workers would be <a href="https://github.com/HenrikBengtsson/future/issues/473">relayed as escaped symbols</a> when using futures. Now, the future framework, and, more specifically, <a href="https://future.futureverse.org/reference/value.html"><code>value()</code></a>, attempts to recover such MS Windows output to UTF-8 before outputting it.</p> <p>For example, in <strong>future</strong> (&lt; 1.23.0) you would get the following:</p> <pre><code class="language-r">f &lt;- future({ cat(&quot;\u2713 Everything is OK&quot;) ; 42 }) v &lt;- value(f) #&gt; &lt;U+2713&gt; Everything is OK </code></pre> <p>when, and only when, those futures are resolved on a MS Windows machine. In <strong>future</strong> (&gt;= 1.23.0), we work around this problem by looking for <code>&lt;U+NNNN&gt;</code> like patterns in the output and decode them as UTF-8 symbols;</p> <pre><code class="language-r">f &lt;- future({ cat(&quot;\u2713 Everything is OK&quot;) ; 42 }) v &lt;- value(f) #&gt; ✓ Everything is OK </code></pre> <p><em>Comment</em>: From <a href="https://developer.r-project.org/Blog/public/2021/12/07/upcoming-changes-in-r-4.2-on-windows/index.html">R 4.2.0, R will have native support for UTF-8 also on MS Windows</a>. More testing and validation is needed to confirm this will work out of the box in R (&gt;= 4.2.0) when running R in the terminal, in the R GUI, in the RStudio Console, and so on. If so, <strong>future</strong> will be updated to only apply this workaround for R (&lt; 4.2.0).</p> <h3 id="harmonization-of-future-futureassign-and-futurecall">Harmonization of future(), futureAssign(), and futureCall()</h3> <p>Prior to <strong>future</strong> 1.22.0, argument <code>seed</code> for <a href="https://future.futureverse.org/reference/future.html"><code>futureAssign()</code></a> and <a href="https://future.futureverse.org/reference/future.html"><code>futureCall()</code></a> defaulted to <code>TRUE</code>, whereas it defaulted to <code>FALSE</code> for <a href="https://future.futureverse.org/reference/future.html"><code>future()</code></a>. This was an oversight. In <strong>future</strong> (&gt;= 1.22.0), <code>seed = FALSE</code> is the default for all these functions.</p> <h3 id="protecting-against-non-exportable-results">Protecting against non-exportable results</h3> <p>Analogously to how globals may be scanned for <a href="https://future.futureverse.org/articles/future-4-non-exportable-objects.html">&ldquo;non-exportable&rdquo; objects</a> when option <code>future.globals.onReference</code> is set to <code>&quot;error&quot;</code> or <code>&quot;warning&quot;</code>, <code>value()</code> will now check for similar problems in the value returned from parallel workers. For example, in <strong>future</strong> (&lt; 1.23.0) we would get:</p> <pre><code class="language-r">library(future) plan(multisession, workers = 2) options(future.globals.onReference = &quot;error&quot;) f &lt;- future(xml2::read_xml(&quot;&lt;body&gt;&lt;/body&gt;&quot;)) v &lt;- value(f) print(v) #&gt; Error in doc_type(x) : external pointer is not valid </code></pre> <p>whereas in <strong>future</strong> (&gt;= 1.23.0) we get:</p> <pre><code class="language-r">library(future) plan(multisession, workers = 2) options(future.globals.onReference = &quot;error&quot;) f &lt;- future(xml2::read_xml(&quot;&lt;body&gt;&lt;/body&gt;&quot;)) v &lt;- value(f) #&gt; Error: Detected a non-exportable reference ('externalptr') in the value #&gt; (of class 'xml_document') of the resolved future </code></pre> <h3 id="finer-control-of-what-type-of-conditions-are-captured-and-replayed">Finer control of what type of conditions are captured and replayed</h3> <p>Besides specifying which condition classes to be captured and relayed, in <strong>future</strong> (&gt;= 1.22.0), it is possible to specify also condition classes to be ignored. For example,</p> <pre><code class="language-r">f &lt;- future(..., conditions = structure(&quot;condition&quot;, exclude = &quot;message&quot;)) </code></pre> <p>captures all conditions but message conditions. The default is <code>conditions = &quot;condition&quot;</code>, which captures and relays any type of condition.</p> <h2 id="performance-improvements">Performance improvements</h2> <p>I always prioritize correctness over performance in the <strong>future</strong> framework. So, whenever optimizing for performance, one always has to make sure we are not breaking things somewhere else. Thankfully, there are now <a href="https://www.futureverse.org/statistics.html">over 200 reverse-dependency packages on CRAN</a> and Bioconductor that I can validate against. They provide another comfy cushion against mistakes than what we already get from package unit tests and the <strong><a href="https://future.tests.futureverse.org">future.tests</a></strong> test suite. Below are some recent performance improvements made.</p> <h3 id="less-latency-for-multicore-multisession-and-cluster-futures">Less latency for multicore, multisession, and cluster futures</h3> <p>In <strong>future</strong> 1.22.0, the default timeout of <a href="https://future.futureverse.org/reference/resolved.html"><code>resolved()</code></a> was decreased from 0.20 seconds to 0.01 seconds for multicore, multisession, and cluster futures. This means that less time is now spent on checking for results from these future backends when they are not yet available. After making sure it is safe to do so, we might decrease the default timeout to zero in a later release.</p> <h3 id="less-overhead-when-initiating-futures">Less overhead when initiating futures</h3> <p>The overhead of initiating futures was significantly reduced in <strong>future</strong> 1.22.0. For example, the round-trip time for <code>value(future(NULL))</code> is about twice as fast for sequential, cluster, and multisession futures. For multicore futures the round-trip speedup is about 20%.</p> <p>The speedup comes from pre-compiling the future&rsquo;s R expression into an R expression template, which then can quickly re-compiled into the final expression to be evaluated. Specifically, instead of calling <code>expr &lt;- base::bquote(tmpl)</code> for each future, which is computationally expensive, we take a two-step approach where we first call <code>tmpl_cmp &lt;- bquote_compile(tmpl)</code> once per session such that we only have to call the much faster <code>expr &lt;- bquote_apply(tmpl_cmp)</code> for each future.(*) This new pre-compile approach speeds up the construction of the final future expression from the original future expression ~10 times.</p> <p>(*) These are <a href="https://github.com/HenrikBengtsson/future/blob/1064c4ec2c37a70fa8fff8887d0030a5f03c46da/R/000.bquote.R#L56-L131">internal functions</a> of the <strong>future</strong> package.</p> <h3 id="environment-variables-are-only-used-when-package-is-loaded">Environment variables are only used when package is loaded</h3> <p>All R <a href="https://future.futureverse.org/reference/future.options.html">options specific to the future framework</a> have defaults that fall back to corresponding environment variables. For example, the default for option <code>future.rng.onMisuse</code> can be set by environment variable <code>R_FUTURE_RNG_ONMISUSE</code>.</p> <p>The purpose of the environment variables is to make it possible to configure the future framework before launching R, e.g. in shell startup scripts, or in shell scripts submitted to job schedulers in high-performance compute (HPC) environments. When R is already running, the best practice is to use the R options to configure the future framework.</p> <p>In order to avoid the overhead from querying and parsing environment variables at runtime, but also to clarify how and when environment variables should be set, starting with <strong>future</strong> 1.22.0, <em><code>R_FUTURE_*</code> environment variables are only used when the <strong>future</strong> package is loaded</em>. Then, if set, they are used for setting the corresponding <code>future.*</code> option.</p> <h2 id="cleanups-to-make-room-for-new-features">Cleanups to make room for new features</h2> <p>The <code>values()</code> function is defunct since <strong>future</strong> 1.23.0 in favor of <code>value()</code>. All CRAN and Bioconductor packages that depend on <strong>future</strong> have been updated since a long time. If you get the error:</p> <pre><code class="language-r">Error: values() is defunct in future (&gt;= 1.20.0). Use value() instead. </code></pre> <p>make sure to update your R packages. A few users of <strong><a href="https://furrr.futureverse.org">furrr</a></strong> have run into this error - updating to <strong>furrr</strong> (&gt;= 0.2.0) solved the problem.</p> <p>Continuing, to further harmonize how developers use the Future API, we are moving away from odds-and-ends features, especially the ones that are holding us back from adding new features. The goal is to ensure that more code using futures can truly run anywhere, not just on a particular parallel backend that the developer work with.</p> <p>In this spirit, we are slowly moving away from &ldquo;persistent&rdquo; workers. For example, in <strong>future</strong> (&gt;= 1.23.0), <code>plan(multisession, persistent = TRUE)</code> is no longer supported and will produce an error if attempted. The same will eventually happen also for <code>plan(cluster, persistent = TRUE)</code>, but not until we have <a href="https://www.futureverse.org/roadmap.html">support for caching &ldquo;sticky&rdquo; globals</a>, which is the main use case for persistent workers.</p> <p>Another example is transparent futures, which are prepared for deprecation in <strong>future</strong> (&gt;= 1.23.0). If used, <code>plan(transparent)</code> produces a warning, which soon will be upgraded to a formal deprecation warning. In a later release, it will produce an error. Transparent futures were added during the early days in order to simplify troubleshooting of futures. A better approach these days is to use <code>plan(sequential, split = TRUE)</code>, which makes interactive troubleshooting tools such as <code>browser()</code> and <code>debug()</code> to work.</p> <h2 id="significant-changes-preparing-for-the-future">Significant changes preparing for the future</h2> <p>Prior to <strong>future</strong> 1.22.0, lazy futures were assigned to the currently set future backend immediately when created. For example, if we do:</p> <pre><code class="language-r">library(future) plan(multisession, workers = 2) f &lt;- future(42, lazy = TRUE) </code></pre> <p>with <strong>future</strong> (&lt; 1.22.0), we would get:</p> <pre><code class="language-r">class(f) #&gt; [1] &quot;MultisessionFuture&quot; &quot;ClusterFuture&quot; &quot;MultiprocessFuture&quot; #&gt; [4] &quot;Future&quot; &quot;environment&quot; </code></pre> <p>Starting with <strong>future</strong> 1.22.0, lazy futures remain generic futures until they are launched, which means they are not assigned a backend class until they have to. Now, the above example gives:</p> <pre><code class="language-r">class(f) #&gt; [1] &quot;Future&quot; &quot;environment&quot; </code></pre> <p>This change opens up the door for storing futures themselves to file and sending them elsewhere. More precisely, this means we can start working towards a <em>queue of futures</em>, which then can be processed on whatever compute resources we have access to at the moment, e.g. some futures might be resolved on the local computer, others on machines on a local cluster, and when those fill up, we can burst out to cloud resources, or maybe process them via a community-driven peer-to-peer cluster.</p> <h2 id="roadmap-ahead">Roadmap ahead</h2> <p>There are lots of new features on the roadmap related to the above and other things. I hope to make progress on several of them during 2022. If you&rsquo;re curious about what&rsquo;s coming up, see the <a href="https://www.futureverse.org/roadmap.html">Project Roadmap</a>, stay tuned on this blog (<a href="https://www.jottr.org/index.xml">feed</a>), or follow <a href="https://twitter.com/henrikbengtsson/">me on Twitter</a>.</p> <p>Happy futuring!</p> <p>Henrik</p> <h2 id="links">Links</h2> <ul> <li><strong>future</strong> package: <a href="https://cran.r-project.org/package=future">CRAN</a>, <a href="https://github.com/HenrikBengtsson/future">GitHub</a>, <a href="https://future.futureverse.org">pkgdown</a></li> <li><strong>parallelly</strong> package: <a href="https://cran.r-project.org/package=parallelly">CRAN</a>, <a href="https://github.com/HenrikBengtsson/parallelly">GitHub</a>, <a href="https://parallelly.futureverse.org">pkgdown</a></li> </ul> parallelly 1.29.0: New Skills and Less Communication Latency on Linux https://www.jottr.org/2021/11/22/parallelly-1.29.0/ Mon, 22 Nov 2021 21:00:00 -0800 https://www.jottr.org/2021/11/22/parallelly-1.29.0/ <div style="padding: 2ex; float: right;"/> <center> <img src="https://www.jottr.org/post/parallelly-logo.png" alt="The 'parallelly' hexlogo"/> </center> </div> <p><strong><a href="https://parallelly.futureverse.org">parallelly</a></strong> 1.29.0 is on CRAN. The <strong>parallelly</strong> package enhances the <strong>parallel</strong> package - our built-in R package for parallel processing - by improving on existing features and by adding new ones. Somewhat simplified, <strong>parallelly</strong> provides the things that you would otherwise expect to find in the <strong>parallel</strong> package. The <strong><a href="https://future.futureverse.org">future</a></strong> package rely on the <strong>parallelly</strong> package internally for local and remote parallelization.</p> <p>Since my <a href="https://www.jottr.org/2021/06/10/parallelly-1.26.0/">previous post on <strong>parallelly</strong></a> five months ago, the <strong>parallelly</strong> package had some bugs fixed, and it gained a few new features;</p> <ul> <li><p>new <code>isForkedChild()</code> to test if R runs in a forked process,</p></li> <li><p>new <code>isNodeAlive()</code> to test if one or more cluster-node processes are running,</p></li> <li><p><code>availableCores()</code> now respects also Bioconductor settings,</p></li> <li><p><code>makeClusterPSOCK(..., rscript = &quot;*&quot;)</code> automatically expands to the proper Rscript executable,</p></li> <li><p><code>makeClusterPSOCK(…, rscript_envs = c(UNSET_ME = NA_character_))</code> unsets environment variables on cluster nodes, and</p></li> <li><p><code>makeClusterPSOCK()</code> sets up clusters with less communication latency on Unix.</p></li> </ul> <p>Below is a detailed description of these new features.</p> <h2 id="new-function-isforkedchild">New function isForkedChild()</h2> <p>If you run R on Unix and macOS, you can parallelize code using so called <em>forked</em> parallel processing. It is a very convenient way of parallelizing code, especially since forking is implemented at the core of the operating system and there is very little extra you have to do at the R level to get it to work. Compared with other parallelization solutions, forked processing has often less overhead, resulting in shorter turnaround times. To date, the most famous method for parallelizing using forks is <code>mclapply()</code> of the <strong>parallel</strong> package. For example,</p> <pre><code class="language-r">library(parallel) y &lt;- mclapply(X, some_slow_fcn, mc.cores = 4) </code></pre> <p>works just like <code>lapply(X, some_slow_fcn)</code> but will perform the same tasks in parallel using four (4) CPU cores. MS Windows does not support <a href="https://en.wikipedia.org/wiki/Fork_(system_call)">forked processing</a>; any attempt to use <code>mclapply()</code> there will cause it to silently fall back to a sequential <code>lapply()</code> call.</p> <p>In the <strong>future</strong> ecosystem, you get forked parallelization with the <code>multicore</code> backend, e.g.</p> <pre><code class="language-r">library(future.apply) plan(multicore, workers = 4) y &lt;- future_lapply(X, some_slow_fcn) </code></pre> <p>Unfortunately, we cannot parallelize all types of code using forks. If done, you might get an error, but in the worst case you crash (segmentation fault) your R process. For example, some graphical user interfaces (GUIs) do not play well with forked processing, e.g. the RStudio Console, but also other GUIs. Multi-threaded parallelization has also been reported to cause problems when run within forked parallelization. We sometime talk about <em>non-fork-safe code</em>, in contrast to <em>fork-safe</em> code, to refer to code that risks crashing the software if run in forked processes.</p> <p>Here is what R-core developer Simon Urbanek and author of <code>mclapply()</code> wrote in the R-devel thread <a href="https://stat.ethz.ch/pipermail/r-devel/2020-April/079384.html">&lsquo;mclapply returns NULLs on MacOS when running GAM&rsquo;</a> on 2020-04-28:</p> <blockquote> <p>Do NOT use <code>mcparallel()</code> in packages except as a non-default option that user can set for the reasons &hellip; explained [above]. Multicore is intended for HPC applications that need to use many cores for computing-heavy jobs, but it does not play well with RStudio and more importantly you don&rsquo;t know the resource available so only the user can tell you when it is safe to use. Multi-core machines are often shared so using all detected cores is a very bad idea. The user should be able to explicitly enable it, but it should not be enabled by default.</p> </blockquote> <p>It is not always obvious to know whether a certain function call in R is fork safe, especially not if we haven&rsquo;t written all the code ourselves. Because of this, it is more of a trial and error so see if works. However, when we know that a certain function call is <em>not</em> fork safe, it is useful to protect against using it in forked parallelization. In <strong>parallelly</strong> (&gt;= 1.28.0), we can use function <a href="https://parallelly.futureverse.org/reference/isForkedChild.html"><code>isForkedChild()</code></a> test whether or not R runs in a forked child process. For example, the author of <code>some_slow_fcn()</code> above, could protect against mistakes by:</p> <pre><code class="language-r">some_slow_fcn &lt;- function(x) { if (parallelly::isForkedChild()) { stop(&quot;This function must not be used in *forked* parallel processing&quot;) } y &lt;- non_fork_safe_code(x) ... } </code></pre> <p>or, if they have an alternative, less preferred, non-fork-safe implementation, they could run that conditionally on R being executed in a forked child process:</p> <pre><code class="language-r">some_slow_fcn &lt;- function(x) { if (parallelly::isForkedChild()) { y &lt;- fork_safe_code(x) } else { y &lt;- alternative_code(x) } ... } </code></pre> <h2 id="new-function-isnodealive">New function isNodeAlive()</h2> <p>The new function <a href="https://parallelly.futureverse.org/reference/isNodeAlive.html"><code>isNodeAlive()</code></a> checks whether one or more nodes are alive. For instance,</p> <pre><code class="language-r">library(parallelly) cl &lt;- makeClusterPSOCK(3) isNodeAlive(cl) #&gt; [1] TRUE TRUE TRUE </code></pre> <p>Imagine the second parallel worker crashes, which we can emulate with</p> <pre><code class="language-r">clusterEvalQ(cl[2], tools::pskill(Sys.getpid())) #&gt; Error in unserialize(node$con) : error reading from connection </code></pre> <p>then we get:</p> <pre><code class="language-r">isNodeAlive(cl) #&gt; [1] TRUE FALSE TRUE </code></pre> <p>The <code>isNodeAlive()</code> function works by querying the operating system to see if those processes are still running, based their process IDs (PIDs) recorded by <code>makeClusterPSOCK()</code> when launched. If the workers&rsquo; PIDs are unknown, then <code>NA</code> is returned instead. For instance, contrary to <code>parallelly::makeClusterPSOCK()</code>, <code>parallel::makeCluster()</code> does not record the PIDs and we get missing values as the result;</p> <pre><code class="language-r">library(parallelly) cl &lt;- parallel::makeCluster(3) isNodeAlive(cl) #&gt; [1] NA NA NA </code></pre> <p>Similarly, if one of the parallel workers runs on a remote machine, we cannot easily query the remote machine for the PID existing or not. In such cases, <code>NA</code> is returned. Maybe we will be able to query also remote machines in a future version of <strong>parallelly</strong>, but for now, it is not possible.</p> <h2 id="availablecores-respects-bioconductor-settings">availableCores() respects Bioconductor settings</h2> <p>Function <a href="https://parallelly.futureverse.org/reference/availableCores.html"><code>availableCores()</code></a> queries the hardware and the system environment to find out how many CPU cores it may run on. It does this by checking system settings, environment variables, and R options that may be set by the end-user, the system administrator, the parent R process, the operating system, a job scheduler, and so on. When you use <code>availableCores()</code>, you don&rsquo;t have to worry about using more CPU resources than you were assigned, which helps guarantee that it runs nicely together with everything else on the same machine.</p> <p>In <strong>parallelly</strong> (&gt;= 1.29.0), <code>availableCores()</code> is now also agile to Bioconductor-specific settings. For example, <strong><a href="https://bioconductor.org/packages/BiocParallel">BiocParallel</a></strong> 1.27.2 introduced environment variable <code>BIOCPARALLEL_WORKER_NUMBER</code>, which sets the default number of parallel workers when using <strong>BiocParallel</strong> for parallelization. Similarly, on Bioconductor check servers, they set environment variable <code>BBS_HOME</code>, which <strong>BiocParallel</strong> uses to limit the number of cores to four (4). Now <code>availableCores()</code> reflects also those settings, which, in turn, means that <strong>future</strong> settings like <code>plan(multisession)</code> will also automatically respect the Bioconductor settings.</p> <p>Function <a href="https://parallelly.futureverse.org/reference/availableWorkers.html"><code>availableWorkers()</code></a>, which relies on <code>availableCores()</code> as a fallback, is therefore also agile to these Bioconductor environment variables.</p> <!-- ## Improvements to makeClusterPSOCK() arguments 'rscript' and 'rscript_envs' Three improvements to [`makeClusterPSOCK()`] has been made: * A `*` value in argument `rscript` to `makeClusterPSOCK()` expands to the corrent `Rscript` executable * Argument `rscript_envs` of `makeClusterPSOCK()` can be used to unset environment variables onthe parallel workers * On Unix, the _communication latency_ between the main R session and the parallel workers is not much smaller when using `makeClusterPSOCK()` --> <h2 id="makeclusterpsock-rscript">makeClusterPSOCK(&hellip;, rscript = &ldquo;*&ldquo;)</h2> <p>Argument <code>rscript</code> of <code>makeClusterPSOCK()</code> can be used to control exactly which <code>Rscript</code> executable is used to launch the parallel workers, and also how that executable is launched. The default settings is often sufficient, but if you want to launch a worker, say, within a Linux container you can do so by adjusting <code>rscript</code>. The help page for <a href="https://parallelly.futureverse.org/reference/makeClusterPSOCK.html"><code>makeClusterPSOCK()</code></a> has several examples of this. It may also be used for other setups. For example, to launch two parallel workers on a remote Linux machine, such that their CPU priority is less than other processing running on that machine, we can use (*):</p> <pre><code class="language-r">workers &lt;- rep(&quot;remote.example.org&quot;, times = 2) cl &lt;- makeClusterPSOCK(workers, rscript = c(&quot;nice&quot;, &quot;Rscript&quot;)) </code></pre> <p>This causes the two R workers to be launched using <code>nice Rscript ...</code>. The Unix command <code>nice</code> is what makes <code>Rscript</code> to run with a lower CPU priority. By running at a lower priority, we decrease the risk for our parallel tasks to have a negative impact on other software running on that machine, e.g. someone might use that machine for interactive work without us knowing. We can do the same thing on our local machine via:</p> <pre><code class="language-r">cl &lt;- makeClusterPSOCK(2L, rscript = c(&quot;nice&quot;, file.path(R.home(&quot;bin&quot;), &quot;Rscript&quot;))) </code></pre> <p>Here we specified the absolute path to <code>Rscript</code> to make sure we run the same version of R as the main R session, and not another <code>Rscript</code> that may be on the system <code>PATH</code>.</p> <p>Starting with <strong>parallelly</strong> 1.29.0, we can replace the Rscript specification in the above two examples with <code>&quot;*&quot;</code>, as in:</p> <pre><code class="language-r">workers &lt;- rep(&quot;remote-machine.example.org, times = 2L) cl &lt;- makeClusterPSOCK(workers, rscript = c(&quot;nice&quot;, &quot;*&quot;)) </code></pre> <p>and</p> <pre><code class="language-r">cl &lt;- makeClusterPSOCK(2L, rscript = c(&quot;nice&quot;, &quot;*&quot;)) </code></pre> <p>When used, <code>makeClusterPSOCK()</code> will expand <code>&quot;*&quot;</code> to the proper Rscript specification depending on running remotely or not. To further emphasize the convenience of this, consider:</p> <pre><code class="language-r">workers &lt;- c(&quot;localhost&quot;, &quot;remote-machine.example.org&quot;) cl &lt;- makeClusterPSOCK(workers, rscript = c(&quot;nice&quot;, &quot;*&quot;)) </code></pre> <p>which launches two parallel workers - one running on local machine and one running on the remote machine.</p> <p>Note that, when using <strong><a href="https://future.futureverse.org">future</a></strong>, we can pass <code>rscript</code> to <code>plan(multisession)</code> and <code>plan(cluster)</code> to achieve the same thing, as in</p> <pre><code class="language-r">plan(cluster, workers = workers, rscript = c(&quot;nice&quot;, &quot;*&quot;)) </code></pre> <p>and</p> <pre><code class="language-r">plan(multisession, workers = 2L, rscript = c(&quot;nice&quot;, &quot;*&quot;)) </code></pre> <p>(*) Here we use <code>nice</code> as an example, because it is a simple way to illustrate how <code>rscript</code> can be used. As a matter of fact, there is already an <a href="https://parallelly.futureverse.org/reference/makeClusterPSOCK.html">argument <code>renice</code></a>, which we can use to achieve the same without using the <code>rscript</code> argument.</p> <h2 id="makeclusterpsock-rscript-envs-c-unset-me-na-character">makeClusterPSOCK(&hellip;, rscript_envs = c(UNSET_ME = NA_character_))</h2> <p>Argument <code>rscript_envs</code> of <code>makeClusterPSOCK()</code> can be used to set environment variables on cluster nodes, or copy existing ones from the main R session to the cluster nodes. For example,</p> <pre><code class="language-r">cl &lt;- makeClusterPSOCK(2, rscript_envs = c(PI = &quot;3.14&quot;, &quot;MY_EMAIL&quot;)) </code></pre> <p>will, during startup, set environment variable <code>PI</code> on each of the two cluster nodes to have value <code>3.14</code>. It will also set <code>MY_EMAIL</code> on them to the value of <code>Sys.getenv(&quot;MY_EMAIL&quot;)</code> in the current R session.</p> <p>Starting with <strong>parallelly</strong> 1.29.0, we can now also <em>unset</em> environment variables, in case they are set on the cluster nodes. Any named element with a missing value causes the corresponding environment variable to be unset, e.g.</p> <pre><code class="language-r">cl &lt;- makeClusterPSOCK(2, rscript_envs = c(_R_CHECK_LENGTH_1_CONDITION_ = NA_character_)) </code></pre> <p>This results in passing <code>-e 'Sys.unsetenv(&quot;_R_CHECK_LENGTH_1_CONDITION_&quot;)'</code> to <code>Rscript</code> when launching each worker.</p> <h2 id="makeclusterpsock-sets-up-clusters-with-less-communication-latency-on-unix">makeClusterPSOCK() sets up clusters with less communication latency on Unix</h2> <p>It turns out that, in R <em>on Unix</em>, there is <a href="https://stat.ethz.ch/pipermail/r-devel/2020-November/080060.html">a significant <em>latency</em> in the communication between the parallel workers and the main R session</a> (**). Starting in R (&gt;= 4.1.0), it is possible to decrease this latency by setting a dedicated R option <em>on each of the workers</em>, e.g.</p> <pre><code class="language-r">rscript_args &lt;- c(&quot;-e&quot;, shQuote(&quot;options(socketOptions = 'no-delay')&quot;) cl &lt;- parallel::makeCluster(workers, rscript_args = rscript_args)) </code></pre> <p>This is quite verbose, so I&rsquo;ve made this the new default in <strong>parallelly</strong> (&gt;= 1.29.0), i.e. you can keep using:</p> <pre><code class="language-r">cl &lt;- parallelly::makeClusterPSOCK(workers) </code></pre> <p>to benefit from the above. See help for <a href="https://parallelly.futureverse.org/reference/makeClusterPSOCK.html"><code>makeClusterPSOCK()</code></a> for options on how to change this new default.</p> <p>Here is an example that illustrates the difference in latency with and without the new settings;</p> <pre><code class="language-r">cl_parallel &lt;- parallel::makeCluster(1) cl_parallelly &lt;- parallelly::makeClusterPSOCK(1) res &lt;- bench::mark(iterations = 1000L, parallel = parallel::clusterEvalQ(cl_parallel, iris), parallelly = parallel::clusterEvalQ(cl_parallelly, iris) ) res[, c(1:4,9)] #&gt; # A tibble: 2 × 5 #&gt; expression min median `itr/sec` total_time #&gt; &lt;bch:expr&gt; &lt;bch:tm&gt; &lt;bch:tm&gt; &lt;dbl&gt; &lt;bch:tm&gt; #&gt; 1 parallel 277µs 44ms 22.5 44.4s #&gt; 2 parallelly 380µs 582µs 1670. 598.3ms </code></pre> <p>From this, we see that the total latency overhead for 1,000 parallel tasks went from 44 seconds down to 0.60 seconds, which is ~75 times less on average. Does this mean your parallel code will run faster? No, it is just the communication <em>latency</em> that has decreased. But, why waste time on <em>waiting</em> on your results when you don&rsquo;t have to? This is why I changed the defaults in <strong>parallelly</strong>. It will also bring the experience on Unix on par with MS Windows and macOS.</p> <p>Note that the relatively high latency affects only Unix. MS Windows and macOS do not suffer from this extra latency. For example, on MS Windows 10 that runs in a virtual machine on the same Linux computer as above, I get:</p> <pre><code class="language-r">#&gt; # A tibble: 2 × 5 #&gt; expression min median `itr/sec` total_time #&gt; &lt;bch:expr&gt; &lt;bch:tm&gt; &lt;bch:tm&gt; &lt;dbl&gt; &lt;bch:tm&gt; #&gt; 1 parallel 191us 314us 2993. 333ms #&gt; 2 parallelly 164us 311us 3227. 310ms </code></pre> <p>If you&rsquo;re using <strong><a href="https://future.futureverse.org">future</a></strong> with <code>plan(multisession)</code> or <code>plan(cluster)</code>, you&rsquo;re already benefitting from the performance gain, because those rely on <code>parallelly::makeClusterPSOCK()</code> internally.</p> <!-- avoid a quite large latency in the communication between parallel workers and the main R session ```r gg <- plot(res) + labs(x = element_blank()) + theme(text = element_text(size = 20)) + theme(legend.position = "none") ggsave("parallelly_faster_turnarounds-figure.png", plot = gg, width = 7.0, height = 5.0) ``` <center> <img src="https://www.jottr.org/post/parallelly_faster_turnarounds-figure.png" alt="..." style="width: 65%;"/><br/> </center> <small><em>Figure: ...<br/></em></small> --> <p>(**) <em>Technical details</em>: Options <code>socketOptions</code> sets the default value of argument <code>options</code> of <code>base::socketConnection()</code>. The default is <code>NULL</code>, but if we set it to <code>&quot;no-delay&quot;</code>, the created TCP socket connections are configured to use the <code>TCP_NODELAY</code> flag. When using <code>TCP_NODELAY</code>, a TCP connection will no longer use the so called <a href="https://www.wikipedia.org/wiki/Nagle%27s_algorithm">Nagle&rsquo;s algorithm</a>, which otherwise is used to reduces the number of TCP packets needed to be sent over the network by making sure TCP fills up each packet before sending it off. When using the new <code>&quot;no-delay&quot;</code>, this buffering is disabled and packets are sent as soon as data come in. Credits for this improvement should go to Jeff Keller, who identified and <a href="https://stat.ethz.ch/pipermail/r-devel/2020-November/080060.html">reported the problem to R-devel</a>, to Iñaki Úcar who pitched in, and to Simon Urbanek, who implemented <a href="https://github.com/wch/r-source/commit/82369f73fc297981e64cac8c9a696d05116f0797">support for <code>socketConnection(..., options = &quot;no-delay&quot;)</code></a> for R 4.1.0.</p> <h2 id="bug-fixes">Bug fixes</h2> <p>Finally, the most important bug fixes since <strong>parallelly</strong> 1.26.0 are:</p> <ul> <li><p><code>availableCores()</code> would produce an error on Linux systems without <code>nproc</code> installed.</p></li> <li><p><code>makeClusterPSOCK()</code> failed with &ldquo;Error in freePort(port) : Unknown value on argument ‘port’: &lsquo;auto&rsquo;&rdquo; if environment variable <code>R_PARALLEL_PORT</code> was set to a port number.</p></li> <li><p>In R environments not supporting <code>setup_strategy = &quot;parallel&quot;</code>, <code>makeClusterPSOCK()</code> failed to fall back to <code>setup_strategy = &quot;sequential&quot;</code>.</p></li> </ul> <p>For all other bug fixes and updates, please see <a href="https://parallelly.futureverse.org/news/index.html">NEWS</a>.</p> <!-- <center> <img src="https://www.jottr.org/post/parallelly_faster_turnarounds.png" alt="..." style="width: 65%;"/><br/> </center> <small><em>Figure: Our parallel results are now turned around much faster on Linux than before.<br/></em></small> --> <p>Over and out!</p> <h2 id="links">Links</h2> <ul> <li><strong>parallelly</strong> package: <a href="https://cran.r-project.org/package=parallelly">CRAN</a>, <a href="https://github.com/HenrikBengtsson/parallelly">GitHub</a>, <a href="https://parallelly.futureverse.org">pkgdown</a></li> <li><strong>future</strong> package: <a href="https://cran.r-project.org/package=future">CRAN</a>, <a href="https://github.com/HenrikBengtsson/future">GitHub</a>, <a href="https://future.futureverse.org">pkgdown</a></li> <li><strong>future.apply</strong> package: <a href="https://cran.r-project.org/package=future.apply">CRAN</a>, <a href="https://github.com/HenrikBengtsson/future.apply">GitHub</a>, <a href="https://future.apply.futureverse.org">pkgdown</a></li> </ul> <!-- nworkers <- 4L cl_parallel <- parallel::makeCluster(nworkers) cl_parallelly <- parallelly::makeClusterPSOCK(nworkers) plan(cluster, workers = cl_parallel) stats <- bench::mark(iterations = 100L, parallel = { f <- cluster(iris, workers = cl_parallel); value(f) }, parallelly = { f <- cluster(iris, workers = cl_parallelly); value(f) } ) plan(cluster, workers = cl_parallelly) stats2 <- bench::mark(iterations = 10L, parallelly = { f <- future(iris); value(f) } ) stats <- rbind(stats1, stats2) --> matrixStats: Consistent Support for Name Attributes via GSoC Project https://www.jottr.org/2021/08/23/matrixstats-gsoc-2021/ Mon, 23 Aug 2021 00:10:00 +0200 https://www.jottr.org/2021/08/23/matrixstats-gsoc-2021/ <p><em>Author: Angelina Panagopoulou, GSoC student developer, undergraduate in the Department of Informatics &amp; Telecommunications (DIT), University of Athens, Greece</em></p> <p><center> <img src="https://www.jottr.org/post/2048px-GSoC_logo.svg.png" alt="Google Summer of Code logo" style="width: 40%"/> <!-- Image source: https://commons.wikimedia.org/wiki/File:GSoC_logo.svg --> </center></p> <p>We are glad to announce recent CRAN releases of <strong><a href="https://cran.r-project.org/package=matrixStats">matrixStats</a></strong> with support for handling and returning name attributes. This feature is added to make <strong>matrixStats</strong> functions handle names in the same manner as the corresponding base R functions. In particular, the behavior of <strong>matrixStats</strong> functions is now the same as the <code>apply()</code> function in R, resolving previous lack of, or inconsistent, handling of row and column names. The added support for <code>names</code> and <code>dimnames</code> attributes has already reached a wide, active user base, while at the same time we expect to attract users and developers who lack this feature and therefore could not use <strong>matrixStats</strong> package for their needs.</p> <p>The <strong>matrixStats</strong> package provides high-performing functions operating on rows and columns of matrices. These functions are optimized such that both memory use and processing time are minimized. In order to minimize the overhead of handling name attributes, the naming support is implemented in native (C) code, where possible. In <strong>matrixStats</strong> (&gt;= 0.60.0), handling of row and column names is optional. This is done to allow for maximum performance where needed. In addition, in order to avoid breaking some scripts and packages that rely on the previous semi-inconsistent behavior of functions, special care has been taken to ensure backward compatibility by default for the time being. We have validated the correctness of these newly implemented features by extending existing package tests to check name attributes, measuring the code coverage with the <strong><a href="https://cran.r-project.org/package=covr">covr</a></strong> package, and checking all 358 reverse-dependency packages using the <strong><a href="https://github.com/r-lib/revdepcheck">revdepcheck</a></strong> package.</p> <h2 id="example">Example</h2> <p><code>useNames</code> is an argument added to each of the <strong>matrixStats</strong> functions that gained support naming. It takes values <code>TRUE</code>, <code>FALSE</code>, or <code>NA</code>. For backward compatible reasons, the default value of <code>useNames</code> is <code>NA</code>, meaning the default behavior from earlier versions of <strong>matrixStats</strong> is preserved. If <code>TRUE</code>, <code>names</code> or <code>dimnames</code> attribute of result is set, otherwise, if <code>FALSE</code>, the results do not have name attributes set. For example, consider the following 5-by-3 matrix with row and column names:</p> <pre><code class="language-r">&gt; x &lt;- matrix(rnorm(5 * 3), nrow = 5, ncol = 3, dimnames = list(letters[1:5], LETTERS[1:3])) &gt; x A B C a 0.30292612 1.3825644 -0.2125219 b 0.15812229 2.7719647 1.6237263 c -0.09881700 -0.6468119 -0.6481911 d 0.38520941 -0.8466505 -0.4779964 e -0.01599926 -0.8907434 0.6334347 </code></pre> <p>If we use the base R method to calculate row medians, we see that the names attribute of the results reflects the row names of the input matrix:</p> <pre><code class="language-r">&gt; library(stats) &gt; apply(x, MARGIN = 1, FUN = median) a b c d e 0.30292612 1.62372626 -0.64681187 -0.47799635 -0.01599926 </code></pre> <p>If we use <strong>matrixStats</strong> function <code>rowMedians()</code> with argument <code>useNames = TRUE</code> set, we get the same result as above:</p> <pre><code class="language-r">&gt; library(matrixStats) &gt; rowMedians(x, useNames = TRUE) a b c d e 0.30292612 1.62372626 -0.64681187 -0.47799635 -0.01599926 </code></pre> <p>If the name attributes are not of interest, we can use <code>useNames = FALSE</code> as in:</p> <pre><code class="language-r">&gt; rowMedians(x, useNames = FALSE) [1] 0.30292612 1.62372626 -0.64681187 -0.47799635 -0.01599926 </code></pre> <p>Doing so will also avoid the overhead, time and memory, that otherwise comes from processing name attributes.</p> <p>If we don&rsquo;t specify <code>useNames</code> explicitly, the default is currently <code>useNames = NA</code>, which corresponds to the non-documented behavior that existed in <strong>matrixStats</strong> (&lt; 0.60.0). For several functions, that corresponded to setting <code>useNames = FALSE</code>, however for other functions it corresponds to setting <code>useNames = TRUE</code>, and for others it might have set, say, row names but not column names. In our example, the default happens to be the same as <code>useNames = FALSE</code>:</p> <pre><code class="language-r">&gt; rowMedians(x) # default as in matrixStats (&lt; 0.60.0) [1] 0.30292612 1.62372626 -0.64681187 -0.47799635 -0.01599926 </code></pre> <h2 id="future-plan">Future Plan</h2> <p>The future plan is to change the default value of <code>useNames</code> to <code>TRUE</code> or <code>FALSE</code> and eventually deprecate the backward-compatible behavior of <code>useNames = NA</code>. The default value of <code>useNames</code> is a design choice that requires further investigation. On the one hand, <code>useNames = TRUE</code> as the default is more convenient, but creates an additional performance and memory overhead when name attributes are not needed. On the other hand, make <code>FALSE</code> the default is appropriate for users and packages that rely on the maximum performance. Whatever the new default will become, we will make sure to work with package maintainers to minimize the risk for breaking existing code.</p> <h2 id="google-summer-of-code-2021">Google Summer of Code 2021</h2> <p>The project that introduces the consistent support for name attributes on the <strong>matrixStats</strong> package is a part of the <a href="https://github.com/rstats-gsoc/gsoc2021/wiki">R Project&rsquo;s participation in the Google Summer of Code 2021</a>.</p> <h3 id="links">Links</h3> <ul> <li><a href="https://github.com/rstats-gsoc/gsoc2021/wiki/matrixStats">The matrixStats GSoC 2021 project</a></li> <li><a href="https://cran.r-project.org/web/packages/matrixStats/index.html">matrixStats CRAN page</a></li> <li><a href="https://github.com/HenrikBengtsson/matrixStats">matrixStats GitHub page</a></li> <li><a href="https://github.com/HenrikBengtsson/matrixStats/commits?author=AngelPn">All commits during GSoC 2021 - author Angelina Panagopoulou</a></li> </ul> <h3 id="authors">Authors</h3> <ul> <li><a href="https://github.com/AngelPn">Angelina Panagopoulou</a> - <em>Student Developer</em>: I am an undergraduate in the Department of Informatics &amp; Telecommunications (DIT) in University of Athens.</li> <li><a href="https://github.com/yaccos">Jakob Peder Pettersen</a> - <em>Mentor</em>: PhD Student, Department of Biotechnology and Food Science, Norwegian University of Science and Technology (NTNU). Jakob is a part of the <a href="https://almaaslab.nt.ntnu.no/">Almaas Lab</a> and does research on genome-scale metabolic modeling and behavior of microbial communities.</li> <li><a href="https://github.com/HenrikBengtsson/">Henrik Bengtsson</a> - <em>Co-Mentor</em>: Associate Professor, Department of Epidemiology and Biostatistics, University of California San Francisco (UCSF). He is the author and maintainer of a large number of CRAN and Bioconductor packages including <strong>matrixStats</strong>.</li> </ul> <h3 id="contributions">Contributions</h3> <p><strong>Phase I</strong></p> <ul> <li>All functions implements <code>useNames = NA/FALSE/TRUE</code> using R code and tests are written.</li> <li>Identify reverse dependency packages that rely on <code>useNames = NA/FALSE/TRUE</code>.</li> <li>New release on CRAN with <code>useNames = NA</code>. This allow useRs and package maintainers to complain if anything breaks.</li> </ul> <p><strong>Phase II</strong></p> <ul> <li>Changed C code structure such that <code>validateIndices()</code> always return <code>R_xlen_t*</code>. Clean up unnecessary macros. <ul> <li>Outcome: shorter compile times, smaller compiled package/library, fewer exported symbols.</li> </ul></li> <li>Simplify C API for <code>setNames()/setDimnames()</code>.</li> <li>Implemented <code>useNames = NA/FALSE/TRUE</code> in C code where possible and cleanup work too.</li> </ul> <h3 id="summary">Summary</h3> <p>We have completed all goals that we had initially planned. The release 0.60.0 of <strong>matrixStats</strong> on CRAN included the contributions of GSoC Phase I (&ldquo;implementation in R&rdquo;) and a new release of version 0.60.1 includes the contributions of Phase II (&ldquo;implementation in C&rdquo;).</p> <h3 id="experience">Experience</h3> <p>When I first heard about the Google Summer of Code, I really wanted to participate in it, but I thought that maybe I do not have the prerequisite knowledge yet. And it was true. It was difficult for me to find a project that I had at least half of the mentioned prerequisites. So, I started looking for a project based on what I would be interested in doing during the summer. This project was an opportunity for me to learn a new programming language, the R language, and also to get in touch with advanced R. I am grateful for all the learning opportunities: programming in R, developing an R package, using a variety of tools that make developing R packages easier and more productive, working with GitHub tools, interacting with the open source community. My mentors had an understanding of the lack of experience and really helped me achieve this. Participating in Google Summer of Code 2021 as student developer is definitely worth it and I recommend every student who wants to open source contribute to give it a try.</p> <h2 id="acknowledgements">Acknowledgements</h2> <ul> <li>The Google Summer of Code program for bringing more student developers into open source software development.</li> <li>Jacob Pettersen for being a great project leader and for providing guidance and willingness to impart his knowledge. Henrik Bengtsson whose insight and knowledge into the subject matter steered me through R package development. I am very grateful for the immense amount of useful discussions and valuable feedback.</li> <li>The members of the R community for building this warming community.</li> </ul> progressr 0.8.0: RStudio's Progress Bar, Shiny Progress Updates, and Absolute Progress https://www.jottr.org/2021/06/11/progressr-0.8.0/ Fri, 11 Jun 2021 19:00:00 -0700 https://www.jottr.org/2021/06/11/progressr-0.8.0/ <p><strong><a href="https://progressr.futureverse.org">progressr</a></strong> 0.8.0 is on CRAN. It comes with some new features:</p> <ul> <li>A new &lsquo;rstudio&rsquo; handler that reports on progress via the RStudio job interface in RStudio</li> <li><code>withProgressShiny()</code> now updates the <code>detail</code> part, instead of the <code>message</code> part</li> <li>In addition to signalling relative amounts of progress, it&rsquo;s now also possible to signal total amounts</li> </ul> <p>If you&rsquo;re curious what <strong>progressr</strong> is about, have a look at my <a href="https://www.jottr.org/2020/07/04/progressr-erum2020-slides/">e-Rum 2020 presentation</a>.</p> <h2 id="progress-updates-in-rstudio-s-job-interface">Progress updates in RStudio&rsquo;s job interface</h2> <p>If you&rsquo;re using RStudio Console, you can now report on progress in the RStudio&rsquo;s job interface as long as the progress originates from a <strong>progressr</strong>-signalling function. I’ve shown an example of this in Figure&nbsp;1.</p> <figure style="margin-top: 3ex;"> <img src="https://www.jottr.org/post/progressr-rstudio.png" alt="A screenshot of the upper part of the RStudio Console panel. Below the title bar, which says 'R 4.1.0 ~/', there is a row with the text 'Console 05:50:51 PM' left of a green progress bar at 30% followed by the text '0:03'. Below these two lines are the R commands called this far, which are the same as in the below example. Following the commands, is output 'M: Added value 1', 'M: Added value 2', and 'M: Added value 3', from the first steps that have completed this far."/> <figcaption> Figure 1: The RStudio job interface can show progress bars and we can use it with **progressr**. The progress bar title - "Console 05:50:51 PM" - shows at what time the progress began. The '0:03' shows for how long the progress has been running - here 3 seconds. </figcaption> </figure> <p>To try this yourself, run the below in the RStudio Console.</p> <pre><code class="language-r">library(progressr) handlers(global = TRUE) handlers(&quot;rstudio&quot;) y &lt;- slow_sum(1:10) </code></pre> <p>The progress bar disappears when the calculation completes.</p> <h2 id="tweaks-to-withprogressshiny">Tweaks to withProgressShiny()</h2> <p>The <code>withProgressShiny()</code> function, which is a <strong>progressr</strong>-aware version of <code>withProgress()</code>, gained argument <code>inputs</code>. It defaults to <code>inputs = list(message = NULL, detail = &quot;message&quot;)</code>, which says that a progress message should update the &lsquo;detail&rsquo; part of the Shiny progress panel. For example,</p> <pre><code class="language-r">X &lt;- 1:10 withProgressShiny(message = &quot;Calculation in progress&quot;, detail = &quot;Starting ...&quot;, value = 0, { p &lt;- progressor(along = X) y &lt;- lapply(X, FUN=function(x) { Sys.sleep(0.25) p(sprintf(&quot;x=%d&quot;, x)) }) }) </code></pre> <p>will start out as in the left panel of Figure&nbsp;2, and, as soon as the first progress signal is received, the &lsquo;detail&rsquo; part is updated with <code>x=1</code> as shown in the right panel.</p> <figure style="margin-top: 3ex;"> <table style="margin: 1ex;"> <tr style="margin: 1ex;"> <td> <img src="https://www.jottr.org/post/withProgressShiny_A_x=0.png" alt="A Shiny progress bar panel with a progress bar at 0% on top, with 'Calculation in progress' written in a bold large font, with 'Starting ...' written in a normal small font below."/> </td> <td> <img src="https://www.jottr.org/post/withProgressShiny_A_x=1.png" alt="A Shiny progress bar panel with a progress bar at 10% on top, with 'Calculation in progress' written in a bold large font, with 'x=1' written in a normal small font below."/> </td> </tr> </table> <figcaption> Figure 2: A Shiny progress panel that start out with the 'message' part displaying "Calculation in progress" and the 'detail' part displaying "Starting ..." (left), and whose 'detail' part is updated to "x=1" (right) as soon the first progress update comes in. </figcaption> </figure> <p>Prior to this new release, the default behavior was to update the &lsquo;message&rsquo; part of the Shiny progress panel. To revert to the old behavior, set argument <code>inputs</code> as in:</p> <pre><code class="language-r">X &lt;- 1:10 withProgressShiny(message = &quot;Starting ...&quot;, detail = &quot;Calculation in progress&quot;, value = 0, { p &lt;- progressor(along = X) y &lt;- lapply(X, FUN=function(x) { Sys.sleep(0.25) p(sprintf(&quot;x=%d&quot;, x)) }) }, inputs = list(message = &quot;message&quot;, detail = NULL)) </code></pre> <p>This results in what you see in Figure&nbsp;3. I think that the new behavior, as shown in Figure&nbsp;2, looks better and makes more sense.</p> <figure style="margin-top: 3ex;"> <table style="margin: 1ex;"> <tr style="margin: 1ex;"> <td> <img src="https://www.jottr.org/post/withProgressShiny_B_x=0.png" alt="A Shiny progress bar panel with a progress bar at 0% on top, with 'Starting ...' written in a bold large font, with 'Calculation in progress' written to the right of it and wrapping onto the next row."/> </td> <td> <img src="https://www.jottr.org/post/withProgressShiny_B_x=1.png" alt="A Shiny progress bar panel with a progress bar at 10% on top, with 'x=1' written in a bold large font, with 'Calculation in progress' written to the right of it."/> </td> </tr> </table> <figcaption> Figure 3: A Shiny progress panel that start out with the 'message' part displaying "Starting ..." and the 'detail' part displaying "Calculation in progress" (left), and whose 'message' part is updated to "x=1" (right) as soon the first progress update comes in. </figcaption> </figure> <h2 id="update-to-a-specific-amount-of-total-progress">Update to a specific amount of total progress</h2> <p>When using <strong>progressr</strong>, we start out by creating a progressor function that we then call to signal progress. For example, if we do:</p> <pre><code class="language-r">my_slow_fun &lt;- function() { p &lt;- progressr::progressor(steps = 10) count &lt;- 0 for (i in 1:10) { count &lt;- count + 1 Sys.sleep(1) p(sprintf(&quot;count=%d&quot;, count)) } count }) </code></pre> <p>each call to <code>p()</code> corresponds to <code>p(amount = 1)</code>, which signals that our function have moved <code>amount = 1</code> steps closer to the total amount <code>steps = 10</code>. We can take smaller or bigger steps by specifying another <code>amount</code>.</p> <p>In this new version, I&rsquo;ve introduced a new beta feature that allows us to signal progress that says where we are in <em>absolute terms</em>. With this, we can do things like:</p> <pre><code class="language-r">my_slow_fun &lt;- function() { p &lt;- progressr::progressor(steps = 10) count &lt;- 0 for (i in 1:5) { count &lt;- count + 1 Sys.sleep(1) if (runif(1) &lt; 0.5) break p(sprintf(&quot;count=%d&quot;, count)) } ## In case we broke out of the loop early, ## make sure to update to 5/10 progress p(step = 5) for (i in 1:5) { count &lt;- count + 1 Sys.sleep(1) p(sprintf(&quot;count=%d&quot;, count)) } count } </code></pre> <p>When calling <code>my_slow_fun()</code>, we might see progress being reported as:</p> <pre><code>- [------------------------------------------------] 0% \ [===&gt;-------------------------------------] 10% count=1 | [=======&gt;---------------------------------] 20% count=2 \ [===================&gt;---------------------] 50% count=3 ... </code></pre> <p>Note how it took a leap from 20% to 50% when <code>count == 2</code>. If we run it again, the move to 50% might happen at another iteration.</p> <h2 id="wrapping-up">Wrapping up</h2> <p>There are also a few bug fixes, which you can read about in <a href="https://progressr.futureverse.org/news/index.html">NEWS</a>. And a usual, all of this work also when you run in parallel using the <a href="https://futureverse.org">future framework</a>.</p> <p>Make progress!</p> <h2 id="links">Links</h2> <ul> <li><strong>progressr</strong> package: <a href="https://cran.r-project.org/package=progressr">CRAN</a>, <a href="https://github.com/HenrikBengtsson/progressr">GitHub</a>, <a href="https://progressr.futureverse.org">pkgdown</a></li> </ul> parallelly 1.26.0: Fast, Concurrent Setup of Parallel Workers (Finally) https://www.jottr.org/2021/06/10/parallelly-1.26.0/ Thu, 10 Jun 2021 15:00:00 -0700 https://www.jottr.org/2021/06/10/parallelly-1.26.0/ <p><strong><a href="https://parallelly.futureverse.org">parallelly</a></strong> 1.26.0 is on CRAN. It comes with one major improvement and one new function:</p> <ul> <li><p>The setup of parallel workers is now <em>much faster</em>, which comes from using a concurrent, instead of sequential, setup strategy</p></li> <li><p>The new <code>freePort()</code> can be used to find a TCP port that is currently available</p></li> </ul> <h2 id="faster-setup-of-local-parallel-workers">Faster setup of local, parallel workers</h2> <p>In R 4.0.0, which was released in May 2020, <code>parallel::makeCluster(n)</code> gained the power of setting up the <code>n</code> local cluster nodes all at the same time, which greatly reduces to total setup time. Previously, because it was setting up the workers one after the other, which involved a lot of waiting for each worker to get ready. You can read about the details in the <a href="https://developer.r-project.org/Blog/public/2020/03/17/socket-connections-update/index.html">Socket Connections Update</a> blog post by Tomas Kalibera and Luke Tierney on 2020-03-17.</p> <p><center> <img src="https://www.jottr.org/post/parallelly_faster_setup_of_cluster.png" alt="An X-Y graph with 'Total setup time (s)' on the vertical axis ranging from 0 to 55, and 'Number of cores' on the horizontal axis ranging from 0 to 128. Two smooth curves, which look very linear with intersection at the origin and unnoticeable variance, are drawn for the two setup strategies 'sequential' and 'parallel'. The 'sequential' line is much steeper." style="width: 65%;"/><br/> </center> <small><em>Figure: The total setup time versus the number of local cluster workers for the &ldquo;sequential&rdquo; setup strategy (red) and the new &ldquo;parallel&rdquo; strategy (turquoise). Data were collected on a 128-core Linux machine.<br/></em></small></p> <p>With this release of <strong>parallelly</strong>, <code>parallelly::makeClusterPSOCK(n)</code> gained the same skills. I benchmarked the new, default &ldquo;parallel&rdquo; setup strategy against the previous &ldquo;sequential&rdquo; strategy on a CentOS 7 Linux machine with 128 CPU cores and 512 GiB RAM while the machine was idle. I ran these benchmarks five times, which are summarized as smooth curves in the above figure. The variance between the replicate runs is tiny and the smooth curves appear almost linear. Assuming a linear relationship between setup time and number of cluster workers, a linear fit of gives a speedup of approximately 50 times on this machine. It took 52 seconds to set up 122 (sic!) workers when using the &ldquo;sequential&rdquo; approach, whereas it took only 1.1 seconds with the &ldquo;parallel&rdquo; approach. Not surprisingly, rerunning these benchmarks with <code>parallel::makePSOCKcluster()</code> instead gives nearly identical results.</p> <p>Importantly, the &ldquo;parallel&rdquo; setup strategy, which is the new default, can only be used when setting up parallel workers running on the local machine. When setting up workers on external or remote machines, the &ldquo;sequential&rdquo; setup strategy will still be used.</p> <p>If you&rsquo;re using <strong><a href="https://future.futureverse.org">future</a></strong> and use</p> <pre><code class="language-r">plan(multisession) </code></pre> <p>you&rsquo;ll immediately benefit from this performance gain, because it relies on <code>parallelly::makeClusterPSOCK()</code> internally.</p> <p>All credit for this improvement in <strong>parallelly</strong> and <code>parallelly::makeClusterPSOCK()</code> should go to Tomas Kalibera and Luke Tierney, who implemented support for this in R 4.0.0.</p> <p><em>Edit 2021-06-11 and 2021-07-01</em>: There&rsquo;s a bug in R (&gt;= 4.0.0 &amp;&amp; &lt;= 4.1.0) causing the new <code>setup_strategy = &quot;parallel&quot;</code> to fail in the RStudio Console on some systems. If you&rsquo;re running <em>RStudio Console</em> and get &ldquo;Error in makeClusterPSOCK(workers, &hellip;) : Cluster setup failed. 8 of 8 workers failed to connect.&ldquo;, update to <strong>parallelly</strong> 1.26.1 released on 2021-06-30:</p> <pre><code class="language-r">install.packages(&quot;parallelly&quot;) </code></pre> <p>which will work around this problem. Alternatively, you can manually set:</p> <pre><code class="language-r">options(parallelly.makeNodePSOCK.setup_strategy = &quot;sequential&quot;) </code></pre> <p><em>Comment</em>: Note that I could only test with up to 122 parallel workers, and not 128, which is the number of CPU cores available on the test machine. The reason for this is that each worker consumes one R connection in the main R session, and R has a limit in the number of connection it can have open at any time. The typical R installation can only have 128 connections open, and three are always occupied by the standard input (stdin), the standard output (stdout), and the standard error (stderr). Thus, the absolute maximum number of workers I could use 125. However, because I used the <strong><a href="https://progressr.futureverse.org">progressr</a></strong> package to report on progress, and a few other things that consumed a few more connections, I could only test up to 122 workers. You can read more about this limit in <a href="https://parallelly.futureverse.org/reference/availableConnections.html"><code>?parallelly::freeConnections</code></a>, which also gives a reference for how to increase this limit by recompling R from source.</p> <h2 id="find-an-available-tcp-port">Find an available TCP port</h2> <p>I&rsquo;ve also added <code>freePort()</code>, which will find a random port in [1024,65535] that is currently not occupied by another process on the machine. For example,</p> <pre><code class="language-r">&gt; freePort() [1] 30386 &gt; freePort() [1] 37882 </code></pre> <p>Using this function to pick a TCP port at random lowers the risk of trying to use a port already occupied as when using just <code>sample(1024:65535, size=1)</code>.</p> <p>Just like <code>parallel::makePSOCKcluster()</code>, <code>parallelly::makeClusterPSOCK()</code> still uses <code>sample(11000:11999, size=1)</code> to find a random port. I want <code>freePort()</code> to get some more mileage and CRAN validation before switching over, but the plan is to use <code>freePort()</code> by default in the next release of <strong>parallelly</strong>.</p> <p>Over and out!</p> <h2 id="links">Links</h2> <ul> <li><strong>parallelly</strong> package: <a href="https://cran.r-project.org/package=parallelly">CRAN</a>, <a href="https://github.com/HenrikBengtsson/parallelly">GitHub</a>, <a href="https://parallelly.futureverse.org">pkgdown</a></li> <li><strong>future</strong> package: <a href="https://cran.r-project.org/package=future">CRAN</a>, <a href="https://github.com/HenrikBengtsson/future">GitHub</a>, <a href="https://future.futureverse.org">pkgdown</a></li> <li><strong>progressr</strong> package: <a href="https://cran.r-project.org/package=progressr">CRAN</a>, <a href="https://github.com/HenrikBengtsson/progressr">GitHub</a>, <a href="https://progressr.futureverse.org">pkgdown</a></li> </ul> parallelly 1.25.0: availableCores(omit=n) and, Finally, Built-in SSH Support for MS Windows 10 Users https://www.jottr.org/2021/04/30/parallelly-1.25.0/ Fri, 30 Apr 2021 15:00:00 -0700 https://www.jottr.org/2021/04/30/parallelly-1.25.0/ <p><center> <img src="https://www.jottr.org/post/nasa-climate-ice-core-small.jpg" alt="A 25-cm long ice core is held in front of the camera on a sunny day. The background is an endless snow-covered flat landscape and a bright blue sky." style="width: 65%;"/><br/> <small><em>A piece of an ice core - more pleasing to look at than yet another illustration of a CPU core<br/> <small>(Image credit: Ludovic Brucker, NASA&rsquo;s Goddard Space Flight Center)</small> </em></small> </center></p> <p><strong><a href="https://cran.r-project.org/package=parallelly">parallelly</a></strong> 1.25.0 is on CRAN. It comes with two major improvements:</p> <ul> <li><p>You can now use <code>availableCores(omit = n)</code> to ask for all but <code>n</code> CPU cores</p></li> <li><p><code>makeClusterPSOCK()</code> can finally use the built-in SSH client on MS Windows 10 to set up remote workers</p></li> </ul> <h1 id="availablecores-omit-n-is-your-new-friend">availableCores(omit = n) is your new friend</h1> <p>When running R code in parallel, many choose to parallelize on as many CPU cores as possible, e.g.</p> <pre><code class="language-r">ncores &lt;- parallel::detectCores() </code></pre> <p>It&rsquo;s also common to leave out a few cores so that we can still use the computer for other basic tasks, e.g. checking email, editing files, and browsing the web. This is often done by something like:</p> <pre><code class="language-r">ncores &lt;- parallel::detectCores() - 1 </code></pre> <p>which will return seven on a machine with eight CPU cores. If you look around, you also find that some leave two cores aside for other tasks;</p> <pre><code class="language-r">ncores &lt;- parallel::detectCores() - 2 </code></pre> <p>I&rsquo;m sorry to be the party killer, but <em>none of the above is guaranteed to work everywhere</em>. It might work on your computer but not on your collaborator&rsquo;s computer, or in the cloud, or on continuous integration (CI) services, etc. There are two problems with the above approaches. The help page of <code>parallel::detectCores()</code> describes the first problem:</p> <blockquote> <p><strong>Value</strong><br /> An integer, <code>NA</code> if the answer is unknown.</p> </blockquote> <p>Yup, <code>detectCores()</code> might return <code>NA</code>. Ouf!</p> <p>The second problem is that your code might run on a machine that has only one or two CPU cores. That means that <code>parallel::detectCores() - 1</code> may return zero, and <code>parallel::detectCores() - 2</code> may even return a negative one. You might think such machines no longer exists, but they do. The most common cases these days are virtual machines (VMs) running in the cloud. Note, if you&rsquo;re a package developer, GitHub Actions, Travis CI, and AppVeyor CI are all running in VMs with two cores.</p> <p>So, to make sure your code will run everywhere, you need to do something like:</p> <pre><code class="language-r">ncores &lt;- max(parallel::detectCores() - 1, 1, na.rm = TRUE) </code></pre> <p>With that approach, we know that <code>ncores</code> is at least one and never a missing value. I don&rsquo;t know about you, but I often do thinkos where I mix up <code>min()</code> and <code>max()</code>, which I&rsquo;m sure we don&rsquo;t want. So, let me introduce you to your new friend:</p> <pre><code class="language-r">ncores &lt;- parallelly::availableCores(omit = 1) </code></pre> <p>Just use that and you&rsquo;ll be fine everywhere - it&rsquo;ll always give you a value of one or greater. It&rsquo;s neater and less error prone. Also, in contrast to <code>parallel::detectCores()</code>, <code>parallelly::availableCores()</code> respects various CPU settings and configurations that the system wants your to follow.</p> <h1 id="makeclusterpsock-to-remote-machines-works-out-of-the-box-also-ms-windows-10">makeClusterPSOCK() to remote machines works out-of-the-box also MS Windows 10</h1> <p>If you&rsquo;re into parallelizing across multiple machines, either on your local network, or remotely, say in the cloud, you can use:</p> <pre><code class="language-r">workers &lt;- parallelly::makeClusterPSOCK(c(&quot;n1.example.org&quot;, &quot;n2.example.org&quot;)) </code></pre> <p>to spawn two R workers running in the background on those two machines. We can use these workers with different R parallel backends, e.g. with bare-bone <strong>parallel</strong></p> <pre><code class="language-r">y &lt;- parallel::parLapply(workers, X, slow_fcn) </code></pre> <p>with <strong>foreach</strong> and the classical <strong>doParallel</strong> adapter,</p> <pre><code class="language-r">library(foreach) doParallel::registerDoParallel(workers) y &lt;- foreach(x = X) %dopar% slow_fcn(x) </code></pre> <p>and, obviously, my favorite, the <strong>future</strong> framework, which comes with lots of alternatives, e.g.</p> <pre><code class="language-r">library(future) plan(cluster, workers = workers) y &lt;- future.apply::future_lapply(X, slow_fcn) y &lt;- furrr::future_map(X, slow_fcn) library(foreach) doFuture::registerDoFuture() y &lt;- foreach(x = X) %dopar% slow_fcn(x) y &lt;- BiocParallel::bplapply(X, slow_fcn) </code></pre> <p>Now, in order to set up remote workers out of the box as shown above, you need to make sure you can do the following from the terminal:</p> <pre><code class="language-r">{local}$ ssh n1.example.org Rscript --version R scripting front-end version 4.0.4 (2021-02-15) </code></pre> <p>If you can get to that point, you can also use those two remote machines to parallel from your local computer, which, at least I think, is pretty cool. To get to that point, you basically need to configure SSH locally and remotely so that you can log in without having to enter a password, which you do by using SSH keys. It does <em>not</em> require admin rights, and it&rsquo;s not that hard to do when you know how to do it. Search the web for &ldquo;SSH key authentication&rdquo; for instructions, but the gist is that you create a public-private key pair locally and you copy the public one to the remote machine. The setup is the same for Linux, macOS, and MS Windows 10.</p> <p>What&rsquo;s new in <strong>parallelly</strong> 1.25.0 is that <em>MS Windows 10 users no longer have to install the PuTTY SSH client</em> - the Unix-compatible <code>ssh</code> client that comes with all MS Windows 10 installations works out of the box.</p> <p>The reason why we couldn&rsquo;t use the built-in Windows 10 client before is that it has an <a href="https://github.com/PowerShell/Win32-OpenSSH/issues/1265">bug preventing us from using it for reverse tunneling</a>, which is needed for remote, parallel processing. However, someone found a workaround, so that bug is no longer a blocker. Thus, now <code>makeClusterPSOCK()</code> works as we always wanted it to.</p> <h2 id="take-homes">Take-homes</h2> <ul> <li><p>Use <code>parallelly::availableCores()</code></p></li> <li><p>Remote parallelization from MS Windows 10 is now as easy as from Linux and macOS</p></li> </ul> <p>For all updates, including what bugs have been fixed, see the <a href="https://parallelly.futureverse.org/news/index.html">NEWS</a> of <strong>parallelly</strong>.</p> <p>Over and out!</p> <h2 id="links">Links</h2> <ul> <li><p><strong>parallelly</strong> package: <a href="https://cran.r-project.org/package=parallelly">CRAN</a>, <a href="https://github.com/HenrikBengtsson/parallelly">GitHub</a>, <a href="https://parallelly.futureverse.org">pkgdown</a></p></li> <li><p><strong>future</strong> package: <a href="https://cran.r-project.org/package=future">CRAN</a>, <a href="https://github.com/HenrikBengtsson/future">GitHub</a>, <a href="https://future.futureverse.org">pkgdown</a></p></li> <li><p><strong>future.apply</strong> package: <a href="https://cran.r-project.org/package=future.apply">CRAN</a>, <a href="https://github.com/HenrikBengtsson/future.apply">GitHub</a>, <a href="https://future.apply.futureverse.org">pkgdown</a></p></li> <li><p><strong>furrr</strong> package: <a href="https://cran.r-project.org/package=furrr">CRAN</a>, <a href="https://github.com/HenrikBengtsson/furrr">GitHub</a>, <a href="https://furrr.futureverse.org">pkgdown</a></p></li> </ul> <p>PS. If you&rsquo;re interested in learning more about ice cores and how they are used to track changes in our atmosphere and climate, see <a href="https://climate.nasa.gov/news/2616/core-questions-an-introduction-to-ice-cores/">Core questions: An introduction to ice cores</a> by Jessica Stoller-Conrad, NASA&rsquo;s Jet Propulsion Laboratory.</p> Using Kubernetes and the Future Package to Easily Parallelize R in the Cloud https://www.jottr.org/2021/04/08/future-and-kubernetes/ Thu, 08 Apr 2021 19:00:00 -0700 https://www.jottr.org/2021/04/08/future-and-kubernetes/ <p><em>This is a guest post by <a href="https://www.stat.berkeley.edu/~paciorek">Chris Paciorek</a>, Department of Statistics, University of California at Berkeley.</em></p> <p>In this post, I&rsquo;ll demonstrate that you can easily use the <strong><a href="https://cran.r-project.org/package=future">future</a></strong> package in R on a cluster of machines running in the cloud, specifically on a Kubernetes cluster.</p> <p>This allows you to easily doing parallel computing in R in the cloud. One advantage of doing this in the cloud is the ability to easily scale the number and type of (virtual) machines across which you run your parallel computation.</p> <h2 id="why-use-kubernetes-to-start-a-cluster-in-the-cloud">Why use Kubernetes to start a cluster in the cloud?</h2> <p>Kubernetes is a platform for managing containers. You can think of the containers as lightweight Linux machines on which you can do your computation. By using the Kubernetes service of a cloud provider such as Google Cloud Platform (GCP) or Amazon Web Services (AWS), you can easily start up a cluster of (virtual) machines.</p> <p>There have been (and are) approaches to starting up a cluster of machines on AWS easily from the command line on your laptop. Some tools that are no longer actively maintained are <a href="http://star.mit.edu/cluster">StarCluster</a> and <a href="https://cfncluster.readthedocs.io/en/latest">CfnCluster</a>. And there is now something called <a href="https://docs.aws.amazon.com/parallelcluster/latest/ug/getting_started.html">AWS ParallelCluster</a>. But doing it via Kubernetes allows you to build upon an industry standard platform that can be used on various cloud providers. A similar effort (which I heavily borrowed from in developing the setup described here) allows one to run a <a href="https://docs.dask.org/en/latest/setup/kubernetes-helm.html">Python Dask cluster</a> accessed via a Jupyter notebook.</p> <p>Many of the cloud providers have Kubernetes services (and it&rsquo;s also possible you&rsquo;d have access to a Kubernetes service running at your institution or company). In particular, I&rsquo;ve experimented with <a href="https://cloud.google.com/kubernetes-engine">Google Kubernetes Engine (GKE)</a> and <a href="https://aws.amazon.com/eks">Amazon&rsquo;s Elastic Kubernetes Service (EKS)</a>. This post will demonstrate setting up your cluster using Google&rsquo;s GKE, but see my GitHub <a href="https://github.com/paciorek/future-kubernetes">future-kubernetes</a> repository for details on doing it on Amazon&rsquo;s EKS. Note that while I&rsquo;ve gotten things to work on EKS, there have been <a href="https://github.com/paciorek/future-kubernetes#AWS-troubleshooting">various headaches</a> that I haven&rsquo;t encountered on GKE.</p> <p>I&rsquo;m not a Kubernetes expert, nor a GCP or AWS expert (that might explain the headaches I just mentioned), but one upside is that hopefully I&rsquo;ll go through all the details at a level someone who is not an expert can follow along. In fact, part of my goal in setting this up has been to learn more about Kubernetes, which I&rsquo;ve done, but note that there&rsquo;s <em>a lot</em> to it.</p> <p>More details about the setup, including how it was developed and troubleshooting tips can be found in my <a href="https://github.com/paciorek/future-kubernetes">future-kubernetes</a> repository.</p> <h2 id="how-it-works-briefly">How it works (briefly)</h2> <p>This diagram in Figure 1 outlines the pieces of the setup.</p> <figure> <img src="https://www.jottr.org/post/k8s.png" alt="Overview of using future on a Kubernetes cluster" width="700"/> <figcaption style="font-style: italic;">Figure 1. Overview of using future on a Kubernetes cluster</figcaption> </figure> <p>Work on a Kubernetes cluster is divided amongst <em>pods</em>, which carry out the components of your work and can communicate with each other. A pod is basically a Linux container. (Strictly speaking a pod can contain multiple containers and shared resources for those containers, but for our purposes, it&rsquo;s simplest just to think of a pod as being a Linux container.) The pods run on the nodes in the Kubernetes cluster, where each Kubernetes node runs on a compute instance of the cloud provider. These instances are themselves virtual machines running on the cloud provider&rsquo;s actual hardware. (I.e., somewhere out there, behind all the layers of abstraction, there are actual real computers running on endless aisles of computer racks in some windowless warehouse!) One of the nice things about Kubernetes is that if a pod dies, Kubernetes will automatically restart it.</p> <p>The basic steps are:</p> <ol> <li>Start your Kubernetes cluster on the cloud provider&rsquo;s Kubernetes service</li> <li>Start the pods using Helm, the Kubernetes package manager</li> <li>Connect to the RStudio Server session running on the cluster from your browser</li> <li>Run your future-based computation</li> <li>Terminate the Kubernetes cluster</li> </ol> <p>We use the Kubernetes package manager, Helm, to run the pods of interest:</p> <ul> <li>one (scheduler) pod for a main process that runs RStudio Server and communicates with the workers</li> <li>multiple (worker) pods, each with one R worker process to act as the workers managed by the <strong>future</strong> package</li> </ul> <p>Helm manages the pods and related <em>services</em>. An example of a service is to open a port on the scheduler pod so the R worker processes can connect to that port, allowing the scheduler pod RStudio Server process to communicate with the worker R processes. I have a <a href="https://github.com/paciorek/future-helm-chart">Helm chart</a> that does this; it borrows heavily from the <a href="https://github.com/dask/helm-chart">Dask Helm chart</a> for the Dask package for Python.</p> <p>Each pod runs a Docker container. I use my own <a href="https://github.com/paciorek/future-kubernetes-docker">Docker container</a> that layers a bit on top of the <a href="https://rocker-project.org">Rocker</a> container that contains R and RStudio Server.</p> <h2 id="step-1-start-the-kubernetes-cluster">Step 1: Start the Kubernetes cluster</h2> <p>Here I assume you have already installed:</p> <ul> <li>the command line interface to Google Cloud,</li> <li>the <code>kubectl</code> interface for interacting with Kubernetes, and</li> <li><code>helm</code> for installing Helm charts (i.e., Kubernetes packages).</li> </ul> <p>Installation details can be found in the <a href="https://github.com/paciorek/future-kubernetes">future-kubernetes</a> repository.</p> <p>First we&rsquo;ll start our cluster (the first part of Step 1 in Figure 1):</p> <pre><code class="language-sh">gcloud container clusters create \ --machine-type n1-standard-1 \ --num-nodes 4 \ --zone us-west1-a \ --cluster-version latest \ my-cluster </code></pre> <p>I&rsquo;ve asked for four virtual machines (nodes), using the basic (and cheap) <code>n1-standard-1</code> instance type (which has a single CPU per virtual machine) from Google Cloud Platform.</p> <p>You&rsquo;ll want to specify the total number of cores on the virtual machines to be equal to the number of R workers that you want to start and that you specify in the Helm chart (as discussed below). Here we ask for four one-cpu nodes, and our Helm chart starts four workers, so all is well. See the <a href="#modifications">Modifications section</a> below on how to start up a different number of workers.</p> <p>Since the RStudio Server process that you interact with wouldn&rsquo;t generally be doing heavy computation at the same time as the workers, it&rsquo;s OK that the RStudio scheduler pod and a worker pod would end up using the same virtual machine.</p> <h2 id="step-2-install-the-helm-chart-to-set-up-your-pods">Step 2: Install the Helm chart to set up your pods</h2> <p>Next we need to get our pods going by installing the Helm chart (i.e., package) on the cluster; the installed chart is called a <em>release</em>. As discussed above, the Helm chart tells Kubernetes what pods to start and how they are configured.</p> <p>First we need to give our account permissions to perform administrative actions:</p> <pre><code class="language-sh">kubectl create clusterrolebinding cluster-admin-binding \ --clusterrole=cluster-admin </code></pre> <p>Now let&rsquo;s install the release. This code assumes the use of Helm version 3 or greater (for older versions <a href="https://github.com/paciorek/future-kubernetes">see my full instructions</a>).</p> <pre><code class="language-sh">git clone https://github.com/paciorek/future-helm-chart # download the materials tar -czf future-helm.tgz -C future-helm-chart . # create a zipped archive (tarball) that `helm install` needs helm install --wait test ./future-helm.tgz # install (start the pods) </code></pre> <p>You&rsquo;ll need to name your release; I&rsquo;ve used &lsquo;test&rsquo; above.</p> <p>The <code>--wait</code> flag tells helm to wait until all the pods have started. Once that happens, you&rsquo;ll see a message about the release and how to connect to the RStudio interface, which we&rsquo;ll discuss further in the next section.</p> <p>We can check the pods are running:</p> <pre><code class="language-sh">kubectl get pods </code></pre> <p>You should see something like this (the alphanumeric characters at the ends of the names will differ in your case):</p> <pre><code>NAME READY STATUS RESTARTS AGE future-scheduler-6476fd9c44-mvmz6 1/1 Running 0 116s future-worker-54db85cb7b-47qsd 1/1 Running 0 115s future-worker-54db85cb7b-4xf4x 1/1 Running 0 115s future-worker-54db85cb7b-rj6bj 1/1 Running 0 116s future-worker-54db85cb7b-wvp4n 1/1 Running 0 115s </code></pre> <p>As expected, we have one scheduler and four workers.</p> <h2 id="step-3-connect-to-rstudio-server-running-in-the-cluster">Step 3: Connect to RStudio Server running in the cluster</h2> <p>Next we&rsquo;ll connect to the RStudio instance running via RStudio Server on our main (scheduler) pod, using the browser on our laptop (Step 3 in Figure 1).</p> <p>After installing the Helm chart, you should have seen a printout with some instructions on how to do this. First you need to connect a port on your laptop to the RStudio port on the main pod (running of course in the cloud):</p> <pre><code class="language-sh">export RSTUDIO_SERVER_IP=&quot;127.0.0.1&quot; export RSTUDIO_SERVER_PORT=8787 kubectl port-forward --namespace default svc/future-scheduler $RSTUDIO_SERVER_PORT:8787 &amp; </code></pre> <p>You can now connect from your browser to the RStudio Server instance by going to the URL: <a href="http://127.0.0.1:8787">http://127.0.0.1:8787</a>.</p> <p>Enter <code>rstudio</code> as the username and <code>future</code> as the password to login to RStudio.</p> <p>What&rsquo;s happening is that port 8787 on your laptop is forwarding to the port on the main pod on which RStudio Server is listening (which is also port 8787). So you can just act as if RStudio Server is accessible directly on your laptop.</p> <p>One nice thing about this is that there is no public IP address for someone to maliciously use to connect to your cluster. Instead the access is handled securely entirely through <code>kubectl</code> running on your laptop. However, it also means that you couldn&rsquo;t easily share your cluster with a collaborator. For details on configuring things so there is a public IP, please see <a href="https://github.com/paciorek/future-kubernetes#connecting-to-the-rstudio-instance-when-starting-the-cluster-from-a-remote-machine">my repository</a>.</p> <p>Note that there is nothing magical about running your computation via RStudio. You could <a href="#connect-to-a-pod">connect to the main pod</a> and simply run R in it and then use the <strong>future</strong> package.</p> <h2 id="step-4-run-your-future-based-parallel-r-code">Step 4: Run your future-based parallel R code</h2> <p>Now we&rsquo;ll start up our future cluster and run our computation (Step 4 in Figure 1):</p> <pre><code class="language-r">library(future) plan(cluster, manual = TRUE, quiet = TRUE) </code></pre> <p>The key thing is that we set <code>manual = TRUE</code> above. This ensures that the functions from the <strong>future</strong> package don&rsquo;t try to start R processes on the workers, as those R processes have already been started by Kubernetes and are waiting to connect to the main (RStudio Server) process.</p> <p>Note that we don&rsquo;t need to say how many future workers we want. This is because the Helm chart sets an environment variable in the scheduler pod&rsquo;s <code>Renviron</code> file based on the number of worker pod replicas. Since that variable is used by the <strong>future</strong> package (via <code>parallelly::availableCores()</code>) as the default number of future workers, this ensures that there are only as many future workers as you have worker pods. However, if you modify the number of worker pods after installing the Helm chart, you may need to set the <code>workers</code> argument to <code>plan()</code> manually. (And note that if you were to specify more future workers than R worker processes (i.e., pods) you would get an error and if you were to specify fewer, you wouldn&rsquo;t be using all the resources that you are paying for.)</p> <p>Now we can use the various tools in the <strong>future</strong> package as we would if on our own machine or working on a Linux cluster.</p> <p>Let&rsquo;s run our parallelized operations. I&rsquo;m going to do the world&rsquo;s least interesting calculation of calculating the mean of many (10 million) random numbers forty separate times in parallel. Not interesting, but presumably if you&rsquo;re reading this you have your own interesting computation in mind and hopefully know how to do it using future&rsquo;s tools such as <strong><a href="https://cran.r-project.org/package=future.apply">future.apply</a></strong> and <strong><a href="https://cran.r-project.org/package=foreach">foreach</a></strong> with <strong><a href="https://cran.r-project.org/package=doFuture">doFuture</a></strong>.</p> <pre><code class="language-r">library(future.apply) output &lt;- future_sapply(1:40, function(i) mean(rnorm(1e7)), future.seed = TRUE) </code></pre> <p>Note that all of this assumes you&rsquo;re working interactively, but you can always reconnect to the RStudio Server instance after closing the browser, and any long-running code should continue running even if you close the browser.</p> <p>Figure 2 shows a screenshot of the RStudio interface.</p> <figure> <img src="https://www.jottr.org/post/rstudio.png" alt="RStudio interface, demonstrating use of future commands" width="700"/> <figcaption style="font-style: italic;">Figure 2. Screenshot of the RStudio interface</figcaption> </figure> <h3 id="working-with-files">Working with files</h3> <p>Note that <code>/home/rstudio</code> will be your default working directory in RStudio and the RStudio Server process will be running as the user <code>rstudio</code>.</p> <p>You can use <code>/tmp</code> and <code>/home/rstudio</code> for files, both within RStudio and within code running on the workers, but note that files (even in <code>/home/rstudio</code>) are not shared between workers nor between the workers and the RStudio Server pod.</p> <p>To make data available to your RStudio process or get output data back to your laptop, you can use <code>kubectl cp</code> to copy files between your laptop and the RStudio Server pod. Here&rsquo;s an example of copying to/from <code>/home/rstudio</code>:</p> <pre><code class="language-sh">## create a variable with the name of the scheduler pod export SCHEDULER=$(kubectl get pod --namespace default -o jsonpath='{.items[?(@.metadata.labels.component==&quot;scheduler&quot;)].metadata.name}') ## copy a file to the scheduler pod kubectl cp my_laptop_file ${SCHEDULER}:home/rstudio/ ## copy a file from the scheduler pod kubectl cp ${SCHEDULER}:home/rstudio/my_output_file . </code></pre> <p>Of course you can also interact with the web from your RStudio process, so you could download data to the RStudio process from the internet.</p> <h2 id="step-5-cleaning-up">Step 5: Cleaning up</h2> <p>Make sure to shut down your Kubernetes cluster, so you don&rsquo;t keep getting charged.</p> <pre><code class="language-sh">gcloud container clusters delete my-cluster --zone=us-west1-a </code></pre> <h2 id="modifications">Modifications</h2> <p>You can modify the Helm chart in advance, before installing it. For example you might want to install other R packages for use in your parallel code or change the number of workers.</p> <p>To add additional R packages, go into the <code>future-helm-chart</code> directory (which you created using the directions above in Step 2) and edit the <a href="https://github.com/paciorek/future-helm-chart/blob/master/values.yaml">values.yaml</a> file. Simply modify the lines that look like this:</p> <pre><code class="language-yaml"> env: # - name: EXTRA_R_PACKAGES # value: data.table </code></pre> <p>by removing the &ldquo;#&rdquo; comment characters and putting the R packages you want installed in place of <code>data.table</code>, with the names of the packages separated by spaces, e.g.,</p> <pre><code class="language-yaml"> env: - name: EXTRA_R_PACKAGES value: foreach doFuture </code></pre> <p>In many cases you may want these packages installed on both the scheduler pod (where RStudio Server runs) and on the workers. If so, make sure to modify the lines above in both the <code>scheduler</code> and <code>worker</code> stanzas.</p> <p>To modify the number of workers, modify the <code>replicas</code> line in the <code>worker</code> stanza of the <a href="https://github.com/paciorek/future-helm-chart/blob/master/values.yaml">values.yaml</a> file.</p> <p>Then rebuild the Helm chart:</p> <pre><code class="language-sh">cd future-helm-chart ## ensure you are in the directory containing `values.yaml` tar -czf ../future-helm.tgz . </code></pre> <p>and install as done previously.</p> <p>Note that doing the above to increase the number of workers would probably only make sense if you also modify the number of virtual machines you start your Kubernetes cluster with such that the total number of cores across the cloud provider compute instances matches the number of worker replicas.</p> <p>You may also be able to modify a running cluster. For example you could use <code>gcloud container clusters resize</code>. I haven&rsquo;t experimented with this.</p> <p>To modify if your Helm chart is already installed (i.e., your release is running), one simple option is to reinstall the Helm chart as discussed below. You may also need to kill the <code>port-forward</code> process discussed in Step 3.</p> <p>For some changes, you can also also update a running release without uninstalling it by &ldquo;patching&rdquo; the running release or scaling resources. I won&rsquo;t go into details here.</p> <h2 id="troubleshooting">Troubleshooting</h2> <p>Things can definitely go wrong in getting all the pods to start up and communicate with each other. Here are some suggestions for monitoring what is going on and troubleshooting.</p> <p>First, you can use <code>kubectl</code> to check the pods are running:</p> <pre><code class="language-sh">kubectl get pods </code></pre> <h3 id="connect-to-a-pod">Connect to a pod</h3> <p>To connect to a pod, which allows you to check on installed software, check on what the pod is doing, and other troubleshooting, you can do the following</p> <pre><code class="language-sh">export SCHEDULER=$(kubectl get pod --namespace default -o jsonpath='{.items[?(@.metadata.labels.component==&quot;scheduler&quot;)].metadata.name}') export WORKERS=$(kubectl get pod --namespace default -o jsonpath='{.items[?(@.metadata.labels.component==&quot;worker&quot;)].metadata.name}') ## access the scheduler pod: kubectl exec -it ${SCHEDULER} -- /bin/bash ## access a worker pod: echo $WORKERS kubectl exec -it &lt;insert_name_of_a_worker&gt; -- /bin/bash </code></pre> <p>Alternatively just determine the name of the pod with <code>kubectl get pods</code> and then run the <code>kubectl exec -it ...</code> invocation above.</p> <p>Note that once you are in a pod, you can install software in the usual fashion of a Linux machine (in this case using <code>apt</code> commands such as <code>apt-get install</code>).</p> <h3 id="connect-to-a-virtual-machine">Connect to a virtual machine</h3> <p>Or to connect directly to an underlying VM, you can first determine the name of the VM and then use the <code>gcloud</code> tools to connect to it.</p> <pre><code class="language-sh">kubectl get nodes ## now, connect to one of the nodes, 'gke-my-cluster-default-pool-8b490768-2q9v' in this case: gcloud compute ssh gke-my-cluster-default-pool-8b490768-2q9v --zone us-west1-a </code></pre> <h3 id="check-your-running-code">Check your running code</h3> <p>To check that your code is actually running in parallel, one can run the following test and see that the result returns the names of distinct worker pods.</p> <pre><code class="language-r">library(future.apply) future_sapply(seq_len(nbrOfWorkers()), function(i) Sys.info()[[&quot;nodename&quot;]]) </code></pre> <p>You should see something like this:</p> <pre><code>[1] future-worker-54db85cb7b-47qsd future-worker-54db85cb7b-4xf4x [3] future-worker-54db85cb7b-rj6bj future-worker-54db85cb7b-wvp4n </code></pre> <p>One can also connect to the pods or to the underlying virtual nodes (as discussed above) and run Unix commands such as <code>top</code> and <code>free</code> to understand CPU and memory usage.</p> <h3 id="reinstall-the-helm-release">Reinstall the Helm release</h3> <p>You can restart your release (i.e., restarting the pods, without restarting the whole Kubernetes cluster):</p> <pre><code class="language-sh">helm uninstall test helm install --wait test ./future-helm.tgz </code></pre> <p>Note that you may need to restart the entire Kubernetes cluster if you&rsquo;re having difficulties that reinstalling the release doesn&rsquo;t fix.</p> <h2 id="how-does-it-work">How does it work?</h2> <p>I&rsquo;ve provided many of the details of how it works in my <a href="https://github.com/paciorek/future-kubernetes">future-kubernetes</a> repository.</p> <p>The key pieces are:</p> <ol> <li>The <a href="https://github.com/paciorek/future-helm-chart">Helm chart</a> with the instructions for how to start the pods and any associated services.</li> <li>The <a href="https://github.com/paciorek/future-kubernetes-docker">Rocker-based Docker container(s)</a> that the pods run.</li> </ol> <p>That&rsquo;s all there is to it &hellip; plus <a href="https://github.com/paciorek/future-kubernetes">these instructions</a>.</p> <p>Briefly:</p> <ol> <li>Based on the Helm chart, Kubernetes starts up the &lsquo;main&rsquo; or &lsquo;scheduler&rsquo; pod running RStudio Server and multiple worker pods each running an R process. All of the pods are running the Rocker-based Docker container</li> <li>The RStudio Server main process and the workers use socket connections (via the R function <code>socketConnection()</code>) to communicate: <ul> <li>the worker processes start R processes that are instructed to regularly make a socket connection using a particular port on the main scheduler pod</li> <li>when you run <code>future::plan()</code> (which calls <code>makeClusterPSOCK()</code>) in RStudio, the RStudio Server process attempts to make socket connections to the workers using that same port</li> </ul></li> <li>Once the socket connections are established, command of the RStudio session returns to you and you can run your future-based parallel R code.</li> </ol> <p>One thing I haven&rsquo;t had time to work through is how to easily scale the number of workers after the Kubernetes cluster is running and the Helm chart installed, or even how to auto-scale &ndash; starting up workers as needed based on the number of workers requested via <code>plan()</code>.</p> <h2 id="wrap-up">Wrap up</h2> <p>If you&rsquo;re interested in extending or improving this or collaborating in some fashion, please feel free to get in touch with me via the <a href="https://github.com/paciorek/future-kubernetes/issues">&lsquo;future-kubernetes&rsquo; issue tracker</a> or by email.</p> <p>And if you&rsquo;re interested in using R with Kubernetes, note that RStudio provides an integration of RStudio Server Pro with Kubernetes that should allow one to run future-based workflows in parallel.</p> <p>/Chris</p> <h2 id="links">Links</h2> <ul> <li><p>future-kubernetes repository:</p> <ul> <li>GitHub page: <a href="https://github.com/paciorek/future-kubernetes">https://github.com/paciorek/future-kubernetes</a></li> </ul></li> <li><p>future-kubernetes Helm chart:</p> <ul> <li>GitHub page: <a href="https://github.com/paciorek/future-helm-chart">https://github.com/paciorek/future-helm-chart</a></li> </ul></li> <li><p>future-kubernetes Docker container:</p> <ul> <li>GitHub page: <a href="https://github.com/paciorek/future-kubernetes-docker">https://github.com/paciorek/future-kubernetes-docker</a></li> </ul></li> <li><p>future package:</p> <ul> <li>CRAN page: <a href="https://cran.r-project.org/package=future">https://cran.r-project.org/package=future</a></li> <li>GitHub page: <a href="https://github.com/HenrikBengtsson/future">https://github.com/HenrikBengtsson/future</a></li> </ul></li> </ul> future.BatchJobs - End-of-Life Announcement https://www.jottr.org/2021/01/08/future.batchjobs-end-of-life-announcement/ Fri, 08 Jan 2021 09:00:00 -0800 https://www.jottr.org/2021/01/08/future.batchjobs-end-of-life-announcement/ <div style="width: 40%; margin: 2ex; float: right;"/> <center> <img src="https://www.jottr.org/post/sign_out_of_service_do_not_use.png" alt="Sign: Out of Service - Do not use!"/> </center> </div> <p>This is an announcement that <strong><a href="https://cran.r-project.org/package=future.BatchJobs">future.BatchJobs</a></strong> - <em>A Future API for Parallel and Distributed Processing using BatchJobs</em> has been archived on CRAN. The package has been deprecated for years with a recommendation of using <strong><a href="https://cran.r-project.org/package=future.batchtools">future.batchtools</a></strong> instead. The latter has been on CRAN since June 2017 and builds upon the <strong><a href="https://cran.r-project.org/package=batchtools">batchtools</a></strong> package, which itself supersedes the <strong><a href="https://cran.r-project.org/package=BatchJobs">BatchJobs</a></strong> package.</p> <p>To wrap up the three-and-a-half year long life of <strong><a href="https://cran.r-project.org/package=future.BatchJobs">future.BatchJobs</a></strong>, the very last version, 0.17.0, reached CRAN on 2021-01-04 and passed on CRAN checks as of 2020-01-08, when the the package was requested to be formally archived. All versions ever existing on CRAN can be found at <a href="https://cran.r-project.org/src/contrib/Archive/future.BatchJobs/">https://cran.r-project.org/src/contrib/Archive/future.BatchJobs/</a>.</p> <p>Archiving the <strong>future.BatchJobs</strong> package will speed up new releases of the <strong>future</strong> package. In the past, some of the <strong>future</strong> releases required internal updates to reverse packages dependencies such as <strong>future.BatchJobs</strong> to be rolled out on CRAN first in order for <strong>future</strong> to pass the CRAN incoming checks.</p> <h2 id="postscript">Postscript</h2> <p>The <a href="https://cran.r-project.org/package=future.BatchJobs">https://cran.r-project.org/package=future.BatchJobs</a> page mentions:</p> <blockquote> <p>Archived on 2021-01-08 at the request of the maintainer.</p> <p>Consider using package ‘<a href="https://cran.r-project.org/package=future.batchtools">future.batchtools</a>’ instead.</p> </blockquote> <p>I&rsquo;m happy to see that we can suggest another package on our archived package pages. All I did to get this was to mention it in my email to CRAN:</p> <blockquote> <p>Hi,</p> <p>please archive the &lsquo;future.BatchJobs&rsquo; package. It has zero reverse dependencies. The package has been labelled deprecated for a long time now and has been superseded by the &lsquo;future.batchtools&rsquo; package.</p> <p>Thank you,<br /> Henrik</p> </blockquote> <h2 id="links">Links</h2> <ul> <li><p>future package:</p> <ul> <li>CRAN page: <a href="https://cran.r-project.org/package=future">https://cran.r-project.org/package=future</a></li> <li>GitHub page: <a href="https://github.com/HenrikBengtsson/future">https://github.com/HenrikBengtsson/future</a></li> </ul></li> <li><p>future.BatchJobs package:</p> <ul> <li>CRAN page: <a href="https://cran.r-project.org/package=future.BatchJobs">https://cran.r-project.org/package=future.BatchJobs</a></li> <li>All CRAN versions: <a href="https://cran.r-project.org/src/contrib/Archive/future.BatchJobs/">https://cran.r-project.org/src/contrib/Archive/future.BatchJobs/</a></li> <li>GitHub page: <a href="https://github.com/HenrikBengtsson/future.BatchJobs">https://github.com/HenrikBengtsson/future.BatchJobs</a></li> </ul></li> <li><p>future.batchtools package:</p> <ul> <li>CRAN page: <a href="https://cran.r-project.org/package=future.batchtools">https://cran.r-project.org/package=future.batchtools</a></li> <li>GitHub page: <a href="https://github.com/HenrikBengtsson/future.batchtools">https://github.com/HenrikBengtsson/future.batchtools</a></li> </ul></li> </ul> My Keynote 'Future' Presentation at the European Bioconductor Meeting 2020 https://www.jottr.org/2020/12/19/future-eurobioc2020-slides/ Sat, 19 Dec 2020 10:00:00 -0800 https://www.jottr.org/2020/12/19/future-eurobioc2020-slides/ <div style="width: 40%; margin: 2ex; float: right;"/> <center> <img src="https://www.jottr.org/post/LukeZapia_20201218-EuroBioc2020-future_mindmap.jpg" alt="A hand-drawn summary of Henrik Bengtsson's future talk at the European Bioconductor Meeting 2020 in the form of a mindmap on a whileboard" style="border: 1px solid #666;/> <span style="font-size: 80%; font-style: italic;"><a href="https://twitter.com/_lazappi_">Luke Zappia</a>'s summary of the talk</span> </center> </div> <p>I presented <em>Future: A Simple, Extendable, Generic Framework for Parallel Processing in R</em> at the <a href="https://eurobioc2020.bioconductor.org/">European Bioconductor Meeting 2020</a>, which took place online during the week of December 14-18, 2020.</p> <p>You&rsquo;ll find my slides (39 slides + Q&amp;A slides; 35 minutes) below:</p> <ul> <li><a href="https://www.jottr.org/presentations/EuroBioc2020/BengtssonH_20201218-futures-EuroBioc2020.abstract.txt">Title &amp; Abstract</a></li> <li><a href="https://docs.google.com/presentation/d/e/2PACX-1vTVyeaWRH251Pm8BfrlH1yK4Bd_YojEmo1I0VFxkoehnoxYJXglLdDf5T6_bTDv7lFJjwrXNYFBtfHT/pub?start=false&amp;loop=false&amp;delayms=10000">HTML</a> (Google Slides; requires online access)</li> <li><a href="https://www.jottr.org/presentations/EuroBioc2020/BengtssonH_20201218-futures-EuroBioc2020.pdf">PDF</a> (flat slides)</li> <li><a href="https://www.youtube.com/watch?v=Ph8jItU7Dlo">Video</a> (YouTube)</li> </ul> <p>I want to thank the organizers for inviting me to this Bioconductor conference. The <a href="http://bioconductor.org/">Bioconductor Project</a> provides a powerful and an important technical and social environment for developing and conducting computational research in bioinformatics and genomics. It has a great, world-wide community and engaging leadership which effortlessly keep delivering great tools (~2,000 R packages as of December 2020) and <a href="http://bioconductor.org/help/course-materials/">training</a> year after year. I am honored for the opportunity to give a keynote presentation to this community.</p> <p>- Henrik</p> <h2 id="links">Links</h2> <ul> <li><p>Relevant packages mentioned in this talk:</p> <ul> <li><strong>future</strong> package: <a href="https://cran.r-project.org/package=future">CRAN</a>, <a href="https://github.com/HenrikBengtsson/future">GitHub</a></li> <li><strong>future.apply</strong> package: <a href="https://cran.r-project.org/package=future.apply">CRAN</a>, <a href="https://github.com/HenrikBengtsson/future.apply">GitHub</a></li> <li><strong>furrr</strong> package: <a href="https://cran.r-project.org/package=furrr">CRAN</a>, <a href="https://github.com/DavisVaughan/furrr">GitHub</a></li> <li><strong>foreach</strong> package: <a href="https://cran.r-project.org/package=foreach">CRAN</a>, <a href="https://github.com/RevolutionAnalytics/foreach">GitHub</a></li> <li><strong>doFuture</strong> package: <a href="https://cran.r-project.org/package=doFuture">CRAN</a>, <a href="https://github.com/HenrikBengtsson/doFuture">GitHub</a></li> <li><strong>doParallel</strong> package: <a href="https://cran.r-project.org/package=doParallel">CRAN</a>, <a href="https://github.com/RevolutionAnalytics/doParallel">GitHub</a></li> <li><strong>future.batchtools</strong> package: <a href="https://cran.r-project.org/package=future.batchtools">CRAN</a>, <a href="https://github.com/HenrikBengtsson/future.batchtools">GitHub</a></li> <li><strong>future.callr</strong> package: <a href="https://cran.r-project.org/package=future.callr">CRAN</a>, <a href="https://github.com/HenrikBengtsson/future.callr">GitHub</a></li> <li><strong>clustermq</strong> package: <a href="https://cran.r-project.org/package=clustermq">CRAN</a>, <a href="https://github.com/mschubert/clustermq">GitHub</a></li> <li><strong>BiocParallel</strong> package: <a href="https://cran.r-project.org/package=BiocParallel">CRAN</a>, <a href="https://github.com/Bioconductor/BiocParallel">GitHub</a></li> </ul></li> </ul> NYC R Meetup: Slides on Future https://www.jottr.org/2020/11/12/future-nycmeetup-slides/ Thu, 12 Nov 2020 19:30:00 -0800 https://www.jottr.org/2020/11/12/future-nycmeetup-slides/ <div style="width: 35%; margin: 2ex; float: right;"/> <center> <img src="https://www.jottr.org/post/poster-for-nycmeetup2020-talk.png" alt="The official poster for this New York Open Statistical Programming Meetup"/> </center> </div> <p>I presented <em>Future: Simple, Friendly Parallel Processing for R</em> (67 minutes; 59 slides + Q&amp;A slides) at <a href="https://nyhackr.org/">New York Open Statistical Programming Meetup</a>, on November 9, 2020:</p> <ul> <li><a href="https://docs.google.com/presentation/d/1E2Gcm33_uMrhQL7jLzodlMXUefnSshHUdYsoXWAkFYE/edit?usp=sharing">HTML</a> (incremental Google Slides; requires online access)</li> <li><a href="https://www.jottr.org/presentations/NYCMeetup2020/BengtssonH_20191109-futures-NYC.pdf">PDF</a> (flat slides)</li> <li><a href="https://youtu.be/2ZlpFkFMy7E?t=630">Video</a> (presentation starts at 0h10m30s, Q&amp;A starts at 1h17m40s)</li> </ul> <p>I like to thanks everyone who attented and everyone who asked lots of brilliant questions during the Q&amp;A. I&rsquo;d also want to express my gratitude to Amada, Jared, and Noam for the invitation and making this event possible. It was great fun.</p> <p>- Henrik</p> <h2 id="links">Links</h2> <ul> <li><p>Relevant packages mentioned in this talk:</p> <ul> <li><strong>future</strong> package: <a href="https://cran.r-project.org/package=future">CRAN</a>, <a href="https://github.com/HenrikBengtsson/future">GitHub</a></li> <li><strong>future.apply</strong> package: <a href="https://cran.r-project.org/package=future.apply">CRAN</a>, <a href="https://github.com/HenrikBengtsson/future.apply">GitHub</a></li> <li><strong>furrr</strong> package: <a href="https://cran.r-project.org/package=furrr">CRAN</a>, <a href="https://github.com/DavisVaughan/furrr">GitHub</a></li> <li><strong>foreach</strong> package: <a href="https://cran.r-project.org/package=foreach">CRAN</a>, <a href="https://github.com/RevolutionAnalytics/foreach">GitHub</a></li> <li><strong>doFuture</strong> package: <a href="https://cran.r-project.org/package=doFuture">CRAN</a>, <a href="https://github.com/HenrikBengtsson/doFuture">GitHub</a></li> <li><strong>doParallel</strong> package: <a href="https://cran.r-project.org/package=doParallel">CRAN</a>, <a href="https://github.com/RevolutionAnalytics/doParallel">GitHub</a></li> <li><strong>future.batchtools</strong> package: <a href="https://cran.r-project.org/package=future.batchtools">CRAN</a>, <a href="https://github.com/HenrikBengtsson/future.batchtools">GitHub</a></li> <li><strong>future.callr</strong> package: <a href="https://cran.r-project.org/package=future.callr">CRAN</a>, <a href="https://github.com/HenrikBengtsson/future.callr">GitHub</a></li> <li><strong>future.tests</strong> package: <a href="https://cran.r-project.org/package=future.tests">CRAN</a>, <a href="https://github.com/HenrikBengtsson/future.tests">GitHub</a></li> <li><strong>clustermq</strong> package: <a href="https://cran.r-project.org/package=clustermq">CRAN</a>, <a href="https://github.com/mschubert/clustermq">GitHub</a></li> </ul></li> </ul> future 1.20.1 - The Future Just Got a Bit Brighter https://www.jottr.org/2020/11/06/future-1.20.1-the-future-just-got-a-bit-brighter/ Fri, 06 Nov 2020 13:00:00 -0800 https://www.jottr.org/2020/11/06/future-1.20.1-the-future-just-got-a-bit-brighter/ <p><center> <img src="https://www.jottr.org/post/sparkles-through-space.gif" alt="&quot;Short-loop artsy animation: Flying through colorful, sparkling lights positioned in circles with star-like lights on a black background in the distance&quot;" /> </center></p> <p><strong><a href="https://cran.r-project.org/package=future">future</a></strong> 1.20.1 is on CRAN. It adds some new features, deprecates old and unwanted behaviors, adds a couple of vignettes, and fixes a few bugs.</p> <h1 id="interactive-debugging">Interactive debugging</h1> <p>First out among the new features, and a long-running feature request, is the addition of argument <code>split</code> to <code>plan()</code>, which allows us to split, or &ldquo;tee&rdquo;, any output produced by futures.</p> <p>The default is <code>split = FALSE</code> for which standard output and conditions are captured by the future and only relayed after the future has been resolved, i.e. the captured output is displayed and re-signaled on the main R session when value of the future is queried. This emulates what we experience in R when not using futures, e.g. we can add temporary <code>print()</code> and <code>message()</code> statements to our code for quick troubleshooting. You can read more about this in blog post &lsquo;<a href="https://www.jottr.org/2018/07/23/output-from-the-future/">future 1.9.0 - Output from The Future</a>&rsquo;.</p> <p>However, if we want to use <code>debug()</code> or <code>browser()</code> for interactive debugging, we quickly realize they&rsquo;re not very useful because no output is visible, which is because also their output is captured by the future. This is where the new &ldquo;split&rdquo; feature comes to rescue. By using <code>split = TRUE</code>, the standard output and all non-error conditions are split (&ldquo;tee:d&rdquo;) on the worker&rsquo;s end, while still being captured by the future to be relayed back to the main R session at a later time. This means that we can debug &lsquo;sequential&rsquo; future interactively. Here is an illustration of using <code>browser()</code> for debugging a future:</p> <pre><code class="language-r">&gt; library(future) &gt; plan(sequential, split = TRUE) &gt; mysqrt &lt;- function(x) { browser(); y &lt;- sqrt(x); y } &gt; f &lt;- future(mysqrt(1:3)) Called from: mysqrt(1:3) Browse[1]&gt; str(x) int [1:3] 1 2 3 Browse[1]&gt; debug at #1: y &lt;- sqrt(x) Browse[2]&gt; debug at #1: y Browse[2]&gt; str(y) num [1:3] 1 1.41 1.73 Browse[2]&gt; y[1] &lt;- 0 Browse[2]&gt; cont &gt; v &lt;- value(f) Called from: mysqrt(1:3) int [1:3] 1 2 3 debug at #1: y &lt;- sqrt(x) debug at #1: y num [1:3] 1 1.41 1.73 &gt; v [1] 0.000000 1.414214 1.732051 </code></pre> <p><em>Comment</em>: Note how the output produced while debugging is relayed also when <code>value()</code> is called. This is a somewhat unfortunate side effect from futures capturing <em>all</em> output produced while they are active.</p> <h1 id="preserved-logging-on-workers-e-g-future-batchtools">Preserved logging on workers (e.g. future.batchtools)</h1> <p>The added support for <code>split = TRUE</code> also means that we can now preserve all output in any log files that might be produced on parallel workers. For example, if you use <strong><a href="https://cran.r-project.org/package=future.batchtools">future.batchtools</a></strong> on a Slurm scheduler, you can use <code>plan(future.batchtools::batchtools_slurm, split = TRUE)</code> to make sure standard output, messages, warnings, etc. are ending up in the <strong><a href="https://cran.r-project.org/package=batchtools">batchtools</a></strong> log files while still being relayed to the main R session at the end. This way we can inspect cluster jobs while they still run, among other things. Here is a proof-of-concept example using a &lsquo;batchtools_local&rsquo; future:</p> <pre><code class="language-r">&gt; library(future.batchtools) &gt; plan(batchtools_local, split = TRUE) &gt; f &lt;- future({ message(&quot;Hello world&quot;); y &lt;- 42; print(y); sqrt(y) }) &gt; v &lt;- value(f) [1] 42 Hello world &gt; v [1] 6.480741 &gt; loggedOutput(f) [1] &quot;### [bt]: This is batchtools v0.9.14&quot; [2] &quot;### [bt]: Starting calculation of 1 jobs&quot; [3] &quot;### [bt]: Setting working directory to '/home/alice/repositories/future'&quot; [4] &quot;### [bt]: Memory measurement disabled&quot; [5] &quot;### [bt]: Starting job [batchtools job.id=1]&quot; [6] &quot;### [bt]: Setting seed to 15794 ...&quot; [7] &quot;Hello world&quot; [8] &quot;[1] 42&quot; [9] &quot;&quot; [10] &quot;### [bt]: Job terminated successfully [batchtools job.id=1]&quot; [11] &quot;### [bt]: Calculation finished!&quot; </code></pre> <p>Without <code>split = TRUE</code>, we would not get lines 7 and 8 in the <strong>batchtools</strong> logs.</p> <h1 id="near-live-progress-updates-also-from-multicore-futures">Near-live progress updates also from &lsquo;multicore&rsquo; futures</h1> <p>Second out among the new features is &lsquo;multicore&rsquo; futures, which now join &lsquo;sequential&rsquo;, &lsquo;multisession&rsquo;, and (local and remote) &lsquo;cluster&rsquo; futures in the ability of relaying progress updates of <strong><a href="https://cran.r-project.org/package=progressr">progressr</a></strong> in a near-live fashion. This means that all of our most common parallelization backends support near-live progress updates. If this is the first time you hear of <strong>progressr</strong>, here&rsquo;s an example of how it can be used in parallel processing:</p> <pre><code class="language-r">library(future.apply) plan(multicore) library(progressr) handlers(&quot;progress&quot;) xs &lt;- 1:5 with_progress({ p &lt;- progressor(along = xs) y &lt;- future_lapply(xs, function(x, ...) { Sys.sleep(6.0-x) p(sprintf(&quot;x=%g&quot;, x)) sqrt(x) }) }) # [=================&gt;------------------------------] 40% x=2 </code></pre> <p>Note that the progress updates signaled by <code>p()</code>, updates the progress bar almost instantly, even if the parallel workers run on a remote machine.</p> <h1 id="multisession-futures-agile-to-changes-in-r-s-library-path">Multisession futures agile to changes in R&rsquo;s library path</h1> <p>Third out is &lsquo;multisession&rsquo; futures. It now automatically inherits the package library path from the main R session. For instance, if you use <code>.libPaths()</code> to adjust your library path and <em>then</em> call <code>plan(multisession)</code>, the multisession workers will see the same packages as the parent session. This change is based on a feature request related to RStudio Connect. With this update, it no longer matters which type of local futures you use - &lsquo;sequential&rsquo;, &lsquo;multisession&rsquo;, or &lsquo;multicore&rsquo; - your future code has access to the same set of installed packages.</p> <p>As a proof of concept, assume that we add <code>tempdir()</code> as a new folder to R&rsquo;s library path;</p> <pre><code class="language-r">&gt; .libPaths(c(tempdir(), .libPaths())) &gt; .libPaths() [1] &quot;/tmp/alice/RtmpwLKdrG&quot; [2] &quot;/home/alice/R/x86_64-pc-linux-gnu-library/4.0-custom&quot; [3] &quot;/home/alice/software/R-devel/tags/R-4-0-3/lib/R/library&quot; </code></pre> <p>If we then launch a &lsquo;multisession&rsquo; future, we find that it uses the same library path;</p> <pre><code class="language-r">&gt; library(future) &gt; plan(multisession) &gt; f &lt;- future(.libPaths()) &gt; value(f) [1] &quot;/tmp/alice/RtmpwLKdrG&quot; [2] &quot;/home/alice/R/x86_64-pc-linux-gnu-library/4.0-custom&quot; [3] &quot;/home/alice/software/R-devel/tags/R-4-0-3/lib/R/library&quot; </code></pre> <h1 id="best-practices-for-package-developers">Best practices for package developers</h1> <p>I&rsquo;ve added a vignette &lsquo;<a href="https://cran.r-project.org/web/packages/future/vignettes/future-7-for-package-developers.html">Best Practices for Package Developers</a>&rsquo;, which hopefully provides some useful guidelines on how to write and validate future code so it will work on as many parallel backends as possible.</p> <h1 id="saying-goodbye-to-multiprocess-but-don-t-worry">Saying goodbye to &lsquo;multiprocess&rsquo; - but don&rsquo;t worry &hellip;</h1> <p>Ok, lets discuss what is being removed. Using <code>plan(multiprocess)</code>, which was just an alias for &ldquo;<code>plan(multicore)</code> on Linux and macOS and <code>plan(multisession)</code> on MS Windows&rdquo;, is now deprecated. If used, you will get a one-time warning:</p> <pre><code class="language-r">&gt; plan(multiprocess) Warning message: Strategy 'multiprocess' is deprecated in future (&gt;= 1.20.0). Instead, explicitly specify either 'multisession' or 'multicore'. In the current R session, 'multiprocess' equals 'multicore'. </code></pre> <p>I recommend that you use <code>plan(multisession)</code> as a replacement for <code>plan(multiprocess)</code>. If you are on Linux or macOS, and are 100% sure that your code and all its dependencies is fork-safe, then you can also use <code>plan(multicore)</code>.</p> <p>Although &lsquo;multiprocess&rsquo; was neat to use in documentation and examples, it was at the same time ambiguous, and it risked introducing a platform-dependent behavior to those examples. For instance, it could be that the parallel code worked only for users on Linux and macOS because some non-exportable globals were used. If a user or MS Windows tried the same code, they might have gotten run-time errors. Vice versa, it could also be that code works on MS Windows but not on Linux or macOS. Moreover, in <strong>future</strong> 1.13.0 (2019-05-08), support for &lsquo;multicore&rsquo; futures was disabled when running R via RStudio. This was done because forked parallel processing was deemed unstable in RStudio. This meant that a user on macOS who used <code>plan(multiprocess)</code> would end up getting &lsquo;multicore&rsquo; futures when running in the terminal while getting &lsquo;multisession&rsquo; futures when running in RStudio. These types of platform-specific, environment-specific user experiences were confusing and complicates troubleshooting and communications, which is why it was decided to move away from &lsquo;multiprocess&rsquo; in favor of explicitly specifying &lsquo;multisession&rsquo; or &lsquo;multicore&rsquo;.</p> <h1 id="saying-goodbye-to-local-false-a-good-thing">Saying goodbye to &lsquo;local = FALSE&rsquo; - a good thing</h1> <p>In an effort of refining the Future API, the use of <code>future(..., local = FALSE)</code> is now deprecated. The only place where it is still supported, for backward compatible reason, is when using &lsquo;cluster&rsquo; futures that are persistent, i.e. <code>plan(cluster, ..., persistent = TRUE)</code>. If you use the latter, I recommended that you start thinking about moving away from using <code>local = FALSE</code> also in those cases. Although <code>persistent = TRUE</code> is rarely used, I am aware that some of you got use cases that require objects to remain on the parallel workers also after a future has been resolved. If you have such needs, please see <a href="https://github.com/HenrikBengtsson/future/issues/433">future Issue #433</a>, particularly the parts on &ldquo;sticky globals&rdquo;. Feel free to add your comments and suggestions for how we best could move forward on this. The long-term goals is to get rid of both <code>local</code> and <code>persistent</code> in order to harmonize the Future API across <em>all</em> future backends.</p> <p>For recent bug fixes, please see the package <a href="https://cran.r-project.org/web/packages/future/NEWS">NEWS</a>.</p> <h1 id="what-s-on-the-horizon">What&rsquo;s on the horizon?</h1> <p>There are still lots of things on the roadmap. In no specific order, here are the few things in the works:</p> <ul> <li><p>Sticky globals for caching globals on workers. This will decrease the number of globals that need to be exported when launching futures. It addresses several related feature requests, e.g. future Issues <a href="https://github.com/HenrikBengtsson/future/issues/273">#273</a>, <a href="https://github.com/HenrikBengtsson/future/issues/339">#339</a>, <a href="https://github.com/HenrikBengtsson/future/issues/346">#346</a>, <a href="https://github.com/HenrikBengtsson/future/issues/431">#431</a>, and <a href="https://github.com/HenrikBengtsson/future/issues/437">#437</a>.</p></li> <li><p>Ability to terminate futures (for backends supporting it), which opens up for the possibility of restarting failed futures and more. This is a frequently requested feature, e.g. Issues <a href="https://github.com/HenrikBengtsson/future/issues/93">#93</a>, <a href="https://github.com/HenrikBengtsson/future/issues/188">#188</a>, <a href="https://github.com/HenrikBengtsson/future/issues/205">#205</a>, <a href="https://github.com/HenrikBengtsson/future/issues/213">#213</a>, and <a href="https://github.com/HenrikBengtsson/future/issues/236">#236</a>.</p></li> <li><p>Optional, zero-cost generic hook function. Having them in place opens up for adding a framework for doing time-and-memory profiling/benchmarking futures and their backends. Being able profile futures and their backends will help identify bottlenecks and improve the performance on some of our parallel backends, e.g. Issues <a href="https://github.com/HenrikBengtsson/future/issues/49">#59</a>, <a href="https://github.com/HenrikBengtsson/future/issues/142">#142</a>, <a href="https://github.com/HenrikBengtsson/future/issues/239">#239</a>, and <a href="https://github.com/HenrikBengtsson/future/issues/437">#437</a>.</p></li> <li><p>Add support for global calling handlers in <strong>progressr</strong>. This is not specific to the future framework but since its closely related, I figured I mention this here too. A global calling handler for progress updates would remove the need for having to use <code>with_progress()</code> when monitoring progress. This would also help resolve the common problem where package developers want to provide progress updates without having to ask the user to use <code>with_progress()</code>, e.g. <strong>progressr</strong> Issues <a href="https://github.com/HenrikBengtsson/progressr/issues/78">#78</a>, <a href="https://github.com/HenrikBengtsson/progressr/issues/83">#83</a>, and <a href="https://github.com/HenrikBengtsson/progressr/issues/85">#85</a>.</p></li> </ul> <p>That&rsquo;s all for now - Happy futuring!</p> <h2 id="links">Links</h2> <ul> <li><strong>future</strong> package: <a href="https://cran.r-project.org/package=future">CRAN</a>, <a href="https://github.com/HenrikBengtsson/future">GitHub</a></li> <li><strong>future.batchtools</strong> package: <a href="https://cran.r-project.org/package=future.batchtools">CRAN</a>, <a href="https://github.com/HenrikBengtsson/future.batchtools">GitHub</a></li> <li><strong>progressr</strong> package: <a href="https://cran.r-project.org/package=progressr">CRAN</a>, <a href="https://github.com/HenrikBengtsson/progressr">GitHub</a></li> </ul> <h2 id="see-also">See also</h2> <ul> <li><a href="https://www.jottr.org/2016/10/22/future-hpc/">High-Performance Compute in R Using Futures</a>, 2016-10-22</li> <li><a href="https://www.jottr.org/2018/07/23/output-from-the-future/">future 1.9.0 - Output from The Future</a>, 2018-07-23</li> <li><a href="https://www.jottr.org/2020/07/04/progressr-erum2020-slides/">e-Rum 2020 Slides on Progressr</a>, 2020-07-04</li> </ul> parallelly, future - Cleaning Up Around the House https://www.jottr.org/2020/11/04/parallelly-future-cleaning-up-around-the-house/ Wed, 04 Nov 2020 18:00:00 -0800 https://www.jottr.org/2020/11/04/parallelly-future-cleaning-up-around-the-house/ <blockquote cite="https://www.merriam-webster.com/dictionary/parallelly" style="font-size: 150%"> <strong>parallelly</strong> adverb<br> par·​al·​lel·​ly | \ ˈpa-rə-le(l)li \ <br> Definition: in a parallel manner </blockquote> <blockquote cite="https://www.merriam-webster.com/dictionary/future" style="font-size: 150%"> <strong>future</strong> noun<br> fu·​ture | \ ˈfyü-chər \ <br> Definition: existing or occurring at a later time </blockquote> <p>I&rsquo;ve cleaned up around the house - with the recent release of <strong><a href="https://cran.r-project.org/package=future">future</a></strong> 1.20.1, the package gained a dependency on the new <strong><a href="https://cran.r-project.org/package=parallelly">parallelly</a></strong> package. Now, if you&rsquo;re like me and concerned about bloating package dependencies, I&rsquo;m sure you immediately wondered why I chose to introduce a new dependency. I&rsquo;ll try to explain this below, but let me be start by clarifying a few things:</p> <ul> <li><p>The functions in the <strong>parallelly</strong> package used to be part of the <strong>future</strong> package</p></li> <li><p>The functions have been removed from the <strong>future</strong> making that package smaller while its total installation &ldquo;weight&rdquo; remains about the same when adding the <strong>parallelly</strong></p></li> <li><p>The <strong>future</strong> package re-exports these functions, i.e. for the time being, everything works as before</p></li> </ul> <p>Specifically, I’ve moved the following functions from the <strong>future</strong> package to the <strong>parallelly</strong> package:</p> <ul> <li><code>as.cluster()</code> - Coerce an object to a &lsquo;cluster&rsquo; object</li> <li><code>c(...)</code> - Combine multiple &lsquo;cluster&rsquo; objects into a single, large cluster</li> <li><code>autoStopCluster()</code> - Automatically stop a &lsquo;cluster&rsquo; when garbage collected</li> <li><code>availableCores()</code> - Get number of available cores on the current machine; a better, safer alternative to <code>parallel::detectCores()</code></li> <li><code>availableWorkers()</code> - Get set of available workers</li> <li><code>makeClusterPSOCK()</code> - Create a PSOCK cluster of R workers for parallel processing; a more powerful alternative to <code>parallel::makePSOCKcluster()</code></li> <li><code>makeClusterMPI()</code> - Create a message passing interface (MPI) cluster of R workers for parallel processing; a tweaked version of <code>parallel::makeMPIcluster()</code></li> <li><code>supportsMulticore()</code> - Check if forked processing (&ldquo;multicore&rdquo;) is supported</li> </ul> <p>Because these are re-exported as-is, you can still use them as if they were part of the <strong>future</strong> package. For example, you may now use <code>availableCores()</code> as</p> <pre><code class="language-r">ncores &lt;- parallelly::availableCores() </code></pre> <p>or keep using it as</p> <pre><code class="language-r">ncores &lt;- future::availableCores() </code></pre> <p>One reason for moving these functions to a separate package is to make them readily available also outside of the future framework. For instance, using <code>parallelly::availableCores()</code> for decided on the number of parallel workers is a <em>much</em> better and safer alternative than using <code>parallel::detectCores()</code> - see <code>help(&quot;availableCores&quot;, package = &quot;parallelly&quot;)</code> for why. Making these functions available in a lightweight package will attract additional users and developers that are not using futures. More users means more real-world validation, more vetting, and more feedback, which will improve these functions further and indirectly also the future framework.</p> <p>Another reason is that several of the functions in <strong>parallelly</strong> are bug fixes and improvements to functions in the <strong>parallel</strong> package. By extracting these functions from the <strong>future</strong> package and putting them in a standalone package, it should be more clear what these improvements are. At the same time, it should lower the threshold of getting these improvements into the <strong>parallel</strong> package, where I hope they will end up one day. <em>The <strong>parallelly</strong> package comes with an open invitation to the R Core to incorporate <strong>parallelly</strong>&rsquo;s implementation or ideas into <strong>parallel</strong>.</em></p> <p>For users of the future framework, maybe the most important reason for this migration is <em>speedier implementation of improvements and feature requests for the <strong>future</strong> package and the future ecosystem</em>. Over the years, many discussions around enhancing <strong>future</strong> came down to enhancing the functions that are now part of the <strong>parallelly</strong> package, especially for adding new features to <code>makeClusterPSOCK()</code>, which is the internal work horse for setting up &lsquo;multisession&rsquo; parallel workers but also used explicitly by many when setting up other types of &lsquo;cluster&rsquo; workers. The roles and responsibility of the <strong>parallelly</strong> and <strong>future</strong> packages are well separated, which should make it straightforward to further improve on these functions. For example, if we want to introduce a new argument to <code>makeClusterPSOCK()</code>, or change one of its defaults (e.g. use the faster <code>useXDR = FALSE</code>), we can now discuss and test them quicker and often without having to bring in futures into the discussion. Don&rsquo;t worry - <strong>parallelly</strong> will undergo the same, <a href="https://www.jottr.org/2020/11/04/trust-the-future/">strict validation process as the <strong>future</strong> package</a> does to avoid introducing breaking changes to the future framework. For example, reverse-dependency checks will be run on first (e.g. <strong>future</strong>), and second (e.g. <strong>future.apply</strong>, <strong>furrr</strong>, <strong>doFuture</strong>, <strong>drake</strong>, <strong>mlr3</strong>, <strong>plumber</strong>, <strong>promises</strong>,and <strong>Seurat</strong>) generation dependencies.</p> <p>Happy parallelly futuring!</p> <p><small> <sup>*</sup> I&rsquo;ll try to make another post in a couple of days covering the new features that comes with <strong>future</strong> 1.20.1. Stay tuned. </small></p> <h2 id="links">Links</h2> <ul> <li><strong>future</strong> package: <a href="https://cran.r-project.org/package=future">CRAN</a>, <a href="https://github.com/HenrikBengtsson/future">GitHub</a></li> <li><strong>parallelly</strong> package: <a href="https://cran.r-project.org/package=parallelly">CRAN</a>, <a href="https://github.com/HenrikBengtsson/parallelly">GitHub</a></li> </ul> Trust the Future https://www.jottr.org/2020/11/04/trust-the-future/ Wed, 04 Nov 2020 14:00:00 -0800 https://www.jottr.org/2020/11/04/trust-the-future/ <p><center> <img src="https://www.jottr.org/post/you_dont_have_to_worry_about_your_future.jpg" alt="A fortune cookie that reads 'You do not have to worry about your future'" style="border: solid 1px; max-width: 70%"/> </center></p> <p>Each time we use R to analyze data, we rely on the assumption that functions used produce correct results. If we can&rsquo;t make this assumption, we have to spend a lot of time validating every nitty detail. Luckily, we don&rsquo;t have to do this. There are many reasons for why we can comfortably use R for our analyses and some of them are unique to R. Here are some I could think of while writing this blog post - I&rsquo;m sure I forgot something:</p> <ul> <li><p>R is a functional language with few side effects (&ldquo;just like mathematical functions&rdquo;)</p></li> <li><p>R, and its predecessor S, has undergone lots of real-world validation over the last two-three decades</p></li> <li><p>Millions of users and developers use and vet R regularly, which increases the chances for detecting mistakes and bugs</p></li> <li><p>R has one established, agreed-upon framework for validating an R package: <code>R CMD check</code></p></li> <li><p>The majority of R packages are distributed through a single repository (CRAN)</p></li> <li><p>CRAN requires that all R packages pass checks on past, current, and upcoming R versions, across operating systems (MS Windows, Linux, macOS, and Solaris), and on different compilers</p></li> <li><p>New checks are continuously added to <code>R CMD check</code> causing the quality of new and existing R packages to improve over time</p></li> <li><p>CRAN asserts that package updates do not break reverse package dependencies</p></li> <li><p>R developers spend a substantial amount of time validating their packages</p></li> <li><p>R has users and developers with various backgrounds and areas of expertise</p></li> <li><p>R has a community that actively engages in discussions on best practices, troubleshooting, bug fixes, testing, and language development</p></li> <li><p>There are many third-party contributed tools for developing and testing R packages</p></li> </ul> <p>I think <a href="https://twitter.com/j_v_66">Jan Vitek</a> summarized it well in the &lsquo;Why R?&rsquo; panel discussion on <a href="https://youtu.be/uiEhmKN1RJo?t=1917">&lsquo;Performance in R&rsquo;</a> on 2020-09-26:</p> <blockquote> <p>R is an ecosystem. It is not a language. The language is the little bit on top. You come for the ecosystem - the books, all of the questions and answers, the snippets of code, the quality of CRAN. &hellip; The quality assurance that CRAN brings &hellip; we don&rsquo;t have that in any other language that I know of.</p> </blockquote> <p>Without the above technical and social ecosystem, I believe the quality of my own R packages would have been substantially lower. Regardless of how many unit tests I would write, I could never achieve the same amount of validation that the full R ecosystem brings to the table.</p> <p>When you use the <a href="https://cran.r-project.org/package=future">future framework for parallel and distributed processing</a>, it is essential that it delivers a corresponding level of correctness and reproducibility to that you get when implementing the same task sequentially. Because of this, validation is a <em>top priority</em> and part of the design and implementation throughout the future ecosystem. Below, I summarize how it is validated:</p> <ul> <li><p>All the essential core packages part of the future framework, <strong><a href="https://cran.r-project.org/package=future">future</a></strong>, <strong><a href="https://CRAN.R-Project.org/package=globals">globals</a></strong>, <strong><a href="https://CRAN.R-Project.org/package=listenv">listenv</a></strong>, and <strong><a href="https://cran.r-project.org/package=parallelly">parallelly</a></strong>, implement a rich set of package tests. These are validated regularly across the wide-range of operating systems (Linux, Solaris, macOS, and MS Windows) and R versions available on CRAN, on continuous integration (CI) services (<a href="https://github.com/features/actions">GitHub Actions</a>, <a href="https://travis-ci.org/">Travis CI</a>, and <a href="https://www.appveyor.com/">AppVeyor CI</a>), an on <a href="https://builder.r-hub.io/">R-hub</a>.</p></li> <li><p>For each new release, these packages undergo full reverse-package dependency checks using <strong><a href="https://github.com/r-lib/revdepcheck">revdepcheck</a></strong>. As of October 2020, the <strong>future</strong> package is tested against more than 140 direct reverse-package dependencies available on CRAN and Bioconductor, including packages <strong><a href="https://cran.r-project.org/package=future.apply">future.apply</a></strong>, <strong><a href="https://cran.r-project.org/package=furrr">furrr</a></strong>, <strong><a href="https://cran.r-project.org/package=doFuture">doFuture</a></strong>, <strong><a href="https://cran.r-project.org/package=drake">drake</a></strong>, <strong><a href="https://cran.r-project.org/package=googleComputeEngineR">googleComputeEngineR</a></strong>, <strong><a href="https://cran.r-project.org/package=mlr3">mlr3</a></strong>, <strong><a href="https://cran.r-project.org/package=plumber">plumber</a></strong>, <strong><a href="https://cran.r-project.org/package=promises">promises</a></strong> (used by <strong><a href="https://cran.r-project.org/package=shiny">shiny</a></strong>), and <strong><a href="https://cran.r-project.org/package=Seurat">Seurat</a></strong>. These checks are performed on Linux with both the default settings and when forcing tests to use multisession workers (SOCK clusters), which further validates that globals and packages are identified correctly.</p></li> <li><p>A suite of <em>Future API conformance tests</em> available in the <strong><a href="https://cran.r-project.org/package=future.tests">future.tests</a></strong> package validates the correctness of all future backends. Any new future backend developed must pass these tests to comply with the <em>Future API</em>. By conforming to this API, the end-user can trust that the backend will produce the same correct and reproducible results as any other backend, including the ones that the developer have tested on. Also, by making it the responsibility of the developer to assert that their new future backend conforms to the <em>Future API</em>, we relieve other developers from having to test that their future-based software works on all backends. It would be a daunting task for a developer to validate the correctness of their software with all existing backends. Even if they would achieve that, there may be additional third-party future backends that they are not aware of, that they do not have the possibility to test with, or that are yet to be developed. The <strong>future.tests</strong> framework was sponsored by an <a href="https://www.r-consortium.org/projects/awarded-projects">R Consortium ISC grant</a>.</p></li> <li><p>Since <strong><a href="https://CRAN.R-Project.org/package=foreach">foreach</a></strong> is used by a large number of essential CRAN packages, it provides an excellent opportunity for supplementary validation. Specifically, I dynamically tweak the examples of <strong><a href="https://CRAN.R-Project.org/package=foreach">foreach</a></strong> and popular CRAN packages <strong><a href="https://CRAN.R-Project.org/package=caret">caret</a></strong>, <strong><a href="https://CRAN.R-Project.org/package=glmnet">glmnet</a></strong>, <strong><a href="https://CRAN.R-Project.org/package=NMF">NMF</a></strong>, <strong><a href="https://CRAN.R-Project.org/package=plyr">plyr</a></strong>, and <strong><a href="https://CRAN.R-Project.org/package=TSP">TSP</a></strong> to use the <strong><a href="https://cran.r-project.org/package=doFuture">doFuture</a></strong> adaptor. This allows me to run these examples with a variety of future backends to validate that the examples produce no run-time errors, which indirectly validates the backends and the <em>Future API</em>. In the past, these types of tests helped to identify and resolve corner cases where automatic identification of global variables would fail. As a side note, several of these foreach-based examples fail when using a parallel foreach adaptor because they do not properly export globals or declare package dependencies. The exception is when using the sequential <em>doSEQ</em> adaptor (default), fork-based ones such as <strong><a href="https://CRAN.R-Project.org/package=doMC">doMC</a></strong>, or the generic <strong><a href="https://cran.r-project.org/package=doFuture">doFuture</a></strong>, which supports any future backend and relies on the future framework for handling globals and packages.</p></li> <li><p>Analogously to above reverse-dependency checks of each new release, CRAN and Bioconductor continuously run checks on all these direct, but also indirect, reverse dependencies, which further increases the validation of the <em>Future API</em> and the future ecosystem at large.</p></li> </ul> <p>May the future be with you!</p> <h2 id="links">Links</h2> <ul> <li><strong>future</strong> package: <a href="https://cran.r-project.org/package=future">CRAN</a>, <a href="https://github.com/HenrikBengtsson/future">GitHub</a></li> <li><strong>future.tests</strong> package: <a href="https://cran.r-project.org/package=future.tests">CRAN</a>, <a href="https://github.com/HenrikBengtsson/future.tests">GitHub</a></li> </ul> future 1.19.1 - Making Sure Proper Random Numbers are Produced in Parallel Processing https://www.jottr.org/2020/09/22/push-for-statistical-sound-rng/ Tue, 22 Sep 2020 19:00:00 -0700 https://www.jottr.org/2020/09/22/push-for-statistical-sound-rng/ <p><center> <img src="https://www.jottr.org/post/Digital_rain_animation_medium_letters_clear.gif" alt="&quot;Animation of green Japanese Kana symbols raining down in parallel on a black background inspired by The Matrix movie&quot;" /> <small><em>Parallel &lsquo;Digital Rain&rsquo; by <a href="https://commons.wikimedia.org/w/index.php?curid=63377054">Jahobr</a></em></small> </center></p> <p>After two-and-a-half months, <strong><a href="https://cran.r-project.org/package=future">future</a></strong> 1.19.1 is now on CRAN. As usual, there are some bug fixes and minor improvements here and there (<a href="https://cran.r-project.org/web/packages/future/NEWS">NEWS</a>), including things needed by the next version of <strong><a href="https://cran.r-project.org/package=furrr">furrr</a></strong>. For those of you who use Slurm or LSF/OpenLava as a scheduler on your high-performance compute (HPC) cluster, <code>future::availableCores()</code> will now do a better job respecting the CPU resources that those schedulers allocate for your R jobs.</p> <p>With all that said, the most significant update is that <strong>an informative warning is now given if random numbers were produced unexpectedly</strong>. Here &ldquo;unexpectedly&rdquo; means that the developer did not declare that their code needs random numbers.</p> <p>If you are just interested in the updates regarding random numbers and how to make sure your code is compliant, skip down to the section on &lsquo;<a href="#random-number-generation-in-the-future-framework">Random Number Generation in the Future Framework</a>&rsquo;. If you are curious how R generates random numbers and how that matters when we use parallel processing, keep on reading.</p> <p><em>Disclaimer</em>: I should clarify that, although I understand some algorithms and statistical aspects behind random number generation, my knowledge is limited. If you find mistakes below, please let me know so I can correct them. If you have ideas on how to improve this blog post, or parallel random number generation, I am grateful for such suggestions.</p> <h2 id="random-number-generation-in-r">Random Number Generation in R</h2> <p>Being able to generate high-quality random numbers is essential in many areas. For example, we use random number generation in cryptography to produce public-private key pairs. If there is a correlation in the random numbers produced, there is a risk that someone can reverse engineer the private key. In statistics, we need random numbers in simulation studies, bootstrap, and permutation tests. The correctness of these methods rely on the assumptions that the random numbers drawn are &ldquo;as random as possible&rdquo;. What we mean by &ldquo;as random as possible&rdquo; depends on context and there are several ways to measure &ldquo;amount of randomness&rdquo;, e.g. amount of autocorrelation in the sequence of numbers produced.</p> <p>As developers, statisticians, and data scientists, we often have better things to do than validating the quality of random numbers. Instead, we just want to rely on the computer to produce random numbers that are &ldquo;good enough.&rdquo; This is often safe to do because most programming languages produce high-quality random numbers out of the box. However, <strong>when we run our algorithms in parallel, random number generation becomes more complicated</strong> and we have to make efforts to get it right.</p> <p>In software, a so-called <em>random number generator</em> (RNG) produces all random numbers. Although hardware RNGs exist (e.g. thermal noise), by far the most common way to produce random numbers is through a pseudo RNG. A pseudo RNG uses an algorithm that produces a sequence of numbers that appear to be random but is fully deterministic given its initial state. For example, in R, we can draw one or more (pseudo) random numbers in $[0,1]$ using <code>runif()</code>, e.g.</p> <pre><code class="language-r">&gt; runif(n = 5) [1] 0.9400145 0.9782264 0.1174874 0.4749971 0.5603327 </code></pre> <p>We can control the RNG state via <code>set.seed()</code>, e.g.</p> <pre><code class="language-r">&gt; set.seed(42) &gt; runif(n = 5) [1] 0.9148060 0.9370754 0.2861395 0.8304476 0.6417455 </code></pre> <p>If we use this technique, we can regenerate the same pseudo random numbers at a later state if we reset to the same initial RNG state, i.e.</p> <pre><code class="language-r">&gt; set.seed(42) &gt; runif(n = 5) [1] 0.9148060 0.9370754 0.2861395 0.8304476 0.6417455 </code></pre> <p>This works also after restarting R, on other computers, and other operating systems. Being able to set the initial RNG state this way allows us to produce numerically reproducible results even when the methods involved rely on randomness.</p> <p>There is no need to set the RNG state, which is also referred to as &ldquo;the random seed&rdquo;. If not set, R uses a “random” initial RNG state based on various “random” properties such as the current timestamp and the process ID of the current R session. Because of this, we rarely have to set the random seed and things just work.</p> <h2 id="random-number-generation-for-parallel-processing">Random Number Generation for Parallel Processing</h2> <p>R does a superb job of taking care of us when it comes to random number generation - as long as we run our analysis sequentially in a single R process. Formally R uses the Mersenne Twister RNG algorithm [1] by default, which can we can set explicitly using <code>RNGkind(&quot;Mersenne-Twister&quot;)</code>. However, like many other RNG algorithms, the authors designed this one for generating random number sequentially but not in parallel. If we use it in parallel code, there is a risk that there will a correlation between the random numbers generated in parallel, and, when taken together, they may no longer be &ldquo;random enough&rdquo; for our needs.</p> <p>A not-so-uncommon, ad hoc attempt to overcome this problem is to set a unique random seed for each parallel iteration, e.g.</p> <pre><code class="language-r">library(parallel) cl &lt;- makeCluster(4) y &lt;- parLapply(cl, 1:10, function(i) { set.seed(i) runif(n = 5) }) stopCluster(cl) </code></pre> <p>The idea is that although <code>i</code> and <code>i+1</code> are deterministic, <code>set.seed(i)</code> and <code>set.seed(i+1)</code> will set two different RNG states that are &ldquo;non-deterministic&rdquo; compared to each other, e.g. if we know one of them, we cannot predict the other. We can also find other variants of this approach. For instance, we can pre-generate a set of &ldquo;random&rdquo; random seeds and use them one-by-one in each iteration;</p> <pre><code class="language-r">library(parallel) cl &lt;- makeCluster(4) set.seed(42) seeds &lt;- sample.int(n = 10) y &lt;- parLapply(cl, seeds, function(seed) { set.seed(seed) runif(n = 5) }) stopCluster(cl) </code></pre> <p><strong>However, these approaches do <em>not</em> guarantee high-quality random numbers</strong>. Although not parallel-safe by itself, the latter approach resembles the gist of RNG algorithms designed for parallel processing.</p> <p>The L&rsquo;Ecuyer Combined Multiple Recursive random number Generators (CMRG) method [2,3] provides an RNG algorithm that works also for parallel processing. R has built-in support for this method via the <strong>parallel</strong> package. See <code>help(&quot;nextRNGStream&quot;, package = &quot;parallel&quot;)</code> for additional information. One way to use this is:</p> <pre><code class="language-r">library(parallel) cl &lt;- makeCluster(4) RNGkind(&quot;L'Ecuyer-CMRG&quot;) set.seed(42) seeds &lt;- list(.Random.seed) for (i in 2:10) seeds[[i]] &lt;- nextRNGStream(seeds[[i - 1]]) y &lt;- parLapply(cl, seeds, function(seed) { .Random.seed &lt;- seed runif(n = 5) }) stopCluster(cl) </code></pre> <p>Note the similarity to the previous attempt above. For convenience, R provides <code>parallel::clusterSetRNGStream()</code>, which allows us to do:</p> <pre><code class="language-r">library(parallel) cl &lt;- makeCluster(4) clusterSetRNGStream(cl, iseed = 42) y &lt;- parLapply(cl, 1:10, function(i) { runif(n = 5) }) stopCluster(cl) </code></pre> <p><em>Comment</em>: Contrary to the manual approach, <code>clusterSetRNGStream()</code> does not create one RNG seed per iteration (here ten) but one per workers (here four). Because of this, the two examples will <em>not</em> produce the same random numbers despite using the same initial seed (42). When using <code>clusterSetRNGStream()</code>, the sequence of random numbers produced will depend on the number of parallel workers used, meaning the results will not be numerically identical unless we use the same number of parallel workers. Having said this, we are using a parallel-safe RNG algorithm here, so we still get high-quality random numbers without risking to compromising our statistical analysis, if that is what we are running.</p> <h2 id="random-number-generation-in-the-future-framework">Random Number Generation in the Future Framework</h2> <p>The <strong><a href="https://cran.r-project.org/package=future">future</a></strong> framework, which provides a unifying approach to parallel processing in R, uses the L&rsquo;Ecuyer CMRG algorithm to generate all random numbers. There is no need to specify <code>RNGkind(&quot;L'Ecuyer-CMRG&quot;)</code> - if not already set, the future framework will still use it internally. At the lowest level, the Future API supports specifying the random seed for each individual future. However, most developers and end-users use the higher-level map-reduce APIs provided by the <strong><a href="https://cran.r-project.org/package=future.apply">future.apply</a></strong> and <strong><a href="https://cran.r-project.org/package=furrr">furrr</a></strong> package, which provides &ldquo;seed&rdquo; arguments for controlling the RNG behavior. Importantly, generating L&rsquo;Ecuyer-CMRG RNG streams comes with a significant overhead. Because of this, the default is to <em>not</em> generate them. If we intend to produce random numbers, we need to specify that via the &ldquo;seed&rdquo; argument, e.g.</p> <pre><code class="language-r">library(future.apply) y &lt;- future_lapply(1:10, function(i) { runif(n = 5) }, future.seed = TRUE) </code></pre> <p>and</p> <pre><code class="language-r">library(furrr) y &lt;- future_map(1:10, function(i) { runif(n = 5) }, .options = future_options(seed = TRUE)) </code></pre> <p>Contrary to generating RNG streams, checking if a future has used random numbers is quick. All we have to do is to keep track of the RNG state and check if it still the same afterward (after the future has been resolved). Starting with <strong>future</strong> 1.19.0, <strong>the future framework will warn us whenever we use the RNG without declaring it</strong>. For instance,</p> <pre><code class="language-r">&gt; y &lt;- future_lapply(1:10, function(i) { + runif(n = 5) + }) Warning message: UNRELIABLE VALUE: Future ('future_lapply-1') unexpectedly generated random numbers without specifying argument '[future.]seed'. There is a risk that those random numbers are not statistically sound and the overall results might be invalid. To fix this, specify argument '[future.]seed', e.g. 'seed=TRUE'. This ensures that proper, parallel-safe random numbers are produced via the L'Ecuyer-CMRG method. To disable this check, use [future].seed=NULL, or set option 'future.rng.onMisuse' to &quot;ignore&quot;. </code></pre> <p>Although technically unnecessary, this warning will also be produced when running sequentially. This is to make sure that all future-based code will produce correct results when switching to a parallel backend.</p> <p>When using <strong><a href="https://cran.r-project.org/package=foreach">foreach</a></strong> the best practice is to use the <strong><a href="https://cran.r-project.org/package=doRNG">doRNG</a></strong> package to produce parallel-safe random numbers. This is true regardless of foreach adaptor and parallel backend used. Specifically, instead of using <code>%dopar%</code> we want to use <code>%dorng%</code>. For example, here is what it looks like if we use the <strong><a href="https://cran.r-project.org/package=doFuture">doFuture</a></strong> adaptor;</p> <pre><code class="language-r">library(foreach) library(doRNG) doFuture::registerDoFuture() future::plan(&quot;multisession&quot;) y &lt;- foreach(i = 1:10) %dorng% { runif(n = 5) } </code></pre> <p>The benefit of using the <strong>doFuture</strong> adaptor is that it will also detect when we, or packages that use <strong>foreach</strong>, forget to declare that the RNG is needed, e.g.</p> <pre><code class="language-r">y &lt;- foreach(i = 1:10) %dopar% { runif(n = 5) } Warning messages: 1: UNRELIABLE VALUE: Future ('doFuture-1') unexpectedly generated random numbers without specifying argument '[future.]seed'. There is a risk that those random numbers are not statistically sound and the overall results might be invalid. To fix this, specify argument '[future.]seed', e.g. 'seed=TRUE'. This ensures that proper, parallel-safe random numbers are produced via the L'Ecuyer-CMRG method. To disable this check, use [future].seed=NULL, or set option 'future.rng.onMisuse' to &quot;ignore&quot;. ... </code></pre> <p>Note that there will be one warning per future, which in the above examples, means one warning per parallel worker.</p> <p>If you are an end-user of a package that uses futures internally and you get these warnings, please report them to the maintainer of that package. You might have to use <code>options(warn = 2)</code> to upgrade to an error and then <code>traceback()</code> to track down from where the warning originates. It is not unlikely that they have forgotten or are not aware of the problem of using a proper RNG for parallel processing. Regardless, the fix is for them to declare <code>future.seed = TRUE</code>. If these warnings are irrelevant and the maintainer does not believe there is an RNG issue, then they can declare that using <code>future.seed = NULL</code>, e.g.</p> <pre><code class="language-r">y &lt;- future_lapply(X, function(x) { ... }, future.seed = NULL) </code></pre> <p>The default is <code>future.seed = FALSE</code>, which means &ldquo;no random numbers will be produced, and if there are, then it is a mistake.&rdquo;</p> <p>Until the maintainer has corrected this, as an end-user you can silence these warnings by setting:</p> <pre><code class="language-r">options(future.rng.onMisuse = &quot;ignore&quot;) </code></pre> <p>which was the default until <strong>future</strong> 1.19.0. If you want to be conservative, you can even upgrade the warning to a run-time error by setting this option to <code>&quot;error&quot;</code>.</p> <p>If you are a developer and struggle to narrow down exactly which part of your code uses random number generation, see my blog post &lsquo;<a href="https://www.jottr.org/2020/09/21/detect-when-the-random-number-generator-was-used/">Detect When the Random Number Generator Was Used</a>&rsquo; for an example how you can track the RNG state at the R prompt and get a notification whenever a function call used the RNG internally.</p> <h2 id="what-s-next-regarding-rng-and-futures">What&rsquo;s next regarding RNG and futures?</h2> <ul> <li><p>The higher-level map-reduce APIs in the future framework support perfectly reproducible random numbers regardless of future backend and number of parallel workers being used. This is convenient because it allows us to get identical results when we, for instance, move from a notebook to an HPC environment. The downside is that this RNG strategy requires that one RNG stream is created per iteration, which is expensive when there are many elements to iterate over. If one does not need numerically reproducible random numbers, then it would be sufficient and valid to produce one RNG stream per chunk, where we often have one chunk per worker, similar to what <code>parallel::clusterSetRNGStream()</code> does. It has been on the roadmap for a while to <a href="https://github.com/HenrikBengtsson/future.apply/issues/20">add support for per-chunk RNG streams</a> as well. The remaining thing we need to resolve is to decide on exactly how to specify that type of strategy, e.g. <code>future_lapply(..., future.seed = &quot;per-chunk&quot;)</code> versus <code>future_lapply(..., future.seed = &quot;per-element&quot;)</code>, where the latter is an alternative to today&rsquo;s <code>future.seed = TRUE</code>. I will probably address this in a new utility package <strong>future.mapreduce</strong> that can serve <strong>future.apply</strong> and <strong>furrr</strong> and likes, so that they do not have to re-implement this locally, which is error prone and how it works at the moment.</p></li> <li><p>L&rsquo;Ecuyer CMRG is not the only RNG algorithm designed for parallel processing but some developers might want to use another method. There are already many CRAN packages that provide alternatives, e.g. <strong><a href="https://cran.r-project.ogr/package=dqrng">dqrng</a></strong>, <strong><a href="https://cran.r-project.ogr/package=qrandom">qrandom</a></strong>, <strong><a href="https://cran.r-project.ogr/package=random">random</a></strong>, <strong><a href="https://cran.r-project.ogr/package=randtoolbox">randtoolbox</a></strong>, <strong><a href="https://cran.r-project.ogr/package=rlecuyer">rlecuyer</a></strong>, <strong><a href="https://cran.r-project.ogr/package=rngtools">rngtools</a></strong>, <strong><a href="https://cran.r-project.ogr/package=rngwell19937">rngwell19937</a></strong>, <strong><a href="https://cran.r-project.ogr/package=rstream">rstream</a></strong>, <strong><a href="https://cran.r-project.ogr/package=rTRNG">rTRNG</a></strong>, and <strong><a href="https://cran.r-project.ogr/package=sitmo">sitmo</a></strong>. It is on the long-term road map to support other types of parallel RNG methods. It will require a fair bit of work to come up with a unifying API for this and then a substantial amount of testing and validation to make sure it is correct.</p></li> </ul> <p>Happy random futuring!</p> <h2 id="references">References</h2> <ol> <li><p>Matsumoto, M. and Nishimura, T. (1998). Mersenne Twister: A 623-dimensionally equidistributed uniform pseudo-random number generator, <em>ACM Transactions on Modeling and Computer Simulation</em>, 8, 3–30.</p></li> <li><p>L&rsquo;Ecuyer, P. (1999). Good parameters and implementations for combined multiple recursive random number generators. <em>Operations Research</em>, 47, 159–164. doi: <a href="https://doi.org/10.1287/opre.47.1.159">10.1287/opre.47.1.159</a>.</p></li> <li><p>L&rsquo;Ecuyer, P., Simard, R., Chen, E. J. and Kelton, W. D. (2002). An object-oriented random-number package with many long streams and substreams. <em>Operations Research</em>, 50, 1073–1075. doi: <a href="https://doi.org/10.1287/opre.50.6.1073.358">10.1287/opre.50.6.1073.358</a>.</p></li> </ol> <h2 id="links">Links</h2> <ul> <li><strong>future</strong> package: <a href="https://cran.r-project.org/package=future">CRAN</a>, <a href="https://github.com/HenrikBengtsson/future">GitHub</a></li> <li><strong>future.apply</strong> package: <a href="https://cran.r-project.org/package=future.apply">CRAN</a>, <a href="https://github.com/HenrikBengtsson/future.apply">GitHub</a></li> <li><strong>doFuture</strong> package: <a href="https://cran.r-project.org/package=doFuture">CRAN</a>, <a href="https://github.com/HenrikBengtsson/doFuture">GitHub</a> (a <strong><a href="https://cran.r-project.org/package=foreach">foreach</a></strong> adapter)</li> </ul> Detect When the Random Number Generator Was Used https://www.jottr.org/2020/09/21/detect-when-the-random-number-generator-was-used/ Mon, 21 Sep 2020 18:45:00 -0700 https://www.jottr.org/2020/09/21/detect-when-the-random-number-generator-was-used/ <p><center> <img src="https://www.jottr.org/post/DistortedRecentEland_50pct.gif" alt="&quot;An animated close-up of a spinning roulette wheel&quot;" /> </center></p> <p>If you ever need to figure out if a function call in R generated a random number or not, here is a simple trick that you can use in an interactive R session. Add the following to your <code>~/.Rprofile</code>(*):</p> <pre><code class="language-r">if (interactive()) { invisible(addTaskCallback(local({ last &lt;- .GlobalEnv$.Random.seed function(...) { curr &lt;- .GlobalEnv$.Random.seed if (!identical(curr, last)) { msg &lt;- &quot;TRACKER: .Random.seed changed&quot; if (requireNamespace(&quot;crayon&quot;, quietly=TRUE)) msg &lt;- crayon::blurred(msg) message(msg) last &lt;&lt;- curr } TRUE } }), name = &quot;RNG tracker&quot;)) } </code></pre> <p>It works by checking whether or not the state of the random number generator (RNG), that is, <code>.Random.seed</code> in the global environment, was changed. If it has, a note is produced. For example,</p> <pre><code class="language-r">&gt; sum(1:100) [1] 5050 &gt; runif(1) [1] 0.280737 TRACKER: .Random.seed changed &gt; </code></pre> <p>It is not always obvious that a function generates random numbers internally. For instance, the <code>rank()</code> function may or may not updated the RNG state depending on argument <code>ties</code> as illustrated in the following example:</p> <pre><code class="language-r">&gt; x &lt;- c(1, 4, 3, 2) &gt; rank(x) [1] 1.0 2.5 2.5 4.0 &gt; rank(x, ties.method = &quot;random&quot;) [1] 1 3 2 4 TRACKER: .Random.seed changed &gt; </code></pre> <p>For some functions, it may even depend on the input data whether or not random numbers are generated, e.g.</p> <pre><code class="language-r">&gt; y &lt;- matrixStats::rowRanks(matrix(c(1,2,2), nrow=2, ncol=3), ties.method = &quot;random&quot;) TRACKER: .Random.seed changed &gt; y &lt;- matrixStats::rowRanks(matrix(c(1,2,3), nrow=2, ncol=3), ties.method = &quot;random&quot;) &gt; </code></pre> <p>I have this RNG tracker enabled all the time to learn about functions that unexpectedly draw random numbers internally, which can be important to know when you run statistical analysis in parallel.</p> <p>As a bonus, if you have the <strong><a href="https://cran.r-project.org/package=crayon">crayon</a></strong> package installed, the RNG tracker will output the note with a style that is less intrusive.</p> <p>(*) If you use the <strong><a href="https://cran.r-project.org/package=startup">startup</a></strong> package, you can add it to a new file <code>~/.Rprofile.d/interactive=TRUE/rng_tracker.R</code>. To learn more about the <strong>startup</strong> package, have a look at the <a href="https://www.jottr.org/tags/startup/">blog posts on <strong>startup</strong></a>.</p> <p>EDIT 2020-09-23: Changed the message prefix from &lsquo;NOTE:&rsquo; to &lsquo;TRACKER:&lsquo;.</p> future and future.apply - Some Recent Improvements https://www.jottr.org/2020/07/11/future-future.apply-recent-improvements/ Sat, 11 Jul 2020 22:15:00 -0700 https://www.jottr.org/2020/07/11/future-future.apply-recent-improvements/ <p>There are new versions of <strong><a href="https://cran.r-project.org/package=future">future</a></strong> and <strong><a href="https://cran.r-project.org/package=future.apply">future.apply</a></strong> - your friends in the parallelization business - on CRAN. These updates are mostly maintenance updates with bug fixes, some improvements, and preparations for upcoming changes. It&rsquo;s been some time since I blogged about these packages, so here is the summary of the main updates this far since early 2020:</p> <ul> <li><p><strong>future</strong>:</p> <ul> <li><p><code>values()</code> for lists and other containers was renamed to <code>value()</code> to simplify the API [future 1.17.0]</p></li> <li><p>When future results in an evaluation error, the <code>result()</code> object of the future holds also the session information when the error occurred [future 1.17.0]</p></li> <li><p><code>value()</code> can now detect and warn if a <code>future(..., seed=FALSE)</code> call generated random numbers, which then might give unreliable results because non-parallel safe, non-statistically sound random number generation (RNG) was used [future 1.16.0]</p></li> <li><p>Progress updates by <strong><a href="https://github.com/HenrikBengtsson/progressr">progressr</a></strong> are relayed in a near-live fashion for multisession and cluster futures [future 1.16.0]</p></li> <li><p><code>makeClusterPSOCK()</code> gained argument <code>rscript_envs</code> for setting or copying environment variables <em>during</em> the startup of each worker, e.g. <code>rscript_envs=c(FOO=&quot;hello world&quot;, &quot;BAR&quot;)</code> [future 1.17.0]. In addition, on Linux and macOS, it also possible to set environment variables <em>prior</em> to launching the workers, e.g. <code>rscript=c(&quot;TMPDIR=/tmp/foo&quot;, &quot;FOO='hello world'&quot;, &quot;Rscript&quot;)</code> [future 1.18.0]</p></li> <li><p>Error messages of severe cluster future failures are more informative and include details on the affected worker include hostname and R version [future 1.17.0 and 1.18.0]</p></li> </ul></li> <li><p><strong>future.apply</strong>:</p> <ul> <li><p><code>future_apply()</code> gained argument <code>simplify</code>, which has been added to <code>base::apply()</code> in R-devel (to become R 4.1.0) [future.apply 1.6.0]</p></li> <li><p>Added <code>future_.mapply()</code> corresponding to <code>base::.mapply()</code> [future.apply 1.5.0]</p></li> <li><p><code>future_lapply()</code> and friends set a label on each future that reflects the name of the function and the index of the chunk, e.g. &lsquo;future_lapply-3&rsquo; [future.apply 1.4.0]</p></li> <li><p>The assertion of the maximum size of globals per chunk is significantly faster for <code>future_apply()</code> [future.apply 1.4.0]</p></li> </ul></li> </ul> <p>There have also been updates to <strong><a href="https://cran.r-project.org/package=doFuture">doFuture</a></strong> and <strong><a href="https://cran.r-project.org/package=future.batchtools">future.batchtools</a></strong>. Please see their NEWS files for the details.</p> <h2 id="what-s-next">What&rsquo;s next?</h2> <p>I&rsquo;m working on cleaning up and harmonization the Future API even further. This is necessary so I can add some powerful features later on. One example of this cleanup is making sure that all types of futures are resolved in a local environment, which means that the <code>local</code> argument can be deprecated and eventually removed. Another example is to deprecate argument <code>persistent</code> for cluster futures, which is an &ldquo;outlier&rdquo; and remnant from the past. I&rsquo;m aware that some of you use <code>plan(cluster, persistent=TRUE)</code>, which, as far as I understand, is because you need to keep persistent variables around throughout the lifetime of the workers. I&rsquo;ve got a prototype of &ldquo;sticky globals&rdquo; that solves this problem differently, without the need for <code>persistent=FALSE</code>. I&rsquo;ll try my best to make sure everyone&rsquo;s needs are met. If you&rsquo;ve got questions, feedback, or a special use case, please reach out on <a href="https://github.com/HenrikBengtsson/future/issues/382">https://github.com/HenrikBengtsson/future/issues/382</a>.</p> <p>I&rsquo;ve also worked with the maintainers of <strong><a href="https://cran.r-project.org/package=foreach">foreach</a></strong> to harmonize the end-user and developer experience of <strong>foreach</strong> with that of the <strong>future</strong> framework. For example, in <code>y &lt;- foreach(...) %dopar% { ... }</code>, the <code>{ ... }</code> expression is now always evaluated in a local environment, just like futures. This helps avoid some quite common beginner mistakes that happen when moving from sequential to parallel processing. You can read about this change in the <a href="https://blog.revolutionanalytics.com/2020/03/foreach-150-released.html">&lsquo;foreach 1.5.0 now available on CRAN&rsquo;</a> blog post by Hong Ooi. There is also <a href="https://github.com/RevolutionAnalytics/foreach/issues/2">a discussion</a> on updating how <strong>foreach</strong> identifies global variables and packages so that it works the same as in the <strong>future</strong> framework.</p> <p>Happy futuring!</p> <h2 id="links">Links</h2> <ul> <li><strong>future</strong> package: <a href="https://cran.r-project.org/package=future">CRAN</a>, <a href="https://github.com/HenrikBengtsson/future">GitHub</a></li> <li><strong>future.apply</strong> package: <a href="https://cran.r-project.org/package=future.apply">CRAN</a>, <a href="https://github.com/HenrikBengtsson/future.apply">GitHub</a></li> <li><strong>doFuture</strong> package: <a href="https://cran.r-project.org/package=doFuture">CRAN</a>, <a href="https://github.com/HenrikBengtsson/doFuture">GitHub</a> (a <strong><a href="https://cran.r-project.org/package=foreach">foreach</a></strong> adapter)</li> <li><strong>future.batchtools</strong> package: <a href="https://cran.r-project.org/package=future.batchtools">CRAN</a>, <a href="https://github.com/HenrikBengtsson/future.batchtools">GitHub</a></li> <li><strong>future.callr</strong> package: <a href="https://cran.r-project.org/package=future.callr">CRAN</a>, <a href="https://github.com/HenrikBengtsson/future.callr">GitHub</a></li> <li><strong>future.tests</strong> package: <a href="https://cran.r-project.org/package=future.tests">CRAN</a>, <a href="https://github.com/HenrikBengtsson/future.tests">GitHub</a></li> <li><strong>progressr</strong> package: <a href="https://github.com/HenrikBengtsson/progressr">GitHub</a></li> </ul> <p>UPDATE: Added link to GitHub issue to discuss deprecation of <code>local</code> and <code>persistent</code> /2020-07-16</p> e-Rum 2020 Slides on Progressr https://www.jottr.org/2020/07/04/progressr-erum2020-slides/ Sat, 04 Jul 2020 17:30:00 -0700 https://www.jottr.org/2020/07/04/progressr-erum2020-slides/ <div style="width: 25%; margin: 2ex; float: right;"/> <center> <img src="https://www.jottr.org/post/three_in_chinese.gif" alt="Animated strokes for writing three in Chineses; one, two, three strokes"/> <span style="font-size: 80%; font-style: italic;">Source: <a href="https://en.wiktionary.org/wiki/File:%E4%B8%89-order.gif">Wiktionary.org</a></span> </center> </div> <p>I presented <em>Progressr: An Inclusive, Unifying API for Progress Updates</em> (15 minutes; 20 slides) at <a href="https://2020.erum.io/">e-Rum 2020</a>, on June 17, 2020:</p> <ul> <li><a href="https://www.jottr.org/presentations/eRum2020/BengtssonH_20200617-progressr-An_Inclusive,_Unifying_API_for_Progress_Updates.abstract.txt">Abstract</a></li> <li><a href="https://docs.google.com/presentation/d/11RymPwL90rPc0dQwpNCnw5KQC_76tuDK7uB7rq26oIg/present#slide=id.g88962cfdb7_0_0">HTML</a> (incremental Google Slides; requires online access)</li> <li><a href="https://www.jottr.org/presentations/eRum2020/BengtssonH_20200617-progressr-An_Inclusive,_Unifying_API_for_Progress_Updates.pdf">PDF</a> (flat slides)</li> <li><a href="https://www.youtube.com/watch?v=NwVOvfpGq4o&amp;t=3001s">Video</a> (starts at 00h49m58s)</li> </ul> <p>I am grateful for everyone involved who made e-Rum 2020 possible. I cannot imagine having to cancel the on-site Milano conference that had planned for more than a year and then start over to re-organize and create a fabulous online experience for ~1,500 participants in such short notice. Your contribution to the R Community in these times is invaluable - thank you soo much.</p> <p>As a speaker, I found it a bit of a challenge. It was my first presentation at an all online conference, so I wasn&rsquo;t sure what to expect and how it would go. As others said, it is indeed a bit unusual to present to an audience you know is there but that you cannot see or interact with during the talk. I gave my presentation a bit before seven o&rsquo;clock in the morning my time, and halfway through, my mind tried to convince me that it would be ok to get up and pour myself another cup of coffee - hehe - I certainly did not expect that one.</p> <p>Now, let&rsquo;s make some progress in this world!</p> <p>- Henrik</p> <h2 id="links">Links</h2> <ul> <li>e-Rum 2020: <ul> <li>Conference site: <a href="https://2020.erum.io/">https://2020.erum.io/</a></li> </ul></li> <li>Packages useful for understanding this talk (in order of appearance): <ul> <li><strong>progressr</strong> package: <a href="https://cran.r-project.org/package=progressr">CRAN</a>, <a href="https://github.com/HenrikBengtsson/progressr">GitHub</a></li> <li><strong>progress</strong> package: <a href="https://cran.r-project.org/package=progress">CRAN</a>, <a href="https://github.com/r-lib/progress">GitHub</a></li> <li><strong>beepr</strong> package: <a href="https://cran.r-project.org/package=beepr">CRAN</a>, <a href="https://github.com/rasmusab/beepr">GitHub</a></li> <li><strong>future</strong> package: <a href="https://cran.r-project.org/package=future">CRAN</a>, <a href="https://github.com/HenrikBengtsson/future">GitHub</a></li> </ul></li> </ul> rstudio::conf 2020 Slides on Futures https://www.jottr.org/2020/02/01/future-rstudioconf2020-slides/ Sat, 01 Feb 2020 19:30:00 -0800 https://www.jottr.org/2020/02/01/future-rstudioconf2020-slides/ <div style="width: 25%; margin: 2ex; float: right;"/> <center> <img src="https://www.jottr.org/post/future-logo.png" alt="The future logo"/> <span style="font-size: 80%; font-style: italic;">Design: <a href="https://twitter.com/embiggenData">Dan LaBar</a></span> </center> </div> <p>I presented <em>Future: Simple Async, Parallel &amp; Distributed Processing in R Why and What’s New?</em> at <a href="https://rstudio.com/conference/">rstudio::conf 2020</a> in San Francisco, USA, on January 29, 2020. Below are the slides for my talk (17 slides; ~18+2 minutes):</p> <ul> <li><a href="https://docs.google.com/presentation/d/1Wn5S91UGIOrc4IyXoV074ij5vGF8I0Km0tCfintyIa4/present?includes_info_params=1&amp;eisi=CM2mhIXwsecCFQyuJgodBQAJ8A#slide=id.p">HTML</a> (incremental Google Slides; requires online access)</li> <li><a href="https://www.jottr.org/presentations/rstudioconf2020/BengtssonH_20200129-future-rstudioconf2020.pdf">PDF</a> (flat slides)</li> <li><a href="https://resources.rstudio.com/rstudio-conf-2020/future-simple-async-parallel-amp-distributed-processing-in-r-whats-next-henrik-bengtsson">Video</a> with closed captions (official rstudio::conf recording)</li> </ul> <p>First of all, a big thank you goes out to Dan LaBar (<a href="https://twitter.com/embiggenData">@embiggenData</a>) for proposing and contributing the original design of the future hex sticker. All credits to Dan. (You can blame me for the tweaked background.)</p> <p>This was my first rstudio::conf and it was such a pleasure to be part of it. I&rsquo;d like to thank <a href="https://blog.rstudio.com/2020/01/29/rstudio-pbc">RStudio, PBC</a> for the invitation to speak and everyone who contributed to the conference - organizers, staff, speakers, poster presenters, and last but not the least, all the wonderful participants. Each one of you makes our R community what it is today.</p> <p><em>Happy futuring!</em></p> <p>- Henrik</p> <h2 id="links">Links</h2> <ul> <li>rstudio::conf 2020: <ul> <li>Conference site: <a href="https://rstudio.com/conference/">https://rstudio.com/conference/</a></li> </ul></li> <li>Packages essential to the understanding of this talk (in order of appearance): <ul> <li><strong>future</strong> package: <a href="https://cran.r-project.org/package=future">CRAN</a>, <a href="https://github.com/HenrikBengtsson/future">GitHub</a></li> <li><strong>future.apply</strong> package: <a href="https://cran.r-project.org/package=future.apply">CRAN</a>, <a href="https://github.com/HenrikBengtsson/future.apply">GitHub</a></li> <li><strong>purrr</strong> package: <a href="https://cran.r-project.org/package=purrr">CRAN</a>, <a href="https://github.com/tidyverse/purrr">GitHub</a></li> <li><strong>furrr</strong> package: <a href="https://cran.r-project.org/package=furrr">CRAN</a>, <a href="https://github.com/DavisVaughan/furrr">GitHub</a></li> <li><strong>foreach</strong> package: <a href="https://cran.r-project.org/package=foreach">CRAN</a>, <a href="https://github.com/RevolutionAnalytics/foreach">GitHub</a></li> <li><strong>doFuture</strong> package: <a href="https://cran.r-project.org/package=doFuture">CRAN</a>, <a href="https://github.com/HenrikBengtsson/doFuture">GitHub</a></li> <li><strong>future.batchtools</strong> package: <a href="https://cran.r-project.org/package=future.batchtools">CRAN</a>, <a href="https://github.com/HenrikBengtsson/future.batchtools">GitHub</a></li> <li><strong>batchtools</strong> package: <a href="https://cran.r-project.org/package=batchtools">CRAN</a>, <a href="https://github.com/mllg/batchtools">GitHub</a></li> <li><strong>shiny</strong> package: <a href="https://cran.r-project.org/package=shiny">CRAN</a>, <a href="https://github.com/rstudio/shiny/issues">GitHub</a></li> <li><strong>future.tests</strong> package: <del>CRAN</del>, <a href="https://github.com/HenrikBengtsson/future.tests">GitHub</a></li> <li><strong>progressr</strong> package: <a href="https://cran.r-project.org/package=progressr">CRAN</a>, <a href="https://github.com/HenrikBengtsson/progressr">GitHub</a></li> <li><strong>progress</strong> package: <a href="https://cran.r-project.org/package=progress">CRAN</a>, <a href="https://github.com/r-lib/progress">GitHub</a></li> <li><strong>beepr</strong> package: <a href="https://cran.r-project.org/package=beepr">CRAN</a>, <a href="https://github.com/rasmusab/beepr">GitHub</a></li> </ul></li> </ul> future 1.15.0 - Lazy Futures are Now Launched if Queried https://www.jottr.org/2019/11/09/resolved-launches-lazy-futures/ Sat, 09 Nov 2019 11:00:00 -0800 https://www.jottr.org/2019/11/09/resolved-launches-lazy-futures/ <p><img src="https://www.jottr.org/post/lazy_dog_in_park.gif" alt="&quot;Lazy dog does not want to leave park&quot;" /> <small><em>No dogs were harmed while making this release</em></small></p> <p><strong><a href="https://cran.r-project.org/package=future">future</a></strong> 1.15.0 is now on CRAN, accompanied by a recent, related update of <strong><a href="https://cran.r-project.org/package=future.callr">future.callr</a></strong> 0.5.0. The main update is a change to the Future API:</p> <p><center> <code>resolved()</code> will now also launch lazy futures </center></p> <p>Although this change does not look much to the world, I&rsquo;d like to think of this as part of a young person slowly finding themselves. This change in behavior helps us in cases where we create lazy futures upfront;</p> <pre><code class="language-r">fs &lt;- lapply(X, future, lazy = TRUE) </code></pre> <p>Such futures remain dormant until we call <code>value()</code> on them, or, as of this release, when we call <code>resolved()</code> on them. Contrary to <code>value()</code>, <code>resolved()</code> is a non-blocking function that allows us to check in on one or more futures to see if they are resolved or not. So, we can now do:</p> <pre><code class="language-r">while (!all(resolved(fs))) { do_something_else() } </code></pre> <p>to run that loop until all futures are resolved. Any lazy future that is still dormant will be launched when queried the first time. Previously, we would have had to write specialized code for the <code>lazy=TRUE</code> case to trigger lazy futures to launch. If not, the above loop would have run forever. This change means that the above design pattern works the same regardless of whether we use <code>lazy=TRUE</code> or <code>lazy=FALSE</code> (default). There is now one less thing to worry about when working with futures. Less mental friction should be good.</p> <h2 id="what-else">What else?</h2> <p>The Future API now guarantees that <code>value()</code> relays the &ldquo;visibility&rdquo; of a future&rsquo;s value. For example,</p> <pre><code class="language-r">&gt; f &lt;- future(invisible(42)) &gt; value(f) &gt; v &lt;- value(f) &gt; v [1] 42 </code></pre> <p>Other than that, I have fixed several non-critical bugs and improved some documentation. See <code>news(package=&quot;future&quot;)</code> or <a href="https://cran.r-project.org/web/packages/future/NEWS">NEWS</a> for all updates.</p> <h2 id="what-s-next">What&rsquo;s next?</h2> <ul> <li><p>I&rsquo;ll be talking about futures at <a href="https://rstudio.com/conference/">rstudio::conf 2020</a> (San Francisco, CA, USA) at the end of January 2020. Please come and say hi - I am keen to hear your R story.</p></li> <li><p>I will wrap up the deliverables for the project <a href="https://github.com/HenrikBengtsson/future.tests">Future Minimal API: Specification with Backend Conformance Test Suite</a> sponsored by the R Consortium. This project helps to robustify the future ecosystem and validate that all backends fulfill the Future API specification. It also serves to refine the Future API specifications. For example, the above change to <code>resolved()</code> resulted from this project.</p></li> <li><p>The maintainers of <strong><a href="https://cran.r-project.org/package=foreach">foreach</a></strong> plan to harmonize how <code>foreach()</code> identifies global variables with how the <strong>future</strong> framework identifies them. The idea is to migrate <strong><a href="https://cran.r-project.org/package=foreach">foreach</a></strong> to use the same approach as <strong>future</strong>, which relies on the <strong><a href="https://cran.r-project.org/package=globals">globals</a></strong> package. If you&rsquo;re curious, you can find out more about this over at the <a href="https://github.com/RevolutionAnalytics/foreach/issues">foreach issue tracker</a>. Yeah, the foreach issue tracker is a fairly recent thing - it&rsquo;s a great addition.</p></li> <li><p>The <strong><a href="https://github.com/HenrikBengtsson/progressr">progressr</a></strong> package (GitHub only) is a proof-of-concept and a working <em>prototype</em> showing how to signal progress updates when doing parallel processing. It works out of the box with the core Future API and higher-level Future APIs such as <strong><a href="https://cran.r-project.org/package=future.apply">future.apply</a></strong>, <strong><a href="https://cran.r-project.org/package=foreach">foreach</a></strong> with <strong><a href="https://cran.r-project.org/package=doFuture">doFuture</a></strong>, <strong><a href="https://cran.r-project.org/package=furrr">furrr</a></strong>, and <strong><a href="https://cran.r-project.org/package=plyr">plyr</a></strong> - regardless of what parallel backend is being used. It should also work with all known non-parallel map-reduce frameworks, including <strong>base</strong> <code>lapply()</code> and <strong><a href="https://cran.r-project.org/package=purrr">purrr</a></strong>. For parallel processing, the &ldquo;granularity&rdquo; of progress updates varies with the type of parallel worker used. Right now, you will get live updates for sequential processing, whereas for parallel processing the updates will come in chunks along with the value whenever it is collected for a particular future. I&rsquo;m working on adding support for &ldquo;live&rdquo; progress updates also for some parallel backends including when running on local and remote workers.</p></li> </ul> <p>Happy futuring!</p> <h2 id="links">Links</h2> <ul> <li><strong>future</strong> package: <a href="https://cran.r-project.org/package=future">CRAN</a>, <a href="https://github.com/HenrikBengtsson/future">GitHub</a></li> <li><strong>future.batchtools</strong> package: <a href="https://cran.r-project.org/package=future.batchtools">CRAN</a>, <a href="https://github.com/HenrikBengtsson/future.batchtools">GitHub</a></li> <li><strong>future.callr</strong> package: <a href="https://cran.r-project.org/package=future.callr">CRAN</a>, <a href="https://github.com/HenrikBengtsson/future.callr">GitHub</a></li> <li><strong>future.apply</strong> package: <a href="https://cran.r-project.org/package=future.apply">CRAN</a>, <a href="https://github.com/HenrikBengtsson/future.apply">GitHub</a></li> <li><strong>doFuture</strong> package: <a href="https://cran.r-project.org/package=doFuture">CRAN</a>, <a href="https://github.com/HenrikBengtsson/doFuture">GitHub</a> (a <strong><a href="https://cran.r-project.org/package=foreach">foreach</a></strong> adapter)</li> <li><strong>progressr</strong> package: <a href="https://github.com/HenrikBengtsson/progressr">GitHub</a></li> <li><a href="https://www.videoman.gr/en/70385" target="_blank">&ldquo;So, what happened to the dog?&rdquo;</a></li> </ul> useR! 2019 Slides on Futures https://www.jottr.org/2019/07/12/future-user2019-slides/ Fri, 12 Jul 2019 16:00:00 +0200 https://www.jottr.org/2019/07/12/future-user2019-slides/ <p><img src="https://www.jottr.org/post/useR2019-logo_400x400.jpg" alt="The useR 2019 logo" style="width: 30%; float: right; margin: 2ex;"/></p> <p>Below are the slides for my <em>Future: Simple Parallel and Distributed Processing in R</em> that I presented at the <a href="https://user2019.r-project.org/">useR! 2019</a> conference in Toulouse, France on July 9-12, 2019.</p> <p>My talk (25 slides; ~15+3 minutes):</p> <ul> <li>Title: <em>Future: Simple Parallel and Distributed Processing in R</em></li> <li><a href="https://docs.google.com/presentation/d/e/2PACX-1vQDLsnzhfp03zAf-BG69mnwO6nqGyLP9Zuj5ShW0gbewY955wop6KO5bidbWxtrIydFj7lznwi1op__/pub?start=false&amp;loop=false&amp;delayms=60000">HTML</a> (incremental Google Slides; requires online access)</li> <li><a href="https://www.jottr.org/presentations/useR2019/BengtssonH_20190712-future-useR2019.pdf">PDF</a> (flat slides)</li> <li><a href="https://www.youtube.com/watch?v=4B3wPFL_Syo&amp;list=PL4IzsxWztPdliwImi5JLBC4BrvqxG-vcA&amp;index=69">Video</a> (official recording)</li> </ul> <p>I want to send out a big thank you to everyone making the useR! conference such wonderful experience.</p> <h2 id="links">Links</h2> <ul> <li>useR! 2019: <ul> <li>Conference site: <a href="https://user2019.r-project.org/">https://user2019.r-project.org/</a></li> </ul></li> <li><strong>future</strong> package: <ul> <li>CRAN page: <a href="https://cran.r-project.org/package=future">https://cran.r-project.org/package=future</a></li> <li>GitHub page: <a href="https://github.com/HenrikBengtsson/future">https://github.com/HenrikBengtsson/future</a></li> </ul></li> <li><strong>future.apply</strong> package: <ul> <li>CRAN page: <a href="https://cran.r-project.org/package=future.apply">https://cran.r-project.org/package=future.apply</a></li> <li>GitHub page: <a href="https://github.com/HenrikBengtsson/future.apply">https://github.com/HenrikBengtsson/future.apply</a></li> </ul></li> <li><strong>progressr</strong> package: <ul> <li>CRAN page: <a href="https://cran.r-project.org/package=progressr">https://cran.r-project.org/package=progressr</a></li> <li>GitHub page: <a href="https://github.com/HenrikBengtsson/progressr">https://github.com/HenrikBengtsson/progressr</a></li> </ul></li> </ul> <p>Edits 2020-02-01: Added link to video recording of presentation and link to the CRAN package page of the progressr package (submitted to CRAN on 2020-01-23).</p> startup - run R startup files once per hour, day, week, ... https://www.jottr.org/2019/05/26/startup-sometimes/ Sun, 26 May 2019 21:00:00 -0700 https://www.jottr.org/2019/05/26/startup-sometimes/ <p>New release: <strong><a href="https://cran.r-project.org/package=startup">startup</a></strong> 0.12.0 is now on CRAN. This version introduces support for processing some of the R startup files with a certain frequency, e.g. once per day, once per week, or once per month. See below for two examples.</p> <p><img src="https://www.jottr.org/post/startup_0.10.0-zxspectrum.gif" alt="ZX Spectrum animation" /> <em>startup::startup() is cross platform.</em></p> <p>The <a href="https://cran.r-project.org/package=startup">startup</a> package makes it easy to split up a long, complicated <code>.Rprofile</code> startup file into multiple, smaller files in a <code>.Rprofile.d/</code> folder. For instance, setting R option <code>repos</code> in a separate file <code>~/.Rprofile.d/repos.R</code> makes it easy to find and update the option. Analogously, environment variables can be configured by using multiple <code>.Renviron.d/</code> files. To make use of this, install the <strong>startup</strong> package, and then call <code>startup::install()</code> once, which will tweak your <code>~/.Rprofile</code> file and create <code>~/.Renviron.d/</code> and <code>~/.Rprofile.d/</code> folders, if missing. For an introduction, see <a href="https://www.jottr.org/2016/12/22/startup/">Start Me Up</a>.</p> <h2 id="example-show-a-fortune-once-per-hour">Example: Show a fortune once per hour</h2> <p>The <a href="https://cran.r-project.org/package=fortunes"><strong>fortunes</strong></a> package is a collection of quotes and wisdom related to the R language. By adding</p> <pre><code class="language-r">if (interactive()) print(fortunes::fortune()) </code></pre> <p>to our <code>~/.Rprofile</code> file, a random fortune will be displayed each time we start R, e.g.</p> <pre><code>$ R --quiet I think, therefore I R. -- William B. King (in his R tutorials) http://ww2.coastal.edu/kingw/statistics/R-tutorials/ (July 2010) &gt; </code></pre> <p>Now, if we&rsquo;re launching R frequently, it might be too much to see a new fortune each time R is started. With <strong>startup</strong> (&gt;= 0.12.0), we can limit how often a certain startup file should be processed via <code>when=&lt;frequency&gt;</code> declarations. Currently supported values are <code>when=once</code>, <code>when=hourly</code>, <code>when=daily</code>, <code>when=weekly</code>, <code>when=fortnighly</code>, and <code>when=monthly</code>. See the package vignette for more details.</p> <p>For instance, we can limit ourselves to one fortune per hour, by creating a file <code>~/.Rprofile.d/interactive=TRUE/when=hourly/package=fortunes.R</code> containing:</p> <pre><code class="language-r">print(fortunes::fortune()) </code></pre> <p>The <code>interactive=TRUE</code> part declares that the file should only be processed in an interactive session, the <code>when=hourly</code> part that it should be processed at most once per hour, and the <code>package=fortunes</code> part that it should be processed only if the <strong>fortunes</strong> package is installed. It not all of these declarations are fulfilled, then the file will <em>not</em> be processed.</p> <h2 id="example-check-the-status-of-your-cran-packages-once-per-day">Example: Check the status of your CRAN packages once per day</h2> <p>If you are a developer with one or more packages on CRAN, the <a href="https://cran.r-project.org/package=foghorn"><strong>foghorn</strong></a> package provides <code>foghorn::summary_cran_results()</code> which is a neat way to get a summary of the CRAN statuses of your packages. I use the following two files to display the summary of my CRAN packages once per day:</p> <p>File <code>~/.Rprofile.d/interactive=TRUE/when=daily/package=foghorn.R</code>:</p> <pre><code class="language-r">try(local({ if (nzchar(email &lt;- Sys.getenv(&quot;MY_CRAN_EMAIL&quot;))) { foghorn::summary_cran_results(email) } }), silent = TRUE) </code></pre> <p>File <code>~/.Renviron.d/private/me</code>:</p> <pre><code>[email protected] </code></pre> <h2 id="links">Links</h2> <ul> <li><strong>startup</strong> package: <ul> <li>CRAN page: <a href="https://cran.r-project.org/package=startup">https://cran.r-project.org/package=startup</a> (<a href="https://cran.r-project.org/web/packages/startup/NEWS">NEWS</a>, <a href="https://cran.r-project.org/web/packages/startup/vignettes/startup-intro.html">vignette</a>)</li> <li>GitHub page: <a href="https://github.com/HenrikBengtsson/startup">https://github.com/HenrikBengtsson/startup</a></li> </ul></li> </ul> <h2 id="related">Related</h2> <ul> <li><a href="https://www.jottr.org/2016/12/22/startup/">Start Me Up</a> on 2016-12-22.</li> <li><a href="https://www.jottr.org/2018/03/30/startup-secrets/">Startup with Secrets - A Poor Man&rsquo;s Approach</a> on 2018-03-30.</li> </ul> SatRday LA 2019 Slides on Futures https://www.jottr.org/2019/05/16/future-satrdayla2019-slides/ Thu, 16 May 2019 12:00:00 -0800 https://www.jottr.org/2019/05/16/future-satrdayla2019-slides/ <p><img src="https://www.jottr.org/post/SatRdayLA2019-Logo.png" alt="The satRday LA 2019 logo" style="width: 30%; float: right; margin: 2ex;"/></p> <p>A bit late but here are my slides on <em>Future: Friendly Parallel Processing in R for Everyone</em> that I presented at the <a href="https://losangeles2019.satrdays.org/">satRday LA 2019</a> conference in Los Angeles, CA, USA on April 6, 2019.</p> <p>My talk (33 slides; ~45 minutes):</p> <ul> <li>Title: <em>: Friendly Parallel and Distributed Processing in R for Everyone</em></li> <li><a href="https://www.jottr.org/presentations/satRdayLA2019/BengtssonH_20190406-SatRdayLA2019,flat.html">HTML</a> (incremental slides; requires online access)</li> <li><a href="https://www.jottr.org/presentations/satRdayLA2019/BengtssonH_20190406-SatRdayLA2019,flat.pdf">PDF</a> (flat slides)</li> <li><a href="https://www.youtube.com/watch?v=KP3pgLfKr00&amp;list=PLQRHxIa9tfRvXYyaVS77zshvD0i17Y60s">Video</a> (44 min; YouTube; sorry, different page numbers)</li> </ul> <p>Thank you all for making this a stellar satRday event. I enjoyed it very much!</p> <h2 id="links">Links</h2> <ul> <li>satRday LA 2019: <ul> <li>Conference site: <a href="https://losangeles2019.satrdays.org/">https://losangeles2019.satrdays.org/</a></li> <li>Conference material: <a href="https://github.com/satRdays/losangeles/tree/master/2019">https://github.com/satRdays/losangeles/tree/master/2019</a></li> </ul></li> <li>future package: <ul> <li>CRAN page: <a href="https://cran.r-project.org/package=future">https://cran.r-project.org/package=future</a></li> <li>GitHub page: <a href="https://github.com/HenrikBengtsson/future">https://github.com/HenrikBengtsson/future</a></li> </ul></li> <li>future.apply package: <ul> <li>CRAN page: <a href="https://cran.r-project.org/package=future.apply">https://cran.r-project.org/package=future.apply</a></li> <li>GitHub page: <a href="https://github.com/HenrikBengtsson/future.apply">https://github.com/HenrikBengtsson/future.apply</a></li> </ul></li> <li>doFuture package: <ul> <li>CRAN page: <a href="https://cran.r-project.org/package=doFuture">https://cran.r-project.org/package=doFuture</a></li> <li>GitHub page: <a href="https://github.com/HenrikBengtsson/doFuture">https://github.com/HenrikBengtsson/doFuture</a></li> </ul></li> </ul> SatRday Paris 2019 Slides on Futures https://www.jottr.org/2019/03/07/future-satrdayparis2019-slides/ Thu, 07 Mar 2019 12:00:00 -0800 https://www.jottr.org/2019/03/07/future-satrdayparis2019-slides/ <p><img src="https://www.jottr.org/post/satRdayParis2019-logo.png" alt="The satRday Paris 2019 logo" style="width: 30%; float: right; margin: 2ex;"/></p> <p>Below are links to my slides from my talk on <em>Future: Friendly Parallel Processing in R for Everyone</em> that I presented last month at the <a href="https://paris2019.satrdays.org/">satRday Paris 2019</a> conference in Paris, France (February 23, 2019).</p> <p>My talk (32 slides; ~40 minutes):</p> <ul> <li>Title: <em>Future: Friendly Parallel Processing in R for Everyone</em></li> <li><a href="https://www.jottr.org/presentations/satRdayParis2019/BengtssonH_20190223-SatRdayParis2019.html">HTML</a> (incremental slides; requires online access)</li> <li><a href="https://www.jottr.org/presentations/satRdayParis2019/BengtssonH_20190223-SatRdayParis2019.pdf">PDF</a> (flat slides)</li> </ul> <p>A big shout out to the organizers, all the volunteers, and everyone else for making it a great satRday.</p> <h2 id="links">Links</h2> <ul> <li>satRday Paris 2019: <ul> <li>Conference site: <a href="https://paris2019.satrdays.org/">https://paris2019.satrdays.org/</a></li> </ul></li> <li>future package: <ul> <li>CRAN page: <a href="https://cran.r-project.org/package=future">https://cran.r-project.org/package=future</a></li> <li>GitHub page: <a href="https://github.com/HenrikBengtsson/future">https://github.com/HenrikBengtsson/future</a></li> </ul></li> <li>future.apply package: <ul> <li>CRAN page: <a href="https://cran.r-project.org/package=future.apply">https://cran.r-project.org/package=future.apply</a></li> <li>GitHub page: <a href="https://github.com/HenrikBengtsson/future.apply">https://github.com/HenrikBengtsson/future.apply</a></li> </ul></li> <li>doFuture package: <ul> <li>CRAN page: <a href="https://cran.r-project.org/package=doFuture">https://cran.r-project.org/package=doFuture</a></li> <li>GitHub page: <a href="https://github.com/HenrikBengtsson/doFuture">https://github.com/HenrikBengtsson/doFuture</a></li> </ul></li> </ul> Parallelize a For-Loop by Rewriting it as an Lapply Call https://www.jottr.org/2019/01/11/parallelize-a-for-loop-by-rewriting-it-as-an-lapply-call/ Fri, 11 Jan 2019 12:00:00 -0800 https://www.jottr.org/2019/01/11/parallelize-a-for-loop-by-rewriting-it-as-an-lapply-call/ <p>A commonly asked question in the R community is:</p> <blockquote> <p>How can I parallelize the following for-loop?</p> </blockquote> <p>The answer almost always involves rewriting the <code>for (...) { ... }</code> loop into something that looks like a <code>y &lt;- lapply(...)</code> call. If you can achieve that, you can parallelize it via for instance <code>y &lt;- future.apply::future_lapply(...)</code> or <code>y &lt;- foreach::foreach() %dopar% { ... }</code>.</p> <p>For some for-loops it is straightforward to rewrite the code to make use of <code>lapply()</code> instead, whereas in other cases it can be a bit more complicated, especially if the for-loop updates multiple variables in each iteration. However, as long as the algorithm behind the for-loop is <em><a href="https://en.wikipedia.org/wiki/Embarrassingly_parallel">embarrassingly parallel</a></em>, it can be done. Whether it should be parallelized in the first place, or it&rsquo;s worth parallelizing it, is a whole other discussion.</p> <p>Below are a few walk-through examples on how to transform a for-loop into an lapply call.</p> <p><img src="https://www.jottr.org/post/Honolulu_IFSS_Teletype1964.jpg" alt="Paper tape relay operation at US FAA's Honolulu flight service station in 1964 showing a large number of punch tapes" /> <em>Run your loops in parallel.</em></p> <h1 id="example-1-a-well-behaving-for-loop">Example 1: A well-behaving for-loop</h1> <p>I will use very simple function calls throughout the examples, e.g. <code>sqrt(x)</code>. For these code snippets to make sense, let us pretend that those functions take a long time to finish and by parallelizing multiple such calls we will shorten the overall processing time.</p> <p>First, consider the following example:</p> <pre><code class="language-r">X &lt;- 1:5 y &lt;- list() for (ii in seq_along(X)) { x &lt;- X[[ii]] tmp &lt;- sqrt(x) ## Assume this takes a long time y[[ii]] &lt;- tmp } </code></pre> <p>When run, this will give us the following result:</p> <pre><code class="language-r">&gt; str(y) List of 5 $ : num 1 $ : num 1.41 $ : num 1.73 $ : num 2 $ : num 2.24 </code></pre> <p>Because the result of each iteration in the for-loop is a single value (variable <code>tmp</code>) it is straightforward to turn this for-loop into an lapply call. I&rsquo;ll first show a version that resembles the original for-loop as far as possible, with one minor but important change. I&rsquo;ll wrap up the &ldquo;iteration&rdquo; code inside <code>local()</code> to make sure it is evaluated in a <em>local environment</em> in order to prevent it from assigning values to the global environment. It is only the &ldquo;result&rdquo; of <code>local()</code> call that I will allow updating <code>y</code>. Here we go:</p> <pre><code class="language-r">y &lt;- list() for (ii in seq_along(X)) { y[[ii]] &lt;- local({ x &lt;- X[[ii]] tmp &lt;- sqrt(x) tmp ## same as return(tmp) }) } </code></pre> <p>By making these, apparently, small adjustments, we lower the risk for missing some critical side effects that may be used in some for-loops. If those exists and we miss to adjust for them, then the for-loop is likely to give the wrong results.</p> <p>If this syntax is unfamiliar to you, run it first to convince yourself that it works. How does it work? The code inside <code>local()</code> will be evaluated in a local environment and it is only its last value (here <code>tmp</code>) that will be returned. It is also neat that <code>x</code>, <code>tmp</code>, and any other created variables, will <em>not</em> clutter up the global environment. Instead, they will vanish after each iteration just like local variables used inside functions. Retry the above after <code>rm(x, tmp)</code> to see that this is really the case.</p> <p>Now we&rsquo;re in a really good position to turn the for-loop into an lapply call. To share my train of thought, I&rsquo;ll start by showing how to do it in a way that best resembles the latter for-loop;</p> <pre><code class="language-r">y &lt;- lapply(seq_along(X), function(ii) { x &lt;- X[[ii]] tmp &lt;- sqrt(x) tmp }) </code></pre> <p>Just like the for-loop with <code>local()</code>, it is the last value (here <code>tmp</code>) that is returned, and everything is evaluated in a local environment, e.g. variable <code>tmp</code> will <em>not</em> show up in our global environment.</p> <p>There is one more update that we can do, namely instead of passing the index <code>ii</code> as an argument and then extract element <code>x &lt;- X[[ii]]</code> inside the function, we can pass that element directly using:</p> <pre><code class="language-r">y &lt;- lapply(X, function(x) { tmp &lt;- sqrt(x) tmp }) </code></pre> <p>If we get this far and have <strong>confirmed that we get the expected results</strong>, then we&rsquo;re home.</p> <p>From here, there are few ways to parallelize the lapply call. The <strong>parallel</strong> package provides the commonly known <code>mclapply()</code> and <code>parLapply()</code> functions, which are found in many examples and inside several R packages. As the author of the <strong><a href="https://cran.r-project.org/package=future">future</a></strong> package, I claim that your life as a developer will be a bit easier if you instead use the future framework. It will also bring more power and options to the end user. Below are a few options for parallelization.</p> <h2 id="future-apply-future-lapply">future.apply::future_lapply()</h2> <p>The parallelization update that takes the least amount of changes is provided by the <strong><a href="https://cran.r-project.org/package=future.apply">future.apply</a></strong> package. All we have to do is to replace <code>lapply()</code> with <code>future_lapply()</code>:</p> <pre><code class="language-r">library(future.apply) plan(multisession) ## =&gt; parallelize on your local computer X &lt;- 1:5 y &lt;- future_lapply(X, function(x) { tmp &lt;- sqrt(x) tmp }) </code></pre> <p>and we&rsquo;re done.</p> <h2 id="foreach-foreach-dopar">foreach::foreach() %dopar% { &hellip; }</h2> <p>If we wish to use the <strong><a href="https://cran.r-project.org/package=foreach">foreach</a></strong> framework, we can do:</p> <pre><code class="language-r">library(doFuture) registerDoFuture() plan(multisession) X &lt;- 1:5 y &lt;- foreach(x = X) %dopar% { tmp &lt;- sqrt(x) tmp } </code></pre> <p>Here I choose the <strong><a href="https://cran.r-project.org/package=doFuture">doFuture</a></strong> adaptor because it provides us with access to the future framework and the full range of parallel backends that comes with it (controlled via <code>plan()</code>).</p> <p>If there is only one thing you should remember from this post, it is the following:</p> <p><strong>It is a common misconception that <code>foreach()</code> works like a regular for-loop. It is doesn&rsquo;t! Instead, think of it as a version of <code>lapply()</code> with a few bells and whistles and always make sure to use it as <code>y &lt;- foreach(...) %dopar% { ... }</code>.</strong></p> <p>To clarify further, the following is <em>not</em> (I repeat: <em>not</em>) a working solution:</p> <pre><code class="language-r">X &lt;- 1:5 y &lt;- list() foreach(x = X) %dopar% { tmp &lt;- sqrt(x) y[[ii]] &lt;- tmp } </code></pre> <p>No, it isn&rsquo;t.</p> <h2 id="additional-parallelization-options">Additional parallelization options</h2> <p>There are several more options available, which are conceptually very similar to the above lapply-like approaches, e.g. <code>y &lt;- furrr::future_map(X, ...)</code>, <code>y &lt;- plyr::llply(X, ..., .parallel = TRUE)</code> or <code>y &lt;- BiocParallel::bplapply(X, ..., BPPARAM = DoparParam())</code>. For also the latter two to parallelize via one of the many future backends, we need to set <code>doFuture::registerDoFuture()</code>. See also my blog post <a href="https://www.jottr.org/2017/06/05/many-faced-future/">The Many-Faced Future</a>.</p> <h1 id="example-2-a-slightly-complicated-for-loop">Example 2: A slightly complicated for-loop</h1> <p>Now, what do we do if the for-loop writes multiple results in each iteration? For example,</p> <pre><code class="language-r">X &lt;- 1:5 y &lt;- list() z &lt;- list() for (ii in seq_along(X)) { x &lt;- X[[ii]] tmp1 &lt;- sqrt(x) y[[ii]] &lt;- tmp1 tmp2 &lt;- x^2 z[[ii]] &lt;- tmp2 } </code></pre> <p>The way to turn this into an lapply call, is to rewrite the code by gathering all the results at the very end of the iteration and then put them into a list;</p> <pre><code class="language-r">X &lt;- 1:5 yz &lt;- list() for (ii in seq_along(X)) { x &lt;- X[[ii]] tmp1 &lt;- sqrt(x) tmp2 &lt;- x^2 yz[[ii]] &lt;- list(y = tmp1, z = tmp2) } </code></pre> <p>This one we know how to rewrite;</p> <pre><code class="language-r">yz &lt;- lapply(X, function(x) { tmp1 &lt;- sqrt(x) tmp2 &lt;- x^2 list(y = tmp1, z = tmp2) }) </code></pre> <p>which we in turn can parallelize with one of the above approaches.</p> <p>The only difference from the original for-loop is that the &lsquo;y&rsquo; and &lsquo;z&rsquo; results are no longer in two separate lists. This makes it a bit harder to get a hold of the two elements. In some cases, then downstream code can work with the new <code>yz</code> format as is but if not, we can always do:</p> <pre><code class="language-r">y &lt;- lapply(yz, function(t) t$y) z &lt;- lapply(yz, function(t) t$z) rm(yz) </code></pre> <h1 id="example-3-a-somewhat-complicated-for-loop">Example 3: A somewhat complicated for-loop</h1> <p>Another, somewhat complicated, for-loop is when, say, one column of a matrix is updated per iteration. For example,</p> <pre><code class="language-r">X &lt;- 1:5 Y &lt;- matrix(0, nrow = 2, ncol = length(X)) rownames(Y) &lt;- c(&quot;sqrt&quot;, &quot;square&quot;) for (ii in seq_along(X)) { x &lt;- X[[ii]] Y[, ii] &lt;- c(sqrt(x), x^2) ## assume this takes a long time } </code></pre> <p>which gives</p> <pre><code class="language-r">&gt; Y [,1] [,2] [,3] [,4] [,5] sqrt 1 1.414214 1.732051 2 2.236068 square 1 4.000000 9.000000 16 25.000000 </code></pre> <p>To turn this into an lapply call, the approach is the same as in Example 2 - we rewrite the for-loop to assign to a list and only afterward we worry about putting those values into a matrix. To keep it simple, this can be done using something like:</p> <pre><code class="language-r">X &lt;- 1:5 tmp &lt;- lapply(X, function(x) { c(sqrt(x), x^2) ## assume this takes a long time }) Y &lt;- matrix(0, nrow = 2, ncol = length(X)) rownames(Y) &lt;- c(&quot;sqrt&quot;, &quot;square&quot;) for (ii in seq_along(tmp)) { Y[, ii] &lt;- tmp[[ii]] } rm(tmp) </code></pre> <p>To parallelize this, all we have to do is to rewrite the lapply call as:</p> <pre><code class="language-r">tmp &lt;- future_lapply(X, function(x) { c(sqrt(x), x^2) }) </code></pre> <h1 id="example-4-a-non-embarrassingly-parallel-for-loop">Example 4: A non-embarrassingly parallel for-loop</h1> <p>Now, if our for-loop is such that one iteration depends on the previous iterations, things becomes much more complicated. For example,</p> <pre><code class="language-r">X &lt;- 1:5 y &lt;- list() y[[1]] &lt;- 1 for (ii in 2:length(X)) { x &lt;- X[[ii]] tmp &lt;- sqrt(x) y[[ii]] &lt;- y[[ii - 1]] + tmp } </code></pre> <p>does <em>not</em> use an embarrassingly parallel for-loop. This code cannot be rewritten as an lapply call and therefore it cannot be parallelized.</p> <h1 id="summary">Summary</h1> <p>To parallelize a for-loop:</p> <ol> <li>Rewrite your for-loop such that each iteration is done inside a <code>local()</code> call (most of the work is done here)</li> <li>Rewrite this new for-loop as an lapply call (straightforward)</li> <li>Replace the lapply call with a parallel implementation of your choice (straightforward)</li> </ol> <p><em>Happy futuring!</em></p> <h2 id="see-also">See also</h2> <ul> <li><a href="https://www.jottr.org/2019/01/07/maintenance-updates-of-future-backends-and-dofuture/">Maintenance Updates of Future Backends and doFuture</a>, 2019-01-07</li> <li><a href="https://www.jottr.org/2018/07/23/output-from-the-future/">future 1.9.0 - Output from The Future</a>, 2018-07-23</li> <li><a href="https://www.jottr.org/2018/06/23/future.apply_1.0.0/">future.apply - Parallelize Any Base R Apply Function</a>, 2018-06-23</li> <li><a href="https://www.jottr.org/2018/06/18/future-erum2018-slides/">Delayed Future(Slides from eRum 2018)</a>, 2018-06-19</li> <li><a href="https://www.jottr.org/2018/04/12/future-results/">future 1.8.0: Preparing for a Shiny Future</a>, 2018-04-12</li> <li><a href="https://www.jottr.org/2017/06/05/many-faced-future/">The Many-Faced Future</a>, 2017-06-05</li> <li><a href="https://www.jottr.org/2017/02/19/future-rng/">future 1.3.0: Reproducible RNGs, future&#95;lapply() and More</a>, 2017-02-19</li> <li><a href="https://www.jottr.org/2016/10/22/future-hpc/">High-Performance Compute in R Using Futures</a>, 2016-10-22</li> <li><a href="https://www.jottr.org/2016/10/11/future-remotes/">Remote Processing Using Futures</a>, 2016-10-11</li> <li><a href="http://127.0.0.1:4321/2016/07/02/future-user2016-slides/">A Future for R: Slides from useR 2016</a>, 2016-07-02</li> </ul> <h1 id="appendix">Appendix</h1> <h2 id="a-regular-for-loop-with-future-future">A regular for-loop with future::future()</h2> <p>In order to lower the risk for mistakes, and because I think the for-loop-to-lapply approach is the one that the works out of the box in the most cases, I decided to not mention the following approach in the main text above, but if you&rsquo;re interested, here it is. With the core building blocks of the Future API, we can actually do parallel processing using a regular for-loop. Have a look at that second code snippet in Example 1 where we use a for-loop together with <code>local()</code>. All we need to do is to replace <code>local()</code> with <code>future()</code> and make sure to &ldquo;collect&rdquo; the values after the for-loop;</p> <pre><code class="language-r">library(future) plan(multisession) X &lt;- 1:5 y &lt;- list() for (ii in seq_along(X)) { y[[ii]] &lt;- future({ x &lt;- X[[ii]] tmp &lt;- sqrt(x) tmp }) } y &lt;- values(y) ## collect values </code></pre> <p>Note that this approach does <em>not</em> perform load balancing*. That is, contrary to the above mentioned lapply-like options, it will not chunk up the elements in <code>X</code> into equally-sized portions for each parallel worker to process. Instead, it will call each worker multiple times, which can bring some significant overhead, especially if there are many elements to iterate over.</p> <p>However, one neat feature of this bare-bones approach is that we have full control of the iteration. For instance, we can initiate each iteration using a bit of sequential code before we use parallel code. This can be particularly useful for subsetting large objects to avoid passing them to each worker, which otherwise can be costly. For example, we can rewrite the above as:</p> <pre><code class="language-r">library(future) plan(multisession) X &lt;- 1:5 y &lt;- list() for (ii in seq_along(X)) { x &lt;- X[[ii]] y[[ii]] &lt;- future({ tmp &lt;- sqrt(x) tmp }) } y &lt;- values(y) </code></pre> <p>This is just one example. I&rsquo;ve run into several other use cases in my large-scale genomics research, where I found it extremely useful to be able to perform the beginning of an iteration sequentially in the main processes before passing on the remaining part to be processed in parallel by the workers.</p> <p>(*) I do have some ideas on how to get the above code snippet to do automatic workload balancing &ldquo;under the hood&rdquo;, but that is quite far into the future of the future framework.</p> <p>UPDATE 2022-12-11: Update examples that used the deprecated <code>multiprocess</code> future backend alias to use the <code>multisession</code> backend.</p> Maintenance Updates of Future Backends and doFuture https://www.jottr.org/2019/01/07/maintenance-updates-of-future-backends-and-dofuture/ Mon, 07 Jan 2019 00:00:00 +0000 https://www.jottr.org/2019/01/07/maintenance-updates-of-future-backends-and-dofuture/ <p>New versions of the following future backends are available on CRAN:</p> <ul> <li><strong><a href="https://cran.r-project.org/package=future.callr">future.callr</a></strong> - parallelization via <strong><a href="https://cran.r-project.org/package=callr">callr</a></strong>, i.e. on the local machine</li> <li><strong><a href="https://cran.r-project.org/package=future.batchtools">future.batchtools</a></strong> - parallelization via <strong><a href="https://cran.r-project.org/package=batchtools">batchtools</a></strong>, i.e. on a compute cluster with job schedulers (SLURM, SGE, Torque/PBS, etc.) but also on the local machine</li> <li><strong><a href="https://cran.r-project.org/package=future.BatchJobs">future.BatchJobs</a></strong> - (maintained for legacy reasons) parallelization via <strong><a href="https://cran.r-project.org/package=BatchJobs">BatchJobs</a></strong>, which is the predecessor of batchtools</li> </ul> <p>These releases fix a few small bugs and inconsistencies that were identified with help of the <strong><a href="https://github.com/HenrikBengtsson/future.tests">future.tests</a></strong> framework that is being developed with <a href="https://www.r-consortium.org/projects/awarded-projects">support from the R Consortium</a>.</p> <p>I also released a new version of:</p> <ul> <li><strong><a href="https://cran.r-project.org/package=doFuture">doFuture</a></strong> - use <em>any</em> future backend for <code>foreach()</code> parallelization</li> </ul> <p>which comes with a few improvements and bug fixes.</p> <p><img src="https://www.jottr.org/post/the-future-is-now.gif" alt="An old TV screen struggling to display the text &quot;THE FUTURE IS NOW&quot;" /> <em>The future is now.</em></p> <h2 id="the-future-is-what">The future is &hellip; what?</h2> <p>If you never heard of the future framework before, here is a simple example. Assume that you want to run</p> <pre><code class="language-r">y &lt;- lapply(X, FUN = my_slow_function) </code></pre> <p>in parallel on your local computer. The most straightforward way to achieve this is to use:</p> <pre><code class="language-r">library(future.apply) plan(multisession) y &lt;- future_lapply(X, FUN = my_slow_function) </code></pre> <p>If you have SSH access to a few machines here and there with R installed, you can use:</p> <pre><code class="language-r">library(future.apply) plan(cluster, workers = c(&quot;localhost&quot;, &quot;gandalf.remote.edu&quot;, &quot;server.cloud.org&quot;)) y &lt;- future_lapply(X, FUN = my_slow_function) </code></pre> <p>Even better, if you have access to compute cluster with an SGE job scheduler, you could use:</p> <pre><code class="language-r">library(future.apply) plan(future.batchtools::batchtools_sge) y &lt;- future_lapply(X, FUN = my_slow_function) </code></pre> <h2 id="the-future-is-why">The future is &hellip; why?</h2> <p>The <strong><a href="https://cran.r-project.org/package=future">future</a></strong> package provides a simple, cross-platform, and lightweight API for parallel processing in R. At its core, there are three core building blocks for doing parallel processing - <code>future()</code>, <code>resolved()</code> and <code>value()</code>- which are used for creating the asynchronous evaluation of an R expression, querying whether it&rsquo;s done or not, and collecting the results. With these fundamental building blocks, a large variety of parallel tasks can be performed, either by using these functions directly or indirectly via more feature rich higher-level parallelization APIs such as <strong><a href="https://cran.r-project.org/package=future.apply">future.apply</a></strong>, <strong><a href="https://cran.r-project.org/package=foreach">foreach</a></strong>, <strong><a href="https://bioconductor.org/packages/release/bioc/html/BiocParallel.html">BiocParallel</a></strong> or <strong><a href="https://cran.r-project.org/package=plyr">plyr</a></strong> with <strong><a href="https://cran.r-project.org/package=doFuture">doFuture</a></strong>, and <strong><a href="https://cran.r-project.org/package=furrr">furrr</a></strong>. In all cases, how and where future R expressions are evaluated, that is, how and where the parallelization is performed, depends solely on which <em>future backend</em> is currently used, which is controlled by the <code>plan()</code> function.</p> <p>One advantage of the Future API, whether it is used directly as is or via one of the higher-level APIs, is that it encapsulates the details on <em>how</em> and <em>where</em> the code is parallelized allowing the developer to instead focus on <em>what</em> to parallelize. Another advantage is that the end user will have control over which future backend to use. For instance, one user may choose to run an analysis in parallel on their notebook or in the cloud, whereas another may want to run it via a job scheduler in a high-performance compute (HPC) environment.</p> <h2 id="what-s-next">What’s next?</h2> <p>I&rsquo;ve spent a fair bit of time working on <strong><a href="https://github.com/HenrikBengtsson/future.tests">future.tests</a></strong>, which is a single framework for testing future backends. It will allow developers of future backends to validate that they fully conform to the Future API. This will lower the barrier for creating a new backend (e.g. <a href="https://github.com/HenrikBengtsson/future/issues/204">future.clustermq</a> on top of <strong><a href="https://cran.r-project.org/package=clustermq">clustermq</a></strong> or <a href="https://github.com/HenrikBengtsson/future/issues/151">one on top Redis</a>) and it will add trust for existing ones such that end users can reliably switch between backends without having to worry about the results being different or even corrupted. So, backed by <strong><a href="https://github.com/HenrikBengtsson/future.tests">future.tests</a></strong>, I feel more comfortable attacking some of the feature requests - and there are <a href="https://github.com/HenrikBengtsson/future/issues?q=is%3Aissue+is%3Aopen+label%3A%22feature+request%22">quite a few of them</a>. Indeed, I&rsquo;ve already implemented one of them. More news coming soon &hellip;</p> <p><em>Happy futuring!</em></p> <p>UPDATE 2022-12-11: Update examples that used the deprecated <code>multiprocess</code> future backend alias to use the <code>multisession</code> backend.</p> <h2 id="see-also">See also</h2> <ul> <li><a href="https://www.jottr.org/2018/07/23/output-from-the-future/">future 1.9.0 - Output from The Future</a>, 2018-07-23</li> <li><a href="https://www.jottr.org/2018/06/23/future.apply_1.0.0/">future.apply - Parallelize Any Base R Apply Function</a>, 2018-06-23</li> <li><a href="https://www.jottr.org/2018/06/18/future-erum2018-slides/">Delayed Future(Slides from eRum 2018)</a>, 2018-06-19</li> <li><a href="https://www.jottr.org/2018/04/12/future-results/">future 1.8.0: Preparing for a Shiny Future</a>, 2018-04-12</li> <li><a href="https://www.jottr.org/2017/06/05/many-faced-future/">The Many-Faced Future</a>, 2017-06-05</li> <li><a href="https://www.jottr.org/2017/02/19/future-rng/">future 1.3.0 Reproducible RNGs, future&#95;lapply() and More</a>, 2017-02-19</li> <li><a href="https://www.jottr.org/2016/10/22/future-hpc/">High-Performance Compute in R Using Futures</a>, 2016-10-22</li> <li><a href="https://www.jottr.org/2016/10/11/future-remotes/">Remote Processing Using Futures</a>, 2016-10-11</li> <li><a href="http://127.0.0.1:4321/2016/07/02/future-user2016-slides/">A Future for R: Slides from useR 2016</a>, 2016-07-02</li> </ul> future 1.9.0 - Output from The Future https://www.jottr.org/2018/07/23/output-from-the-future/ Mon, 23 Jul 2018 00:00:00 +0000 https://www.jottr.org/2018/07/23/output-from-the-future/ <p><strong><a href="https://cran.r-project.org/package=future">future</a></strong> 1.9.0 - <em>Unified Parallel and Distributed Processing in R for Everyone</em> - is on CRAN. This is a milestone release:</p> <p><strong>Standard output is now relayed from futures back to the master R session - regardless of where the futures are processed!</strong></p> <p><em>Disclaimer:</em> A future&rsquo;s output is relayed only after it is resolved and when its value is retrieved by the master R process. In other words, the output is not streamed back in a &ldquo;live&rdquo; fashion as it is produced. Also, it is only the standard output that is relayed. See below, for why the standard error cannot be relayed.</p> <p><img src="https://www.jottr.org/post/Signaling_by_Napoleonic_semaphore_line.jpg" alt="Illustration of communication by mechanical semaphore in 1800s France. Lines of towers supporting semaphore masts were built within visual distance of each other. The arms of the semaphore were moved to different positions, to spell out text messages. The operators in the next tower would read the message and pass it on. Invented by Claude Chappee in 1792, semaphore was a popular communication technology in the early 19th century until the telegraph replaced it. (source: wikipedia.org)" /> <em>Relaying standard output from far away</em></p> <h2 id="examples">Examples</h2> <p>Assume we have access to three machines with R installed on our local network. We can distribute our R processing to these machines using futures by:</p> <pre><code class="language-r">&gt; library(future) &gt; plan(cluster, workers = c(&quot;n1&quot;, &quot;n2&quot;, &quot;n3&quot;)) &gt; nbrOfWorkers() [1] 3 </code></pre> <p>With the above, future expressions will now be processed across those three machines. To see which machine a future ends up being resolved by, we can output the hostname, e.g.</p> <pre><code class="language-r">&gt; printf &lt;- function(...) cat(sprintf(...)) &gt; f &lt;- future({ + printf(&quot;Hostname: %s\n&quot;, Sys.info()[[&quot;nodename&quot;]]) + 42 + }) &gt; v &lt;- value(f) Hostname: n1 &gt; v [1] 42 </code></pre> <p>We see that this particular future was resolved on the <em>n1</em> machine. Note how <em>the output is relayed when we call <code>value()</code></em>. This means that if we call <code>value()</code> multiple times, the output will also be relayed multiple times, e.g.</p> <pre><code class="language-r">&gt; v &lt;- value(f) Hostname: n1 &gt; value(f) Hostname: n1 [1] 42 </code></pre> <p>This is intended and by design. In case you are new to futures, note that <em>a future is only evaluated once</em>. In other words, calling <code>value()</code> multiple times will not re-evaluate the future expression.</p> <p>The output is also relayed when using future assignments (<code>%&lt;-%</code>). For example,</p> <pre><code class="language-r">&gt; v %&lt;-% { + printf(&quot;Hostname: %s\n&quot;, Sys.info()[[&quot;nodename&quot;]]) + 42 + } &gt; v Hostname: n1 [1] 42 &gt; v [1] 42 </code></pre> <p>In this case, the output is only relayed the first time we print <code>v</code>. The reason for this is because when first set up, <code>v</code> is a promise (delayed assignment), and as soon as we &ldquo;touch&rdquo; (here print) it, it will internally call <code>value()</code> on the underlying future and then be resolved to a regular variable <code>v</code>. This is also intended and by design.</p> <p>In the spirit of the Future API, any <em>output behaves exactly the same way regardless of future backend used</em>. In the above, we see that output can be relayed from three external machines back to our local R session. We would get the exact same if we run our futures in parallel, or sequentially, on our local machine, e.g.</p> <pre><code class="language-r">&gt; plan(sequential) v %&lt;-% { printf(&quot;Hostname: %s\n&quot;, Sys.info()[[&quot;nodename&quot;]]) 42 } &gt; v Hostname: my-laptop [1] 42 </code></pre> <p>This also works when we use nested futures wherever the workers are located (local or remote), e.g.</p> <pre><code class="language-r">&gt; plan(list(sequential, multisession)) &gt; a %&lt;-% { + printf(&quot;PID: %d\n&quot;, Sys.getpid()) + b %&lt;-% { + printf(&quot;PID: %d\n&quot;, Sys.getpid()) + 42 + } + b + } &gt; a PID: 360547 PID: 484252 [1] 42 </code></pre> <h2 id="higher-level-future-frontends">Higher-Level Future Frontends</h2> <p>The core Future API, that is, the explicit <code>future()</code>-<code>value()</code> functions and the implicit future-assignment operator <code>%&lt;-%</code> function, provides the foundation for all of the future ecosystem. Because of this, <em>relaying of output will work out of the box wherever futures are used</em>. For example, when using <strong>future.apply</strong> we get:</p> <pre><code>&gt; library(future.apply) &gt; plan(cluster, workers = c(&quot;n1&quot;, &quot;n2&quot;, &quot;n3&quot;)) &gt; printf &lt;- function(...) cat(sprintf(...)) &gt; y &lt;- future_lapply(1:5, FUN = function(x) { + printf(&quot;Hostname: %s (x = %g)\n&quot;, Sys.info()[[&quot;nodename&quot;]], x) + sqrt(x) + }) Hostname: n1 (x = 1) Hostname: n1 (x = 2) Hostname: n2 (x = 3) Hostname: n3 (x = 4) Hostname: n3 (x = 5) &gt; unlist(y) [1] 1.000000 1.414214 1.732051 2.000000 2.236068 </code></pre> <p>and similarly when, for example, using <strong>foreach</strong>:</p> <pre><code class="language-r">&gt; library(doFuture) &gt; registerDoFuture() &gt; plan(cluster, workers = c(&quot;n1&quot;, &quot;n2&quot;, &quot;n3&quot;)) &gt; printf &lt;- function(...) cat(sprintf(...)) &gt; y &lt;- foreach(x = 1:5) %dopar% { + printf(&quot;Hostname: %s (x = %g)\n&quot;, Sys.info()[[&quot;nodename&quot;]], x) + sqrt(x) + } Hostname: n1 (x = 1) Hostname: n1 (x = 2) Hostname: n2 (x = 3) Hostname: n3 (x = 4) Hostname: n3 (x = 5) &gt; unlist(y) [1] 1.000000 1.414214 1.732051 2.000000 2.236068 </code></pre> <h2 id="what-about-standard-error">What about standard error?</h2> <p>Unfortunately, it is <em>not possible</em> to relay output sent to the standard error (stderr), that is, output by <code>message()</code>, <code>cat(..., file = stderr())</code>, and so on, is not taken care of. This is due to a <a href="https://github.com/HenrikBengtsson/Wishlist-for-R/issues/55">limitation in R</a>, preventing us from capturing stderr in a reliable way. The gist of the problem is that, contrary to stdout (&ldquo;output&rdquo;), there can only be a single stderr (&ldquo;message&rdquo;) sink active in R at any time. What really is the show stopper is that if we allocate such a message sink, it will be stolen from us the moment other code/functions request the message sink. In other words, message sinks cannot be used reliably in R unless one fully controls the whole software stack. As long as this is the case, it is not possible to collect and relay stderr in a consistent fashion across <em>all</em> future backends (*). But, of course, I&rsquo;ll keep on trying to find a solution to this problem. If anyone has a suggestion for a workaround or a patch to R, please let me know.</p> <p>(*) The <strong><a href="https://cran.r-project.org/package=callr">callr</a></strong> package captures stdout and stderr in a consistent manner, so for the <strong><a href="https://cran.r-project.org/package=future.callr">future.callr</a></strong> backend, we could indeed already now relay stderr. We could probably also find a solution for <strong><a href="https://cran.r-project.org/package=future.batchtools">future.batchtools</a></strong> backends, which targets HPC job schedulers by utilizing the <strong><a href="https://cran.r-project.org/package=batchtools">batchtools</a></strong> package. However, if code becomes dependent on using specific future backends, it will limit the end users&rsquo; options - we want to avoid that as far as ever possible. Having said this, it is possible that we&rsquo;ll start out supporting stderr by making it an <a href="https://github.com/HenrikBengtsson/future/issues/172">optional feature of the Future API</a>.</p> <h2 id="poor-man-s-debugging">Poor Man&rsquo;s debugging</h2> <p>Because the output is also relayed when there is an error, e.g.</p> <pre><code class="language-r">&gt; x &lt;- &quot;42&quot; &gt; f &lt;- future({ + str(list(x = x)) + log(x) + }) &gt; value(f) List of 1 $ x: chr &quot;42&quot; Error in log(x) : non-numeric argument to mathematical function </code></pre> <p>it can be used for simple troubleshooting and narrowing down errors. For example,</p> <pre><code class="language-r">&gt; library(doFuture) &gt; registerDoFuture() &gt; plan(multisession) &gt; nbrOfWorkers() [1] 2 &gt; x &lt;- list(1, &quot;2&quot;, 3, 4, 5) &gt; y &lt;- foreach(x = x) %dopar% { + str(list(x = x)) + log(x) + } List of 1 $ x: num 1 List of 1 $ x: chr &quot;2&quot; List of 1 $ x: num 3 List of 1 $ x: num 4 List of 1 $ x: num 5 Error in { : task 2 failed - &quot;non-numeric argument to mathematical function&quot; &gt; </code></pre> <p>From the error message, we get that there was an &ldquo;non-numeric argument&rdquo; (element) passed to a function. By adding the <code>str()</code>, we can also see that it is of type character and what its value is. This will help us go back to the data source (<code>x</code>) and continue the troubleshooting there.</p> <h2 id="what-s-next">What&rsquo;s next?</h2> <p>Progress bar information is one of several frequently <a href="https://github.com/HenrikBengtsson/future/labels/feature%20request">requested features</a> in the future framework. I hope to attack the problem of progress bars and progress messages in higher-level future frontends such as <strong><a href="https://cran.r-project.org/package=future.apply">future.apply</a></strong>. Ideally, this can be done in a uniform and generic fashion to meet all needs. A possible implementation that has been discussed, is to provide a set of basic hook functions (e.g. on-start, on-resolved, on-value) that any ProgressBar API (e.g. <strong><a href="https://github.com/ropenscilabs/jobstatus">jobstatus</a></strong>) can build upon. This could help avoid tie-in to a particular progress-bar implementation.</p> <p>Another feature I&rsquo;d like to get going is (optional) <a href="https://github.com/HenrikBengtsson/future/issues/59">benchmarking of processing time and memory consumption</a>. This type of information will help optimize parallel and distributed processing by identifying and understand the various sources of overhead involved in parallelizing a particular piece of code in a particular compute environment. This information will also help any efforts trying to automate load balancing. It may even be used for progress bars that try to estimate the remaining processing time (&ldquo;ETA&rdquo;).</p> <p>So, lots of work ahead. Oh well &hellip;</p> <p><em>Happy futuring!</em></p> <p>UPDATE 2022-12-11: Update examples that used the deprecated <code>multiprocess</code> future backend alias to use the <code>multisession</code> backend.</p> <h2 id="see-also">See also</h2> <ul> <li><p>About <a href="https://www.wikipedia.org/wiki/Semaphore_line">Semaphore Telegraphs</a>, Wikipedia</p></li> <li><p><a href="https://www.jottr.org/2018/06/23/future.apply_1.0.0/">future.apply - Parallelize Any Base R Apply Function</a>, 2018-06-23</p></li> <li><p><a href="https://www.jottr.org/2018/06/18/future-erum2018-slides/">Delayed Future(Slides from eRum 2018)</a>, 2018-06-19</p></li> <li><p><a href="https://www.jottr.org/2018/04/12/future-results/">future 1.8.0: Preparing for a Shiny Future</a>, 2018-04-12</p></li> <li><p><a href="https://www.jottr.org/2017/06/05/many-faced-future/">The Many-Faced Future</a>, 2017-06-05</p></li> <li><p><a href="https://www.jottr.org/2017/02/19/future-rng/">future 1.3.0 Reproducible RNGs, future&#95;lapply() and More</a>, 2017-02-19</p></li> <li><p><a href="https://www.jottr.org/2016/10/22/future-hpc/">High-Performance Compute in R Using Futures</a>, 2016-10-22</p></li> <li><p><a href="https://www.jottr.org/2016/10/11/future-remotes/">Remote Processing Using Futures</a>, 2016-10-11</p></li> </ul> <h2 id="links">Links</h2> <ul> <li>future - <em>Unified Parallel and Distributed Processing in R for Everyone</em> <ul> <li>CRAN page: <a href="https://cran.r-project.org/package=future">https://cran.r-project.org/package=future</a></li> <li>GitHub page: <a href="https://github.com/HenrikBengtsson/future">https://github.com/HenrikBengtsson/future</a></li> </ul></li> <li>future.apply - <em>Apply Function to Elements in Parallel using Futures</em> <ul> <li>CRAN page: <a href="https://cran.r-project.org/package=future.apply">https://cran.r-project.org/package=future.apply</a></li> <li>GitHub page: <a href="https://github.com/HenrikBengtsson/future.apply">https://github.com/HenrikBengtsson/future.apply</a></li> </ul></li> <li>doFuture - <em>A Universal Foreach Parallel Adaptor using the Future API of the &lsquo;future&rsquo; Package</em> <ul> <li>CRAN page: <a href="https://cran.r-project.org/package=doFuture">https://cran.r-project.org/package=doFuture</a></li> <li>GitHub page: <a href="https://github.com/HenrikBengtsson/doFuture">https://github.com/HenrikBengtsson/doFuture</a></li> </ul></li> <li>future.batchtools - <em>A Future API for Parallel and Distributed Processing using &lsquo;batchtools&rsquo;</em> <ul> <li>CRAN page: <a href="https://cran.r-project.org/package=future.batchtools">https://cran.r-project.org/package=future.batchtools</a></li> <li>GitHub page: <a href="https://github.com/HenrikBengtsson/future.batchtools">https://github.com/HenrikBengtsson/future.batchtools</a></li> </ul></li> <li>future.callr - <em>A Future API for Parallel Processing using &lsquo;callr&rsquo;</em> <ul> <li>CRAN page: <a href="https://cran.r-project.org/package=future.callr">https://cran.r-project.org/package=future.callr</a></li> <li>GitHub page: <a href="https://github.com/HenrikBengtsson/future.callr">https://github.com/HenrikBengtsson/future.callr</a></li> </ul></li> </ul> R.devices - Into the Void https://www.jottr.org/2018/07/21/suppressgraphics/ Sat, 21 Jul 2018 00:00:00 +0000 https://www.jottr.org/2018/07/21/suppressgraphics/ <p><strong><a href="https://cran.r-project.org/package=R.devices">R.devices</a></strong> 2.16.0 - <em>Unified Handling of Graphics Devices</em> - is on CRAN. With this release, you can now easily <strong>suppress unwanted graphics</strong>, e.g. graphics produced by one of those do-everything-in-one-call functions that we all bump into once in a while. To suppress graphics, the <strong>R.devices</strong> package provides graphics device <code>nulldev()</code>, and function <code>suppressGraphics()</code>, which both send any produced graphics into the void. This works on all operating systems, including Windows.</p> <p><img src="https://www.jottr.org/post/guillaume_nery_into_the_void_2.gif" alt="&quot;Into the void&quot;" /> <small><em><a href="https://www.youtube.com/watch?v=uQITWbAaDx0">Guillaume Nery base jumping at Dean&rsquo;s Blue Hole, filmed on breath hold by Julie Gautier</a></em></small> <!-- GIF from https://blog.francetvinfo.fr/l-instit-humeurs/2013/09/01/vis-ma-vie-dinstit-en-gif-anime-9.html --></p> <h2 id="examples">Examples</h2> <pre><code class="language-r">library(R.devices) nulldev() plot(1:100, main = &quot;Some Ignored Graphics&quot;) dev.off() </code></pre> <pre><code class="language-r">R.devices::suppressGraphics({ plot(1:100, main = &quot;Some Ignored Graphics&quot;) }) </code></pre> <h2 id="other-features">Other Features</h2> <p>Some other reasons for using the <strong>R.devices</strong> package:</p> <ul> <li><p><strong>No need to call dev.off()</strong> - Did you ever forgot to call <code>dev.off()</code>, or did a function call produce an error causing <code>dev.off()</code> not to be reached, leaving a graphics device open? By using one of the <code>toPDF()</code>, <code>toPNG()</code>, &hellip; functions, or the more general <code>devEval()</code> function, <code>dev.off()</code> is automatically taken care of.</p></li> <li><p><strong>No need to specify filename extension</strong> - Did you ever switch from using <code>png()</code> to, say, <code>pdf()</code>, and forgot to update the filename resulting in a <code>my_plot.png</code> file that is actually a PDF file? By using one of the <code>toPDF()</code>, <code>toPNG()</code>, &hellip; functions, or the more general <code>devEval()</code> function, filename extensions are automatically taken care of - just specify the part without the extension.</p></li> <li><p><strong>Specify the aspect ratio</strong> - rather than having to manually calculate device-specific arguments <code>width</code> or <code>height</code>, e.g. <code>toPNG(&quot;my_plot&quot;, { plot(1:10) }, aspectRatio = 2/3)</code>. This is particularly useful when switching between device types, or when outputting to multiple ones at the same time.</p></li> <li><p><strong>Unified API for graphics options</strong> - conveniently set (most) graphics options including those that can otherwise only be controlled via arguments, e.g. <code>devOptions(&quot;png&quot;, width = 1024)</code>.</p></li> <li><p><strong>Control where figure files are saved</strong> - the default is folder <code>figures/</code> but can be set per device type or globally, e.g. <code>devOptions(&quot;*&quot;, path = &quot;figures/col/&quot;)</code>.</p></li> <li><p><strong>Easily produce EPS and favicons</strong> - <code>toEPS()</code> and <code>toFavicon()</code> are friendly wrappers for producing EPS and favicon graphics.</p></li> <li><p><strong>Capture and replay graphics</strong> - for instance, use <code>future::plan(remote, workers = &quot;remote.server.org&quot;); p %&lt;-% capturePlot({ plot(1:10) })</code> to produce graphics on a remote machine, and then display it locally by printing <code>p</code>.</p></li> </ul> <h3 id="some-more-examples">Some more examples</h3> <pre><code class="language-r">R.devices::toPDF(&quot;my_plot&quot;, { plot(1:100, main = &quot;Amazing Graphics&quot;) }) ### [1] &quot;figures/my_plot.pdf&quot; </code></pre> <pre><code class="language-r">R.devices::toPNG(&quot;my_plot&quot;, { plot(1:100, main = &quot;Amazing Graphics&quot;) }) ### [1] &quot;figures/my_plot.png&quot; </code></pre> <pre><code class="language-r">R.devices::toEPS(&quot;my_plot&quot;, { plot(1:100, main = &quot;Amazing Graphics&quot;) }) ### [1] &quot;figures/my_plot.eps&quot; </code></pre> <pre><code class="language-r">R.devices::devEval(c(&quot;png&quot;, &quot;pdf&quot;, &quot;eps&quot;), name = &quot;my_plot&quot;, { plot(1:100, main = &quot;Amazing Graphics&quot;) }, aspectRatio = 1.3) ### $png ### [1] &quot;figures/my_plot.png&quot; ### ### $pdf ### [1] &quot;figures/my_plot.pdf&quot; ### ### $eps ### [1] &quot;figures/my_plot.eps&quot; </code></pre> <h2 id="links">Links</h2> <ul> <li>R.devices package: <ul> <li>CRAN page: <a href="https://cran.r-project.org/package=R.devices">https://cran.r-project.org/package=R.devices</a></li> <li>GitHub page: <a href="https://github.com/HenrikBengtsson/R.devices">https://github.com/HenrikBengtsson/R.devices</a></li> </ul></li> </ul> <h2 id="see-also">See also</h2> <ul> <li><a href="https://www.jottr.org/2016/07/02/future-user2016-slides/">A Future for R: Slides from useR 2016</a>, 2016-07-02 <ul> <li>See Slide 17 for an example of using <code>capturePlot()</code> remotely and plotting locally</li> </ul></li> </ul> future.apply - Parallelize Any Base R Apply Function https://www.jottr.org/2018/06/23/future.apply_1.0.0/ Sat, 23 Jun 2018 00:00:00 +0000 https://www.jottr.org/2018/06/23/future.apply_1.0.0/ <p><img src="https://www.jottr.org/post/future.apply_1.0.0-htop_32cores.png" alt="0% to 100% utilization" /> <em>Got compute?</em></p> <p><a href="https://cran.r-project.org/package=future.apply">future.apply</a> 1.0.0 - <em>Apply Function to Elements in Parallel using Futures</em> - is on CRAN. With this milestone release, all<sup>*</sup> base R apply functions now have corresponding futurized implementations. This makes it easier than ever before to parallelize your existing <code>apply()</code>, <code>lapply()</code>, <code>mapply()</code>, &hellip; code - just prepend <code>future_</code> to an apply call that takes a long time to complete. That&rsquo;s it! The default is sequential processing but by using <code>plan(multisession)</code> it&rsquo;ll run in parallel.</p> <p><br> <em>Table: All future_nnn() functions in the <strong>future.apply</strong> package. Each function takes the same arguments as the corresponding <strong>base</strong> function does.</em><br></p> <table> <thead> <tr> <th>Function</th> <th>Description</th> </tr> </thead> <tbody> <tr> <td><code>future_<strong>apply()</strong></code></td> <td>Apply Functions Over Array Margins</td> </tr> <tr> <td><code>future_<strong>lapply()</strong></code></td> <td>Apply a Function over a List or Vector</td> </tr> <tr> <td><code>future_<strong>sapply()</strong></code></td> <td>- &ldquo; -</td> </tr> <tr> <td><code>future_<strong>vapply()</strong></code></td> <td>- &ldquo; -</td> </tr> <tr> <td><code>future_<strong>replicate()</strong></code></td> <td>- &ldquo; -</td> </tr> <tr> <td><code>future_<strong>mapply()</strong></code></td> <td>Apply a Function to Multiple List or Vector Arguments</td> </tr> <tr> <td><code>future_<strong>Map()</strong></code></td> <td>- &ldquo; -</td> </tr> <tr> <td><code>future_<strong>eapply()</strong></code></td> <td>Apply a Function Over Values in an Environment</td> </tr> <tr> <td><code>future_<strong>tapply()</strong></code></td> <td>Apply a Function Over a Ragged Array</td> </tr> </tbody> </table> <p><sup>*</sup> <code>future_<strong>rapply()</strong></code> - Recursively Apply a Function to a List - is yet to be implemented.</p> <h2 id="a-motivating-example">A Motivating Example</h2> <p>In the <strong>parallel</strong> package there is an example - in <code>?clusterApply</code> - showing how to perform bootstrap simulations in parallel. After some small modifications to clarify the steps, it looks like the following:</p> <pre><code class="language-r">library(parallel) library(boot) run1 &lt;- function(...) { library(boot) cd4.rg &lt;- function(data, mle) MASS::mvrnorm(nrow(data), mle$m, mle$v) cd4.mle &lt;- list(m = colMeans(cd4), v = var(cd4)) boot(cd4, corr, R = 500, sim = &quot;parametric&quot;, ran.gen = cd4.rg, mle = cd4.mle) } cl &lt;- makeCluster(4) ## Parallelize using four cores clusterSetRNGStream(cl, 123) cd4.boot &lt;- do.call(c, parLapply(cl, 1:4, run1)) boot.ci(cd4.boot, type = c(&quot;norm&quot;, &quot;basic&quot;, &quot;perc&quot;), conf = 0.9, h = atanh, hinv = tanh) stopCluster(cl) </code></pre> <p>The script defines a function <code>run1()</code> that produces 500 bootstrap samples, and then it calls this function four times, combines the four replicated samples into one <code>cd4.boot</code>, and at the end it uses <code>boot.ci()</code> to summarize the results.</p> <p>The corresponding sequential implementation would look something like:</p> <pre><code class="language-r">library(boot) run1 &lt;- function(...) { cd4.rg &lt;- function(data, mle) MASS::mvrnorm(nrow(data), mle$m, mle$v) cd4.mle &lt;- list(m = colMeans(cd4), v = var(cd4)) boot(cd4, corr, R = 500, sim = &quot;parametric&quot;, ran.gen = cd4.rg, mle = cd4.mle) } set.seed(123) cd4.boot &lt;- do.call(c, lapply(1:4, run1)) boot.ci(cd4.boot, type = c(&quot;norm&quot;, &quot;basic&quot;, &quot;perc&quot;), conf = 0.9, h = atanh, hinv = tanh) </code></pre> <p>We notice a few things about these two code snippets. First of all, in the parallel code, there are two <code>library(boot)</code> calls; one in the main code and one inside the <code>run1()</code> function. The reason for this is to make sure that the <strong>boot</strong> package is also attached in the parallel, background R session when <code>run1()</code> is called there. The <strong>boot</strong> package defines the <code>boot.ci()</code> function, as well as the <code>boot()</code> function and the <code>cd4</code> data.frame - both used inside <code>run1()</code>. If <strong>boot</strong> is not attached inside the function, we would get an error on <code>&quot;object 'cd4' not found&quot;</code> when running the parallel code. In contrast, we do not need to do this in the sequential code. Also, if we later would turn our parallel script into a package, then <code>R CMD check</code> would complain if we kept the <code>library(boot)</code> call inside the <code>run1()</code> function.</p> <p>Second, the example uses <code>MASS::mvrnorm()</code> in <code>run1()</code>. The reason for this is related to the above - if we use only <code>mvrnorm()</code>, we need to attach the <strong>MASS</strong> package using <code>library(MASS)</code> and also do so inside <code>run1()</code>. Since there is only one <strong>MASS</strong> function called, it&rsquo;s easier and neater to use the form <code>MASS::mvrnorm()</code>.</p> <p>Third, the random-seed setup differs between the sequential and the parallel approach.</p> <p>In summary, in order to turn the sequential script into a script that parallelizes using the <strong>parallel</strong> package, we would have to not only rewrite parts of the code but also be aware of important differences in order to avoid getting run-time errors due to missing packages or global variables.</p> <p>One of the objectives of the <strong>future.apply</strong> package, and the <strong>future</strong> ecosystem in general, is to make transitions from writing sequential code to writing parallel code as simple and frictionless as possible.</p> <p>Here is the same example parallelized using the <strong>future.apply</strong> package:</p> <pre><code class="language-r">library(future.apply) plan(multisession, workers = 4) ## Parallelize using four cores library(boot) run1 &lt;- function(...) { cd4.rg &lt;- function(data, mle) MASS::mvrnorm(nrow(data), mle$m, mle$v) cd4.mle &lt;- list(m = colMeans(cd4), v = var(cd4)) boot(cd4, corr, R = 500, sim = &quot;parametric&quot;, ran.gen = cd4.rg, mle = cd4.mle) } set.seed(123) cd4.boot &lt;- do.call(c, future_lapply(1:4, run1, future.seed = TRUE)) boot.ci(cd4.boot, type = c(&quot;norm&quot;, &quot;basic&quot;, &quot;perc&quot;), conf = 0.9, h = atanh, hinv = tanh) </code></pre> <p>The difference between the sequential base-R implementation and the <strong>future.apply</strong> implementation is minimal. The <strong>future.apply</strong> package is attached, the parallel plan of four workers is set up, and the <code>apply()</code> function is replaced by <code>future_apply()</code>, where we specify <code>future.seed = TRUE</code> to get statistical sound and numerically reproducible parallel random number generation (RNG). More importantly, notice how there is no need to worry about which packages need to be attached on the workers and which global variables need to be exported. That is all taken care of automatically by the <strong>future</strong> framework.</p> <h2 id="q-a">Q&amp;A</h2> <p>Q. <em>What are my options for parallelization?</em><br> A. Everything in <strong>future.apply</strong> is processed through the <a href="https://cran.r-project.org/package=future">future</a> framework. This means that all parallelization backends supported by the <strong>parallel</strong> package are supported out of the box, e.g. on your <strong>local machine</strong>, and on <strong>local</strong> or <strong>remote</strong> ad-hoc <strong>compute clusters</strong> (also in the <strong>cloud</strong>). Additional parallelization and distribution schemas are provided by backends such as <strong><a href="https://cran.r-project.org/package=future.callr">future.callr</a></strong> (parallelization on your local machine) and <strong><a href="https://cran.r-project.org/package=future.batchtools">future.batchtools</a></strong> (large-scale parallelization via <strong>HPC job schedulers</strong>). For other alternatives, see the CRAN Page for the <strong><a href="https://cran.r-project.org/package=future">future</a></strong> package and the <a href="https://cran.r-project.org/web/views/HighPerformanceComputing.html">High-Performance and Parallel Computing with R</a> CRAN Task View.</p> <p>Q. <em>Righty-oh, so how do I specify which parallelization backend to use?</em><br> A. A fundamental design pattern of the future framework is that <em>the end user decides <strong>how and where</strong> to parallelize</em> while <em>the developer decides <strong>what</strong> to parallelize</em>. This means that you do <em>not</em> specify the backend via some argument to the <code>future_nnn()</code> functions. Instead, the backend is specified by the <code>plan()</code> function - you can almost think of it as a global option that the end user controls. For example, <code>plan(multisession)</code> will parallelize on the local machine, so will <code>plan(future.callr::callr)</code>, whereas <code>plan(cluster, workers = c(&quot;n1&quot;, &quot;n2&quot;, &quot;remote.server.org&quot;))</code> will parallelize on two local machines and one remote machine. Using <code>plan(future.batchtools::batchtools_sge)</code> will distribute the processing on your SGE-supported compute cluster. BTW, you can also have <a href="https://cran.r-project.org/web/packages/future/vignettes/future-3-topologies.html">nested parallelization strategies</a>, e.g. <code>plan(list(tweak(cluster, workers = nodes), multisession))</code> where <code>nodes = c(&quot;n1&quot;, &quot;n2&quot;, &quot;remote.server.org&quot;)</code>.</p> <p>Q. <em>What about load balancing?</em><br> A. The default behavior of all functions is to distribute <strong>equally-sized chunks</strong> of elements to each available background worker - such that each worker process exactly one chunk (= one future). If the processing times vary significantly across chunks, you can increase the average number of chunks processed by each worker, e.g. to have them process two chunks on average, specify <code>future.scheduling = 2.0</code>. Alternatively, you can specify the number of elements processed per chunk, e.g. <code>future.chunk.size = 10L</code> (an analog to the <code>chunk.size</code> argument added to the <strong>parallel</strong> package in R 3.5.0).</p> <p>Q. <em>What about random number generation (RNG)? I&rsquo;ve heard it&rsquo;s tricky to get right when running in parallel.</em><br> A. Just add <code>future.seed = TRUE</code> and you&rsquo;re good. This will use <strong>parallel safe</strong> and <strong>statistical sound</strong> <strong>L&rsquo;Ecuyer-CMRG RNG</strong>, which is a well-established parallel RNG algorithm and used by the <strong>parallel</strong> package. The <strong>future.apply</strong> functions use this in a way that is also <strong>invariant to</strong> the future backend and the amount of &ldquo;chunking&rdquo; used. To produce numerically reproducible results, set <code>set.seed(123)</code> before (as in the above example), or simply use <code>future.seed = 123</code>.</p> <p>Q. <em>What about global variables? Whenever I&rsquo;ve tried to parallelize code before, I often ran into errors on &ldquo;this or that variable is not found&rdquo;.</em><br> A. This is very rarely a problem when using the <a href="https://cran.r-project.org/package=future">future</a> framework - things work out of the box. <strong>Global variables and packages</strong> needed are <strong>automatically identified</strong> from static code inspection and passed on to the workers - even when the workers run on remote computers or in the cloud.</p> <p><em>Happy futuring!</em></p> <p>UPDATE 2022-12-11: Update examples that used the deprecated <code>multiprocess</code> future backend alias to use the <code>multisession</code> backend.</p> <h2 id="links">Links</h2> <ul> <li>future.apply package: <ul> <li>CRAN page: <a href="https://cran.r-project.org/package=future.apply">https://cran.r-project.org/package=future.apply</a></li> <li>GitHub page: <a href="https://github.com/HenrikBengtsson/future.apply">https://github.com/HenrikBengtsson/future.apply</a></li> </ul></li> <li>future package: <ul> <li>CRAN page: <a href="https://cran.r-project.org/package=future">https://cran.r-project.org/package=future</a></li> <li>GitHub page: <a href="https://github.com/HenrikBengtsson/future">https://github.com/HenrikBengtsson/future</a></li> </ul></li> <li>future.batchtools package: <ul> <li>CRAN page: <a href="https://cran.r-project.org/package=future.batchtools">https://cran.r-project.org/package=future.batchtools</a></li> <li>GitHub page: <a href="https://github.com/HenrikBengtsson/future.batchtools">https://github.com/HenrikBengtsson/future.batchtools</a></li> </ul></li> <li>doFuture package: <ul> <li>CRAN page: <a href="https://cran.r-project.org/package=doFuture">https://cran.r-project.org/package=doFuture</a></li> <li>GitHub page: <a href="https://github.com/HenrikBengtsson/doFuture">https://github.com/HenrikBengtsson/doFuture</a></li> </ul></li> </ul> <h2 id="see-also">See also</h2> <ul> <li><a href="https://www.jottr.org/2018/06/18/future-erum2018-slides/">Delayed Future(Slides from eRum 2018)</a>, 2018-06-19</li> <li><a href="https://www.jottr.org/2018/04/12/future-results/">future 1.8.0: Preparing for a Shiny Future</a>, 2018-04-12</li> <li><a href="https://www.jottr.org/2017/06/05/many-faced-future/">The Many-Faced Future</a>, 2017-06-05</li> <li><a href="https://www.jottr.org/2017/02/19/future-rng/">future 1.3.0 Reproducible RNGs, future&#95;lapply() and More</a>, 2017-02-19</li> <li><a href="https://www.jottr.org/2016/10/22/future-hpc/">High-Performance Compute in R Using Futures</a>, 2016-10-22</li> <li><a href="https://www.jottr.org/2016/10/11/future-remotes/">Remote Processing Using Futures</a>, 2016-10-11</li> </ul> Delayed Future(Slides from eRum 2018) https://www.jottr.org/2018/06/18/future-erum2018-slides/ Mon, 18 Jun 2018 00:00:00 +0000 https://www.jottr.org/2018/06/18/future-erum2018-slides/ <p><img src="https://www.jottr.org/post/erum2018--hexlogo.jpg" alt="The eRum 2018 hex sticker" /></p> <p>As promised - though a bit delayed - below are links to my slides and the video of my talk on <em>Future: Parallel &amp; Distributed Processing in R for Everyone</em> that I presented last month at the <a href="https://2018.erum.io/">eRum 2018</a> conference in Budapest, Hungary (May 14-16, 2018).</p> <p>The conference was very well organized (thank you everyone involved) with a great lineup of several brilliant workshop sessions, talks, and poster presentations (thanks all). It was such a pleasure to attend this conference and to connect and reconnect with so many of the lovely people that we are fortunate to have in the R Community. I&rsquo;m looking forward to meeting you all again.</p> <p>My talk (22 slides plus several appendix slides):</p> <ul> <li>Title: <em>Future: Parallel &amp; Distributed Processing in R for Everyone</em></li> <li><a href="https://www.jottr.org/presentations/eRum2018/BengtssonH_20180516-eRum2018.html">HTML</a> (incremental slides; requires online access)</li> <li><a href="https://www.jottr.org/presentations/eRum2018/BengtssonH_20180516-eRum2018.pdf">PDF</a> (flat slides)</li> <li><a href="https://www.youtube.com/watch?v=doa7avxbptQ">Video</a> (22 mins)</li> </ul> <p>May the future be with you!</p> <h2 id="links">Links</h2> <ul> <li>eRum 2018: <ul> <li>Conference site: <a href="https://2018.erum.io/">https://2018.erum.io/</a></li> <li>All talks (slides &amp; videos): <a href="https://2018.erum.io/#talk-abstracts">https://2018.erum.io/#talk-abstracts</a></li> </ul></li> <li>future package: <ul> <li>CRAN page: <a href="https://cran.r-project.org/package=future">https://cran.r-project.org/package=future</a></li> <li>GitHub page: <a href="https://github.com/HenrikBengtsson/future">https://github.com/HenrikBengtsson/future</a></li> </ul></li> <li>future.batchtools package: <ul> <li>CRAN page: <a href="https://cran.r-project.org/package=future.batchtools">https://cran.r-project.org/package=future.batchtools</a></li> <li>GitHub page: <a href="https://github.com/HenrikBengtsson/future.batchtools">https://github.com/HenrikBengtsson/future.batchtools</a></li> </ul></li> <li>doFuture package: <ul> <li>CRAN page: <a href="https://cran.r-project.org/package=doFuture">https://cran.r-project.org/package=doFuture</a></li> <li>GitHub page: <a href="https://github.com/HenrikBengtsson/doFuture">https://github.com/HenrikBengtsson/doFuture</a></li> </ul></li> <li>future.apply package: <ul> <li>CRAN page: <a href="https://cran.r-project.org/package=future.apply">https://cran.r-project.org/package=future.apply</a></li> <li>GitHub page: <a href="https://github.com/HenrikBengtsson/future.apply">https://github.com/HenrikBengtsson/future.apply</a></li> </ul></li> </ul> future 1.8.0: Preparing for a Shiny Future https://www.jottr.org/2018/04/12/future-results/ Thu, 12 Apr 2018 00:00:00 +0000 https://www.jottr.org/2018/04/12/future-results/ <p><strong><a href="https://cran.r-project.org/package=future">future</a></strong> 1.8.0 is available on CRAN.</p> <p>This release lays the foundation for being able to capture outputs from futures, perform automated timing and memory benchmarking (profiling) on futures, and more. These features are <em>not</em> yet available out of the box, but thanks to this release we will be able to make some headway on many of <a href="https://github.com/HenrikBengtsson/future/issues/172">the feature requests related to this</a> - hopefully already by the next release.</p> <p><img src="https://www.jottr.org/post/retro-shiny-future-small.png" alt="&quot;A Shiny Future&quot;" /></p> <p>For <strong>shiny</strong> users following Joe Cheng&rsquo;s efforts on extending <a href="https://rstudio.github.io/promises/articles/shiny.html">Shiny with asynchronous processing using futures</a>, <strong>future</strong> 1.8.0 comes with some <a href="https://github.com/HenrikBengtsson/future/issues/200">important updates/bug fixes</a> that allow for consistent error handling regardless whether Shiny runs with or without futures and regardless of the future backend used. With previous versions of the <strong>future</strong> package, you would receive errors of different classes depending on which future backend was used.</p> <p>The <code>future_lapply()</code> function was moved to the <strong><a href="https://cran.r-project.org/package=future.apply">future.apply</a></strong> package back in January 2018. Please use that one instead, especially since the one in the <strong>future</strong> package is now formally deprecated (and produces a warning if used). In <strong>future.apply</strong> there is also a <code>future_sapply()</code> function and hopefully, in a not too far future, we&rsquo;ll see additional futurized versions of other base R apply functions, e.g. <code>future_vapply()</code> and <code>future_apply()</code>.</p> <p>Finally, with this release, there was an bug fix related to <em>nested futures</em> (where you call <code>future()</code> within a <code>future()</code> - or use <code>%&lt;-%</code> within another <code>%&lt;-%</code>). When using non-standard evaluation (NSE) such as <strong>dplyr</strong> expressions in a nested future, you could get a false error that complained about not being able to identify a global variable when it actually was a column in a data.frame.</p> <h2 id="what-s-next">What&rsquo;s next?</h2> <ul> <li><p>I&rsquo;m giving a presentation on futures at the <a href="https://2018.erum.io/">eRum 2018 conference taking place on May 14-16, 2018 in Budapest</a>. I&rsquo;m excited about this opportunity and to meet more folks in the European R community.</p></li> <li><p>I&rsquo;m happy to announce that The Infrastructure Steering Committee of The R Consortium is funding the project <a href="https://www.r-consortium.org/projects/awarded-projects">Future Minimal API: Specification with Backend Conformance Test Suite</a>. I&rsquo;m grateful for their support. The aim is to formalize the Future API further and to provide a standardized test suite that packages implementing future backends can validate their implementations against. This will benefit the quality of higher-level parallel frameworks that utilize futures internally, e.g. <strong><a href="https://cran.r-project.org/package=future.apply">future.apply</a></strong> and <strong>foreach</strong> with <strong><a href="https://cran.r-project.org/package=doFuture">doFuture</a></strong>. It will also help moving forward on several of <a href="https://github.com/HenrikBengtsson/future/issues/172">the feature requests received from the community</a>.</p></li> </ul> <h2 id="help-shape-the-future">Help shape the future</h2> <p>If you find futures useful in your R-related work, please consider sharing your stories, e.g. by blogging, on <a href="https://twitter.com/henrikbengtsson">Twitter</a>, or on <a href="https://github.com/HenrikBengtsson/future">GitHub</a>. It always exciting to hear about how people are using them or how they&rsquo;d like to use. I know there are so many great ideas out there!</p> <p>Happy futuring!</p> <h2 id="links">Links</h2> <ul> <li>future package: <a href="https://cran.r-project.org/package=future">CRAN</a>, <a href="https://github.com/HenrikBengtsson/future">GitHub</a></li> <li>future.batchtools package: <a href="https://cran.r-project.org/package=future.batchtools">CRAN</a>, <a href="https://github.com/HenrikBengtsson/future.batchtools">GitHub</a></li> <li>future.callr package: <a href="https://cran.r-project.org/package=future.callr">CRAN</a>, <a href="https://github.com/HenrikBengtsson/future.callr">GitHub</a></li> <li>doFuture package: <a href="https://cran.r-project.org/package=doFuture">CRAN</a>, <a href="https://github.com/HenrikBengtsson/doFuture">GitHub</a> (a <a href="https://cran.r-project.org/package=foreach">foreach</a> adaptor)</li> </ul> Performance: Avoid Coercing Indices To Doubles https://www.jottr.org/2018/04/02/coercion-of-indices/ Mon, 02 Apr 2018 00:00:00 +0000 https://www.jottr.org/2018/04/02/coercion-of-indices/ <p><img src="https://www.jottr.org/post/1or1L.png" alt="&quot;1 or 1L?&quot;" /></p> <p><code>x[idxs + 1]</code> or <code>x[idxs + 1L]</code>? That is the question.</p> <p>Assume that we have a vector $x$ of $n = 100,000$ random values, e.g.</p> <pre><code class="language-r">&gt; n &lt;- 100000 &gt; x &lt;- rnorm(n) </code></pre> <p>and that we wish to calculate the $n-1$ first-order differences $y=(y_1, y_2, &hellip;, y_{n-1})$ where $y_i=x_{i+1} - x_i$. In R, we can calculate this using the following vectorized form:</p> <pre><code class="language-r">&gt; idxs &lt;- seq_len(n - 1) &gt; y &lt;- x[idxs + 1] - x[idxs] </code></pre> <p>We can certainly do better if we turn to native code, but is there a more efficient way to implement this using plain R code? It turns out there is (*). The following <strong>calculation is ~15-20% faster</strong>:</p> <pre><code class="language-r">&gt; y &lt;- x[idxs + 1L] - x[idxs] </code></pre> <p>The reason for this is because the index calculation:</p> <pre><code class="language-r">idxs + 1 </code></pre> <p>is <strong>inefficient due to a coercion of integers to doubles</strong>. We have that <code>idxs</code> is an integer vector but <code>idxs + 1</code> becomes a double vector because <code>1</code> is a double:</p> <pre><code class="language-r">&gt; typeof(idxs) [1] &quot;integer&quot; &gt; typeof(idxs + 1) [1] &quot;double&quot; &gt; typeof(1) [1] &quot;double&quot; </code></pre> <p>Note also that doubles (aka &ldquo;numerics&rdquo; in R) take up <strong>twice the amount of memory</strong>:</p> <pre><code class="language-r">&gt; object.size(idxs) 400040 bytes &gt; object.size(idxs + 1) 800032 bytes </code></pre> <p>which is because integers are stored as 4 bytes and doubles as 8 bytes.</p> <p>By using <code>1L</code> instead, we can avoid this coercion from integers to doubles:</p> <pre><code class="language-r">&gt; typeof(idxs) [1] &quot;integer&quot; &gt; typeof(idxs + 1L) [1] &quot;integer&quot; &gt; typeof(1L) [1] &quot;integer&quot; </code></pre> <p>and we save some, otherwise wasted, memory;</p> <pre><code class="language-r">&gt; object.size(idxs + 1L) 400040 bytes </code></pre> <p><strong>Does it really matter for the overall performance?</strong> It should because <strong>less memory is allocated</strong> which always comes with some overhead. Possibly more importantly, by using objects that are smaller in memory, the more likely it is that elements can be found in the memory cache rather than in the RAM itself, i.e. the <strong>chance for <em>cache hits</em> increases</strong>. Accessing data in the cache is orders of magnitute faster than in RAM. Furthermore, we also <strong>avoid coercion/casting</strong> of doubles to integers when R adds one to each element, which may add some extra CPU overhead.</p> <p>The performance gain is confirmed by running <strong><a href="https://cran.r-project.org/package=microbenchmark">microbenchmark</a></strong> on the two alternatives:</p> <pre><code class="language-r">&gt; microbenchmark::microbenchmark( + y &lt;- x[idxs + 1 ] - x[idxs], + y &lt;- x[idxs + 1L] - x[idxs] + ) Unit: milliseconds expr min lq mean median uq max neval cld y &lt;- x[idxs + 1] - x[idxs] 1.27 1.58 3.71 2.27 2.62 80.6 100 a y &lt;- x[idxs + 1L] - x[idxs] 1.04 1.25 2.38 1.34 2.20 76.5 100 a </code></pre> <p>From the median (which is the most informative here), we see that using <code>idxs + 1L</code> is ~15-20% faster than <code>idxs + 1</code> in this case (it depends on $n$ and the overall calculation performed).</p> <p><strong>Is it worth it?</strong> Although it is &ldquo;only&rdquo; an absolute difference of ~1 ms, it adds up if we do these calculations a large number times, e.g. in a bootstrap algorithm. And if there are many places in the code that result in coercions from index calculations like these, that also adds up. Some may argue it&rsquo;s not worth it, but at least now you know it does indeed improve the performance a bit if you specify index constants as integers, i.e. by appending an <code>L</code>.</p> <p>To wrap it up, here is look at the cost of subsetting all of the $1,000,000$ elements in a vector using various types of integer and double index vectors:</p> <pre><code class="language-r">&gt; n &lt;- 1000000 &gt; x &lt;- rnorm(n) &gt; idxs &lt;- seq_len(n) ## integer indices &gt; idxs_dbl &lt;- as.double(idxs) ## double indices &gt; microbenchmark::microbenchmark(unit = &quot;ms&quot;, + x[], + x[idxs], + x[idxs + 0L], + x[idxs_dbl], + x[idxs_dbl + 0], + x[idxs_dbl + 0L], + x[idxs + 0] + ) Unit: milliseconds expr min lq mean median uq max neval cld x[] 0.7056 0.7481 1.6563 0.7632 0.8351 74.682 100 a x[idxs] 3.9647 4.0638 5.1735 4.2020 4.7311 78.038 100 b x[idxs + 0L] 5.7553 5.8724 6.2694 6.0810 6.6447 7.845 100 bc x[idxs_dbl] 6.6355 6.7799 7.9916 7.1305 7.6349 77.696 100 cd x[idxs_dbl + 0] 7.7081 7.9441 8.6044 8.3321 8.9432 12.171 100 d x[idxs_dbl + 0L] 8.0770 8.3050 8.8973 8.7669 9.1682 12.578 100 d x[idxs + 0] 7.9980 8.2586 8.8544 8.8924 9.2197 12.345 100 d </code></pre> <p>(I ordered the entries by their &lsquo;median&rsquo; processing times.)</p> <p>In all cases, we are extracting the complete vector of <code>x</code>. We see that</p> <ol> <li>subsetting using an integer vector is faster than using a double vector,</li> <li><code>x[idxs + 0L]</code> is faster than <code>x[idxs + 0]</code> (as seen previously),</li> <li><code>x[idxs + 0L]</code> is still faster than <code>x[idxs_dbl]</code> despite also involving an addition, and</li> <li><code>x[]</code> is whoppingly fast (probably because it does not have to iterate over an index vector) and serves as a lower-bound reference for the best we can hope for.</li> </ol> <p>(*): There already exists a highly efficient implementation for calculating the first-order differences, namely <code>y &lt;- diff(x)</code>. But for the sake of the take-home message of this blog post, let&rsquo;s ignore that.</p> <p><strong>Bonus</strong>: Did you know that <code>sd(y) / sqrt(2)</code> is an estimator of the standard deviation of the above <code>x</code>:s (von Neumann et al., 1941)? It&rsquo;s actually not too hard to derive this - give it a try by deriving the variance when <code>x</code> is independent, identically distributed Gaussian random variables. This property is useful in cases where we are interested in the noise level of <code>x</code> and <code>x</code> has a piecewise constant mean level which changes at a small number of locations, e.g. a DNA copy-number profile of a tumor. In such cases we cannot use <code>sd(x)</code>, because the estimate would be biased due to the different mean levels. Instead, by taking the first-order differences <code>y</code>, changes in mean levels of <code>x</code> become sporadic outliers in <code>y</code>. If we could trim off these outliers, <code>sd(y) / sqrt(2)</code> would be a good estimate of the standard deviation of <code>x</code> after subtracting the mean levels. Even better, by using a robust estimator, such as the median absolute deviation (MAD) - <code>mad(y) / sqrt(2)</code> - we do not have to worry about have to identify the outliers. Efficient implementations of <code>sd(diff(x)) / sqrt(2))</code> and <code>mad(diff(x)) / sqrt(2))</code> are <code>sdDiff(x)</code> and <code>madDiff(x)</code> of the <strong><a href="https://cran.r-project.org/package=matrixStats">matrixStats</a></strong> package.</p> <h1 id="references">References</h1> <p>J. von Neumann et al., The mean square successive difference. <em>Annals of Mathematical Statistics</em>, 1941, 12, 153-162.</p> <h1 id="session-information">Session information</h1> <p><details></p> <pre><code class="language-r">&gt; sessionInfo() R version 3.4.4 (2018-03-15) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 16.04.4 LTS Matrix products: default BLAS: /usr/lib/atlas-base/atlas/libblas.so.3.0 LAPACK: /usr/lib/atlas-base/atlas/liblapack.so.3.0 locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base loaded via a namespace (and not attached): [1] compiler_3.4.4 </code></pre> <p></details></p> Startup with Secrets - A Poor Man's Approach https://www.jottr.org/2018/03/30/startup-secrets/ Fri, 30 Mar 2018 00:00:00 +0000 https://www.jottr.org/2018/03/30/startup-secrets/ <p>New release: <strong><a href="https://cran.r-project.org/package=startup">startup</a></strong> 0.10.0 is now on CRAN.</p> <p>If your R startup files (<code>.Renviron</code> and <code>.Rprofile</code>) get long and windy, or if you want to make parts of them public and other parts private, then you can use the <strong><a href="https://cran.r-project.org/package=startup">startup</a></strong> package to split them up in separate files and directories under <code>.Renviron.d/</code> and <code>.Rprofile.d/</code>. For instance, the <code>.Rprofile.d/repos.R</code> file can be solely dedicated to setting in the <code>repos</code> option, which specifies from which web servers R packages are installed from. This makes it easy to find and easy to share with others (e.g. on GitHub). To make use of <strong>startup</strong>, install the package and then call <code>startup::install()</code> once. For an introduction, see <a href="https://www.jottr.org/2016/12/22/startup/">Start Me Up</a>.</p> <p><img src="https://www.jottr.org/post/startup_0.10.0-zxspectrum.gif" alt="ZX Spectrum animation" /> <em>startup::startup() is cross platform.</em></p> <p>Several R packages provide APIs for easier access to online services such as GitHub, GitLab, Twitter, Amazon AWS, Google GCE, etc. These packages often rely on R options or environment variables to hold your secret credentials or tokens in order to provide more or less automatic, batch-friendly access to those services. For convenience, it is common to set these secret options in <code>~/.Rprofile</code> or secret environment variables in <code>~/.Renviron</code> - or if you use the <strong><a href="https://cran.r-project.org/package=startup">startup</a></strong> package, in separate files. For instance, by adding a file <code>~/.Renviron.d/private/github</code> containing:</p> <pre><code>## GitHub token used by devtools GITHUB_PAT=db80a925a60ee5b57f323c7b3719bbaaf9f96b26 </code></pre> <p>then, when you start R, environment variable <code>GITHUB_PAT</code> will be accessible from within R as:</p> <pre><code class="language-r">&gt; Sys.getenv(&quot;GITHUB_PAT&quot;) [1] &quot;db80a925a60ee5b57f323c7b3719bbaaf9f96b26&quot; </code></pre> <p>which means that also <strong>devtools</strong> can make use of it.</p> <p><strong>IMPORTANT</strong>: If you&rsquo;re on a shared file system or a computer with multiple users, you want to make sure no one else can access your files holding &ldquo;secrets&rdquo;. If you&rsquo;re on Linux or macOS, this can be done by:</p> <pre><code class="language-sh">$ chmod -R go-rwx ~/.Renviron.d/private/ </code></pre> <p>Also, <em>keeping &ldquo;secrets&rdquo; in options or environment variables is <strong>not</strong> super secure</em>. For instance, <em>if your script or a third-party package dumps <code>Sys.getenv()</code> to a log file, that log file will contain your &ldquo;secrets&rdquo; too</em>. Depending on your default settings on the machine / file system, that log file might be readable by others in your group or even by anyone on the file system. And if you&rsquo;re not careful, you might even end up sharing that file with the public, e.g. on GitHub.</p> <p>Having said this, with the above setup we at least know that the secret token is only loaded when we run R and only when we run R as ourselves. <strong>Starting with startup 0.10.0</strong> (*), we can customize the startup further such that secrets are only loaded conditionally on a certain environment variable. For instance, if we instead of putting our secret files in a folder named:</p> <pre><code>~/.Renviron.d/private/SECRET=develop/ </code></pre> <p>because then (i) that folder will not be visible to anyone else because we already restricted access to <code>~/.Renviron.d/private/</code> and (ii) the secrets defined by files of that folder will <em>only be loaded</em> during the R startup <em>if and only if</em> environment variable <code>SECRET</code> has value <code>develop</code>. For example,</p> <pre><code class="language-r">$ SECRET=develop Rscript -e &quot;Sys.getenv('GITHUB_PAT')&quot; [1] &quot;db80a925a60ee5b57f323c7b3719bbaaf9f96b26&quot; </code></pre> <p>will load the secrets, but none of:</p> <pre><code class="language-r">$ Rscript -e &quot;Sys.getenv('GITHUB_PAT')&quot; [1] &quot;&quot; $ SECRET=runtime Rscript -e &quot;Sys.getenv('GITHUB_PAT')&quot; [1] &quot;&quot; </code></pre> <p>In other words, with the above approach, you can avoid loading secrets by default and only load them when you really need them. This lowers the risk of exposing them by mistake in log files or to R code you&rsquo;re not in control of. Furthermore, if you only need <code>GITHUB_PAT</code> in <em>interactive</em> devtools sessions, name the folder:</p> <pre><code>~/.Renviron.d/private/interactive=TRUE,SECRET=develop/ </code></pre> <p>and it will only be loaded in an interactive session, e.g.</p> <pre><code class="language-r">$ SECRET=develop Rscript -e &quot;Sys.getenv('GITHUB_PAT')&quot; [1] &quot;&quot; </code></pre> <p>and</p> <pre><code class="language-r">$ SECRET=develop R --quiet &gt; Sys.getenv('GITHUB_PAT') [1] &quot;db80a925a60ee5b57f323c7b3719bbaaf9f96b26&quot; </code></pre> <p>To repeat what already been said above, <em>storing secrets in environment variables or R variables provides only very limited security</em>. The above approach is meant to provide you with a bit more control if you are already storing credentials in <code>~/.Renviron</code> or <code>~/.Rprofile</code>. For a more secure approach to store secrets, see the <strong><a href="https://cran.r-project.org/package=keyring">keyring</a></strong> package, which makes it easy to &ldquo;access the system credential store from R&rdquo; in a cross-platform fashion, provides a better alternative.</p> <h2 id="what-s-new-in-startup-0-10-0">What&rsquo;s new in startup 0.10.0?</h2> <ul> <li><p>Renviron and Rprofile startup files that use <code>&lt;key&gt;=&lt;value&gt;</code> filters with non-declared keys are now(*) skipped (which makes the above possible).</p></li> <li><p><code>startup(debug = TRUE)</code> report on more details.</p></li> <li><p>A startup script can use <code>startup::is_debug_on()</code> to output message during the startup process conditionally on whether the user chooses to display debug message or not.</p></li> <li><p>Added <code>sysinfo()</code> flags <code>microsoftr</code>, <code>pqr</code>, <code>rstudioterm</code>, and <code>rtichoke</code>, which can be used in directory and file names to process them depending on in which environment R is running.</p></li> <li><p><code>restart()</code> works also in the RStudio Terminal.</p></li> </ul> <h2 id="links">Links</h2> <ul> <li><p><strong>startup</strong> package:</p> <ul> <li>CRAN page: <a href="https://cran.r-project.org/package=startup">https://cran.r-project.org/package=startup</a> (<a href="https://cran.r-project.org/web/packages/startup/NEWS">NEWS</a>, <a href="https://cran.r-project.org/web/packages/startup/vignettes/startup-intro.html">vignette</a>)</li> <li>GitHub page: <a href="https://github.com/HenrikBengtsson/startup">https://github.com/HenrikBengtsson/startup</a></li> </ul></li> <li><p>Blog post <a href="https://www.jottr.org/2016/12/22/startup/">Start Me Up</a> on 2016-12-22.</p></li> </ul> <p>(*) In <strong>startup</strong> (&lt; 0.10.0), <code>~/.Renviron.d/private/SECRET=develop/</code> would be processed not only when <code>SECRET</code> had value <code>develop</code> but also when it was <em>undefined</em>. In <strong>startup</strong> (&gt;= 0.10.0), files with such <code>&lt;key&gt;=&lt;value&gt;</code> tags will now be skipped when that key variable is undefined.</p> The Many-Faced Future https://www.jottr.org/2017/06/05/many-faced-future/ Mon, 05 Jun 2017 00:00:00 +0000 https://www.jottr.org/2017/06/05/many-faced-future/ <p>The <a href="https://cran.r-project.org/package=future">future</a> package defines the Future API, which is a unified, generic, friendly API for parallel processing. The Future API follows the principle of <strong>write code once and run anywhere</strong> - the developer chooses what to parallelize and the user how and where.</p> <p>The nature of a future is such that it lends itself to be used with several of the existing map-reduce frameworks already available in R. In this post, I&rsquo;ll give an example of how to apply a function over a set of elements concurrently using plain sequential R, the parallel package, the <a href="https://cran.r-project.org/package=future">future</a> package alone, as well as future in combination of the <a href="https://cran.r-project.org/package=foreach">foreach</a>, the <a href="https://cran.r-project.org/package=plyr">plyr</a>, and the <a href="https://cran.r-project.org/package=purrr">purrr</a> packages.</p> <p><img src="https://www.jottr.org/post/julia_sets.gif" alt="Julia Set animation" /> <em>You can choose your own future and what you want to do with it.</em></p> <h2 id="example-multiple-julia-sets">Example: Multiple Julia sets</h2> <p>The <a href="https://cran.r-project.org/package=Julia">Julia</a> package provides the <code>JuliaImage()</code> function for generating a <a href="https://en.wikipedia.org/wiki/Julia_set">Julia set</a> for a given set of start parameters <code>(centre, L, C)</code> where <code>centre</code> specify the center point in the complex plane, <code>L</code> specify the width and height of the square region around this location, and <code>C</code> is a complex coefficient controlling the &ldquo;shape&rdquo; of the generated Julia set. For example, to generate one of the above Julia set images (1000-by-1000 pixels), you can use:</p> <pre><code class="language-r">library(&quot;Julia&quot;) set &lt;- JuliaImage(1000, centre = 0 + 0i, L = 3.5, C = -0.4 + 0.6i) plot_julia(set) </code></pre> <p>with</p> <pre><code class="language-r">plot_julia &lt;- function(img, col = topo.colors(16)) { par(mar = c(0, 0, 0, 0)) image(img, col = col, axes = FALSE) } </code></pre> <p>For the purpose of illustrating how to calculate different Julia sets in parallel, I will use the same <code>(centre, L) = (0 + 0i, 3.5)</code> region as above with the following ten complex coefficients (from <a href="https://en.wikipedia.org/wiki/Julia_set">Julia set</a>):</p> <pre><code class="language-r">Cs &lt;- c( a = -0.618, b = -0.4 + 0.6i, c = 0.285 + 0i, d = 0.285 + 0.01i, e = 0.45 + 0.1428i, f = -0.70176 - 0.3842i, g = 0.835 - 0.2321i, h = -0.8 + 0.156i, i = -0.7269 + 0.1889i, j = - 0.8i ) </code></pre> <p>Now we&rsquo;re ready to see how we can use futures in combination of different map-reduce implementations in R for generating these ten sets in parallel. Note that all approaches will generate the exact same ten Julia sets. So, feel free to pick your favorite approach.</p> <h2 id="sequential">Sequential</h2> <p>To process the above ten regions sequentially, we can use the <code>lapply()</code> function:</p> <pre><code class="language-r">library(&quot;Julia&quot;) sets &lt;- lapply(Cs, function(C) { JuliaImage(1000, centre = 0 + 0i, L = 3.5, C = C) }) </code></pre> <h2 id="parallel">Parallel</h2> <pre><code class="language-r">library(&quot;parallel&quot;) ncores &lt;- future::availableCores() ## a friendly version of detectCores() cl &lt;- makeCluster(ncores) clusterEvalQ(cl, library(&quot;Julia&quot;)) sets &lt;- parLapply(cl, Cs, function(C) { JuliaImage(1000, centre = 0 + 0i, L = 3.5, C = C) }) </code></pre> <h2 id="futures-in-parallel">Futures (in parallel)</h2> <pre><code class="language-r">library(&quot;future&quot;) plan(multisession) ## defaults to availableCores() workers library(&quot;Julia&quot;) sets &lt;- future_lapply(Cs, function(C) { JuliaImage(1000, centre = 0 + 0i, L = 3.5, C = C) }) </code></pre> <p>We could also have used the more explicit setup <code>plan(cluster, workers = makeCluster(availableCores()))</code>, which is identical to <code>plan(multisession)</code>.</p> <h2 id="futures-with-foreach">Futures with foreach</h2> <pre><code class="language-r">library(&quot;doFuture&quot;) registerDoFuture() ## tells foreach futures should be used plan(multisession) ## specifies what type of futures sets &lt;- foreach(C = Cs) %dopar% { JuliaImage(1000, centre = 0 + 0i, L = 3.5, C = C) } </code></pre> <p>Note that I didn&rsquo;t pass <code>.packages = &quot;Julia&quot;</code> to <code>foreach()</code> because the doFuture backend will do that automatically for us - that&rsquo;s one of the treats of using futures. If we would have used <code>doParallel::registerDoParallel(cl)</code> or similar, we would have had to worry about that.</p> <h2 id="futures-with-plyr">Futures with plyr</h2> <p>The plyr package will utilize foreach internally if we pass <code>.parallel = TRUE</code>. Because of this, we can use <code>plyr::llply()</code> to parallelize via futures as follows:</p> <pre><code class="language-r">library(&quot;plyr&quot;) library(&quot;doFuture&quot;) registerDoFuture() ## tells foreach futures should be used plan(multisession) ## specifies what type of futures library(&quot;Julia&quot;) sets &lt;- llply(Cs, function(C) { JuliaImage(1000, centre = 0 + 0i, L = 3.5, C = C) }, .parallel = TRUE) </code></pre> <p>For the same reason as above, we also here don&rsquo;t have to worry about global variables and making sure needed packages are attached; that&rsquo;s all handles by the future packages.</p> <h2 id="futures-with-purrr-furrr">Futures with purrr (= furrr)</h2> <p>As a final example, here is how you can use futures to parallelize your <code>purrr::map()</code> calls:</p> <pre><code class="language-r">library(&quot;purrr&quot;) library(&quot;future&quot;) plan(multisession) library(&quot;Julia&quot;) sets &lt;- Cs %&gt;% map(~ future(JuliaImage(1000, centre = 0 + 0i, L = 3.5, C = .x))) %&gt;% values </code></pre> <p><em>Comment:</em> This latter approach will not perform load balancing (&ldquo;scheduling&rdquo;) across backend workers; that&rsquo;s a feature that ideally would be taken care of by purrr itself. However, I have some ideas for future versions of future (pun&hellip;) that may achieve this without having to modify the purrr package.</p> <h1 id="got-compute">Got compute?</h1> <p>If you have access to one or more machines with R installed (e.g. a local or remote cluster, or a <a href="https://cran.r-project.org/package=googleComputeEngineR">Google Compute Engine cluster</a>), and you&rsquo;ve got direct SSH access to those machines, you can have those machines to calculate the above Julia sets; just change future plan, e.g.</p> <pre><code class="language-r">plan(cluster, workers = c(&quot;machine1&quot;, &quot;machine2&quot;, &quot;machine3.remote.org&quot;)) </code></pre> <p>If you have access to a high-performance compute (HPC) cluster with a HPC scheduler (e.g. Slurm, TORQUE / PBS, LSF, and SGE), then you can harness its power by switching to:</p> <pre><code class="language-r">library(&quot;future.batchtools&quot;) plan(batchtools_sge) </code></pre> <p>For more details, see the vignettes of the <a href="https://cran.r-project.org/package=future.batchtools">future.batchtools</a> and <a href="https://cran.r-project.org/package=batchtools">batchtools</a> packages.</p> <p>Happy futuring!</p> <h2 id="links">Links</h2> <ul> <li>future package: <ul> <li>CRAN page: <a href="https://cran.r-project.org/package=future">https://cran.r-project.org/package=future</a></li> <li>GitHub page: <a href="https://github.com/HenrikBengtsson/future">https://github.com/HenrikBengtsson/future</a></li> </ul></li> <li>future.batchtools package: <ul> <li>CRAN page: <a href="https://cran.r-project.org/package=future.batchtools">https://cran.r-project.org/package=future.batchtools</a></li> <li>GitHub page: <a href="https://github.com/HenrikBengtsson/future.batchtools">https://github.com/HenrikBengtsson/future.batchtools</a></li> </ul></li> <li>doFuture package (an <a href="https://cran.r-project.org/package=foreach">foreach</a> adaptor): <ul> <li>CRAN page: <a href="https://cran.r-project.org/package=doFuture">https://cran.r-project.org/package=doFuture</a></li> <li>GitHub page: <a href="https://github.com/HenrikBengtsson/doFuture">https://github.com/HenrikBengtsson/doFuture</a></li> </ul></li> </ul> <h2 id="see-also">See also</h2> <ul> <li><a href="https://www.jottr.org/2016/07/a-future-for-r-slides-from-user-2016.html">A Future for R: Slides from useR 2016</a>, 2016-07-02</li> <li><a href="https://www.jottr.org/2016/10/remote-processing-using-futures.html">Remote Processing Using Futures</a>, 2016-10-21</li> <li><a href="https://www.jottr.org/2017/02/future-reproducible-rngs-futurelapply.html">future: Reproducible RNGs, future_lapply() and more</a>, 2017-02-19</li> <li><a href="https://www.jottr.org/2017/03/dofuture-universal-foreach-adapator.html">doFuture: A universal foreach adaptor ready to be used by 1,000+ packages</a>, 2017-03-18</li> </ul> The R-help Community was Started on This Day 20 Years Ago https://www.jottr.org/2017/04/01/history-r-help-20-years/ Sat, 01 Apr 2017 00:00:00 +0000 https://www.jottr.org/2017/04/01/history-r-help-20-years/ <p>Today, its been 20 years since Martin Mächler started the <a href="https://stat.ethz.ch/pipermail/r-help/">R-help community list</a>. The <a href="https://stat.ethz.ch/pipermail/r-help/1997-April/001488.html">first post</a> was written by Ross Ihaka on 1997-04-01:</p> <p><img src="https://www.jottr.org/post/r-help_first_post.png" alt="Subject: R-alpha: R-testers: pmin heisenbug From: Ross Ihaka &lt;ihaka at stat.auckland.ac.nz&gt; When: Tue Apr 1 10:35:48 CEST 1997" /> <em>Screenshot of the very first post to the R-help mailing list.</em></p> <p>This is a post about R&rsquo;s memory model. We&rsquo;re talking <a href="https://cran.r-project.org/src/base/R-0/">R v0.50 beta</a>. I think that the paragraph at the end provides a nice anecdote on the importance not to be overwhelmed by problems ahead:</p> <blockquote> <p>&rdquo;(The consumption of one cell per string is perhaps the major memory problem in R - we didn&rsquo;t design it with large problems in mind. It is probably fixable, but it will mean a lot of work).&rdquo;</p> </blockquote> <p>We all know the story; an endless number of hours has been put in by many contributors throughout the years, making The R Project and its community the great experience it is today.</p> <p>Thank you!</p> <p>PS. This is a blog version of my <a href="https://stat.ethz.ch/pipermail/r-help/2017-April/445921.html">R-help post</a> with the same content.</p> doFuture: A Universal Foreach Adaptor Ready to be Used by 1,000+ Packages https://www.jottr.org/2017/03/18/dofuture/ Sat, 18 Mar 2017 00:00:00 +0000 https://www.jottr.org/2017/03/18/dofuture/ <p><a href="https://cran.r-project.org/package=doFuture">doFuture</a> 0.4.0 is available on CRAN. The doFuture package provides a <em>universal</em> <a href="https://cran.r-project.org/package=foreach">foreach</a> adaptor enabling <em>any</em> <a href="https://cran.r-project.org/package=future">future</a> backend to be used with the <code>foreach() %dopar% { ... }</code> construct. As shown below, this will allow <code>foreach()</code> to parallelize on not only multiple cores, multiple background R sessions, and ad-hoc clusters, but also cloud-based clusters and high performance compute (HPC) environments.</p> <p>1,300+ R packages on CRAN and Bioconductor depend, directly or indirectly, on foreach for their parallel processing. By using doFuture, a user has the option to parallelize those computations on more compute environments than previously supported, especially HPC clusters. Notably, all <a href="https://cran.r-project.org/package=plyr">plyr</a> code with <code>.parallel = TRUE</code> will be able to take advantage of this without need for modifications - this is possible because internally plyr makes use of foreach for its parallelization.</p> <p><img src="https://www.jottr.org/post/programmer_next_to_62500_punch_cards_SAGE.jpg" alt=" Programmer standing beside punched cards" /> <em>With doFuture, foreach can process your code in more places than ever before. Alright, it may not be able to process <a href="http://www.computerhistory.org/revolution/memory-storage/8/326/924">this programmer&rsquo;s 62,500 punched cards</a>.</em></p> <h2 id="what-is-new-in-dofuture-0-4-0">What is new in doFuture 0.4.0?</h2> <ul> <li><p><strong>Load balancing</strong>: The doFuture <code>%dopar%</code> backend will now partition all iterations (elements) and distribute them uniformly such that the each backend worker will receive exactly one partition equally sized to those sent to the other workers. This approach speeds up the processing significantly when iterating over a large set of elements that each has a relatively small processing time.</p></li> <li><p><strong>Globals</strong>: Global variables and packages needed in order for external R workers to evaluate the foreach expression are now identified by the same algorithm as used for regular future constructs and <code>future::future_lapply()</code>.</p></li> </ul> <p>For full details on updates, please see the <a href="https://cran.r-project.org/package=doFuture">NEWS</a> file. <strong>The doFuture package installs out-of-the-box on all operating systems</strong>.</p> <h2 id="a-quick-example">A quick example</h2> <p>Here is a bootstrap example using foreach adapted from <code>help(&quot;clusterApply&quot;, package = &quot;parallel&quot;)</code>. I use this example to illustrate how to perform <code>foreach()</code> iterations in parallel on a variety of backends.</p> <pre><code>library(&quot;boot&quot;) run &lt;- function(...) { cd4.rg &lt;- function(data, mle) MASS::mvrnorm(nrow(data), mle$m, mle$v) cd4.mle &lt;- list(m = colMeans(cd4), v = var(cd4)) boot(cd4, corr, R = 10000, sim = &quot;parametric&quot;, ran.gen = cd4.rg, mle = cd4.mle) } ## Attach doFuture (and foreach), and tell foreach to use futures library(&quot;doFuture&quot;) registerDoFuture() ## Sequentially on the local machine plan(sequential) system.time(boot &lt;- foreach(i = 1:100, .packages = &quot;boot&quot;) %dopar% { run() }) ## user system elapsed ## 298.728 0.601 304.242 # In parallel on local machine (with 8 cores) plan(multisession) system.time(boot &lt;- foreach(i = 1:100, .packages = &quot;boot&quot;) %dopar% { run() }) ## user system elapsed ## 452.241 1.635 68.740 # In parallel on the ad-hoc cluster machine (5 machines with 4 workers each) nodes &lt;- rep(c(&quot;n1&quot;, &quot;n2&quot;, &quot;n3&quot;, &quot;n4&quot;, &quot;n5&quot;), each = 4L) plan(cluster, workers = nodes) system.time(boot &lt;- foreach(i = 1:100, .packages = &quot;boot&quot;) %dopar% { run() }) ## user system elapsed ## 2.046 0.188 22.227 # In parallel on Google Compute Engine (10 r-base Docker containers) vms &lt;- lapply(paste0(&quot;node&quot;, 1:10), FUN = googleComputeEngineR::gce_vm, template = &quot;r-base&quot;) vms &lt;- lapply(vms, FUN = gce_ssh_setup) vms &lt;- as.cluster(vms, docker_image = &quot;henrikbengtsson/r-base-future&quot;) plan(cluster, workers = vms) system.time(boot &lt;- foreach(i = 1:100, .packages = &quot;boot&quot;) %dopar% { run() }) ## user system elapsed ## 0.952 0.040 26.269 # In parallel on a HPC cluster with a TORQUE / PBS scheduler # (Note, the below timing includes waiting time on job queue) plan(future.BatchJobs::batchjobs_torque, workers = 10) system.time(boot &lt;- foreach(i = 1:100, .packages = &quot;boot&quot;) %dopar% { run() }) ## user system elapsed ## 15.568 6.778 52.024 </code></pre> <h2 id="about-export-and-packages">About <code>.export</code> and <code>.packages</code></h2> <p>When using <code>doFuture::registerDoFuture()</code>, there is no need to manually specify which global variables (argument <code>.export</code>) to export. By default, the doFuture backend automatically identifies and exports all globals needed. This is done using recursive static-code inspection. The same is true for packages that need to be attached; those will also be handled automatically and there is no need to specify them manually via argument <code>.packages</code>. This is in line with how it works for regular future constructs, e.g. <code>y %&lt;-% { a * sum(x) }</code>.</p> <p>Having said this, you may still want to specify arguments <code>.export</code> and <code>.packages</code> because of the risk that your <code>foreach()</code> statement may not work with other foreach adaptors, e.g. <a href="https://cran.r-project.org/package=doParallel">doParallel</a> and <a href="https://cran.r-project.org/package=doSNOW">doSNOW</a>. Exactly when and where a failure may occur depends on the nestedness of your code and the location of your global variables. Specifying <code>.export</code> and <code>.packages</code> manually skips such automatic identification.</p> <p>Finally, I recommend that you as a developer always try to write your code in such way the users can choose their own futures: The developer decides <em>what</em> should be parallelized - the user chooses <em>how</em>.</p> <p>Happy futuring!</p> <p>UPDATE 2022-12-11: Update examples that used the deprecated <code>multiprocess</code> future backend alias to use the <code>multisession</code> backend.</p> <h2 id="links">Links</h2> <ul> <li>doFuture package: <ul> <li>CRAN page: <a href="https://cran.r-project.org/package=doFuture">https://cran.r-project.org/package=doFuture</a></li> <li>GitHub page: <a href="https://github.com/HenrikBengtsson/doFuture">https://github.com/HenrikBengtsson/doFuture</a></li> </ul></li> <li>future package: <ul> <li>CRAN page: <a href="https://cran.r-project.org/package=future">https://cran.r-project.org/package=future</a></li> <li>GitHub page: <a href="https://github.com/HenrikBengtsson/future">https://github.com/HenrikBengtsson/future</a></li> </ul></li> <li>future.BatchJobs package (enhancing <a href="https://cran.r-project.org/package=BatchJobs">BatchJobs</a>): <ul> <li>CRAN page: <a href="https://cran.r-project.org/package=future.BatchJobs">https://cran.r-project.org/package=future.BatchJobs</a></li> <li>GitHub page: <a href="https://github.com/HenrikBengtsson/future.BatchJobs">https://github.com/HenrikBengtsson/future.BatchJobs</a></li> </ul></li> <li>future.batchtools package (enhancing <a href="https://cran.r-project.org/package=batchtools">batchtools</a>): <ul> <li>CRAN page: coming soon</li> <li>GitHub page: <a href="https://github.com/HenrikBengtsson/future.batchtools">https://github.com/HenrikBengtsson/future.batchtools</a></li> </ul></li> <li>googleComputeEngineR package: <ul> <li>CRAN page: <a href="https://cran.r-project.org/package=googleComputeEngineR">https://cran.r-project.org/package=googleComputeEngineR</a></li> <li>GitHub page: <a href="https://cloudyr.github.io/googleComputeEngineR">https://cloudyr.github.io/googleComputeEngineR</a> <br /></li> </ul></li> </ul> <h2 id="see-also">See also</h2> <ul> <li><a href="https://www.jottr.org/2017/02/future-reproducible-rngs-futurelapply.html">future: Reproducible RNGs, future_lapply() and more</a>, 2017-02-19</li> <li><a href="https://www.jottr.org/2016/10/remote-processing-using-futures.html">Remote Processing Using Futures</a>, 2016-10-21</li> <li><a href="https://www.jottr.org/2016/07/a-future-for-r-slides-from-user-2016.html">A Future for R: Slides from useR 2016</a>, 2016-07-02</li> </ul> future 1.3.0: Reproducible RNGs, future_lapply() and More https://www.jottr.org/2017/02/19/future-rng/ Sun, 19 Feb 2017 00:00:00 +0000 https://www.jottr.org/2017/02/19/future-rng/ <p><a href="https://cran.r-project.org/package=future">future</a> 1.3.0 is available on CRAN. With futures, it is easy to <strong>write R code once</strong>, which the user can choose to evaluate in parallel using whatever resources s/he has available, e.g. a local machine, a set of local machines, a set of remote machines, a high-end compute cluster (via <a href="https://cran.r-project.org/package=future.BatchJobs">future.BatchJobs</a> and soon also <a href="https://github.com/HenrikBengtsson/future.batchtools">future.batchtools</a>), or in the cloud (e.g. via <a href="https://cran.r-project.org/package=googleComputeEngineR">googleComputeEngineR</a>).</p> <p><img src="https://www.jottr.org/post/funny_car_magnet_animated.gif" alt="Silent movie clip of man in a cart catching a ride with a car passing by using a giant magnet" /> <em>Futures makes it easy to harness any resources at hand.</em></p> <p>Thanks to great feedback from the community, this new version provides:</p> <ul> <li><p><strong>A convenient lapply() function</strong></p> <ul> <li>Added <code>future_lapply()</code> that works like <code>lapply()</code> and gives identical results with the difference that futures are used internally. Depending on user&rsquo;s choice of <code>plan()</code>, these calculations may be processed sequential, in parallel, or distributed on multiple machines.</li> <li>Load balancing can be controlled by argument <code>future.scheduling</code>, which is a scalar adjusting how many futures each worker should process.</li> <li>Perfect reproducible random number generation (RNG) is guaranteed given the same initial seed, regardless of the type of futures used and choice of load balancing. Argument <code>future.seed = TRUE</code> (default) will use a random initial seed, which may also be specified as <code>future.seed = &lt;integer&gt;</code>. L&rsquo;Ecuyer-CMRG RNG streams are used internally.</li> </ul></li> <li><p><strong>Clarifies distinction between developer and end user</strong></p> <ul> <li>The end user controls what future strategy to use by default, e.g. <code>plan(multisession)</code> or <code>plan(cluster, workers = c(&quot;machine1&quot;, &quot;machine2&quot;, &quot;remote.server.org&quot;))</code>.</li> <li>The developer controls whether futures should be resolved eagerly (default) or lazily, e.g. <code>f &lt;- future(..., lazy = TRUE)</code>. Because of this, <code>plan(lazy)</code> is now deprecated.</li> </ul></li> <li><p><strong>Is even more friendly to multi-tenant compute environments</strong></p> <ul> <li><code>availableCores()</code> returns the number of cores available to the current R process. On a regular machine, this typically corresponds to the number of cores on the machine (<code>parallel::detectCores()</code>). If option <code>mc.cores</code> or environment variable <code>MC_CORES</code> is set, then that will be returned. However, on compute clusters using schedulers such as SGE, Slurm, and TORQUE / PBS, the function detects the number of cores allotted to the job by the scheduler and returns that instead. <strong>This way developers don&rsquo;t have to adjust their code to match a certain compute environment; the default works everywhere</strong>.</li> <li>With the new version, it is possible to override the fallback value used when nothing else is specified to not be the number of cores on the machine but to option <code>future.availableCores.fallback</code> or environment variable <code>R_FUTURE_AVAILABLE_FALLBACK</code>. For instance, by using <code>R_FUTURE_AVAILABLE_FALLBACK=1</code> system-wide in HPC environments, any user running outside of the scheduler will automatically use single-core processing unless explicitly requesting more cores. This lowers the risk of overloading the CPU by mistake.</li> <li>Analogously to how <code>availableCores()</code> returns the number of cores, the new function <code>availableWorkers()</code> returns the host names available to the R process. The default is <code>rep(&quot;localhost&quot;, times = availableCores())</code>, but when using HPC schedulers it may be the host names of other compute notes allocated to the job. <br /></li> </ul></li> </ul> <p>For full details on updates, please see the <a href="https://cran.r-project.org/package=future">NEWS</a> file. <strong>The future package installs out-of-the-box on all operating systems</strong>.</p> <h2 id="a-quick-example">A quick example</h2> <p>The bootstrap example of <code>help(&quot;clusterApply&quot;, package = &quot;parallel&quot;)</code> adapted to make use of futures.</p> <pre><code class="language-r">library(&quot;future&quot;) library(&quot;boot&quot;) run &lt;- function(...) { cd4.rg &lt;- function(data, mle) MASS::mvrnorm(nrow(data), mle$m, mle$v) cd4.mle &lt;- list(m = colMeans(cd4), v = var(cd4)) boot(cd4, corr, R = 5000, sim = &quot;parametric&quot;, ran.gen = cd4.rg, mle = cd4.mle) } # base::lapply() system.time(boot &lt;- lapply(1:100, FUN = run)) ### user system elapsed ### 133.637 0.000 133.744 # Sequentially on the local machine plan(sequential) system.time(boot0 &lt;- future_lapply(1:100, FUN = run, future.seed = 0xBEEF)) ### user system elapsed ### 134.916 0.003 135.039 # In parallel on the local machine (with 8 cores) plan(multisession) system.time(boot1 &lt;- future_lapply(1:100, FUN = run, future.seed = 0xBEEF)) ### user system elapsed ### 0.960 0.041 29.527 stopifnot(all.equal(boot1, boot0)) </code></pre> <h2 id="what-s-next">What&rsquo;s next?</h2> <p>The <a href="https://cran.r-project.org/package=future.BatchJobs">future.BatchJobs</a> package, which builds on top of <a href="https://cran.r-project.org/package=BatchJobs">BatchJobs</a>, provides future strategies for various HPC schedulers, e.g. SGE, Slurm, and TORQUE / PBS. For example, by using <code>plan(batchjobs_torque)</code> instead of <code>plan(multisession)</code> your futures will be resolved distributed on a compute cluster instead of parallel on your local machine. That&rsquo;s it! However, since last year, the BatchJobs package has been decommissioned and the authors recommend everyone to use their new <a href="https://cran.r-project.org/package=batchtools">batchtools</a> package instead. Just like BatchJobs, it is a very well written package, but at the same time it is more robust against cluster problems and it also supports more types of HPC schedulers. Because of this, I&rsquo;ve been working on <a href="https://github.com/HenrikBengtsson/future.batchtools">future.batchtools</a> which I hope to be able to release soon.</p> <p>Finally, I&rsquo;m really keen on looking into how futures can be used with Shaun Jackman&rsquo;s <a href="https://github.com/sjackman/lambdar">lambdar</a>, which is a proof-of-concept that allows you to execute R code on Amazon&rsquo;s &ldquo;serverless&rdquo; <a href="https://aws.amazon.com/lambda/">AWS Lambda</a> framework. My hope is that, in a not too far future (pun not intended*), we&rsquo;ll be able to resolve our futures on AWS Lambda using <code>plan(aws_lambda)</code>.</p> <p>Happy futuring!</p> <p>(*) Alright, I admit, it was intended.</p> <p>UPDATE 2022-12-11: Update examples that used the deprecated <code>multiprocess</code> future backend alias to use the <code>multisession</code> backend.</p> <h2 id="links">Links</h2> <ul> <li>future package: <ul> <li>CRAN page: <a href="https://cran.r-project.org/package=future">https://cran.r-project.org/package=future</a></li> <li>GitHub page: <a href="https://github.com/HenrikBengtsson/future">https://github.com/HenrikBengtsson/future</a></li> </ul></li> <li>future.BatchJobs package: <ul> <li>CRAN page: <a href="https://cran.r-project.org/package=future.BatchJobs">https://cran.r-project.org/package=future.BatchJobs</a></li> <li>GitHub page: <a href="https://github.com/HenrikBengtsson/future.BatchJobs">https://github.com/HenrikBengtsson/future.BatchJobs</a></li> </ul></li> <li>future.batchtools package: <ul> <li>CRAN page: N/A</li> <li>GitHub page: <a href="https://github.com/HenrikBengtsson/future.batchtools">https://github.com/HenrikBengtsson/future.batchtools</a></li> </ul></li> <li>doFuture package (a <a href="https://cran.r-project.org/package=foreach">foreach</a> adaptor): <ul> <li>CRAN page: <a href="https://cran.r-project.org/package=doFuture">https://cran.r-project.org/package=doFuture</a></li> <li>GitHub page: <a href="https://github.com/HenrikBengtsson/doFuture">https://github.com/HenrikBengtsson/doFuture</a></li> </ul></li> </ul> <h2 id="see-also">See also</h2> <ul> <li><a href="https://www.jottr.org/2016/07/a-future-for-r-slides-from-user-2016.html">A Future for R: Slides from useR 2016</a>, 2016-07-02</li> <li><a href="https://www.jottr.org/2016/10/remote-processing-using-futures.html">Remote Processing Using Futures</a>, 2016-10-21</li> </ul> Start Me Up https://www.jottr.org/2016/12/22/startup/ Thu, 22 Dec 2016 00:00:00 +0000 https://www.jottr.org/2016/12/22/startup/ <p>The <a href="https://cran.r-project.org/package=startup">startup</a> package makes it easy to control your R startup processes and to share part of your startup settings with others (e.g. as a public Git repository) while keeping secret parts to yourself. Instead of having long and windy <code>.Renviron</code> and <code>.Rprofile</code> startup files, you can split them up into short specific files under corresponding <code>.Renviron.d/</code> and <code>.Rprofile.d/</code> directories. For example,</p> <pre><code># Environment variables # (one name=value per line) .Renviron.d/ +- lang # language settings +- libs # library settings +- r_cmd_check # R CMD check settings +- secrets # secret access keys (don't share!) # Configuration scripts # (regular R scripts) .Rprofile.d/ +- interactive=TRUE/ # Used in interactive-mode only: | +- help.start.R # - launch the help server on fixed port | +- misc.R # - TAB completions and more | +- package=fortunes.R # - show a random fortune (iff installed) +- package=devtools.R # devtools-specific options +- os=windows.R # Windows-specific settings +- repos.R # set up the CRAN repository </code></pre> <p>All you need to for this to work is to have a line:</p> <pre><code class="language-r">startup::startup() </code></pre> <p>in your <code>~/.Rprofile</code> file (you may use it in any of the other locations that R supports). As an alternative to manually edit this file, just call <code>startup::install()</code> and this line will be appended if missing and if the file is missing that will also be created. Don&rsquo;t worry, your old file will be backed up with a timestamp.</p> <p>The startup package is extremely lightweight, has no external dependencies and depends only on the &lsquo;base&rsquo; R package. It can be installed from CRAN using <code>install.packages(&quot;startup&quot;)</code>. <em>Note, startup 0.4.0 was released on CRAN on 2016-12-22 - until macOS and Windows binaries are available you can install it via <code>install.packages(&quot;startup&quot;, type = &quot;source&quot;)</code>.</em></p> <p>For more information on what&rsquo;s possible to do with the startup package, see the <a href="https://cran.r-project.org/web/packages/startup/README.html">README</a> file of the package.</p> <h2 id="links">Links</h2> <ul> <li>startup package: <ul> <li>CRAN page: <a href="https://cran.r-project.org/package=startup">https://cran.r-project.org/package=startup</a></li> <li>GitHub page: <a href="https://github.com/HenrikBengtsson/startup">https://github.com/HenrikBengtsson/startup</a></li> </ul></li> </ul> High-Performance Compute in R Using Futures https://www.jottr.org/2016/10/22/future-hpc/ Sat, 22 Oct 2016 00:00:00 +0000 https://www.jottr.org/2016/10/22/future-hpc/ <p>A new version of the <a href="https://cran.r-project.org/package=future.BatchJobs">future.BatchJobs</a> package has been released and is available on CRAN. With a single change of settings, it allows you to switch from running an analysis sequentially on a local machine to running it in parallel on a compute cluster.</p> <p><img src="https://www.jottr.org/post/future_mainframe_red.jpg" alt="A room with a classical mainframe computer and work desks" /> <em>Our different futures can easily be resolved on high-performance compute clusters.</em></p> <h2 id="requirements">Requirements</h2> <p>The future.BatchJobs package implements the Future API, as defined by the <a href="https://cran.r-project.org/package=future">future</a> package, on top of the API provided by the <a href="https://cran.r-project.org/package=BatchJobs">BatchJobs</a> package. These packages and their dependencies install out-of-the-box on all operating systems.</p> <p>Installing the package is all that is needed in order to give it a test ride. If you have access to a compute cluster that uses one of the common job schedulers, such as <a href="https://en.wikipedia.org/wiki/TORQUE">TORQUE (PBS)</a>, <a href="https://en.wikipedia.org/wiki/Slurm_Workload_Manager">Slurm</a>, <a href="https://en.wikipedia.org/wiki/Oracle_Grid_Engine">Sun/Oracle Grid Engine (SGE)</a>, <a href="https://en.wikipedia.org/wiki/Platform_LSF">Load Sharing Facility (LSF)</a> or <a href="https://en.wikipedia.org/wiki/OpenLava">OpenLava</a>, then you&rsquo;re ready to take it for a serious ride. If your cluster uses another type of scheduler, it is possible to configure it to work also there. If you don&rsquo;t have access to a compute cluster right now, you can still try future.BatchJobs by simply using <code>plan(batchjobs_local)</code> in the below example - all futures (&ldquo;jobs&rdquo;) will then be processed sequentially on your local machine (*).</p> <p><small> (*) For those of you who are already familiar with the <a href="https://cran.r-project.org/package=future">future</a> package - yes, if you&rsquo;re only going to run locally, then you can equally well use <code>plan(sequential)</code> or <code>plan(multisession)</code>, but for the sake of demonstrating future.BatchJobs per se, I suggest using <code>plan(batchjobs_local)</code> because it will use the BatchJobs machinery underneath. </small></p> <h2 id="example-extracting-text-and-generating-images-from-pdfs">Example: Extracting text and generating images from PDFs</h2> <p>Imagine we have a large set of PDF documents from which we would like to extract the text and also generate PNG images for each of the pages. Below, I will show how this can be easily done in R thanks to the <a href="https://cran.r-project.org/package=pdftools">pdftools</a> package written by <a href="https://github.com/jeroenooms">Jeroen Ooms</a>. I will also show how we can speed up the processing by using futures that are resolved in parallel either on the local machine or, as shown here, distributed on a compute cluster.</p> <pre><code class="language-r">library(&quot;pdftools&quot;) library(&quot;future.BatchJobs&quot;) library(&quot;listenv&quot;) ## Process all PDFs on local TORQUE cluster plan(batchjobs_torque) ## PDF documents to process pdfs &lt;- dir(path = rev(.libPaths())[1], recursive = TRUE, pattern = &quot;[.]pdf$&quot;, full.names = TRUE) pdfs &lt;- pdfs[basename(dirname(pdfs)) == &quot;doc&quot;] print(pdfs) ## For each PDF ... docs &lt;- listenv() for (ii in seq_along(pdfs)) { pdf &lt;- pdfs[ii] message(sprintf(&quot;%d. Processing %s&quot;, ii, pdf)) name &lt;- tools::file_path_sans_ext(basename(pdf)) docs[[name]] %&lt;-% { path &lt;- file.path(&quot;output&quot;, name) dir.create(path, recursive = TRUE, showWarnings = FALSE) ## (a) Extract the text and write to file content &lt;- pdf_text(pdf) txt &lt;- file.path(path, sprintf(&quot;%s.txt&quot;, name)) cat(content, file = txt) ## (b) Create a PNG file per page pngs &lt;- listenv() for (jj in seq_along(content)) { pngs[[jj]] %&lt;-% { img &lt;- pdf_render_page(pdf, page = jj) png &lt;- file.path(path, sprintf(&quot;%s_p%03d.png&quot;, name, jj)) png::writePNG(img, png) png } } list(pdf = pdf, txt = txt, pngs = unlist(pngs)) } } ## Resolve everything if not already done docs &lt;- as.list(docs) str(docs) </code></pre> <p>As true for all code using the Future API, as a user you always have full control on how futures should be resolved. For instance, you can choose to run the above on your local machine, still via the BatchJobs framework, by using <code>plan(batchjobs_local)</code>. You could even skip the future.BatchJobs package and use what is available in the future package alone, e.g. <code>library(&quot;future&quot;)</code> and <code>plan(multisession)</code>.</p> <p>As emphasized in for instance the <a href="https://www.jottr.org/2016/10/remote-processing-using-futures.html">Remote Processing Using Futures</a> blog post and in the vignettes of the <a href="https://cran.r-project.org/package=future">future</a> package, there is no need to manually identify and manually export variables and functions that need to be available to the external R processes resolving the futures. Such global variables are automatically identified by the future package and exported when necessary.</p> <h2 id="futures-may-be-nested">Futures may be nested</h2> <p>Note how we used nested futures in the above example, where we create one future per PDF and for each PDF we, in turn, create one future per PNG. The design of the Future API is such that the user should have full control on how each level of futures is resolved. In other words, it is the user and not the developer who should decide what is specified in <code>plan()</code>.</p> <p>For futures, if nothing is specified, then sequential processing is always used for resolving futures. In the above example, we specified <code>plan(batchjobs_torque)</code>, which means that the outer loop of futures is processed as individual jobs on the cluster. Each of these futures will be resolved in a separate R process. Next, since we didn&rsquo;t specify how the inner loop of futures should be processed, these will be resolved sequentially as part of these individual R processes.</p> <p>However, we could also choose to have the futures in the inner loop be resolved as individual jobs on the scheduler, which can be done as:</p> <pre><code class="language-r">plan(list(batchjobs_torque, batchjobs_torque)) </code></pre> <p>This would cause each PDF to be submitted as an individual job, which when launched on a compute node by scheduler will start by extract the plain text of the document and write it to file. When this is done, the job continues by generating a PDF image file for each page, which is done via individual jobs on the scheduler.</p> <p>Exactly what strategies to use for resolving the different levels of futures depends on how long they take to process. If the amount of processing needed for a future is really long, then it makes sense to submit it the scheduler whereas if it is really quick it probably makes more sense to process it on the current machine either using parallel futures or no futures at all. For instance, in our example, we could also have chosen to generate the PNGs in parallel on the same compute node that extracted the text. Such a configuration could look like:</p> <pre><code class="language-r">plan(list( tweak(batchjobs_torque, resources = &quot;nodes=1:ppn=12&quot;), multisession )) </code></pre> <p>This setup tells the scheduler that each job should be allocated 12 cores that the individual R processes then may use in parallel. The future package and the <code>multisession</code> configuration will automatically detect how many cores it was allocated by the scheduler.</p> <p>There are numerous other ways to control how and where futures are resolved. See the vignettes of the <a href="https://cran.r-project.org/package=future">future</a> and the <a href="https://cran.r-project.org/package=future.BatchJobs">future.BatchJobs</a> packages for more details. Also, if you read the above and thought that this may result in an explosion of futures created recursively that will bring down your computer or your cluster, don&rsquo;t worry. It&rsquo;s built into the core of future package to prevent this from happening.</p> <h2 id="what-s-next">What&rsquo;s next?</h2> <p>The future.BatchJobs package simply implements the Future API (as defined by the future package) on top of the API provided by the awesome BatchJobs package. The creators of that package are working on the next generation of their tool - the <a href="https://github.com/mllg/batchtools">batchtools</a> package. I&rsquo;ve already started on the corresponding future.batchtools package so that you and your users can switch over to using <code>plan(batchtools_torque)</code> - it&rsquo;ll be as simple as that.</p> <p>Happy futuring!</p> <p>UPDATE 2022-12-11: Update examples that used the deprecated <code>multiprocess</code> future backend alias to use the <code>multisession</code> backend.</p> <h2 id="links">Links</h2> <ul> <li>future package: <ul> <li>CRAN page: <a href="https://cran.r-project.org/package=future">https://cran.r-project.org/package=future</a></li> <li>GitHub page: <a href="https://github.com/HenrikBengtsson/future">https://github.com/HenrikBengtsson/future</a></li> </ul></li> <li>future.BatchJobs package: <ul> <li>CRAN page: <a href="https://cran.r-project.org/package=future.BatchJobs">https://cran.r-project.org/package=future.BatchJobs</a></li> <li>GitHub page: <a href="https://github.com/HenrikBengtsson/future.BatchJobs">https://github.com/HenrikBengtsson/future.BatchJobs</a></li> </ul></li> <li>doFuture package: <ul> <li>CRAN page: <a href="https://cran.r-project.org/package=doFuture">https://cran.r-project.org/package=doFuture</a></li> <li>GitHub page: <a href="https://github.com/HenrikBengtsson/doFuture">https://github.com/HenrikBengtsson/doFuture</a></li> </ul></li> </ul> <h2 id="see-also">See also</h2> <ul> <li><a href="https://www.jottr.org/2016/07/a-future-for-r-slides-from-user-2016.html">A Future for R: Slides from useR 2016</a>, 2016-07-02</li> <li><a href="https://www.jottr.org/2016/10/remote-processing-using-futures.html">Remote Processing Using Futures</a>, 2016-10-11</li> </ul> <p>Keywords: R, future, future.BatchJobs, BatchJobs, package, CRAN, asynchronous, parallel processing, distributed processing, high-performance compute, HPC, compute cluster, TORQUE, PBS, Slurm, SGE, LSF, OpenLava</p> Remote Processing Using Futures https://www.jottr.org/2016/10/11/future-remotes/ Tue, 11 Oct 2016 00:00:00 +0000 https://www.jottr.org/2016/10/11/future-remotes/ <p>A new version of the <a href="https://cran.r-project.org/package=future">future</a> package has been released and is available on CRAN. With futures, it is easy to <em>write R code once</em>, which later <em>the user can choose</em> to parallelize using whatever resources s/he has available, e.g. a local machine, a set of local notebooks, a set of remote machines, or a high-end compute cluster.</p> <p><img src="https://www.jottr.org/post/early_days_video_call.jpg" alt="Postcard from 1900 showing how people in the year 2000 will communicate using audio and projected video" /> <em>The future provides comfortable and friendly long-distance interactions.</em></p> <p>The new version, future 1.1.1, provides:</p> <ul> <li><p><strong>Much easier usage of remote computers / clusters</strong></p> <ul> <li>If you can SSH to the machine, then you can also use it to resolve R expressions remotely.</li> <li>Firewall configuration and port forwarding are no longer needed.</li> </ul></li> <li><p><strong>Improved identification of global variables</strong></p> <ul> <li>Corner cases where the package previously failed to identify and export global variables are now also handled. For instance, variable <code>x</code> is now properly identified as a global variable in expressions such as <code>x$a &lt;- 3</code> and <code>x[1, 2, 4] &lt;- 3</code> as well as in formulas such as <code>y ~ x | z</code>.</li> <li>Global variables are by default identified automatically, but can now also be specified manually, either by their names (as a character vector) or by their names and values (as a named list). <br /></li> </ul></li> </ul> <p>For full details on updates, please see the <a href="https://cran.r-project.org/package=future">NEWS</a> file. The future package installs out-of-the-box on all operating systems.</p> <h2 id="example-remote-graphics-rendered-locally">Example: Remote graphics rendered locally</h2> <p>To illustrate how simple and powerful remote futures can be, I will show how to (i) set up locally stored data, (ii) generate <a href="https://cran.r-project.org/package=plotly">plotly</a>-enhanced <a href="https://cran.r-project.org/package=ggplot2">ggplot2</a> graphics based on these data using a remote machine, and then (iii) render these plotly graphics in the local web browser for interactive exploration of data.</p> <p>Before starting, all we need to do is to verify that we have SSH access to the remote machine, let&rsquo;s call it <code>remote.server.org</code>, and that it has R installed:</p> <pre><code class="language-sh">{local}: ssh remote.server.org {remote}: Rscript --version R scripting front-end version 3.3.1 (2016-06-21) {remote}: exit {local}: exit </code></pre> <p>Note, it is highly recommended to use <a href="https://en.wikipedia.org/wiki/Secure_Shell#Key_management">SSH-key pair authentication</a> so that login credentials do not have to be entered manually.</p> <p>After having made sure that the above works, we are ready for our remote future demo. The following code is based on an online <a href="https://plot.ly/ggplot2/">plotly example</a> where only a few minor modifications have been done:</p> <pre><code class="language-r">library(&quot;plotly&quot;) library(&quot;future&quot;) ## %&lt;-% assignments will be resolved remotely plan(remote, workers = &quot;remote.server.org&quot;) ## Set up data (locally) set.seed(100) d &lt;- diamonds[sample(nrow(diamonds), 1000), ] ## Generate ggplot2 graphics and plotly-fy (remotely) gg %&lt;-% { p &lt;- ggplot(data = d, aes(x = carat, y = price)) + geom_point(aes(text = paste(&quot;Clarity:&quot;, clarity)), size = 4) + geom_smooth(aes(colour = cut, fill = cut)) + facet_wrap(~ cut) ggplotly(p) } ## Display graphics in browser (locally) gg </code></pre> <p>The above renders the plotly-compiled ggplot2 graphics in our local browser. See below screenshot for an example.</p> <p>This might sound like magic, but all that is going behind the scenes is a carefully engineered utilization of the <a href="https://cran.r-project.org/package=globals">globals</a> and the parallel packages, which is then encapsulated in the unified API provided by the future package. First, a future assignment (<code>%&lt;-%</code>) is used for <code>gg</code>, instead of a regular assignment (<code>&lt;-</code>). That tells R to use a future to evaluate the expression on the right-hand side (everything within <code>{ ... }</code>). Second, since we specified that we want to use the remote machine <code>remote.server.org</code> to resolve our futures, that is where the future expression is evaluated. Third, necessary data is automatically communicated between our local and remote machines. That is, any global variables (<code>d</code>) and functions are automatically identified and exported to the remote machine and required packages (<code>ggplot2</code> and <code>plotly</code>) are loaded remotely. When resolved, the value of the expression is automatically transferred back to our local machine afterward and is available as the value of future variable <code>gg</code>, which was formally set up as a promise.</p> <p><img src="https://www.jottr.org/post/future_1.1.1-example_plotly.png" alt="Screenshot of a plotly-rendered panel of ggplot2 graphs" /> <em>An example of remote futures: This ggplot2 + plotly figure was generated on a remote machine and then rendered in the local web browser where it is can be interacted with dynamically.</em></p> <p><em>What&rsquo;s next?</em> Over the summer, I have received tremendous feedback from several people, such as (in no particular order) <a href="https://github.com/krlmlr">Kirill Müller</a>, <a href="https://github.com/gdevailly">Guillaume Devailly</a>, <a href="https://github.com/clarkfitzg">Clark Fitzgerald</a>, <a href="https://github.com/michaelsbradleyjr">Michael Bradley</a>, <a href="https://github.com/thomasp85">Thomas Lin Pedersen</a>, <a href="https://github.com/alexvorobiev">Alex Vorobiev</a>, <a href="https://github.com/hrbrmstr">Bob Rudis</a>, <a href="https://github.com/RebelionTheGrey">RebelionTheGrey</a>, <a href="https://github.com/wrathematics">Drew Schmidt</a> and <a href="https://github.com/gaborcsardi">Gábor Csárdi</a> (sorry if I missed anyone, please let me know). This feedback contributed to some of the new features found in future 1.1.1. However, there&rsquo;re many great <a href="https://github.com/HenrikBengtsson/future/issues">suggestions and wishes</a> that didn&rsquo;t make it in for this release - I hope to be able to work on those next. Thank you all.</p> <p>Happy futuring!</p> <h2 id="links">Links</h2> <ul> <li>future package: <ul> <li>CRAN page: <a href="https://cran.r-project.org/package=future">https://cran.r-project.org/package=future</a></li> <li>GitHub page: <a href="https://github.com/HenrikBengtsson/future">https://github.com/HenrikBengtsson/future</a></li> </ul></li> <li>future.BatchJobs package: <ul> <li>CRAN page: <a href="https://cran.r-project.org/package=future.BatchJobs">https://cran.r-project.org/package=future.BatchJobs</a></li> <li>GitHub page: <a href="https://github.com/HenrikBengtsson/future.BatchJobs">https://github.com/HenrikBengtsson/future.BatchJobs</a></li> </ul></li> <li>doFuture package: <ul> <li>CRAN page: <a href="https://cran.r-project.org/package=doFuture">https://cran.r-project.org/package=doFuture</a></li> <li>GitHub page: <a href="https://github.com/HenrikBengtsson/doFuture">https://github.com/HenrikBengtsson/doFuture</a></li> </ul></li> </ul> <h2 id="see-also">See also</h2> <ul> <li><a href="https://www.jottr.org/2016/07/a-future-for-r-slides-from-user-2016.html">A Future for R: Slides from useR 2016</a>, 2016-07-02</li> </ul> A Future for R: Slides from useR 2016 https://www.jottr.org/2016/07/02/future-user2016-slides/ Sat, 02 Jul 2016 00:00:00 +0000 https://www.jottr.org/2016/07/02/future-user2016-slides/ <p>Unless you count DSC 2003 in Vienna, last week&rsquo;s <a href="http://user2016.org/">useR</a> conference at Stanford was my very first time at useR. It was a great event, it was awesome to meet our lovely and vibrant R community in real life, which we otherwise only get know from online interactions, and of course it was very nice to meet old friends and make new ones.</p> <p><img src="https://www.jottr.org/post/hover_craft_car_photo_picture.jpg" alt="Classical illustration of a hover car above the tree taking of from a yard with a house" /> <em>The future is promising.</em></p> <p>At the end of the second day, I presented <em>A Future for R</em> (18 min talk; slides below) on how you can use the <a href="https://cran.r-project.org/package=future">future</a> package for asynchronous (parallel and distributed) processing using a single unified API regardless of what backend you have available, e.g. multicore, multisession, ad hoc cluster, and job schedulers. I ended with a teaser on how futures can be used for much more than speeding up your code, e.g. generating graphics remotely and displaying it locally.</p> <p>Here&rsquo;s an example using two futures that process data in parallel:</p> <pre><code class="language-r">&gt; library(&quot;future&quot;) &gt; plan(multisession) ## Parallel processing &gt; a %&lt;-% slow_sum(1:50) ## These two assignments are &gt; b %&lt;-% slow_sum(51:100) ## non-blocking and in parallel &gt; y &lt;- a + b ## Waits for a and b to be resolved &gt; y [1] 5050 </code></pre> <p>Below are different formats of my talk (18 slides + 9 appendix slides) on 2016-06-28:</p> <ul> <li><a href="http://www.aroma-project.org/share/presentations/BengtssonH_20160628-useR2016/BengtssonH_20160628-A_Future_for_R,useR2016.html">HTML</a> (incremental slides; requires online access)</li> <li><a href="http://www.aroma-project.org/share/presentations/BengtssonH_20160628-useR2016/BengtssonH_20160628-A_Future_for_R,useR2016,flat.html">HTML</a> (non-incremental slides; requires online access)</li> <li><a href="http://www.aroma-project.org/share/presentations/BengtssonH_20160628-useR2016/BengtssonH_20160628-A_Future_for_R,useR2016.pdf">PDF</a> (incremental slides)</li> <li><a href="http://www.aroma-project.org/share/presentations/BengtssonH_20160628-useR2016/BengtssonH_20160628-A_Future_for_R,useR2016,flat.pdf">PDF</a> (non-incremental slides)</li> <li><a href="http://www.aroma-project.org/share/presentations/BengtssonH_20160628-useR2016/BengtssonH_20160628-A_Future_for_R,useR2016,pure.md">Markdown</a> (screen reader friendly)</li> <li><a href="https://www.youtube.com/watch?v=K8KYi9AFRlk">YouTube</a> (video recording)</li> </ul> <p>May the future be with you!</p> <p>UPDATE 2022-12-11: Update examples that used the deprecated <code>multiprocess</code> future backend alias to use the <code>multisession</code> backend.</p> <h2 id="links">Links</h2> <ul> <li>useR 2016: <ul> <li>Conference site: <a href="https://user2016.r-project.org/">https://user2016.r-project.org/</a></li> <li>Talk abstract: <a href="https://user2016.sched.org/event/7BZK/a-future-for-r">https://user2016.sched.org/event/7BZK/a-future-for-r</a></li> </ul></li> <li>future package: <ul> <li>CRAN page: <a href="https://cran.r-project.org/package=future">https://cran.r-project.org/package=future</a></li> <li>GitHub page: <a href="https://github.com/HenrikBengtsson/future">https://github.com/HenrikBengtsson/future</a></li> </ul></li> <li>future.BatchJobs package: <ul> <li>CRAN page: <a href="https://cran.r-project.org/package=future.BatchJobs">https://cran.r-project.org/package=future.BatchJobs</a></li> <li>GitHub page: <a href="https://github.com/HenrikBengtsson/future.BatchJobs">https://github.com/HenrikBengtsson/future.BatchJobs</a></li> </ul></li> <li>doFuture package: <ul> <li>CRAN page: <a href="https://cran.r-project.org/package=doFuture">https://cran.r-project.org/package=doFuture</a></li> <li>GitHub page: <a href="https://github.com/HenrikBengtsson/doFuture">https://github.com/HenrikBengtsson/doFuture</a></li> </ul></li> </ul> matrixStats: Optimized Subsetted Matrix Calculations https://www.jottr.org/2015/12/16/matrixstats-subsetting/ Wed, 16 Dec 2015 00:00:00 +0000 https://www.jottr.org/2015/12/16/matrixstats-subsetting/ <p>The <a href="http://cran.r-project.org/package=matrixStats">matrixStats</a> package provides highly optimized functions for computing <a href="https://cran.r-project.org/web/packages/matrixStats/vignettes/matrixStats-methods.html">common summaries</a> over rows and columns of matrices. In a <a href="https://www.jottr.org/2015/01/matrixStats-0.13.1.html">previous blog post</a>, I showed that, instead of using <code>apply(X, MARGIN = 2, FUN = median)</code>, we can speed up calculations dramatically by using <code>colMedians(X)</code>. In the most recent release (version 0.50.0), matrixStats has been extended to perform <strong>optimized calculations also on a subset of rows and/or columns</strong> specified via new arguments <code>rows</code> and <code>cols</code>, e.g. <code>colMedians(X, cols = 1:50)</code>.</p> <p><img src="https://www.jottr.org/post/DragsterLeavingTeamBehind.gif" alt="Draster leaving team behind" /></p> <p>For instance, assume we wish to find the median value of the first 50 columns of matrix <code>X</code> with 1,000,000 rows and 100 columns. For simplicity, assume</p> <pre><code class="language-r">&gt; X &lt;- matrix(rnorm(1e6 * 100), nrow = 1e6, ncol = 100) </code></pre> <p>To get the median values without matrixStats, we would do</p> <pre><code class="language-r">&gt; y &lt;- apply(X[, 1:50], MARGIN = 2, FUN = median) &gt; str(y) num [1:50] -0.001059 0.00059 0.001316 0.00103 0.000814 ... </code></pre> <p>As in the past, we could use matrixStats to do</p> <pre><code class="language-r">&gt; y &lt;- colMedians(X[, 1:50]) </code></pre> <p>which is <a href="https://www.jottr.org/2015/01/matrixStats-0.13.1.html">much faster</a> than <code>apply()</code> with <code>median()</code>.</p> <p>However, both approaches require that <code>X</code> is subsetted before the actual calculations can be performed, i.e. the temporary object <code>X[, 1:50]</code> is created. In this example, the size of the original matrix is ~760 MiB and the subsetted one is ~380 MiB;</p> <pre><code class="language-r">&gt; object.size(X) 800000200 bytes &gt; object.size(X[, 1:50]) 400000100 bytes </code></pre> <p>This temporary object is created by (i) R first allocating the size for it and then (ii) copying all its values over from <code>X</code>. After the medians have been calculated this temporary object is automatically discarded and eventually (iii) R&rsquo;s garbage collector will deallocate its memory. This introduces overhead in form of extra memory usage as well as processing time.</p> <p>Starting with matrixStats 0.50.0, we can avoid this overhead by instead using</p> <pre><code class="language-r">&gt; y &lt;- colMedians(X, cols = 1:50) </code></pre> <p><strong>This uses less memory</strong>, because no internal copy of <code>X[, 1:50]</code> has to be created. Instead all calculations are performed directly on the source object <code>X</code>. Because of this, the latter approach of subsetting is <strong>also faster</strong>.</p> <h2 id="bootstrapping-example">Bootstrapping example</h2> <p>Subsetted calculations occur naturally in bootstrap analysis. Assume we want to calculate the median for each column of a 100-by-10,000 matrix <code>X</code> where <strong>the rows are resampled with replacement</strong> 1,000 times. Without matrixStats, this can be done as</p> <pre><code class="language-r">B &lt;- 1000 Y &lt;- matrix(NA_real_, nrow = B, ncol = ncol(X)) for (b in seq_len(B)) { rows &lt;- sample(seq_len(nrow(X)), replace = TRUE) Y[b,] &lt;- apply(X[rows, ], MARGIN = 2, FUN = median) } </code></pre> <p>However, powered with the new matrixStats we can do</p> <pre><code class="language-r">B &lt;- 1000 Y &lt;- matrix(NA_real_, nrow = B, ncol = ncol(X)) for (b in seq_len(B)) { rows &lt;- sample(seq_len(nrow(X)), replace = TRUE) Y[b, ] &lt;- colMedians(X, rows = rows) } </code></pre> <p>In the first approach, with explicit subsetting (<code>X[rows, ]</code>), we are creating a large number of temporary objects - each of size <code>object.size(X[rows, ]) == object.size(X)</code> - that all need to be allocated, copied and deallocated. Thus, if <code>X</code> is a 100-by-10,000 double matrix of size 8,000,200 bytes = 7.6 MiB we are allocating and deallocating a total of 7.5 GiB worth of RAM when using 1,000 bootstrap samples. With a million bootstrap samples, we&rsquo;re consuming a total of 7.3 TiB RAM. In other words, we are wasting lots of compute resources on memory allocation, copying, deallocation and garbage collection. Instead, by using the optimized subsetted calculations available in matrixStats (&gt;= 0.50.0), which is used in the second approach, we spare the computer all that overhead.</p> <p>Not only does the peak memory requirement go down by roughly a half, but <strong>the overall speedup is also substantial</strong>; using a regular notebook the above 1,000 bootstrap samples took 660 seconds (= 11 minutes) to complete using <code>apply(X[rows, ])</code>, 85 seconds (8x speedup) using <code>colMedians(X[rows, ])</code> and 45 seconds (<strong>15x speedup</strong>) using <code>colMedians(X, rows = rows)</code>.</p> <h2 id="availability">Availability</h2> <p>The matrixStats package can be installed on all common operating systems as</p> <pre><code class="language-r">&gt; install.packages(&quot;matrixStats&quot;) </code></pre> <p>The source code is available on <a href="https://github.com/HenrikBengtsson/matrixStats/">GitHub</a>.</p> <h2 id="credits">Credits</h2> <p>Support for optimized calculations on subsets was implemented by <a href="https://www.linkedin.com/in/dongcanjiang">Dongcan Jiang</a>. Dongcan is a Master&rsquo;s student in Computer Science at Peking University and worked on <a href="https://github.com/rstats-gsoc/gsoc2015/wiki/matrixStats">this project</a> from April to August 2015 through support by the <a href="https://developers.google.com/open-source/gsoc/">Google Summer of Code</a> 2015 program. This GSoC project was mentored jointly by me and Hector Corrada Bravo at University of Maryland. We would like to thank Dongcan again for this valuable addition to the package and the community. We would also like to thank Google and the <a href="https://github.com/rstats-gsoc/">R Project in GSoC</a> for making this possible.</p> <p>Any type of feedback, including <a href="https://github.com/HenrikBengtsson/matrixStats/issues/">bug reports</a>, is always appreciated!</p> <h2 id="links">Links</h2> <ul> <li>CRAN package: <a href="http://cran.r-project.org/package=matrixStats">http://cran.r-project.org/package=matrixStats</a></li> <li>Source code and bug reports: <a href="https://github.com/HenrikBengtsson/matrixStats">https://github.com/HenrikBengtsson/matrixStats</a></li> <li>Google Summer of Code (GSoC): <a href="https://developers.google.com/open-source/gsoc/">https://developers.google.com/open-source/gsoc/</a></li> <li>R Project in GSoC (R-GSoC): <a href="https://github.com/rstats-gsoc">https://github.com/rstats-gsoc</a></li> <li>matrixStats in R-GSoC 2015: <a href="https://github.com/rstats-gsoc/gsoc2015/wiki/matrixStats">https://github.com/rstats-gsoc/gsoc2015/wiki/matrixStats</a></li> </ul> Milestone: 7000 Packages on CRAN https://www.jottr.org/2015/08/12/milestone-cran-7000/ Wed, 12 Aug 2015 00:00:00 +0000 https://www.jottr.org/2015/08/12/milestone-cran-7000/ <p>Another 1,000 packages were added to CRAN, which took less than 9 months. Today (August 12, 2015), the Comprehensive R Archive Network (CRAN) package page reports:</p> <blockquote> <p>&ldquo;Currently, the CRAN package repository features 7002 available packages.&rdquo;</p> </blockquote> <p>While the previous 1,000 packages took 355 days, going from 6,000 to 7,000 packages took 286 days - which means that now a new CRAN package is born on average every 6.9 hours (or 3.5 packages per day). Since the start of CRAN 18.3 years ago on April 23, 1997, there has been on average one new package appearing on CRAN every 22.9 hours. It is actually more frequent than that because dropped/archived packages are not accounted for. The 7,000 packages on CRAN are maintained by ~4,130 people.</p> <p>Thanks to the CRAN team and to all package developers. You can give back by carefully reporting bugs to the maintainers and properly citing any packages you use in your publications (see <code>citation(&quot;pkg name&quot;)</code>).</p> <p>Milestones:</p> <ul> <li>2015-08-12: <a href="https://stat.ethz.ch/pipermail/r-package-devel/2015q3/000393.html">7000 packages</a></li> <li>2014-10-29: <a href="https://mailman.stat.ethz.ch/pipermail/r-devel/2014-October/069997.html">6000 packages</a></li> <li>2013-11-08: <a href="https://stat.ethz.ch/pipermail/r-devel/2013-November/067935.html">5000 packages</a></li> <li>2012-08-23: <a href="https://stat.ethz.ch/pipermail/r-devel/2012-August/064675.html">4000 packages</a></li> <li>2011-05-12: <a href="https://stat.ethz.ch/pipermail/r-devel/2011-May/061002.html">3000 packages</a></li> <li>2009-10-04: <a href="https://stat.ethz.ch/pipermail/r-devel/2009-October/055049.html">2000 packages</a></li> <li>2007-04-12: <a href="https://stat.ethz.ch/pipermail/r-devel/2007-April/045359.html">1000 packages</a></li> <li>2004-10-01: 500 packages</li> <li>2003-04-01: 250 packages</li> </ul> <p>These data are for CRAN only. There are many more packages elsewhere, e.g. <a href="http://bioconductor.org/">Bioconductor</a>, <a href="http://r-forge.r-project.org/">R-Forge</a> (sic!), <a href="http://rforge.net/">RForge</a> (sic!), <a href="http://github.com/">Github</a> etc.</p> Performance: Calling R_CheckUserInterrupt() Every 256 Iteration is Actually Faster than Every 1,000,000 Iteration https://www.jottr.org/2015/06/05/checkuserinterrupt/ Fri, 05 Jun 2015 00:00:00 +0000 https://www.jottr.org/2015/06/05/checkuserinterrupt/ <p>If your native code takes more than a few seconds to finish, it is a nice courtesy to the user to check for user interrupts (Ctrl-C) once in a while, say, every 1,000 or 1,000,000 iteration. The C-level API of R provides <code>R_CheckUserInterrupt()</code> for this (see &lsquo;Writing R Extensions&rsquo; for more information on this function). Here&rsquo;s what the code would typically look like:</p> <pre><code class="language-c">for (int ii = 0; ii &lt; n; ii++) { /* Some computational expensive code */ if (ii % 1000 == 0) R_CheckUserInterrupt() } </code></pre> <p>This uses the modulo operator <code>%</code> and tests when it is zero, which happens every 1,000 iteration. When this occurs, it calls <code>R_CheckUserInterrupt()</code>, which will interrupt the processing and &ldquo;return to R&rdquo; whenever an interrupt is detected.</p> <p>Interestingly, it turns out that, it is <em>significantly faster to do this check every $k=2^m$ iteration</em>, e.g. instead of doing it every 1,000 iteration, it is faster to do it every 1,024 iteration. Similarly, instead of, say, doing it every 1,000,000 iteration, do it every 1,048,576 - not one less (1,048,575) or one more (1,048,577). The difference is so large that it is even 2-3 times faster to call <code>R_CheckUserInterrupt()</code> every 256 iteration rather than, say, every 1,000,000 iteration, which at least to me was a bit counter intuitive the first time I observed it.</p> <p>Below are some benchmark statistics supporting the claim that testing / calculating <code>ii % k == 0</code> is faster for $k=2^m$ (blue) than for other choices of $k$ (red).</p> <p><img src="https://www.jottr.org/post/boxplot.png" alt="Boxplot showing that testing every 2^k:th iteration is faster" /></p> <p>Note that the times are on the log scale (the results are also tabulated at the end of this post). Now, will it make a big difference to the overall performance of you code if you choose, say, 1,048,576 instead of 1,000,000? Probably not, but on the other hand, it does not hurt to pick an interval that is a $2^m$ integer. This observation may also be useful in algorithms that make lots of use of the modulo operator.</p> <p>So why is <code>ii % k == 0</code> a faster test when $k=2^m$? <del>I can only speculate. For instance, the integer $2^m$ is a binary number with all bits but one set to zero. It might be that this is faster to test for than other bit patterns, but I don&rsquo;t know if this is because of how the native code is optimized by the compiler and/or if it goes down to the hardware/CPU level. I&rsquo;d be interested in feedback and hear your thoughts on this.</del></p> <p><strong>UPDATE 2015-06-15</strong>: Thomas Lumley kindly <a href="https://twitter.com/tslumley/status/610627555545083904">replied</a> and pointed me to fact that <a href="https://en.wikipedia.org/wiki/Modulo_operation#Performance_issues">&ldquo;the modulo of powers of 2 can alternatively be expressed as a bitwise AND operation&rdquo;</a>, which in C terms means that <code>ii % 2^m</code> is identical to <code>ii &amp; (2^m - 1)</code> (at least for positive integers), and this is <a href="http://stackoverflow.com/questions/22446425/do-c-c-compilers-such-as-gcc-generally-optimize-modulo-by-a-constant-power-of">an optimization that the GCC compiler does by default</a>. The bitwise AND operator is extremely fast, because the CPU can take the AND on all bits at the same time (think 64 electronic AND gates for a 64-bit integer). After this, comparing to zero is also very fast. The optimization cannot be done for integers that are not powers of two. So, in our case, when the compiler sees <code>ii % 256 == 0</code> it optimizes it to become <code>ii &amp; 255 == 0</code>, which is much faster to calculate than the non-optimized <code>ii % 256 == 0</code> (or <code>ii % 257 == 0</code>, or <code>ii % 1000000 == 0</code>, and so on).</p> <h2 id="details-on-how-the-benchmarking-was-done">Details on how the benchmarking was done</h2> <p>I used the <a href="http://cran.r-project.org/package=inline">inline</a> package to generate a set of C-level functions with varying interrupt intervals ($k$). I&rsquo;m not passing $k$ as a parameter to these functions. Instead, I use it as a constant value so that the compiler can optimize as far as possible, but also in order to imitate how most code is written. This is why I generate multiple C functions. I benchmarked across a wide range of interval choices using the <a href="http://cran.r-project.org/package=microbenchmark">microbenchmark</a> package. The C functions (with corresponding R functions calling them) and the corresponding benchmark expressions to be called were generated as follows:</p> <pre><code class="language-r">## The interrupt intervals to benchmark ## (a) Classical values ks &lt;- c(1, 10, 100, 1000, 10e3, 100e3, 1e6) ## (b) 2^k values and the ones before and after ms &lt;- c(2, 5, 8, 10, 16, 20) as &lt;- c(-1, 0, +1) + rep(2^ms, each = 3) ## List of unevaluated expressions to benchmark mbexpr &lt;- list() for (k in sort(c(ks, as))) { name &lt;- sprintf(&quot;every_%d&quot;, k) ## The C function assign(name, inline::cfunction(c(length = &quot;integer&quot;), body = sprintf(&quot; int i, n = asInteger(length); for (i=0; i &lt; n; i++) { if (i %% %d == 0) R_CheckUserInterrupt(); } return ScalarInteger(n); &quot;, k))) ## The corresponding expression to benchmark mbexpr &lt;- c(mbexpr, substitute(every(n), list(every = as.symbol(name)))) } </code></pre> <p>The actual benchmarking of the 25 cases was then done by calling:</p> <pre><code class="language-r">n &lt;- 10e6 ## Number of iterations stats &lt;- microbenchmark::microbenchmark(list = mbexpr) </code></pre> <table> <thead> <tr> <th align="left">expr</th> <th align="right">min</th> <th align="right">lq</th> <th align="right">mean</th> <th align="right">median</th> <th align="right">uq</th> <th align="right">max</th> </tr> </thead> <tbody> <tr> <td align="left">every_1(n)</td> <td align="right">479.19</td> <td align="right">485.08</td> <td align="right">511.45</td> <td align="right">492.91</td> <td align="right">521.50</td> <td align="right">839.50</td> </tr> <tr> <td align="left">every_3(n)</td> <td align="right">184.08</td> <td align="right">185.74</td> <td align="right">197.86</td> <td align="right">189.10</td> <td align="right">197.31</td> <td align="right">321.69</td> </tr> <tr> <td align="left">every_4(n)</td> <td align="right">148.99</td> <td align="right">150.80</td> <td align="right">160.92</td> <td align="right">152.73</td> <td align="right">158.55</td> <td align="right">245.72</td> </tr> <tr> <td align="left">every_5(n)</td> <td align="right">127.42</td> <td align="right">129.25</td> <td align="right">134.18</td> <td align="right">131.26</td> <td align="right">134.69</td> <td align="right">190.88</td> </tr> <tr> <td align="left">every_10(n)</td> <td align="right">91.96</td> <td align="right">93.12</td> <td align="right">99.75</td> <td align="right">94.48</td> <td align="right">98.10</td> <td align="right">194.98</td> </tr> <tr> <td align="left">every_31(n)</td> <td align="right">65.78</td> <td align="right">67.15</td> <td align="right">71.18</td> <td align="right">68.33</td> <td align="right">70.52</td> <td align="right">113.55</td> </tr> <tr> <td align="left">every_32(n)</td> <td align="right">49.12</td> <td align="right">49.49</td> <td align="right">51.72</td> <td align="right">50.24</td> <td align="right">51.38</td> <td align="right">91.28</td> </tr> <tr> <td align="left">every_33(n)</td> <td align="right">63.29</td> <td align="right">64.01</td> <td align="right">67.96</td> <td align="right">64.76</td> <td align="right">68.79</td> <td align="right">112.26</td> </tr> <tr> <td align="left">every_100(n)</td> <td align="right">50.85</td> <td align="right">51.46</td> <td align="right">54.81</td> <td align="right">52.37</td> <td align="right">55.01</td> <td align="right">89.83</td> </tr> <tr> <td align="left">every_255(n)</td> <td align="right">56.05</td> <td align="right">56.48</td> <td align="right">59.81</td> <td align="right">57.21</td> <td align="right">59.25</td> <td align="right">119.47</td> </tr> <tr> <td align="left">every_256(n)</td> <td align="right">19.46</td> <td align="right">19.62</td> <td align="right">21.03</td> <td align="right">19.88</td> <td align="right">20.71</td> <td align="right">41.98</td> </tr> <tr> <td align="left">every_257(n)</td> <td align="right">53.32</td> <td align="right">53.70</td> <td align="right">57.16</td> <td align="right">54.54</td> <td align="right">56.34</td> <td align="right">96.61</td> </tr> <tr> <td align="left">every_1000(n)</td> <td align="right">44.76</td> <td align="right">46.68</td> <td align="right">50.40</td> <td align="right">47.50</td> <td align="right">50.19</td> <td align="right">121.97</td> </tr> <tr> <td align="left">every_1023(n)</td> <td align="right">53.68</td> <td align="right">54.89</td> <td align="right">57.64</td> <td align="right">55.57</td> <td align="right">57.71</td> <td align="right">111.59</td> </tr> <tr> <td align="left">every_1024(n)</td> <td align="right">17.41</td> <td align="right">17.55</td> <td align="right">18.86</td> <td align="right">17.80</td> <td align="right">18.78</td> <td align="right">43.54</td> </tr> <tr> <td align="left">every_1025(n)</td> <td align="right">51.19</td> <td align="right">51.72</td> <td align="right">54.09</td> <td align="right">52.28</td> <td align="right">53.29</td> <td align="right">101.97</td> </tr> <tr> <td align="left">every_10000(n)</td> <td align="right">42.82</td> <td align="right">45.65</td> <td align="right">48.09</td> <td align="right">46.20</td> <td align="right">47.83</td> <td align="right">82.92</td> </tr> <tr> <td align="left">every_65535(n)</td> <td align="right">51.51</td> <td align="right">53.45</td> <td align="right">55.68</td> <td align="right">54.00</td> <td align="right">55.04</td> <td align="right">87.36</td> </tr> <tr> <td align="left">every_65536(n)</td> <td align="right">16.74</td> <td align="right">16.84</td> <td align="right">17.91</td> <td align="right">16.99</td> <td align="right">17.37</td> <td align="right">47.82</td> </tr> <tr> <td align="left">every_65537(n)</td> <td align="right">60.62</td> <td align="right">61.44</td> <td align="right">65.16</td> <td align="right">62.56</td> <td align="right">64.93</td> <td align="right">104.71</td> </tr> <tr> <td align="left">every_100000(n)</td> <td align="right">43.68</td> <td align="right">44.48</td> <td align="right">46.81</td> <td align="right">44.98</td> <td align="right">46.51</td> <td align="right">83.33</td> </tr> <tr> <td align="left">every_1000000(n)</td> <td align="right">41.61</td> <td align="right">44.21</td> <td align="right">46.99</td> <td align="right">44.86</td> <td align="right">47.11</td> <td align="right">87.90</td> </tr> <tr> <td align="left">every_1048575(n)</td> <td align="right">50.98</td> <td align="right">52.80</td> <td align="right">54.92</td> <td align="right">53.55</td> <td align="right">55.36</td> <td align="right">72.44</td> </tr> <tr> <td align="left">every_1048576(n)</td> <td align="right">16.73</td> <td align="right">16.83</td> <td align="right">17.92</td> <td align="right">17.05</td> <td align="right">17.89</td> <td align="right">35.52</td> </tr> <tr> <td align="left">every_1048577(n)</td> <td align="right">60.28</td> <td align="right">62.58</td> <td align="right">65.43</td> <td align="right">63.92</td> <td align="right">65.91</td> <td align="right">87.58</td> </tr> </tbody> </table> <p>I get similar results across various operating systems (Windows, OS X and Linux) all using GNU Compiler Collection (GCC).</p> <p>Feedback and comments are apprecated!</p> <p>To reproduce these results, do:</p> <pre><code class="language-r">&gt; path &lt;- 'https://raw.githubusercontent.com/HenrikBengtsson/jottr.org/master/blog/20150604%2CR_CheckUserInterrupt' &gt; html &lt;- R.rsp::rfile('R_CheckUserInterrupt.md.rsp', path = path) &gt; !html ## Open in browser </code></pre> To Students: matrixStats for Google Summer of Code https://www.jottr.org/2015/03/12/matrixstats-gsoc/ Thu, 12 Mar 2015 00:00:00 +0000 https://www.jottr.org/2015/03/12/matrixstats-gsoc/ <p>We are pleased to announce our proposal &lsquo;<strong><a href="https://github.com/rstats-gsoc/gsoc2015/wiki/matrixStats">Subsetted and parallel computations in matrixStats</a></strong>&rsquo; for Google Summer of Code. The project is aimed for a student with experience in R and C, it runs for three months, and the student gets paid 5500 USD by Google. Students from (almost) all over the world can apply. Application deadline is <strong>March 27, 2015</strong>. I, Henrik Bengtsson, and Héctor Corrada Bravo will be joint mentors. Communication and mentoring will occur online. We&rsquo;re looking forward to your application.</p> <p><img src="https://www.jottr.org/post/banner-gsoc2015.png" alt="Google Summer of Code 2015 banner" /></p> <h2 id="links">Links</h2> <ul> <li>The matrixStats GSoC project: <a href="https://github.com/rstats-gsoc/gsoc2015/wiki/matrixStats">Subsetted and parallel computations in matrixStats</a></li> <li>CRAN page: <a href="http://cran.r-project.org/package=matrixStats">http://cran.r-project.org/package=matrixStats</a></li> <li>GitHub page: <a href="https://github.com/HenrikBengtsson/matrixStats">https://github.com/HenrikBengtsson/matrixStats</a></li> <li>R Project GSoC wiki: <a href="https://github.com/rstats-gsoc/gsoc2015">https://github.com/rstats-gsoc/gsoc2015</a></li> <li>Google Summer of Code (GSoC) page: <a href="http://www.google-melange.com/gsoc/homepage/google/gsoc2015">http://www.google-melange.com/gsoc/homepage/google/gsoc2015</a></li> </ul> <h2 id="related-posts">Related posts</h2> <ul> <li><a href="https://www.jottr.org/2015/01/matrixStats-0.13.1.html">PACKAGE: matrixStats 0.13.1 - Methods that Apply to Rows and Columns of a Matrix (and Vectors)</a></li> <li><a href="http://www.r-bloggers.com/?s=Google+Summer+of+Code">R Blogger posts on GSoC</a></li> </ul> How to: Package Vignettes in Plain LaTeX https://www.jottr.org/2015/02/21/how-to-plain-latex-vignettes/ Sat, 21 Feb 2015 00:00:00 +0000 https://www.jottr.org/2015/02/21/how-to-plain-latex-vignettes/ <p>Ever wanted to include a plain-LaTeX vignette in your package and have it compiled into a PDF? The <a href="http://cran.r-project.org/package=R.rsp">R.rsp</a> package provides a four-line solution for this.</p> <p><em>But, first, what&rsquo;s R.rsp?</em> R.rsp is an R package that implements a compiler for the RSP markup language. RSP can be used to embed dynamic R code in <em>any</em> text-based source document to be compiled into a final document, e.g. RSP-embedded LaTeX into PDF, RSP-embedded Markdown into HTML, RSP-embedded HTML into HTML and so on. The package provides a set of <em>vignette engines</em> making it straightforward to use RSP in vignettes and there are also other vignette engines to, for instance, include static PDF vignettes. Starting with R.rsp v0.20.0 (on CRAN), a vignette engine for including plain LaTeX-based vignettes is also available. The R.rsp package installs out-of-the-box on all common operating systems, including Linux, OS X and Windows. Its source code is available on <a href="https://github.com/HenrikBengtsson/R.rsp">GitHub</a>.</p> <p><img src="https://www.jottr.org/post/Writing_ball_keyboard_3.jpg" alt="A Hansen writing ball - a keyboard invented by Rasmus Malling-Hansen in 1865" /></p> <h2 id="steps-to-include-a-latex-vignettes-in-your-package">Steps to include a LaTeX vignettes in your package</h2> <ol> <li><p>Place your LaTeX file in the <code>vignettes/</code> directory of your package. If it needs other files such as image files, place those under this directory too.</p></li> <li><p>Rename the file to have filename extension *.ltx, e.g. vignettes/UsingYadayada.ltx(*)</p></li> <li><p>Add the following meta directives at the top of the LaTeX file:<br /> <code>%\VignetteIndexEntry{Using Yadayada}</code><br /> <code>%\VignetteEngine{R.rsp::tex}</code></p></li> <li><p>Add the following to your <code>DESCRIPTION</code> file:<br /> <code>Suggests: R.rsp</code><br /> <code>VignetteBuilder: R.rsp</code></p></li> </ol> <p>That&rsquo;s all!</p> <p>When you run <code>R CMD build</code>, the <code>R.rsp::tex</code> vignette engine will compile your LaTeX vignette into a PDF and make it part of your package&rsquo;s *.tar.gz file. As for any vignette engine, the PDF will be placed in the <code>inst/doc/</code> directory of the *.tar.gz file, ready to be installed together with your package. Users installing your package will <em>not</em> have to install R.rsp.</p> <p>If this is your first package vignette ever, you should know that you are now only baby steps away from writing your first &ldquo;dynamic&rdquo; vignette using Sweave, <a href="http://cran.r-project.org/package=knitr">knitr</a> or RSP. For RSP-embedded LaTeX vignettes, change the engine to <code>R.rsp::rsp</code>, rename the file to <code>*.ltx.rsp</code> (or <code>*.tex.rsp</code>) and start embedding R code in the LaTeX file, e.g. &lsquo;The p-value is &lt;%= signif(p, 2) %&gt;`.</p> <p><em>Footnote:</em> (*) If one uses filename extension <code>*.tex</code>, then <code>R CMD check</code> will give a <em>false</em> NOTE about the file &ldquo;should probably not be installed&rdquo;. Using extension <code>*.ltx</code>, which is an official LaTeX extension, avoids this issue.</p> <h3 id="why-not-use-sweave">Why not use Sweave?</h3> <p>It has always been possible to &ldquo;hijack&rdquo; the Sweave vignette engine to achieve the same thing by renaming the filename extension to <code>*.Rnw</code> and including the proper <code>\VignetteIndexEntry</code> markup. This would trick R to compile it as an Sweave vignette (without Sweave markup) resulting in a PDF, which in practice would work as a plain LaTeX-to-PDF compiler. The <code>R.rsp::tex</code> engine achieves the same without the &ldquo;hack&rdquo; and without the Sweave machinery.</p> <h3 id="static-pdfs">Static PDFs?</h3> <p>If you want to use a &ldquo;static&rdquo; pre-generated PDF as a package vignette that can also be achieved in a few step using the <code>R.rsp::asis</code> vignette engine. There is an R.rsp <a href="http://cran.r-project.org/package=R.rsp">vignette</a> explaining how to do this, but please consider alternatives that compile from source before doing this. Also, vignettes without full source may not be accepted by CRAN. A LaTeX vignette does not have this problem.</p> <h2 id="links">Links</h2> <ul> <li>CRAN page: <a href="http://cran.r-project.org/package=R.rsp">http://cran.r-project.org/package=R.rsp</a></li> <li>GitHub page: <a href="https://github.com/HenrikBengtsson/R.rsp">https://github.com/HenrikBengtsson/R.rsp</a></li> </ul> Package: matrixStats 0.13.1 - Methods that Apply to Rows and Columns of a Matrix (and Vectors) https://www.jottr.org/2015/01/25/matrixstats-0.13.1/ Sun, 25 Jan 2015 00:00:00 +0000 https://www.jottr.org/2015/01/25/matrixstats-0.13.1/ <p>A new release 0.13.1 of <a href="http://cran.r-project.org/package=matrixStats">matrixStats</a> is now on CRAN. The source code is available on <a href="https://github.com/HenrikBengtsson/matrixStats">GitHub</a>.</p> <h2 id="what-does-it-do">What does it do?</h2> <p>The matrixStats package provides highly optimized functions for computing common summaries over rows and columns of matrices, e.g. <code>rowQuantiles()</code>. There are also functions that operate on vectors, e.g. <code>logSumExp()</code>. Their implementations strive to minimize both memory usage and processing time. They are often remarkably faster compared to good old <code>apply()</code> solutions. The calculations are mostly implemented in C, which allow us to optimize(*) beyond what is possible to do in plain R. The package installs out-of-the-box on all common operating systems, including Linux, OS X and Windows.</p> <p>The following example computes the median of the columns in a 20-by-500 matrix</p> <pre><code class="language-r">&gt; library(&quot;matrixStats&quot;) &gt; X &lt;- matrix(rnorm(20 * 500), nrow = 20, ncol = 500) &gt; stats &lt;- microbenchmark::microbenchmark(colMedians = colMedians(X), + `apply+median` = apply(X, MARGIN = 2, FUN = median), unit = &quot;ms&quot;) &gt; stats Unit: milliseconds expr min lq mean median uq max neval cld colMedians 0.41 0.45 0.49 0.47 0.5 0.75 100 a apply+median 21.50 22.77 25.59 23.86 26.2 107.12 100 b </code></pre> <p><img src="https://www.jottr.org/post/colMedians.png" alt="Graph showing that colMedians is significantly faster than apply+median over 100 test runs" /></p> <p>It shows that <code>colMedians()</code> is ~51 times faster than <code>apply(..., MARGIN = 2, FUN = median)</code> in this particular case. The relative gain varies with matrix shape, so you should benchmark with your configurations. You can also play around with the benchmark reports that are under development, e.g. <code>html &lt;- matrixStats:::benchmark(&quot;colRowMedians&quot;); !html</code>.</p> <h2 id="what-is-new">What is new?</h2> <p>With this release, all <em>the functions run faster than ever before and at the same time use less memory than ever before</em>, which in turn means that now even larger data matrices can be processed without having to upgrade the RAM. A few small bugs have also been fixed and some &ldquo;missing&rdquo; <a href="http://cran.r-project.org/web/packages/matrixStats/vignettes/matrixStats-methods.html">functions</a> have been added to the R API. This update is part of a long-term tune-up that started back in June 2014. Most of the major groundwork has already been done, but there is still room for improvements. If you&rsquo;re using matrixStats functions in your package already now, you should see some notable speedups for those function calls, especially compared to what was available back in June. For instance, <code>rowMins()</code> is now <a href="http://stackoverflow.com/questions/13676878/fastest-way-to-get-min-from-every-column-in-a-matrix">5-20 times faster</a> than functions such as <code>base::pmin.int()</code> whereas in the past they performed roughly the same.</p> <p>I&rsquo;ve also added a large number of new package tests; the R and C source code coverage has recently gone up from 59% to <a href="https://coveralls.io/r/HenrikBengtsson/matrixStats?branch=develop">96%</a> (&hellip; and counting). Some of the bugs were discovered as part of this effort. Here a special thank should go out to Jim Hester for his great work on <a href="https://github.com/jimhester/covr">covr</a>, which provides me with on-the-fly coverage reports via Coveralls. (You can run covr locally or via GitHub + Travis CI, which is very easy if you&rsquo;re already up and running there. <em>Try it!</em>) I would also like to thank the R core team and the CRAN team for their continuous efforts on improving the package tests that we get via <code>R CMD check</code> but also via the CRAN farm (which occasionally catches code issues that I&rsquo;m not always seeing on my end).</p> <p><em>Footnote: (*) One strategy for keeping the memory footprint at a minimum is to optimize the implementations for the integer and the numeric (double) data types separately. Because of this, a great number of data-type coercions are avoided, coercions that otherwise would consume precious memory due to temporarily allocated copies, but also precious processing time because the garbage collector later would have to spend time cleaning up the mess. The new <code>weightedMean()</code> function, which is many times faster than <code>stats::weighted.mean()</code>, is one of several cases where this strategy is particular helpful.</em></p> <h2 id="links">Links</h2> <ul> <li>CRAN page: <a href="http://cran.r-project.org/package=matrixStats">http://cran.r-project.org/package=matrixStats</a></li> <li>GitHub page: <a href="https://github.com/HenrikBengtsson/matrixStats">https://github.com/HenrikBengtsson/matrixStats</a></li> <li>Coveralls page: <a href="https://coveralls.io/r/HenrikBengtsson/matrixStats?branch=develop">https://coveralls.io/r/HenrikBengtsson/matrixStats?branch=develop</a></li> <li>Bug reports: <a href="https://github.com/HenrikBengtsson/matrixStats/issues">https://github.com/HenrikBengtsson/matrixStats/issues</a></li> <li>covr: <a href="https://github.com/jimhester/covr">https://github.com/jimhester/covr</a></li> </ul> Milestone: 6000 Packages on CRAN https://www.jottr.org/2014/10/29/milestone-cran-6000/ Wed, 29 Oct 2014 00:00:00 +0000 https://www.jottr.org/2014/10/29/milestone-cran-6000/ <p>Another 1,000 packages were added to CRAN and this time in less than 12 months. Today (2014-10-29) on The Comprehensive R Archive Network (CRAN) package page:</p> <blockquote> <p>&ldquo;Currently, the CRAN package repository features 6000 available packages.&rdquo;</p> </blockquote> <p>Going from 5,000 to 6,000 packages took 355 days - which means that it on average was only ~8.5 hours between each new packages added. It is actually even more frequent since dropped packages are not accounted for. The 6,000 packages on CRAN are maintained by 3,444 people. Thanks to all package developers and to the CRAN Team for handling all this!</p> <p>You can give back by carefully reporting bugs to the maintainers and properly citing any packages you use in your publications, cf. <code>citation(&quot;pkg name&quot;)</code>.</p> <p>Milestones:</p> <ul> <li>2014-10-29: <a href="https://mailman.stat.ethz.ch/pipermail/r-devel/2014-October/069997.html">6000 packages</a></li> <li>2013-11-08: <a href="https://stat.ethz.ch/pipermail/r-devel/2013-November/067935.html">5000 packages</a></li> <li>2012-08-23: <a href="https://stat.ethz.ch/pipermail/r-devel/2012-August/064675.html">4000 packages</a></li> <li>2011-05-12: <a href="https://stat.ethz.ch/pipermail/r-devel/2011-May/061002.html">3000 packages</a></li> <li>2009-10-04: <a href="https://stat.ethz.ch/pipermail/r-devel/2009-October/055049.html">2000 packages</a></li> <li>2007-04-12: <a href="https://stat.ethz.ch/pipermail/r-devel/2007-April/045359.html">1000 packages</a></li> <li>2004-10-01: 500 packages</li> <li>2003-04-01: 250 packages</li> </ul> <p>These data are for CRAN only. There are many more packages elsewhere, e.g. <a href="http://bioconductor.org/">Bioconductor</a>, <a href="http://r-forge.r-project.org/">R-Forge</a> (sic!), <a href="http://rforge.net/">RForge</a> (sic!), <a href="http://github.com/">Github</a> etc.</p> Pitfall: Did You Really Mean to Use matrix(nrow, ncol)? https://www.jottr.org/2014/06/17/matrixna-wrong-way/ Tue, 17 Jun 2014 00:00:00 +0000 https://www.jottr.org/2014/06/17/matrixna-wrong-way/ <p><img src="https://www.jottr.org/post/wrong_way_035.jpg" alt="Road sign reading &quot;Wrong Way&quot;" /></p> <p>Are you a good R citizen and preallocates your matrices? <strong>If you are allocating a numeric matrix in one of the following two ways, then you are doing it the wrong way!</strong></p> <pre><code class="language-r">x &lt;- matrix(nrow = 500, ncol = 100) </code></pre> <p>or</p> <pre><code class="language-r">x &lt;- matrix(NA, nrow = 500, ncol = 100) </code></pre> <p>Why? Because it is counter productive. And why is that? In the above, <code>x</code> becomes a <strong>logical</strong> matrix, and <strong>not a numeric</strong> matrix as intended. This is because the default value of the <code>data</code> argument of <code>matrix()</code> is <code>NA</code>, which is a <strong>logical</strong> value, i.e.</p> <pre><code class="language-r">&gt; x &lt;- matrix(nrow = 500, ncol = 100) &gt; mode(x) [1] &quot;logical&quot; &gt; str(x) logi [1:500, 1:100] NA NA NA NA NA NA ... </code></pre> <p>Why is that bad? Because, as soon as you assign a numeric value to any of the cells in <code>x</code>, the matrix will first have to be coerced to numeric when the new value is assigned. <strong>The originally allocated logical matrix was allocated in vain and just adds an unnecessary memory footprint and extra work for the garbage collector</strong>.</p> <p>Instead allocate it using <code>NA_real_</code> (or <code>NA_integer_</code> for integers):</p> <pre><code class="language-r">x &lt;- matrix(NA_real_, nrow = 500, ncol = 100) </code></pre> <p>Of course, if you wish to allocate a matrix with all zeros, use <code>0</code> instead of <code>NA_real_</code> (or <code>0L</code> for integers).</p> <p>The exact same thing happens with <code>array()</code> and also because the default value is <code>NA</code>, e.g.</p> <pre><code class="language-r">&gt; x &lt;- array(dim = c(500, 100)) &gt; mode(x) [1] &quot;logical&quot; </code></pre> <p>Similarly, be careful when you setup vectors using <code>rep()</code>, e.g. compare</p> <pre><code class="language-r">x &lt;- rep(NA, times = 500) </code></pre> <p>to</p> <pre><code class="language-r">x &lt;- rep(NA_real_, times = 500) </code></pre> <p>Note, if all you want is an empty vector with all zeros, you may as well use</p> <pre><code class="language-r">x &lt;- double(500) </code></pre> <p>for doubles and</p> <pre><code class="language-r">x &lt;- integer(500) </code></pre> <p>for integers.</p> <h2 id="details">Details</h2> <p>In the &lsquo;base&rsquo; package there is a neat little function called <code>tracemem()</code> that can be used to trace the internal copying of objects. We can use it to show how the two cases differ. Lets start by doing it the wrong way:</p> <pre><code class="language-r">&gt; x &lt;- matrix(nrow = 500, ncol = 100) &gt; tracemem(x) [1] &quot;&lt;0x00000000100a0040&gt;&quot; &gt; x[1,1] &lt;- 3.14 tracemem[0x00000000100a0040 -&gt; 0x000007ffffba0010]: &gt; x[1,2] &lt;- 2.71 &gt; </code></pre> <p>That &lsquo;tracemem&rsquo; output message basically tells us that <code>x</code> is copied, or more precisely that a new internal object (0x000007ffffba0010) is allocated and that <code>x</code> now refers to that instead of the original one (0x00000000100a0040). This happens because <code>x</code> needs to be coerced from logical to numerical before assigning cell (1,1) the (numerical) value 3.14. Note that there is no need for R to create a copy in the second assignment to <code>x</code>, because at this point it is already of a numeric type.</p> <p>To avoid the above, lets make sure to allocate a numeric matrix from the start and there will be no extra copies created:</p> <pre><code class="language-r">&gt; x &lt;- matrix(NA_real_, nrow = 500, ncol = 100) &gt; tracemem(x) [1] &quot;&lt;0x000007ffffd70010&gt;&quot; &gt; x[1,1] &lt;- 3.14 &gt; x[1,2] &lt;- 2.71 &gt; </code></pre> <h2 id="appendix">Appendix</h2> <h3 id="session-information">Session information</h3> <pre><code class="language-r">R version 3.1.0 Patched (2014-06-11 r65921) Platform: x86_64-w64-mingw32/x64 (64-bit) locale: [1] LC_COLLATE=English_United States.1252 [2] LC_CTYPE=English_United States.1252 [3] LC_MONETARY=English_United States.1252 [4] LC_NUMERIC=C [5] LC_TIME=English_United States.1252 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] R.utils_1.32.5 R.oo_1.18.2 R.methodsS3_1.6.2 loaded via a namespace (and not attached): [1] R.cache_0.10.0 R.rsp_0.19.0 tools_3.1.0 </code></pre> <h3 id="reproducibility">Reproducibility</h3> <p>This report was generated from an RSP-embedded Markdown <a href="https://gist.github.com/HenrikBengtsson/854d13a11a33b3d43ec3/raw/matrixNA.md.rsp">document</a> using <a href="http://cran.r-project.org/package=R.rsp">R.rsp</a> v0.19.0. <!-- It can be recompiled as `R.rsp::rfile("https://gist.github.com/HenrikBengtsson/854d13a11a33b3d43ec3/raw/matrixNA.md.rsp")`. --></p> Performance: captureOutput() is Much Faster than capture.output() https://www.jottr.org/2014/05/26/captureoutput/ Mon, 26 May 2014 00:00:00 +0000 https://www.jottr.org/2014/05/26/captureoutput/ <p>The R function <code>capture.output()</code> can be used to &ldquo;collect&rdquo; the output of functions such as <code>cat()</code> and <code>print()</code> to strings. For example,</p> <pre><code class="language-r">&gt; s &lt;- capture.output({ + cat(&quot;Hello\nworld!\n&quot;) + print(pi) + }) &gt; s [1] &quot;Hello&quot; &quot;world!&quot; &quot;[1] 3.141593&quot; </code></pre> <p>More precisely, it captures all output sent to the <a href="http://www.wikipedia.org/wiki/Standard_streams">standard output</a> and returns a character vector where each element correspond to a line of output. By the way, it does not capture the output sent to the standard error, e.g. <code>cat(&quot;Hello\nworld!\n&quot;, file = stderr())</code> and <code>message(&quot;Hello\nworld!\n&quot;)</code>.</p> <p>However, as currently implemented (R 3.1.0), this function is <a href="https://stat.ethz.ch/pipermail/r-devel/2014-February/068349.html">very slow</a> in capturing a large number of lines. Its processing time is approximately <em>quadratic (= $O(n^2)$)</em>, <del>exponential (= O(e^n))</del> in the number of lines capture, e.g. on my notebook 10,000 lines take 0.7 seconds to capture, whereas 50,000 take 12 seconds, and 100,000 take 42 seconds. The culprit is <code>textConnection()</code> which <code>capture.output()</code> utilizes. Without going in to the <a href="https://github.com/wch/r-source/blob/R-3-1-branch/src/main/connections.c#L2920-2960">details</a>, it turns out that textConnection() copies lines one by one internally, which is extremely inefficient.</p> <p><strong>The <code>captureOutput()</code> function of <a href="http://cran.r-project.org/package=R.utils">R.utils</a> does not have this problem.</strong> Its processing time is <em>linear</em> in the number of lines and characters, because it relies on <code>rawConnection()</code> instead of <code>textConnection()</code>. For instance, 100,000 lines take 0.2 seconds and 1,000,000 lines take 2.5 seconds to captures when the lines are 100 characters long. For 100,000 lines with 1,000 characters it takes 2.4 seconds.</p> <h2 id="benchmarking">Benchmarking</h2> <p>The above benchmark results were obtained as following. We first create a function that generates a string with a large number of lines:</p> <pre><code class="language-r">&gt; lineBuffer &lt;- function(n, len) { + line &lt;- paste(c(rep(letters, length.out = len), &quot;\n&quot;), collapse = &quot;&quot;) + line &lt;- charToRaw(line) + lines &lt;- rep(line, times = n) + rawToChar(lines, multiple = FALSE) + } </code></pre> <p>For example,</p> <pre><code class="language-r">&gt; cat(lineBuffer(n = 2, len = 10)) abcdefghij abcdefghij </code></pre> <p>For very long character vectors <code>paste()</code> becomes very slow, which is why <code>rawToChar()</code> is used above.</p> <p>Next, lets create a function that measures the processing time for a capture function to capture the output of a given number of lines:</p> <pre><code class="language-r">&gt; benchmark &lt;- function(fcn, n, len) { + x &lt;- lineBuffer(n, len) + system.time({ + fcn(cat(x)) + }, gcFirst = TRUE)[[3]] + } </code></pre> <p>Note that the measured processing time neither includes the creation of the line buffer string nor the garbage collection.</p> <p>The functions to be benchmarked are:</p> <pre><code class="language-r">&gt; fcns &lt;- list(capture.output = capture.output, captureOutput = captureOutput) </code></pre> <p>and we choose to benchmark for outputs with a variety number of lines:</p> <pre><code class="language-r">&gt; ns &lt;- c(1, 10, 100, 1000, 10000, 25000, 50000, 75000, 1e+05) </code></pre> <p>Finally, lets benchmark all of the above with lines of length 100 and 1,000 characters:</p> <pre><code class="language-r">&gt; benchmarkAll &lt;- function(ns, len) { + stats &lt;- lapply(ns, FUN = function(n) { + message(sprintf(&quot;n=%d&quot;, n)) + t &lt;- sapply(fcns, FUN = benchmark, n = n, len = len) + data.frame(name = names(t), n = n, time = unname(t)) + }) + Reduce(rbind, stats) + } &gt; stats_100 &lt;- benchmarkAll(ns, len = 100L) &gt; stats_1000 &lt;- benchmarkAll(ns, len = 1000L) </code></pre> <p>The results are:</p> <table> <thead> <tr> <th align="right">n</th> <th align="right">capture.output(100)</th> <th align="right">captureOutput(100)</th> <th align="right">capture.output(1000)</th> <th align="right">captureOutput(1000)</th> </tr> </thead> <tbody> <tr> <td align="right">1</td> <td align="right">0.00</td> <td align="right">0.00</td> <td align="right">0.00</td> <td align="right">0.00</td> </tr> <tr> <td align="right">10</td> <td align="right">0.00</td> <td align="right">0.00</td> <td align="right">0.00</td> <td align="right">0.00</td> </tr> <tr> <td align="right">100</td> <td align="right">0.00</td> <td align="right">0.00</td> <td align="right">0.01</td> <td align="right">0.00</td> </tr> <tr> <td align="right">1000</td> <td align="right">0.00</td> <td align="right">0.02</td> <td align="right">0.02</td> <td align="right">0.01</td> </tr> <tr> <td align="right">10000</td> <td align="right">0.69</td> <td align="right">0.02</td> <td align="right">0.80</td> <td align="right">0.21</td> </tr> <tr> <td align="right">25000</td> <td align="right">3.18</td> <td align="right">0.05</td> <td align="right">2.99</td> <td align="right">0.57</td> </tr> <tr> <td align="right">50000</td> <td align="right">11.88</td> <td align="right">0.15</td> <td align="right">10.33</td> <td align="right">1.17</td> </tr> <tr> <td align="right">75000</td> <td align="right">25.01</td> <td align="right">0.19</td> <td align="right">25.43</td> <td align="right">1.80</td> </tr> <tr> <td align="right">100000</td> <td align="right">41.73</td> <td align="right">0.24</td> <td align="right">46.34</td> <td align="right">2.41</td> </tr> </tbody> </table> <p><em>Table: Benchmarking of <code>captureOutput()</code> and <code>capture.output()</code> for n lines of length 100 and 1,000 characters. All times are in seconds.</em></p> <p><img src="https://www.jottr.org/post/captureOutput_vs_capture.output,67760e64d0951ca2124886cd8c257b6c,len=100.png" alt="captureOutput_vs_capture.output" /> <em>Figure: <code>captureOutput()</code> captures standard output much faster than <code>capture.output()</code>. The processing time for the latter grows exponentially in the number of lines captured whereas for the former it only grows linearly.</em></p> <p>These results will vary a little bit from run to run, particularly since we only benchmark once per setting. This also explains why for some settings the processing time for lines with 1,000 characters appears faster than the corresponding setting with 100 characters. Averaging over multiple runs would remove this artifact.</p> <p><strong>UPDATE:</strong><br /> 2015-02-06: Thanks to Kevin Van Horn for pointing out that the growth of the <code>capture.output()</code> is probably not as extreme as <em>exponential</em> and suggests <em>quadratic</em> growth.</p> <h2 id="appendix">Appendix</h2> <h3 id="session-information">Session information</h3> <pre><code class="language-r">R version 3.1.0 Patched (2014-05-21 r65711) Platform: x86_64-w64-mingw32/x64 (64-bit) locale: [1] LC_COLLATE=English_United States.1252 [2] LC_CTYPE=English_United States.1252 [3] LC_MONETARY=English_United States.1252 [4] LC_NUMERIC=C [5] LC_TIME=English_United States.1252 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] markdown_0.7 plyr_1.8.1 R.cache_0.9.5 knitr_1.5.26 [5] ggplot2_1.0.0 R.devices_2.9.2 R.utils_1.32.5 R.oo_1.18.2 [9] R.methodsS3_1.6.2 loaded via a namespace (and not attached): [1] base64enc_0.1-1 colorspace_1.2-4 digest_0.6.4 evaluate_0.5.5 [5] formatR_0.10 grid_3.1.0 gtable_0.1.2 labeling_0.2 [9] MASS_7.3-33 mime_0.1.1 munsell_0.4.2 proto_0.3-10 [13] R.rsp_0.18.2 Rcpp_0.11.1 reshape2_1.4 scales_0.2.4 [17] stringr_0.6.2 tools_3.1.0 </code></pre> <p>Tables were generated using <a href="http://cran.r-project.org/package=plyr">plyr</a> and <a href="http://cran.r-project.org/package=knitr">knitr</a>, and graphics using <a href="http://cran.r-project.org/package=ggplot2">ggplot2</a>.</p> <h3 id="reproducibility">Reproducibility</h3> <p>This report was generated from an RSP-embedded Markdown <a href="https://gist.github.com/HenrikBengtsson/854d13a11a33b3d43ec3/raw/captureOutput.md.rsp">document</a> using <a href="http://cran.r-project.org/package=R.rsp">R.rsp</a> v0.18.2. <!-- It can be recompiled as `R.rsp::rfile("https://gist.github.com/HenrikBengtsson/854d13a11a33b3d43ec3/raw/captureOutput.md.rsp")`. --></p> Speed Trick: Assigning Large Object NULL is Much Faster than using rm()! https://www.jottr.org/2013/05/25/trick-fast-rm/ Sat, 25 May 2013 00:00:00 +0000 https://www.jottr.org/2013/05/25/trick-fast-rm/ <p>When processing large data sets in R you often also end up creating large temporary objects. In order to keep the memory footprint small, it is always good to remove those temporary objects as soon as possible. When done, removed objects will be deallocated from memory (RAM) the next time the garbage collection runs.</p> <h2 id="better-use-rm-list-x-instead-of-rm-x-if-using-rm">Better: Use <code>rm(list = &quot;x&quot;)</code> instead of <code>rm(x)</code>, if using <code>rm()</code></h2> <p>To remove an object in R, one can use the <code>rm()</code> function (with alias <code>remove()</code>). However, it turns out that that function has quite a bit of internal overhead (look at its R code), particularly if you call it as <code>rm(x)</code> rather than <code>rm(list = &quot;x&quot;)</code>. The former takes about three times longer to complete. Example:</p> <pre><code class="language-r">&gt; t1 &lt;- system.time(for (k in 1:1e5) { a &lt;- 1; rm(a) }) &gt; t2 &lt;- system.time(for (k in 1:1e5) { a &lt;- 1; rm(list = &quot;a&quot;) }) &gt; t1 user system elapsed 10.45 0.00 10.50 &gt; t2 user system elapsed 2.93 0.00 2.94 &gt; t1/t2 user system elapsed 3.566553 NaN 3.571429 </code></pre> <p>Note: In order to minimize the impact of the memory allocation on the benchmark, I use <code>a &lt;- 1</code> to represent the &ldquo;large&rdquo; object.</p> <h2 id="best-use-x-null-instead-of-rm">Best: Use x &lt;- NULL instead of rm()</h2> <p>Instead of using <code>rm(list = &quot;x&quot;)</code>, which still has a fair amount of overhead, one can remove a large active object by assigning the corresponding variable a new value (a small object), e.g. <code>x &lt;- NULL</code>. Whenever doing this, the previously assigned value (the large object) will become available for garbage collection. Example:</p> <pre><code class="language-r">&gt; t3 &lt;- system.time(for (k in 1:1e5) { a &lt;- 1; a &lt;- NULL }) &gt; t3 user system elapsed 0.05 0.00 0.05 &gt; t1/t3 user system elapsed 209 NaN 210 </code></pre> <p>That&rsquo;s a <strong>200 times speedup</strong>!</p> <h2 id="background">Background</h2> <p>I &ldquo;accidentally&rdquo; discovered this when profiling <code>readMat()</code> in my <a href="http://cran.r-project.org/web/packages/R.matlab/">R.matlab</a> package. In particular, there was one rm(x) call inside a local function that was called thousands of times when parsing modestly large MAT files. Together with some additional optimizations, R.matlab v2.0.0 (to be appear) is now 10-20 times faster. Now I&rsquo;m going to review all my other packages for expensive <code>rm()</code> calls.</p> This Day in History (1997-04-01) https://www.jottr.org/2013/04/01/history-r-help/ Mon, 01 Apr 2013 00:00:00 +0000 https://www.jottr.org/2013/04/01/history-r-help/ <p>Today it&rsquo;s 16 years ago and 367,496 messages later since Martin Mächler started the R-help (321,119 msgs), R-devel (45,830 msgs) and R-announce (547 msgs) mailing lists [1] - a great benefit to all of us. Special thanks to Martin and also thanks to everyone else contributing to these forums.</p> <p><img src="https://www.jottr.org/post/r-help,r-devel.png" alt="Number of messages on R-help and R-devel from 1997 to 2013" /></p> <p>[1] <a href="https://stat.ethz.ch/pipermail/r-help/1997-April/001490.html">https://stat.ethz.ch/pipermail/r-help/1997-April/001490.html</a></p> Speed Trick: unlist(..., use.names=FALSE) is Heaps Faster! https://www.jottr.org/2013/01/07/trick-unlist/ Mon, 07 Jan 2013 00:00:00 +0000 https://www.jottr.org/2013/01/07/trick-unlist/ <p>Sometimes a minor change to your R code can make a big difference in processing time. Here is an example showing that if you&rsquo;re don&rsquo;t care about the names attribute when <code>unlist()</code>:ing a list, specifying argument <code>use.names = FALSE</code> can speed up the processing lots!</p> <pre><code class="language-r">&gt; x &lt;- split(sample(1000, size = 1e6, rep = TRUE), rep(1:1e5, times = 10)) &gt; t1 &lt;- system.time(y1 &lt;- unlist(x)) &gt; t2 &lt;- system.time(y2 &lt;- unlist(x, use.names = FALSE)) &gt; stopifnot(identical(y2, unname(y1))) &gt; t1/t2 user system elapsed 103 NaN 104 </code></pre> <p>That&rsquo;s more than a 100 times speedup.</p> <p>So, check your code to see to which <code>unlist()</code> statements you can add an <code>use.names = FALSE</code>.</p> Force R Help HTML Server to Always Use the Same URL Port https://www.jottr.org/2012/10/22/config-help-start/ Mon, 22 Oct 2012 00:00:00 +0000 https://www.jottr.org/2012/10/22/config-help-start/ <p>The below code shows how to configure the <code>help.ports</code> option in R such that the built-in R help server always uses the same URL port. Just add it to the <code>.Rprofile</code> file in your home directory (iff missing, create it). For more details, see <code>help(&quot;Startup&quot;)</code>.</p> <pre><code class="language-r"># Force the URL of the help to http://127.0.0.1:21510 options(help.ports = 21510) </code></pre> <p>A slighter fancier version is to use a environment variable to set the port(s):</p> <pre><code class="language-r">local({ ports &lt;- Sys.getenv(&quot;R_HELP_PORTS&quot;, 21510) ports &lt;- as.integer(unlist(strsplit(ports, &quot;,&quot;))) options(help.ports = ports) }) </code></pre> <p>However, if you launch multiple R sessions in parallel, this means that they will all try to use the same port, but it&rsquo;s only the first one that will success and all other will fail. An alternative is then to provide R with a set of ports to choose from (see <code>help(&quot;startDynamicHelp&quot;, package = &quot;tools&quot;)</code>). To set the ports to 21510-21519 if you run R v2.15.1, to 21520-21529 if you run R v2.15.2, to 21600-21609 if you run R v2.16.0 (&ldquo;devel&rdquo;) and so on, do:</p> <pre><code class="language-r">local( port &lt;- sum(c(1e4, 100) * as.double(R.version[c(&quot;major&quot;, &quot;minor&quot;)])) options(help.ports = port + 0:9) }) </code></pre> <p>With this it will be easy from the URL to identify for which version of R the displayed help is for. Finally, if you wish the R help server to start automatically in the background when you start R, add:</p> <pre><code class="language-r"># Try to start HTML help server if (interactive()) { try(tools::startDynamicHelp()) } </code></pre> Set Package Repositories at Startup https://www.jottr.org/2012/09/27/config-repos/ Thu, 27 Sep 2012 00:00:00 +0000 https://www.jottr.org/2012/09/27/config-repos/ <p>The below code shows how to configure the <code>repos</code> option in R such that <code>install.packages()</code> etc. will locate the packages without having to explicitly specify the repository. Just add it to the <code>.Rprofile</code> file in your home directory (iff missing, create it). For more details, see <code>help(&quot;Startup&quot;)</code>.</p> <pre><code class="language-r">local({ repos &lt;- getOption(&quot;repos&quot;) # http://cran.r-project.org/ # For a list of CRAN mirrors, see getCRANmirrors(). repos[&quot;CRAN&quot;] &lt;- &quot;http://cran.stat.ucla.edu&quot; # http://www.stats.ox.ac.uk/pub/RWin/ReadMe if (.Platform$OS.type == &quot;windows&quot;) { repos[&quot;CRANextra&quot;] &lt;- &quot;http://www.stats.ox.ac.uk/pub/RWin&quot; } # http://r-forge.r-project.org/ repos[&quot;R-Forge&quot;] &lt;- &quot;http://R-Forge.R-project.org&quot; # http://www.omegahat.org/ repos[&quot;Omegahat&quot;] &lt;- &quot;http://www.omegahat.org/R&quot; options(repos = repos) }) </code></pre>
This XML file does not appear to have any style information associated with it. The document tree is shown below.
<rss xmlns:atom="http://www.w3.org/2005/Atom" version="2.0">
<channel>
<title>JottR on R</title>
<link>https://www.jottr.org/categories/r/</link>
<description>Recent content in R on JottR</description>
<generator>Hugo -- gohugo.io</generator>
<language>en-us</language>
<lastBuildDate>Wed, 25 Jun 2025 00:00:00 +0000</lastBuildDate>
<atom:link href="https://www.jottr.org/categories/r/index.xml" rel="self" type="application/rss+xml"/>
<item>
<title>Setting Future Plans in R Functions — and Why You Probably Shouldn't</title>
<link>https://www.jottr.org/2025/06/25/with-plan/</link>
<pubDate>Wed, 25 Jun 2025 00:00:00 +0000</pubDate>
<guid>https://www.jottr.org/2025/06/25/with-plan/</guid>
<description> <p><a href="https://www.jottr.org/2025/06/19/futureverse-10-years/"><img src="https://www.jottr.org/post/future-logo-balloons.png" alt="The 'future' hexlogo balloon wall" style="width: 20%; padding-left: 2ex; padding-bottom: 2ex; float: right;"/></a></p> <p>The <strong>future</strong> package <a href="https://www.jottr.org/2025/06/19/futureverse-10-years/">celebrates ten years on CRAN</a> as of June 19, 2025. This is the second in a series of blog posts highlighting recent improvements to the <strong><a href="https://www.futureverse.org">futureverse</a></strong> ecosystem.</p> <h2 id="tl-dr">TL;DR</h2> <p>You can now use</p> <pre><code class="language-r">my_fcn &lt;- function(...) { with(plan(multisession), local = TRUE) ... } </code></pre> <p>to <em>temporarily</em> set a future backend for use in your function. This guarantees that any changes are undone when the function exits, even if there is an error or an interrupt.</p> <p>But, I really recommend <em>not</em> doing any of that, as I&rsquo;ll try to explain below.</p> <h2 id="decoupling-of-intent-to-parallelize-and-how-to-execute-it">Decoupling of intent to parallelize and how to execute it</h2> <p>The core design philosophy of <strong>futureverse</strong> is:</p> <blockquote> <p>&ldquo;The developer decides what to parallelize, the user decides where and how.&rdquo;</p> </blockquote> <p>This decoupling of <em>intent</em> (what to parallelize) and <em>execution</em> (how to do it) makes code written using futureverse flexible, portable, and easy to maintain.</p> <p>Specifically, the developer <em>controls what to parallelize</em> by using <code>future()</code> or higher-level abstractions like <code>future_lapply()</code> and <code>future_map()</code> to mark code regions that may run concurrently. The code makes no assumptions about the compute environment and is therefore agnostic to which future backend being used, e.g.</p> <pre><code class="language-r">y &lt;- future_lapply(X, slow_fcn) </code></pre> <p>and</p> <pre><code class="language-r">y &lt;- future_map(X, slow_fcn) </code></pre> <p>Note how there is nothing in those two function calls that specifies how they are parallelized, if at all. Instead, the end user (e.g., data analyst, HPC user, or script runner) <em>controls the execution strategy</em> by setting the <a href="https://www.futureverse.org/backends.html">future backend</a> via <code>plan()</code>, e.g., built-in sequential, built-in multisession, <strong><a href="https://future.callr.futureverse.org">future.callr</a></strong>, and <strong><a href="https://future.mirai.futureverse.org">future.mirai</a></strong> backends. This allows the user to scale the same code from a notebook to an HPC cluster or cloud environment without changing the original code.</p> <p>We can find this design of <em>decoupling intent and execution</em> also in traditional R parallelization frameworks. In the <strong>parallel</strong> package we have <code>setDefaultCluster()</code>, which the user can set to control the default cluster type when none is explicitly specified. For that to be used, the developer needs to make sure to use the default <code>cl = NULL</code>, either explicitly as in:</p> <pre><code class="language-r">y &lt;- parLapply(cl = NULL, X, slow_fcn) </code></pre> <p>or implicitly<sup class="footnote-ref" id="fnref:1"><a href="#fn:1">1</a></sup>, by making sure all arguments are named, as in:</p> <pre><code class="language-r">y &lt;- parLapply(X = X, FUN = slow_fcn) </code></pre> <p>Unfortunately, this is rarely used - instead <code>parLapply(cl, X, FUN)</code> is by far the most common way of using the <strong>parallel</strong> package, resulting in little to no control for the end user.</p> <p>The <strong>foreach</strong> package had greater success with this design philosophy. There the developer writes:</p> <pre><code class="language-r">y &lt;- foreach(x = X) %dopar% { slow_fcn(x) } </code></pre> <p>with no option in that call to specify which parallel backend to use. Instead, the user typically controls the parallel backend via the so called &ldquo;dopar&rdquo; foreach adapter, e.g. <code>doParallel::registerDoParallel()</code>, <code>doMC::registerDoMC()</code>, and <code>doFuture::registerDoFuture()</code>. Unfortunately, there are ways for the developer to write <code>foreach()</code> with <code>%dopar%</code> statements such that the code works only with a specific parallel backend<sup class="footnote-ref" id="fnref:2"><a href="#fn:2">2</a></sup>. Regardless, it is clear from their designs, that both of these packages shared the same fundamental design philosophy of <em>decoupling intent and execution</em> as is used in the <strong>futureverse</strong>. You can read more about this in the introduction of my <a href="https://journal.r-project.org/archive/2021/RJ-2021-048/index.html">H. Bengtsson (2021)</a> article.</p> <p>When writing scripts or Rmarkdown documents, I recommend putting code that controls the execution (e.g. <code>plan()</code>, <code>registerDoNnn()</code>, and <code>setDefaultCluster()</code>) at the very top, immediately after any <code>library()</code> statements. This is also where I, like many others, prefer to put global settings such as <code>options()</code> statements. This makes it easier for anyone to identify which settings are available and used by the script. It also avoids cluttering up the rest of the code with such details.</p> <h2 id="straying-away-from-the-core-design-philosophy">Straying away from the core design philosophy</h2> <p>One practical advantage of the above decoupling design is that there is only one place where parallelization is controlled, instead of it being scattered throughout the code, e.g. as special parallel arguments to different function calls. This makes it easier for the end user, but also for the package developer who does not have to worry about what their APIs should look like and what arguments they should take.</p> <p>That said, some package developers prefer to expose control of parallelization via special function arguments. If we search CRAN packages, we find arguments like <code>parallel = FALSE</code>, <code>ncores = 1</code>, and <code>cluster = NULL</code> that then are used internally to set up the parallel backend. If you write functions that take this approach, it is <em>critical</em> that you remember to set the backend only temporarily, which can be done via <code>on.exit()</code>, e.g.</p> <pre><code class="language-r">my_fcn &lt;- function(xs, ncores = 1) { if (ncores &gt; 1) { cl &lt;- parallel::makeCluster(ncores) on.exit(parallel::stopCluster(cl)) y &lt;- parLapply(cl = cl, xs, slow_fcn) } else { y &lt;- lapply(xs, slow_fcn) } y } </code></pre> <p>If you use futureverse, you can use:</p> <pre><code class="language-r">my_fcn &lt;- function(xs, ncores = 1) { old_plan &lt;- plan(multisession, workers = ncores) on.exit(plan(old_plan)) y &lt;- future_lapply(xs, slow_fcn) y } </code></pre> <p>And, since <strong>future</strong> 1.40.0 (2025-04-10), you can achieve the same with a single line of code<sup class="footnote-ref" id="fnref:3"><a href="#fn:3">3</a></sup>:</p> <pre><code class="language-r">my_fcn &lt;- function(xs, ncores = 1) { with(plan(multisession, workers = ncores), local = TRUE) y &lt;- future_lapply(xs, slow_fcn) y } </code></pre> <p>I hope that this addition lowers the risk of forgetting to undo any changes done by <code>plan()</code> inside functions. If you forget, then you may override what the user intends to use elsewhere. For instance, they might have set <code>plan(batchtools_slurm)</code> to run their R code across a Slurm high-performance-compute (HPC) cluster, but if you change the <code>plan()</code> inside your package function without undoing your changes, then the user is up for a surprise and maybe also hours of troubleshooting.</p> <h2 id="but-please-avoid-switching-future-backends-if-you-can">But, please avoid switching future backends if you can</h2> <p>I still want to plead with package developers to avoid setting the future backend, even temporarily, inside their functions. There are other reasons for not doing this. For instance, if you provide users with an <code>ncores</code> arguments for controlling the amount of parallelization, you risk locking in the user into a specific parallel backend. A common pattern is to use <code>plan(multisession, workers = ncores)</code> as in the above examples. However, this prevents the user from taking advantage of other closely related parallel backends, e.g. <code>plan(callr, workers = ncores)</code> and <code>plan(mirai_multisession, workers = ncores)</code>. The <strong>future.callr</strong> backend runs each parallel task in a fresh R session that is shut down immediately afterward, which is beneficial when memory is the limiting factor. The <strong>future.mirai</strong> backend is optimized to have a low latency, meaning it can parallelize also shorter-term tasks, which might otherwise not be worth parallelizing. Also, contrary to <code>multisession</code>, these alternative backends can make use of all CPU cores available on modern hardware, e.g. 192- and 256-core machines. The <code>multisession</code> backend, which builds upon <strong>parallel</strong> PSOCK clusters, is limited to a maximum of 125 parallel workers, because each parallel worker consumes one R connection, and R can only have 125 connections open at any time. There are ways to increase this limit, but it still requires work. See <a href="https://parallelly.futureverse.org/reference/availableConnections.html"><code>parallelly::availableConnections()</code></a> for more details on this problem and how to increase the maximum number of connections.</p> <p>You can of course add another &ldquo;parallel&rdquo; argument to allow your users to control also which future backend to use, e.g. <code>backend = multisession</code> and <code>ncores = 1</code>. But, this might not be sufficient - there are backends that take additional arguments, which you then also need to support in each of your functions. Finally, new backends will be implemented by others in the future (pun intended and not), and we can&rsquo;t predict what they will require.</p> <p>Related to this, I am working on ways for (i) futureverse to choose among a set of parallel backends - not just one, (ii) based on resource specifications (e.g. memory needs and maximum run times) for specific future statements. This will give back some control to the developer over how and where execution happens and more options for the end user to scale out to different type of compute resources. For instance, a <code>future_map()</code> call with a 192-GiB memory requirement may only be sent to &ldquo;large-memory&rdquo; backends and, if not available, throw an instant error. Another example is a <code>future_map()</code> call with a 256-MiB memory and 5-minute runtime requirement - that is small enough to be sent to an AWS Lambda or GCS Cloud Functions backend, if the user has specified such a backend.</p> <p>In summary, I argue that it&rsquo;s better to let the user be in full control of the future backend, by letting them set it via <code>plan()</code>, preferably at the top of their scripts. If not possible, please make sure to use <code>with(plan(...), local = TRUE)</code>.</p> <p><em>May the future be with you!</em></p> <p>Henrik</p> <h2 id="reference">Reference</h2> <ul> <li>H. Bengtsson, A Unifying Framework for Parallel and Distributed Processing in R using Futures, The R Journal (2021) 13:2, pages 208-227 [<a href="https://journal.r-project.org/archive/2021/RJ-2021-048/index.html">abstract</a>, <a href="https://journal.r-project.org/archive/2021/RJ-2021-048/RJ-2021-048.pdf">PDF</a>]</li> </ul> <div class="footnotes"> <hr /> <ol> <li id="fn:1"><p>If the argument <code>cl = NULL</code> of <a href="https://rdrr.io/r/parallel/clusterApply.html"><code>parLapply()</code></a> had been the last argument instead of the first, then <code>parLapply(X, slow_fcn)</code>, which resembles <code>lapply(X, slow_fcn)</code>, would have also resulted in the default cluster being used.</p> <a class="footnote-return" href="#fnref:1"><sup>[return]</sup></a></li> <li id="fn:2"><p><code>foreach()</code> takes backend-specific options (e.g. <code>.options.multicore</code>, <code>.options.parallel</code>, <code>.options.mpi</code>, and <code>.options.future</code>). The developer can use these to adjust the default behavior of a given foreach adapter. Unfortunately, when used - or rather, when needed - the code is no longer agnostic to the backend - what will happen if a foreach adapter is used that the developer did not anticipate?</p> <a class="footnote-return" href="#fnref:2"><sup>[return]</sup></a></li> <li id="fn:3"><p>The <strong><a href="https://cran.r-project.org/package=withr">withr</a></strong> package has <code>with_nnn()</code> and <code>local_nnn()</code> functions for evaluating code with various settings temporarily changed. Following this lead, I was very close to adding <code>with_plan()</code> and <code>local_plan()</code> to <strong>future</strong> 1.40.0, but then I noticed that <strong><a href="https://cran.r-project.org/package=mirai">mirai</a></strong> supports <code>with(daemons(ncores), { ... })</code>. This works because <code>with()</code> is an S3 generic function. I like this approach, especially since it avoids adding more functions to the API. I added similar support for <code>with(plan(multisession, workers = ncores), { ... })</code>. More importantly, this allowed me to also add the <code>with(..., local = TRUE)</code> variant to be used inside functions, which makes it very easy to safely switch to a temporary future backend inside a function.</p> <a class="footnote-return" href="#fnref:3"><sup>[return]</sup></a></li> </ol> </div> </description>
</item>
<item>
<title>Future Got Better at Finding Global Variables</title>
<link>https://www.jottr.org/2025/06/23/future-got-better-at-finding-global-variables/</link>
<pubDate>Mon, 23 Jun 2025 00:00:00 +0000</pubDate>
<guid>https://www.jottr.org/2025/06/23/future-got-better-at-finding-global-variables/</guid>
<description><p><a href="https://www.jottr.org/2025/06/19/futureverse-10-years/"><img src="https://www.jottr.org/post/future-logo-balloons.png" alt="The 'future' hexlogo balloon wall" style="width: 20%; padding-left: 2ex; padding-bottom: 2ex; float: right;"/></a></p> <p>The <strong>future</strong> package <a href="https://www.jottr.org/2025/06/19/futureverse-10-years/">celebrates ten years on CRAN</a> as of June 19, 2025. This is the first in a series of blog posts highlighting recent improvements to the <strong><a href="https://www.futureverse.org">futureverse</a></strong> ecosystem.</p> <p>The <strong><a href="https://globals.futureverse.org">globals</a></strong> package is part of the futureverse and has had two recent releases on 2025-04-15 and 2025-05-08. These updates address a few corner cases that would otherwise lead to unexpected errors. They also resulted in several long, outstanding issues reported on the <strong><a href="https://future.futureverse.org">future</a></strong>, <strong><a href="https://future.apply.futureverse.org">future.apply</a></strong>, <strong><a href="https://furrr.futureverse.org">furrr</a></strong>, and <strong><a href="https://doFuture.futureverse.org">doFuture</a></strong> package issue trackers, and elsewhere, could be closed.</p> <p>The significant update is that <a href="https://globals.futureverse.org/reference/globalsOf.html"><code>findGlobals()</code></a> gained argument <code>method = &quot;dfs&quot;</code>, which finds globals in R expressions by walking its abstract syntax tree (AST) using a <em>depth-first-search</em> algorithm. <strong>This new approach does a better job of emulating how the R engine identifies global variables, which results in an even smoother ride for anyone using futureverse for parallel and distributed processing.</strong> Previously, a tweaked search algorithm adopted from <code>codetools::findGlobals()</code> was used. The <strong><a href="https://cran.r-project.org/package=codetools">codetools</a></strong> search algorithm is mainly designed for <code>R CMD check</code> to detect undefined variables being used in package code. To limit the number of false positives reported by <code>R CMD check</code>, such algorithms tend to be &ldquo;conservative&rdquo; by nature, so that we can trust what is reported. This strategy is not always sufficient for automatically detecting globals needed in parallel processing. As an example, in</p> <pre><code class="language-r">fcn &lt;- function() { a &lt;- b b &lt;- 1 } </code></pre> <p>variable <code>b</code> is a global variable, but if we ask <strong>codetools</strong>, it does not pick up <code>b</code> as a global;</p> <pre><code class="language-r">codetools::findGlobals(fun) #&gt; [1] &quot;{&quot; &quot;&lt;-&quot; </code></pre> <p>This false negative is alright for <code>R CMD check</code>, but, in contrast, for parallel processing, we need to use a &ldquo;liberal&rdquo; search algorithm. In parallel processing it is okay to pick up and export too many variables to the parallel worker. If a variable is not used, little harm is done, but if we fail to export a needed variable, we&rsquo;ll end up with an object-not-found error. Futureverse has since the early days (December 2015) used a modified version of the <strong>codetools</strong> algorithm that is liberal, but not too liberal. It detects <code>b</code> as a global variable;</p> <pre><code class="language-r">globals::findGlobals(fun) #&gt; [1] &quot;{&quot; &quot;&lt;-&quot; &quot;b&quot; </code></pre> <p>This liberal search strategy turns out to work surprisingly well for detecting globals needed in parallel processing, but there were corner cases where it failed. For example, <strong>futureverse</strong> struggled to identify global variables in cases such as:</p> <pre><code class="language-r">library(future) plan(multisession, workers = 2) x &lt;- 2 f &lt;- future(local({ h &lt;- function(x) -x h(x) })) value(f) </code></pre> <p>which resulted in</p> <pre><code>Error in eval(quote({ : object 'x' not found </code></pre> <p>This is because there are several different variables named <code>x</code>, and the one in the calling environment is &ldquo;masked&rdquo; by argument <code>x</code>, which results in <code>x</code> never be picked up and exported to the parallel worker.</p> <p>It might look as if this type of code was carefully curated to fail, but would rarely, if at all, be spotted in real code. As a matter of fact, this is a distilled version of a large real-world scenario reported by at least one person. It&rsquo;s thanks to such feedback that we together can make improvements to the <strong>futureverse</strong> ecosystem 🙏 I cannot know for sure, but I&rsquo;d suspect this has impacted several R developers already - the <strong>future</strong> package is after all among the 0.6% most downloaded packages and there are <a href="https://r-universe.dev/search?q=needs%3Afuture">1,300 packages that &ldquo;need&rdquo; it</a> as of May 2025. The above problem was fixed in <strong>globals</strong> 0.18.0 (2025-05-08) and <strong>future</strong> 1.49.0 (2025-05-09), which now make use of the new <code>findGlobals(..., method = &quot;dfs&quot;)</code> search strategy internally. After updating these packages, the above code snippet gives us</p> <pre><code class="language-r">value(f) #&gt; [1] -2 </code></pre> <p>as we&rsquo;d expect.</p> <p>Another corner-case bug fix, is where</p> <pre><code class="language-r">library(future) library(magrittr) x &lt;- list() f &lt;- future ({ x %&gt;% `$&lt;-`(&quot;a&quot;, 42) }) </code></pre> <p>would result in the rather obscure error</p> <pre><code class="language-r">Error in e[[4]] : subscript out of bounds </code></pre> <p>This is due to <a href="https://gitlab.com/luke-tierney/codetools/-/issues/16">a bug</a> in the <strong>codetools</strong> package, which <strong>globals</strong> (&gt;= 0.17.0) [2025-04-15] works around. After updating, things work as expected;</p> <pre><code class="language-r">f &lt;- future ({ x %&gt;% `$&lt;-`(&quot;a&quot;, 42) }) value(f) #&gt; $a #&gt; [1] 42 </code></pre> <p>Yet another fix in <strong>globals</strong> (&gt;= 0.17.0) is that previous versions would throw an error if it ran into an S7 object. The S7 object class was introduced in 2023.</p> <p><em>May the future be with you!</em></p> <p>Henrik</p> <p>PS. Did you know that the <strong>codetools</strong> package is <a href="https://gitlab.com/luke-tierney/codetools/-/blob/master/noweb/codetools.nw?ref_type=heads">written using literate programming</a> following the vision of Donald Knuth? Neat, eh? And, it&rsquo;s almost like it was vibe coded, but with the large-language model (LLM) part being replaced by human knowledge and expertise 🤓</p> </description>
</item>
<item>
<title>Futureverse – Ten-Year Anniversary</title>
<link>https://www.jottr.org/2025/06/19/futureverse-10-years/</link>
<pubDate>Thu, 19 Jun 2025 00:00:00 +0000</pubDate>
<guid>https://www.jottr.org/2025/06/19/futureverse-10-years/</guid>
<description><figure style="margin-top: 3ex;"> <div style="padding: 2ex; float: right;"/> <center> <img src="https://www.jottr.org/post/future-logo-balloons.png" alt="The 'future' hexlogo balloon wall" style="width: 80%;"/> </center> </div> <figcaption style="font-style: italic"> The future package turns ten on CRAN today – June 19, 2025. <small>(Image credits: Dan LaBar for the future logo; Hadley Wickham and Greg Swinehart for the ggplot2 logo and balloon wall; The future balloon wall was inspired by ggplot2’s recent real-world version and generated with ChatGPT.)</small> </figcaption> </figure> <p>The <strong><a href="https://future.futureverse.org">future</a></strong> package turns ten years old today. I released version 0.6.0 to CRAN on June 19, 2015, just days before I presented the package and sharing my visions at <a href="https://www.jottr.org/2016/07/02/future-user2016-slides/">useR! 2016</a>. I had no idea adoption would snowball the way it has. It&rsquo;s been an exciting, fun journey, and the best part has been you - the users and developers who shaped the futureverse through questions, discussions, bug reports, and feature requests. Thank you!</p> <p>To celebrate, I’m kicking off a series of posts over the next few weeks covering the latest improvements that make it easier than ever to scale existing code up or out on a parallel or distributed backend of your choice - and eventually in ways that are neater than what our trusty workhorses <strong><a href="https://future.apply.futureverse.org">future.apply</a></strong> and <strong><a href="https://furrr.futureverse.org">furrr</a></strong> offer.</p> <p>These gains come from a slow, steady, multi-year process of remodelling: internal redesigns, working with package maintainers to retire use of deprecated functions, releasing, fixing regressions, and repeating - all while end-users and most developers not noticing, except for a few. The first CRAN release where this work could be noticed was <strong>future</strong> 1.40.0 (April 10), followed by regression fixes and additional features in 1.49.0 (May 9), and lately 1.57.0 (June 5, 2025). More polishing and features are coming before we hit <strong>future</strong> 2.0.0 – in the near future (pun firmly intended). Thanks for helping make future a cornerstone of scalable R programming.</p> <p>Posts in this series thus far:</p> <ul> <li>2025-06-23: <a href="https://www.jottr.org/2025/06/23/future-got-better-at-finding-global-variables/">Future Got Better at Finding Global Variables</a></li> <li>2025-06-25: <a href="https://www.jottr.org/2025/06/25/with-plan/">Setting Future Plans in R Functions — and Why You Probably Shouldn&rsquo;t</a></li> </ul> <p><em>Stay tuned and may the future be with you!</em></p> <p>Henrik</p> </description>
</item>
<item>
<title>parallelly: Querying, Killing and Cloning Parallel Workers Running Locally or Remotely</title>
<link>https://www.jottr.org/2023/07/01/parallelly-managing-workers/</link>
<pubDate>Sat, 01 Jul 2023 18:00:00 +0200</pubDate>
<guid>https://www.jottr.org/2023/07/01/parallelly-managing-workers/</guid>
<description> <div style="padding: 2ex; float: right;"/> <center> <img src="https://www.jottr.org/post/parallelly-logo.png" alt="The 'parallelly' hexlogo"/> </center> </div> <p><strong><a href="https://parallelly.futureverse.org">parallelly</a></strong> 1.36.0 is on CRAN since May 2023. The <strong>parallelly</strong> package is part of the <a href="https://www.futureverse.org">Futureverse</a> and enhances the <strong>parallel</strong> package of base R, e.g. it adds several features you&rsquo;d otherwise expect to see in <strong>parallel</strong>. The <strong>parallelly</strong> package is one of the internal work horses for the <strong><a href="https://future.futureverse.org">future</a></strong> package, but it can also be used outside of the future ecosystem.</p> <p>In this most recent release, <strong>parallelly</strong> gained several new skills in how cluster nodes (a.k.a. parallel workers) can be managed. Most notably,</p> <ul> <li><p>the <a href="https://parallelly.futureverse.org/reference/isNodeAlive.html"><code>isNodeAlive()</code></a> function can now also query parallel workers running on remote machines. Previously, this was only possible to workers running on the same machine.</p></li> <li><p>the <a href="https://parallelly.futureverse.org/reference/killNode.html"><code>killNode()</code></a> function gained the power to terminate parallel workers running also on remotes machines.</p></li> <li><p>the new function <a href="https://parallelly.futureverse.org/reference/cloneNode.html"><code>cloneNode()</code></a> can be used to &ldquo;restart&rdquo; a cluster node, e.g. if a node was determined to no longer be alive by <code>isNodeAlive()</code>, then <code>cloneNode()</code> can be called to launch an new parallel worker on the same machine as the previous worker.</p></li> <li><p>The <code>print()</code> functions for PSOCK clusters and PSOCK nodes reports on the status of the parallel workers.</p></li> </ul> <h2 id="examples">Examples</h2> <p>Assume we&rsquo;re running a PSOCK cluster of two parallel workers - one running on the local machine and the other on a remote machine that we connect to over SSH. Here is how we can set up such a cluster using <strong>parallelly</strong>:</p> <pre><code class="language-r">library(parallelly) cl &lt;- makeClusterPSOCK(c(&quot;localhost&quot;, &quot;server.remote.org&quot;)) print(cl) # Socket cluster with 2 nodes where 1 node is on host 'server.remote.org' (R # version 4.3.1 (2023-06-16), platform x86_64-pc-linux-gnu), 1 node is on host # 'localhost' (R version 4.3.1 (2023-06-16), platform x86_64-pc-linux-gnu) </code></pre> <p>We can check if these two parallel workers are running. We can check this even if they are busy processing parallel tasks. The way <code>isNodeAlive()</code> works is that it checks of the <em>process</em> is running on worker&rsquo;s machine, which is something that can be done even when the worker is busy. For example, let&rsquo;s check the first worker process that run on the current machine:</p> <pre><code class="language-r">print(cl[[1]]) ## RichSOCKnode of a socket cluster on local host 'localhost' with pid 2457339 ## (R version 4.3.1 (2023-06-16), x86_64-pc-linux-gnu) using socket connection ## #3 ('&lt;-localhost:11436') isNodeAlive(cl[[1]]) ## [1] TRUE </code></pre> <p>In <strong>parallelly</strong> (&gt;= 1.36.0), we can now also query the remote machine:</p> <pre><code class="language-r">print(cl[[2]]) ## RichSOCKnode of a socket cluster on remote host 'server.remove.org' with ## pid 7731 (R version 4.3.1 (2023-06-16), x86_64-pc-linux-gnu) using socket ## connection #4 ('&lt;-localhost:11436') isNodeAlive(cl[[2]]) ## [1] TRUE </code></pre> <p>We can also query <em>all</em> parallel workers of the cluster at once, e.g.</p> <pre><code class="language-r">isNodeAlive(cl) ## [1] TRUE TRUE </code></pre> <p>Now, imagine if, say, the remote parallel process terminates for some unknown reasons. For example, the code running in parallel called some code that causes the parallel R process to crash and terminate. Although this &ldquo;should not&rdquo; happen, we all experience it once in a while. Another example is that the machine is running out of memory, for instance due to other misbehaving processes on the same machine. When that happens, the operating system might start killing processes in order not to completely crash the machine.</p> <p>When one of our parallel workers has crashed, it will obviously not respond to requests for processing our R tasks. Instead, we will get obscure errors like:</p> <pre><code class="language-r">y &lt;- parallel::parLapply(cl, X = X, fun = slow_fcn) ## Error in summary.connection(connection) : invalid connection </code></pre> <p>We can see that the second parallel worker in our cluster is no longer alive by:</p> <pre><code class="language-r">isNodeAlive(cl) ## [1] TRUE FALSE </code></pre> <p>We can also see that there is something wrong with the one of our workers if we call <code>print()</code> on our <code>RichSOCKcluster</code> and <code>RichSOCKnode</code> objects, e.g.</p> <pre><code class="language-r">print(cl) ## Socket cluster with 2 nodes where 1 node is on host 'server.remote.org' ## (R version 4.3.1 (2023-06-16), platform x86_64-pc-linux-gnu), 1 node is ## on host 'localhost' (R version 4.3.1 (2023-06-16), platform ## x86_64-pc-linux-gnu). 1 node (#2) has a broken connection (ERROR: ## invalid connection) </code></pre> <p>and</p> <pre><code class="language-r">print(cl[[1]]) ## RichSOCKnode of a socket cluster on local host 'localhost' with pid ## 2457339 (R version 4.3.1 (2023-06-16), x86_64-pc-linux-gnu) using ## socket connection #3 ('&lt;-localhost:11436') print(cl[[2]]) ## RichSOCKnode of a socket cluster on remote host 'server.remote.org' ## with pid 7731 (R version 4.3.1 (2023-06-16), x86_64-pc-linux-gnu) ## using socket connection #4 ('ERROR: invalid connection') </code></pre> <p>If we end up with a broken parallel worker like this, we can since <strong>parallelly</strong> 1.36.0 use <code>cloneNode()</code> to re-create the original worker. In our example, we can do:</p> <pre><code class="language-r">cl[[2]] &lt;- cloneNode(cl[[2]]) print(cl[[2]]) ## RichSOCKnode of a socket cluster on remote host 'server.remote.org' ## with pid 19808 (R version 4.3.1 (2023-06-16), x86_64-pc-linux-gnu) ## using socket connection #4 ('&lt;-localhost:11436') </code></pre> <p>to get a working parallel cluster, e.g.</p> <pre><code class="language-r">isNodeAlive(cl) ## [1] TRUE TRUE </code></pre> <p>and</p> <pre><code class="language-r">y &lt;- parallel::parLapply(cl, X = X, fun = slow_fcn) str(y) ## List of 8 ## $ : num 1 ## $ : num 1.41 ## $ : num 1.73 </code></pre> <p>We can also use <code>cloneNode()</code> to launch <em>additional</em> workers of the same kind. For example, say we want to launch two more local workers and one more remote worker, and append them to the current cluster. One way to achieve that is:</p> <pre><code class="language-r">cl &lt;- c(cl, cloneNode(cl[c(1,1,2)])) print(cl) ## Socket cluster with 5 nodes where 3 nodes are on host 'localhost' ## (R version 4.3.1 (2023-06-16), platform x86_64-pc-linux-gnu), 2 ## nodes are on host 'server.remote.org' (R version 4.3.1 (2023-06-16), ## platform x86_64-pc-linux-gnu) </code></pre> <p>Now, consider we launching many heavy parallel tasks, where some of them run on remote machines. However, after some time, we realize that we have launched tasks that will take much longer to resolve than we first anticipated. If we don&rsquo;t want to wait for this to resolve by itself, we can choose to terminate some or all of the workers using <code>killNode()</code>. For example,</p> <pre><code class="language-r">killNode(cl) ## [1] TRUE TRUE TRUE TRUE TRUE </code></pre> <p>will kill all parallel workers in our cluster, even if they are busy running tasks. We can confirm that these worker processes are no longer alive by calling:</p> <pre><code class="language-r">isNodeAlive(cl) ## [1] FALSE FALSE FALSE FALSE FALSE </code></pre> <p>If we would attempt to use the cluster, we&rsquo;d get the &ldquo;Error in unserialize(node$con) : error reading from connection&rdquo; as we saw previously. After having killed our cluster, we can re-launch it using <code>cloneNode()</code>, e.g.</p> <pre><code class="language-r">cl &lt;- cloneNode(cl) isNodeAlive(cl) ## [1] TRUE TRUE TRUE TRUE TRUE </code></pre> <h2 id="the-new-cluster-managing-skills-enhances-the-future-ecosystem">The new cluster managing skills enhances the future ecosystem</h2> <p>When we use the <a href="https://future.futureverse.org/reference/cluster.html"><code>cluster</code></a> and <a href="https://future.futureverse.org/reference/multisession.html"><code>multisession</code></a> parallel backends of the <strong>future</strong> package, we rely on the <strong>parallelly</strong> package internally. Thanks to these new abilities, the Futureverse can now give more informative error message whenever we fail to launch a future or when we fail to retrieve the results of one. For example, if a parallel worker has terminated, we might get:</p> <pre><code class="language-r">f &lt;- future(slow_fcn(42)) ## Error: ClusterFuture (&lt;none&gt;) failed to call grmall() on cluster ## RichSOCKnode #1 (PID 29701 on 'server.remote.org'). The reason reported ## was 'error reading from connection'. Post-mortem diagnostic: No process ## exists with this PID on the remote host, i.e. the remote worker is no ## longer alive </code></pre> <p>That post-mortem diagnostic is often enough to realize something quite exceptional has happened. It also gives us enough information to troubleshooting the problem further, e.g. if we keep seeing the same problem occurring over and over for a particular machine, it might suggest that there is an issue on that machine and we want to exclude it from further processing.</p> <p>We could imagine that the <strong>future</strong> package would not only give us information on why things went wrong, but it could theoretically also try to fix the problem automatically. For instance, it could automatically re-create the crashed worker using <code>cloneNode()</code>, and re-launch the future. It is on the roadmap to add such robustness to the future ecosystem later on. However, there are several things to consider when doing so. For instance, what should happen if it was not a glitch, but that there is one parallel task that keeps crashing the parallel workers over and over? Most certainly, we want to only retry a fixed number of times, before giving up, otherwise we might get stuck in a never ending procedure. But even so, what if the problematic parallel code brings down the machine where it runs? If we have automatic restart of workers and parallel tasks, we might end up bringing down multiple machines before we notice the problem. So, although it appears fairly straightforward to handle crashed workers automatically, we need to come up with a robust, well-behaving strategy for doing so before we can implement it.</p> <p>I hope you find this useful. If you have questions or comments on <strong>parallelly</strong>, or the Futureverse in general, please use the <a href="https://github.com/HenrikBengtsson/future/discussions/">Futureverse Discussion forum</a>.</p> <p>Henrik</p> <h2 id="links">Links</h2> <ul> <li><strong>parallelly</strong> package: <a href="https://cran.r-project.org/package=parallelly">CRAN</a>, <a href="https://github.com/HenrikBengtsson/parallelly">GitHub</a>, <a href="https://parallelly.futureverse.org">pkgdown</a></li> <li><strong>future</strong> package: <a href="https://cran.r-project.org/package=future">CRAN</a>, <a href="https://github.com/HenrikBengtsson/future">GitHub</a>, <a href="https://future.futureverse.org">pkgdown</a></li> <li><strong>Futureverse</strong>: <a href="https://www.futureverse.org">https://www.futureverse.org</a></li> </ul> </description>
</item>
<item>
<title>%dofuture% - a Better foreach() Parallelization Operator than %dopar%</title>
<link>https://www.jottr.org/2023/06/26/dofuture/</link>
<pubDate>Mon, 26 Jun 2023 19:00:00 +0200</pubDate>
<guid>https://www.jottr.org/2023/06/26/dofuture/</guid>
<description> <div style="margin: 2ex; width: 100%;"/> <center> <img src="https://www.jottr.org/post/dopar-to-dofuture.png" alt="Two lines of code, where the first line shows 'y <- foreach(...) %dopar% { ... }'. The second line 'y <- foreach(...) %dofuture% { ... }'. The %dopar% operator is crossed out and there is a line down to %dofuture% directly below." style="width: 80%; border: 1px solid black;"/> </center> </div> <p><strong><a href="https://doFuture.futureverse.org">doFuture</a></strong> 1.0.0 is on CRAN since March 2023. It introduces a new <strong><a href="https://cran.r-project.org/package=foreach">foreach</a></strong> operator <code>%dofuture%</code>, which makes it even easier to use <code>foreach()</code> to parallelize via the <strong>future</strong> ecosystem. This new operator is designed to be an alternative to the existing <code>%dopar%</code> operator for <code>foreach()</code> - an alternative that works in similar ways but better. If you already use <code>foreach()</code> together with futures, or plan on doing so, I recommend using <code>%dofuture%</code> instead of <code>%dopar%</code>. I&rsquo;ll explain why I think so below.</p> <h2 id="introduction">Introduction</h2> <p>The traditional way to parallelize with <code>foreach()</code> is to use the <code>%dopar%</code> infix operator together with a registered foreach adaptor. The popular <strong><a href="https://cran.r-project.org/package=doParallel">doParallel</a></strong> package provides <code>%dopar%</code> backends for parallelizing on the local machine. Here is an example that uses four local workers:</p> <pre><code class="language-r">library(foreach) workers &lt;- parallel::makeCluster(4) doParallel::registerDoParallel(cl = workers) xs &lt;- rnorm(1000) y &lt;- foreach(x = xs, .export = &quot;slow_fcn&quot;) %dopar% { slow_fcn(x) } </code></pre> <p>I highly suggest Futureverse for parallelization due to its advantages, such as relaying standard output, messages, warnings, and errors that were generated on the parallel workers in the main R process, support for near-live progress updates, and more descriptive backend error messages. Almost from the very beginning of the Futureverse, you have been able to use futures with <code>foreach()</code> and <code>%dopar%</code> via the <strong>doFuture</strong> package. For instance, we can rewrite the above example to use futures as:</p> <pre><code class="language-r">library(foreach) doFuture::registerDoFuture() future::plan(multisession, workers = 4) xs &lt;- rnorm(1000) y &lt;- foreach(x = xs, .export = &quot;slow_fcn&quot;) %dopar% { slow_fcn(x) } </code></pre> <p>In this blog post, I am proposing to move to</p> <pre><code class="language-r">library(foreach) future::plan(multisession, workers = 4) xs &lt;- rnorm(1000) y &lt;- foreach(x = xs, .export = &quot;slow_fcn&quot;) %dofuture% { slow_fcn(x) } </code></pre> <p>instead. So, why is that better? It is because:</p> <ol> <li><p><code>%dofuture%</code> removes the need to register a foreach backend, i.e. no more <code>registerDoMC()</code>, <code>registerDoParallel()</code>, <code>registerDoFuture()</code>, etc.</p></li> <li><p><code>%dofuture%</code> is unaffected by any foreach backends that the end-user has registered.</p></li> <li><p><code>%dofuture%</code> uses a consistent <code>foreach()</code> &ldquo;options&rdquo; argument, regardless of parallel backend used, and <em>not</em> different ones for different backends, e.g. <code>.options.multicore</code>, <code>.options.snow</code>, and <code>.options.mpi</code>.</p></li> <li><p><code>%dofuture%</code> is guaranteed to always parallelizes via the Futureverse, using whatever <code>plan()</code> the end-user has specified. It also means that you, as a developer, have full control of the parallelization code.</p></li> <li><p><code>%dofuture%</code> can generate proper parallel random number generation (RNG). There is no longer a need to use <code>%dorng%</code> of the <strong><a href="https://cran.r-project.org/package=doRNG">doRNG</a></strong> package.</p></li> <li><p><code>%dofuture%</code> automatically identifies global variables and packages that are needed by the parallel workers.</p></li> <li><p><code>%dofuture%</code> relays errors generated in parallel as-is such that they can be handled using standard R methods, e.g. <code>tryCatch()</code>.</p></li> <li><p><code>%dofuture%</code> truly outputs standard output and messages, warnings, and other types of conditions generated in parallel as-is such that they can be handled using standard R methods, e.g. <code>capture.output()</code> and <code>withCallingHandlers()</code>.</p></li> <li><p><code>%dofuture%</code> supports near-live progress updates via the <strong><a href="https://progressr.futureverse.org">progressr</a></strong> package.</p></li> <li><p><code>%dofuture%</code> gives more informative error messages, which helps troubleshooting, if a parallel worker crashes.</p></li> </ol> <p>Below are the details.</p> <h2 id="problems-of-dopar-that-dofuture-addresses">Problems of <code>%dopar%</code> that <code>%dofuture%</code> addresses</h2> <p>Let me discuss a few of the unfortunate drawbacks that comes with <code>%dopar%</code>. Most of these stem from a slightly too lax design. Although convenient, the flexible design prevents us from having full control and writing code that can parallelize on any parallel backend.</p> <h3 id="problem-1-dopar-requires-registering-a-foreach-adaptor">Problem 1. <code>%dopar%</code> requires registering a foreach adaptor</h3> <p>If we write code that others will use, say, an R package, then we can never know what compute resources the user has, or will have in the future. Traditionally, this means that one user might want to use <strong>doParallel</strong> for parallelization, another <strong>doMC</strong>, and yet another, maybe, <strong>doRedis</strong>. Because of this, we must not have any calls to one of the many <code>registerDoNnn()</code> functions in our code. If we do, we lock users into a specific parallel backend. We could of course support a few different backends, but we are still locking users into a small set of parallel backends. If someone develops a new backend in the future, our code has to be updated before users can take advantage the new backends.</p> <p>One can argue that <code>doFuture::registerDoFuture()</code> somewhat addresses this problem. On one hand, when used, it does lock the user into the future framework. On the other hand, the user has many parallel backends to choose from in the Futureverse, including backends that will be developed in the future. In this sense, the lock-in is less severe, especially since we do not have to update our code for new backends to be supported. Also, to avoid destructive side effects, <code>registerDoFuture()</code> allows you to change the foreach backend used inside your functions temporarily, e.g.</p> <pre><code class="language-r">## Temporarily use futures oldDoPar &lt;- registerDoFuture() on.exit(with(oldDoPar, foreach::setDoPar(fun=fun, data=data, info=info)), add = TRUE) </code></pre> <p>This avoids changing the foreach backend that the user might already have set elsewhere.</p> <p>That said, I never wanted to say that people <em>should use</em> <code>registerDoFuture()</code> whenever using <code>%dopar%</code>, because I think that would be against the philosophy behind the <strong>foreach</strong> framework. The <strong>foreach</strong> ecosystem is designed to separate the <code>foreach()</code> + <code>%dopar%</code> code, describing what to parallelize, from the <code>registerDoNnn()</code> call, describing how and where to parallelize.</p> <p>Using <code>%dofuture%</code>, instead of <code>%dopar%</code> with user-controlled foreach backend, avoids this dilemma. With <code>%dofuture%</code> the developer is in full control of the parallelization code.</p> <h3 id="problem-2-chunking-and-load-balancing-differ-among-foreach-backends">Problem 2. Chunking and load-balancing differ among foreach backends</h3> <p>When using parallel map-reduce functions such as <code>mclapply()</code>, <code>parLapply()</code> of the <strong>parallel</strong> package, or <code>foreach()</code> with <code>%dopar%</code>, the tasks are partitioned into subsets and distributed to the parallel workers for processing. This partitioning is often referred to as &ldquo;chunking&rdquo;, because we chunk up the elements into smaller chunks, and then each chunk is processed by one parallel worker. There are different strategies to chunk up the elements. One approach is to use uniformly sized chunks and have each worker process one chunk. Another approach is to use chunks with a single element, and have each worker process one or more chunks.</p> <p>The chunks may be pre-assigned (&ldquo;prescheduled&rdquo;) to the parallel workers up-front, which is referred to as <em>static load balancing</em>. An alternative is to assign chunks to workers on-the-fly as the workers become available, which is referred to as <em>dynamic load balancing</em>.</p> <p>If the processing time differ a lot between elements, it is beneficial to use dynamic load balancing together with small chunk sizes.</p> <p>However, if we dig into the documentation and source code of the different foreach backends, we will find that they use different chunking and load-balancing strategies. For example, assume we are running on a Linux machine, which supports forked processing. Then, if we use</p> <pre><code class="language-r">library(foreach) doParallel::registerDoParallel(ncores = 8) y &lt;- foreach(x = X, .export = &quot;slow_fcn&quot;) %dopar% { slow_fcn(x) } </code></pre> <p>the data will be processed by eight fork-based parallel workers using <em>dynamic load balancing with single-element chunks</em>. However, if we use PSOCK clusters:</p> <pre><code class="language-r">library(foreach) cl &lt;- parallel::makeCluster(8) doParallel::registerDoParallel(cl = cl) y &lt;- foreach(x = X, .export = &quot;slow_fcn&quot;) %dopar% { slow_fcn(x) } </code></pre> <p>the data will be processed by eight PSOCK-based parallel workers using <em>static load balancing with uniformly sized chunks</em>.</p> <p>Which of these two chunking and load-balancing strategies is the most efficient one depends on how much the processing time of <code>slow_fcn(x)</code> varies with different values of <code>x</code>. For example, and without going into details, if the processing times differ a lot, dynamic load balancing often makes better use of the parallel workers and results in a shorter overall processing time.</p> <p>Regardless of which is faster, the problem with different foreach backends using different strategies is that, as a developer with little control over the registered foreach backend, you have equally poor control over the chunking and load-balancing strategies.</p> <p>Using <code>%dofuture%</code>, avoids this problem. If you use <code>%dofuture%</code>, then dynamic load balancing will always be used for processing the data, regardless of which parallel future backend is in place, with the option to control the chunk size. As a side note, <code>%dopar%</code> with <code>registerDoFuture()</code> will also do this.</p> <h3 id="problem-3-different-foreach-backends-use-different-foreach-options">Problem 3. Different foreach backends use different <code>foreach()</code> options</h3> <p>In the previous section, I did not mention that for some foreach backends it is indeed possible to control whether static or dynamic load balancing should be used, and what the chunk sizes should be. This can be controlled by special <code>.options.*</code> arguments for <code>foreach()</code>. However, each foreach backend has their own <code>.options.*</code> argument, e.g. you might find that some use <code>.options.multicore</code>, others <code>.options.snow</code>, or something else. Because they are different, we cannot write code that works with any type of foreach backend.</p> <p>To give two examples, when using <strong>doParallel</strong> and <code>registerDoParallel(cores = 8)</code>, we can replace the default dynamic load balancing with static load balancing as:</p> <pre><code class="language-r">library(foreach) doParallel::registerDoParallel(ncores = 8) y &lt;- foreach(x = X, .export = &quot;slow_fcn&quot;, .options.multicore = list(preschedule = TRUE)) %dopar% { slow_fcn(x) } </code></pre> <p>This change will also switch from chunks with a single element to (eight) chunks with similar size.</p> <p>If we instead would use <code>registerDoParallel(cl)</code>, which gives us the vice versa situation, we can switch out the static load balancing with dynamic load balancing by using:</p> <pre><code class="language-r">library(foreach) cl &lt;- parallel::makeCluster(8) doParallel::registerDoParallel(cl = cl) y &lt;- foreach(x = X, .export = &quot;slow_fcn&quot;, .options.snow = list(preschedule = FALSE)) %dopar% { slow_fcn(x) } </code></pre> <p>This will also switch from uniformly sized chunks to single-element chunks.</p> <p>As we can see, the fact that we have to use different <code>foreach()</code> &ldquo;options&rdquo; arguments (here <code>.options.multicore</code> and <code>.options.snow</code>) for different foreach backends prevents us from writing code that works with any foreach backend.</p> <p>Of course, we could specify &ldquo;options&rdquo; arguments for known foreach backends and hope we haven&rsquo;t missed any and that no new ones are showing up later, e.g.</p> <pre><code class="language-r">library(foreach) doParallel::registerDoParallel(cores = 8) y &lt;- foreach(x = X, .export = &quot;slow_fcn&quot;, .options.multicore = list(preschedule = TRUE), .options.snow = list(preschedule = TRUE), .options.future = list(preschedule = TRUE), .options.mpi = list(chunkSize = 1) ) %dopar% { slow_fcn(x) } </code></pre> <p>Regardlessly, this still limits the end-user to a set of commonly used foreach backends, and our code can never be agile to foreach backends that are developed at a later time.</p> <p>Using <code>%dofuture%</code> avoids these problems. It supports argument <code>.options.future</code> in a consistent way across all future backends, which means that your code will be the same regardless of parallel backend. By the core design of the Futureverse, any new future backends developed later one will automatically work with your <strong>foreach</strong> code if you use <code>%dofuture%</code>.</p> <h3 id="problem-4-global-variables-are-not-always-identified-by-foreach">Problem 4. Global variables are not always identified by <code>foreach()</code></h3> <p>When parallelizing code, the parallel workers must have access to all functions and variables required to evaluate the parallel code. As we have seen the above examples, you can use the <code>.export</code> argument to help <code>foreach()</code> to export the necessary objects to each of the parallel workers.</p> <p>However, a developer who uses <code>doMC::registerDoMC()</code>, or equivalently <code>doParallel::registerDoParallel(cores)</code>, might forget to specify the <code>.export</code> argument. This can happen because the mechanisms of forked processing makes all objects available to the parallel workers. If they test their code using only these foreach backends, they will not notice that <code>.export</code> is not declared. The same may happen if the developer assumes <code>doFuture::registerDoFuture()</code> is used. However, without specifying <code>.export</code>, the code will <em>not</em> work on other types of foreach backends, e.g. <code>doParallel::registerDoParallel(cl)</code> and <code>doMPI::registerDoMPI()</code>. If an R package forgets to specify the <code>.export</code> argument, and is not comprehensively tested, then it will be the end-user, for instance on MS Windows, that runs into the bug.</p> <p>When using <code>%dofuture%</code>, global variables and required packages are by default automatically identified and exported to the parallel workers by the future framework. This is done the same way regardless of parallel backend.</p> <h3 id="problem-5-easy-to-forget-parallel-random-number-generation">Problem 5. Easy to forget parallel random number generation</h3> <p>The <strong>foreach</strong> package and <code>%dopar%</code> do not have built-in support for parallel random number generation (RNG). Statistical sound parallel RNG is critical for many statistical analyses. If not done, then the results can be biases and incorrect conclusions might be drawn. The <strong><a href="https://cran.r-project.org/package=doRNG">doRNG</a></strong> package comes to rescue when using <code>%dopar%</code>. It provides the operator <code>%dorng%</code>, which will use <code>%dopar%</code> internally while automatically setting up parallel RNG. Whenever you use <code>%dopar%</code> and find yourself needing parallel RNG, I recommend to simply replace <code>%dopar%</code> with <code>%dorng%</code>. The <strong>doRNG</strong> package also provides <code>registerDoRNG()</code>, which I do not recommend, because as a developer you do not have full control whether that is registered or not.</p> <p>Because <strong>foreach</strong> does not have built-in support for parallel RNG, it is easy to forget that it should be used. A developer who is aware of the importance of using proper parallel RNG will find out about <strong>doRNG</strong> and how to best use it, but a developer who is not aware of the problem, can easily miss it and publish an R package that produces potentially incorrect results.</p> <p>However, when using the future framework will detect if we forget to use parallel RNG. When this happens, a warning will alert us to the problem and suggest how to fix it. This is the case if you use <code>doFuture::registerDoFuture()</code>, and it&rsquo;s also the case when using <code>%dofuture%</code>. For example,</p> <pre><code class="language-r">library(doFuture) plan(multisession, workers = 3) y &lt;- foreach(ii = 1:4) %dofuture% { runif(ii) } </code></pre> <p>produces</p> <pre><code>Warning messages: 1: UNRELIABLE VALUE: Iteration 1 of the foreach() %dofuture% { ... }, part of chunk #1 ('doFuture2-1'), unexpectedly generated random numbers without declaring so. There is a risk that those random numbers are not statistically sound and the overall results might be invalid. To fix this, specify foreach() argument '.options.future = list(seed = TRUE)'. This ensures that proper, parallel-safe random numbers are produced via the L'Ecuyer-CMRG method. To disable this check, set option 'doFuture.rng.onMisuse' to &quot;ignore&quot;. </code></pre> <p>To fix this, we can specify <code>foreach()</code> argument <code>.options.future = list(seed = TRUE)</code> to declare that we need to draw random number in parallel, i.e.</p> <pre><code class="language-r">library(doFuture) plan(multisession, workers = 3) y &lt;- foreach(ii = 1:4, .options.future = list(seed = TRUE)) %dofuture% { runif(ii) } </code></pre> <p>This makes sure that statistical sound random numbers are generated.</p> <h2 id="migrating-from-dopar-to-dofuture-is-straightforward">Migrating from %dopar% to %dofuture% is straightforward</h2> <p>If you already have code that uses <code>%dopar%</code> and want to start using <code>%dofuture%</code> instead, then it only takes are few changes, which are all straightforward and quick:</p> <ol> <li><p>Replace <code>%dopar%</code> with <code>%dofuture%</code>.</p></li> <li><p>Replace <code>%dorng%</code> with <code>%dofuture%</code> and set <code>.options.future = list(seed = TRUE)</code>.</p></li> <li><p>Replace <code>.export = &lt;character vector of global variables&gt;</code> with <code>.options.future = list(globals = &lt;character vector of global variables&gt;)</code>.</p></li> <li><p>Drop any other <code>registerDoNnn()</code> calls inside your function, if you use them.</p></li> <li><p>Update your documentation to mention that the parallel backend should be set using <code>future::plan()</code> and no longer via different <code>registerDoNnn()</code> calls.</p></li> </ol> <p>In brief, if you use <code>%dofuture%</code> instead of <code>%dopar%</code>, your life as a developer will be easier and so will the end-user&rsquo;s be too.</p> <p>If you have questions or comments on <strong>doFuture</strong> and <code>%dofuture%</code>, or the Futureverse in general, please use the <a href="https://github.com/HenrikBengtsson/future/discussions/">Futureverse Discussion forum</a>.</p> <p>Happy futuring!</p> <p>Henrik</p> <h2 id="links">Links</h2> <ul> <li><strong>doFuture</strong> package: <a href="https://cran.r-project.org/package=doFuture">CRAN</a>, <a href="https://github.com/HenrikBengtsson/doFuture">GitHub</a>, <a href="https://doFuture.futureverse.org">pkgdown</a></li> <li><strong>Futureverse</strong>: <a href="https://www.futureverse.org">https://www.futureverse.org</a></li> </ul> </description>
</item>
<item>
<title>Edmonton R User Group Meetup: Futureverse - A Unifying Parallelization Framework in R for Everyone</title>
<link>https://www.jottr.org/2023/05/22/future-yegrug-2023-slides/</link>
<pubDate>Mon, 22 May 2023 18:00:00 -0700</pubDate>
<guid>https://www.jottr.org/2023/05/22/future-yegrug-2023-slides/</guid>
<description> <div style="margin: 2ex; width: 100%;"/> <center> <img src="https://www.jottr.org/post/YEGRUG_20230522.jpeg" alt="The YEGRUG poster slide for the Futureverse presentation on 2023-05-22" style="width: 80%; border: 1px solid black;"/> </center> </div> <p>Below are the slides from my presentation at the <a href="https://www.meetup.com/edmonton-r-user-group-yegrug/events/fxvdbtyfchbhc/">Edmonton R User Group Meetup (YEGRUG)</a> on May 22, 2023:</p> <p>Title: Futureverse - A Unifying Parallelization Framework in R for Everyone<br /> Speaker: Henrik Bengtsson<br /> Slides: <a href="https://docs.google.com/presentation/d/e/2PACX-1vQfbnVRHZhIkEAd3_pNG14N5JQqE0jqCohSq-m-uWAcA7StF-BuHdOz0IGDhcRI3K681DxoXoqA7pwp/pub?start=true&amp;loop=false&amp;delayms=60000">HTML</a>, <a href="https://www.jottr.org/presentations/yegrug2023/BengtssonH_20230522-Futureverse-YEGRUG.pdf">PDF</a> (46 slides)<br /> Video: <a href="https://www.youtube.com/watch?v=6Dp6zMelrmg">official recording</a> (~60 minutes)</p> <p>Thank you Péter Sólymos and the YEGRUG for the invitate and the opportunity!</p> <p>/Henrik</p> <h2 id="links">Links</h2> <ul> <li>YEGRUG: <a href="https://yegrug.github.io/">https://yegrug.github.io/</a></li> <li><strong>Futureverse</strong> website: <a href="https://www.futureverse.org/">https://www.futureverse.org/</a></li> <li><strong>future</strong> package <a href="https://cran.r-project.org/package=future">CRAN</a>, <a href="https://github.com/HenrikBengtsson/future">GitHub</a>, <a href="https://future.futureverse.org/">pkgdown</a></li> </ul> </description>
</item>
<item>
<title>parallelly 1.34.0: Support for CGroups v2, Killing Parallel Workers, and more</title>
<link>https://www.jottr.org/2023/01/18/parallelly-1.34.0-support-for-cgroups-v2-killing-parallel-workers-and-more/</link>
<pubDate>Wed, 18 Jan 2023 14:00:00 -0800</pubDate>
<guid>https://www.jottr.org/2023/01/18/parallelly-1.34.0-support-for-cgroups-v2-killing-parallel-workers-and-more/</guid>
<description> <div style="padding: 2ex; float: right;"/> <center> <img src="https://www.jottr.org/post/parallelly-logo.png" alt="The 'parallelly' hexlogo"/> </center> </div> <p>With the recent releases of <strong><a href="https://parallelly.futureverse.org">parallelly</a></strong> 1.33.0 (2022-12-13) and 1.34.0 (2023-01-13), <a href="https://parallelly.futureverse.org/reference/availableCores.html"><code>availableCores()</code></a> and <a href="https://parallelly.futureverse.org/reference/availableWorkers.html"><code>availableWorkers()</code></a> gained better support for Linux CGroups, options for avoiding running out of R connections when setting up <strong>parallel</strong>-style clusters, and <code>killNode()</code> for forcefully terminating one or more parallel workers. I summarize these updates below. For other updates, please see the <a href="https://parallelly.futureverse.org/news/index.html">NEWS</a>.</p> <h2 id="added-support-for-cgroups-v2">Added support for CGroups v2</h2> <p><a href="https://parallelly.futureverse.org/reference/availableCores.html"><code>availableCores()</code></a> and <a href="https://parallelly.futureverse.org/reference/availableWorkers.html"><code>availableWorkers()</code></a> gained support for Linux Control Groups v2 (CGroups v2), besides CGroups v1, which has been supported since <strong><a href="https://parallelly.futureverse.org">parallelly</a></strong> 1.31.0 (2022-04-07) and partially since 1.22.0 (2020-12-12). This means that if you use <code>availableCores()</code> and <code>availableWorkers()</code> in your R code, it will better respect the number of CPU cores that the Linux system has made available to you. Not all systems use CGroups, but it is becoming more popular, so if the Linux system you run on does not use it right now, it is likely it will at some point.</p> <h2 id="avoid-running-out-of-r-connections">Avoid running out of R connections</h2> <p>If you run parallel code on a machine with a many CPU cores, there&rsquo;s a risk that you run out of available R connections, which are needed when setting up <strong>parallel</strong> cluster nodes. This is because R has a limit of a maximum 125 connections being used at the same time(*) and each cluster node consumes one R connection. If you try to set up more parallel workers than this, you will get an error. The <strong>parallelly</strong> package already has built-in protection against this, e.g.</p> <pre><code class="language-r">&gt; cl &lt;- parallelly::makeClusterPSOCK(192) Error: Cannot create 192 parallel PSOCK nodes. Each node needs one connection, but there are only 124 connections left out of the maximum 128 available on this R installation </code></pre> <p>This error is <em>instant</em> and with no parallel workers being launched. In contrast, if you use <strong>parallel</strong>, you will only get an error after R has launched the first 124 cluster nodes and fails to launch the 125:th one, e.g.</p> <pre><code class="language-r">&gt; cl &lt;- parallel::makePSOCKcluster(192) Error in socketAccept(socket = socket, blocking = TRUE, open = &quot;a+b&quot;, : all connections are in use </code></pre> <p>Now, assume you use:</p> <pre><code class="language-r">&gt; library(parallelly) &gt; nworkers &lt;- availableCores() &gt; cl &lt;- makeClusterPSOCK(ncores) </code></pre> <p>to set up a maximum-sized cluster on the current machine. This works as long as <code>availableCores()</code> returns something less than 125. However, if you are on machine with, say, 192 CPU cores, you will get the above error. You could do something like:</p> <pre><code class="language-r">&gt; nworkers &lt;- availableCores() &gt; nworkers &lt;- max(nworkers, 125L) </code></pre> <p>to work around this problem. Or, if you want to be more agile to what R supports, you could do:</p> <pre><code class="language-r">&gt; nworkers &lt;- availableCores() &gt; nworkers &lt;- max(nworkers, freeConnections()) </code></pre> <p>With the latest versions of <strong>parallelly</strong>, you can simplify this to:</p> <pre><code class="language-r">&gt; nworkers &lt;- availableCores(constraints = &quot;connections&quot;) </code></pre> <p>The <code>availableWorkers()</code> function also supports <code>constraints = &quot;connections&quot;</code>.</p> <p>(*) The only way to increase this limit is to change the R source code and build R from source, cf. <a href="https://parallelly.futureverse.org/reference/availableConnections.html"><code>freeConnections()</code></a>.</p> <h2 id="forcefully-terminate-psock-cluster-nodes">Forcefully terminate PSOCK cluster nodes</h2> <p>The <code>parallel::stopCluster()</code> should be used for stopping a parallel cluster. This works by asking the clusters node to shut themselves down. However, a parallel worker will only shut down this way when it receives the message, which can only happen when the worker is done processing any parallel tasks. So, if a worker runs a very long-running task, which can take minutes, hours, or even days, it will not shut down until after that completes.</p> <p>This far, we had to turn to special operating-system tools to kill the R process for that cluster worker. With <strong>parallelly</strong> 1.33.0, you can now use <code>killNode()</code> to kill any parallel worker that runs on the local machine and that was launched by <a href="https://parallelly.futureverse.org/reference/makeClusterPSOCK.html"><code>makeClusterPSOCK()</code></a>. For example,</p> <pre><code class="language-r">&gt; library(parallelly) &gt; cl &lt;- makeClusterPSOCK(10) &gt; cl Socket cluster with 10 nodes where 10 nodes are on host 'localhost' (R version 4.2.2 (2022-10-31), platform x86_64-pc-linux-gnu) &gt; which(isNodeAlive(cl)) [1] 1 2 3 4 5 6 7 8 9 10 &gt; success &lt;- killNode(cl[1:3]) &gt; success [1] TRUE TRUE TRUE &gt; which(isNodeAlive(cl)) [1] 4 5 6 7 8 9 10 &gt; cl &lt;- cl[isNodeAlive(cl)] Socket cluster with 7 nodes where 7 nodes are on host 'localhost' (R version 4.2.2 (2022-10-31), platform x86_64-pc-linux-gnu) </code></pre> <p>Over and out,</p> <p>Henrik</p> <h2 id="links">Links</h2> <ul> <li><strong>parallelly</strong> package: <a href="https://cran.r-project.org/package=parallelly">CRAN</a>, <a href="https://github.com/HenrikBengtsson/parallelly">GitHub</a>, <a href="https://parallelly.futureverse.org">pkgdown</a></li> </ul> </description>
</item>
<item>
<title>progressr 0.13.0: cli + progressr = ♥</title>
<link>https://www.jottr.org/2023/01/10/progressr-0.13.0/</link>
<pubDate>Tue, 10 Jan 2023 19:00:00 -0800</pubDate>
<guid>https://www.jottr.org/2023/01/10/progressr-0.13.0/</guid>
<description> <p><strong><a href="https://progressr.futureverse.org">progressr</a></strong> 0.13.0 is on CRAN. In the recent releases, <strong>progressr</strong> gained support for using <strong><a href="https://cli.r-lib.org/">cli</a></strong> to generate progress bars. Vice versa, <strong>cli</strong> can now report on progress via the <strong>progressr</strong> framework. Here are the details. For other updates to <strong>progressr</strong>, see <a href="https://progressr.futureverse.org/news/index.html">NEWS</a>.</p> <div style="padding: 2ex; float: right;"/> <center> <img src="https://www.jottr.org/post/three_in_chinese.gif" alt="Three strokes writing three in Chinese"/> </center> </div> <p>The <strong>progressr</strong> package, part of the <a href="https://www.futureverse.org">futureverse</a>, provides a minimal API for reporting progress updates in R. The design is to separate the representation of progress updates from how they are presented. What type of progress to signal is controlled by the developer. How these progress updates are rendered is controlled by the end user. For instance, some users may prefer visual feedback, such as a horizontal progress bar in the terminal, whereas others may prefer auditory feedback. The <strong>progressr</strong> package works also when processing R in parallel or distributed using the <strong><a href="https://future.futureverse.org">future</a></strong> framework.</p> <h2 id="use-cli-progress-bars-for-progressr-reporting">Use &lsquo;cli&rsquo; progress bars for &lsquo;progressr&rsquo; reporting</h2> <p>In <strong>progressr</strong> (&gt;= 0.12.0) [2022-12-13], you can report on progress using <strong>cli</strong> progress bar. To do this, just set:</p> <pre><code class="language-r">progressr::handlers(global = TRUE) ## automatically report on progress progressr::handlers(&quot;cli&quot;) ## ... using a 'cli' progress bar </code></pre> <p>With these globals settings (e.g. in your <code>~/.Rprofile</code> file; see below), R reports progress as:</p> <pre><code class="language-r">library(progressr) y &lt;- slow_sum(1:10) </code></pre> <p><img src="https://www.jottr.org/post/handler_cli-default-slow_sum.svg" alt="Animation of a one-line, green-blocks cli progress bar in the terminal growing from 0% to 100% with an ETA estimate at the end" /></p> <p>Just like regular <strong>cli</strong> progress bars, you can customize these in the same way. For instance, if you use the following from one of the <strong>cli</strong> examples:</p> <pre><code class="language-r">options(cli.progress_bar_style = list( complete = cli::col_yellow(&quot;\u2605&quot;), incomplete = cli::col_grey(&quot;\u00b7&quot;) )) </code></pre> <p>you&rsquo;ll get:</p> <p><img src="https://www.jottr.org/post/handler_cli-default-slow_sum-yellow-starts.svg" alt="Animation of a one-line, yellow-starts cli progress bar in the terminal growing from 0% to 100% with an ETA estimate at the end" /></p> <h2 id="configure-cli-to-report-progress-via-progressr">Configure &lsquo;cli&rsquo; to Report Progress via &lsquo;progressr&rsquo;</h2> <p>You might have heard that <strong><a href="https://purrr.tidyverse.org/">purrr</a></strong> recently gained support for reporting on progress. If you didn&rsquo;t, you can read about it in the tidyverse blog post &lsquo;<a href="https://www.tidyverse.org/blog/2022/12/purrr-1-0-0/#progress-bars">purrr 1.0.0</a>&rsquo; on 2022-12-20. The gist is to pass <code>.progress = TRUE</code> to the <strong>purrr</strong> function of interest, and it&rsquo;ll show a progress bar while it runs. For example, assume we the following slow function for calculating the square root:</p> <pre><code class="language-r">slow_sqrt &lt;- function(x) { Sys.sleep(0.1); sqrt(x) } </code></pre> <p>If we call</p> <pre><code class="language-r">y &lt;- purrr::map(1:30, slow_sqrt, .progress = TRUE) </code></pre> <p>we&rsquo;ll see a progress bar appearing after about two seconds:</p> <p><img src="https://www.jottr.org/post/handler_cli-default.svg" alt="Animation of a one-line, green-blocks cli progress bar in the terminal growing from 0% to 100% with an ETA estimate at the end" /></p> <p>This progress bar is produced by the <strong>cli</strong> package. Now, the neat thing with the <strong>cli</strong> package is that you can tell it to pass on the progress reporting to another progress framework, including that of the <strong>progressr</strong> package. To do this, set the R option:</p> <pre><code class="language-r">options(cli.progress_handlers = &quot;progressr&quot;) </code></pre> <p>This causes <em>all</em> <strong>cli</strong> progress updates to be reported via <strong>progressr</strong>, so if you, for instance, already have set:</p> <pre><code class="language-r">progressr::handlers(global = TRUE) red_heart &lt;- cli::col_red(cli::symbol$heart) handlers(handler_txtprogressbar(char = red_heart)) </code></pre> <p>the above <code>purrr::map()</code> call will report on progress in the terminal using a classical R progress bar tweaked to use red hearts to fill the bar;</p> <p><img src="https://www.jottr.org/post/handler_txtprogressbar-custom-hearts.svg" alt="Animation of a one-line, text-based red-hearts progress bar in the terminal growing from 0% to 100%" /></p> <p>As another example, if you set:</p> <pre><code class="language-r">progressr::handlers(global = TRUE) progressr::handlers(c(&quot;beepr&quot;, &quot;cli&quot;, &quot;rstudio&quot;)) </code></pre> <p>R will report progress <em>concurrently</em> via audio using different <strong><a href="https://cran.r-project.org/package=beepr">beepr</a></strong> sounds, via the terminal as a <strong>cli</strong> progress bar, and the RStudio&rsquo;s built-in progress bar - whenever progress is reported via the <strong>progressr</strong> framework <em>or</em> the <strong>cli</strong> framework.</p> <h2 id="customize-progress-reporting-when-r-starts">Customize progress reporting when R starts</h2> <p>To safely configure the above for all your <em>interactive</em> R sessions, I recommend adding something like the following to your <code>~/.Rprofile</code> file (or in a standalone file using the <strong><a href="https://cran.r-project.org/package=startup">startup</a></strong> package):</p> <pre><code class="language-r">if (interactive() &amp;&amp; requireNamespace(&quot;progressr&quot;, quietly = TRUE)) { ## progressr reporting without need for with_progress() progressr::handlers(global = TRUE) ## Use 'cli', if installed ... if (requireNamespace(&quot;cli&quot;, quietly = TRUE)) { progressr::handlers(&quot;cli&quot;) ## Hand over all 'cli' progress reporting to 'progressr' options(cli.progress_handlers = &quot;progressr&quot;) } else { ## ... otherwise use the one that comes with R progressr::handlers(&quot;txtprogressbar&quot;) } ## Use 'beepr', if installed ... if (requireNamespace(&quot;beepr&quot;, quietly = TRUE)) { progressr::handlers(&quot;beepr&quot;, append = TRUE) } ## Reporting via RStudio, if running in the RStudio Console, ## but not the terminal if ((Sys.getenv(&quot;RSTUDIO&quot;) == &quot;1&quot;) &amp;&amp; !nzchar(Sys.getenv(&quot;RSTUDIO_TERM&quot;))) { progressr::handlers(&quot;rstudio&quot;, append = TRUE) } } </code></pre> <p>See the <strong><a href="https://progressr.futureverse.org">progressr</a></strong> website for other, additional ways of reporting on progress.</p> <p>Now, go make some progress!</p> <h2 id="other-posts-on-progressr-reporting">Other posts on progressr reporting</h2> <ul> <li><a href="https://www.jottr.org/2022/06/03/progressr-0.10.1/">progressr 0.10.1: Plyr Now Supports Progress Updates also in Parallel</a>, 2022-06-03</li> <li><a href="https://www.jottr.org/2021/06/11/progressr-0.8.0/">progressr 0.8.0 - RStudio&rsquo;s Progress Bar, Shiny Progress Updates, and Absolute Progress</a>, 2021-06-11</li> <li><a href="https://www.jottr.org/2020/07/04/progressr-erum2020-slides/">e-Rum 2020 Slides on Progressr</a>, 2020-07-04</li> <li>See also <a href="https://www.jottr.org/tags/#progressr-list">&lsquo;progressr&rsquo;</a> tag.</li> </ul> <h2 id="links">Links</h2> <ul> <li><strong>progressr</strong> package: <a href="https://cran.r-project.org/package=progressr">CRAN</a>, <a href="https://github.com/HenrikBengtsson/progressr">GitHub</a>, <a href="https://progressr.futureverse.org">pkgdown</a></li> <li><strong>cli</strong> package: <a href="https://cran.r-project.org/package=cli">CRAN</a>, <a href="https://github.com/r-lib/cli">GitHub</a>, <a href="https://cli.r-lib.org/">pkgdown</a></li> <li><strong>future</strong> package: <a href="https://cran.r-project.org/package=future">CRAN</a>, <a href="https://github.com/HenrikBengtsson/future">GitHub</a>, <a href="https://future.futureverse.org">pkgdown</a></li> </ul> </description>
</item>
<item>
<title>Please Avoid detectCores() in your R Packages</title>
<link>https://www.jottr.org/2022/12/05/avoid-detectcores/</link>
<pubDate>Mon, 05 Dec 2022 21:00:00 -0800</pubDate>
<guid>https://www.jottr.org/2022/12/05/avoid-detectcores/</guid>
<description> <p>The <code>detectCores()</code> function of the <strong>parallel</strong> package is probably one of the most used functions when it comes to setting the number of parallel workers to use in R. In this blog post, I&rsquo;ll try to explain why using it is not always a good idea. Already now, I am going to make a bold request and ask you to:</p> <blockquote> <p>Please <em>avoid</em> using <code>parallel::detectCores()</code> in your package!</p> </blockquote> <p>By reading this blog post, I hope you become more aware of the different problems that arise from using <code>detectCores()</code> and how they might affect you and the users of your code.</p> <figure style="margin-top: 3ex;"> <img src="https://www.jottr.org/post/detectCores_bad_vs_good.png" alt="Screenshots of two terminal-based, colored graphs each showing near 100% load on all 24 CPU cores. The load bars to the left are mostly red, whereas the ones to the right are most green. There is a shrug emoji, with the text \"do you want this?\" pointing to the left and the text "or that?" pointing to the right, located inbetween the two graphs." style="width: 100%; margin: 0; margin-bottom: 2ex;"/> <figcaption style="font-style: italic"> Figure&nbsp;1: Using <code>detectCores()</code> risks overloading the machine where R runs, even more so if there are other things already running. The machine seen at the left is heavily loaded, because too many parallel processes compete for the 24 CPU cores available, which results in an extensive amount of kernel context switching (red), which wastes precious CPU cycles. The machine to the right is near-perfectly loaded at 100%, where none of the processes use more than they may use (mostly green). </figcaption> </figure> <h2 id="tl-dr">TL;DR</h2> <p>If you don&rsquo;t have time to read everything, but will take my word that we should avoid <code>detectCores()</code>, then the quick summary is that you basically have two choices for the number of parallel workers to use by default;</p> <ol> <li><p>Have your code run with a single core by default (i.e. sequentially), or</p></li> <li><p>replace all <code>parallel::detectCores()</code> with <a href="https://parallelly.futureverse.org/reference/availableCores.html"><code>parallelly::availableCores()</code></a>.</p></li> </ol> <p>I&rsquo;m in the conservative camp and recommend the first alternative. Using sequential processing by default, where the user has to make an explicit choice to run in parallel, significantly lowers the risk for clogging up the CPUs (left panel in Figure&nbsp;1), especially when there are other things running on the same machine.</p> <p>The second alternative is useful if you&rsquo;re not ready to make the move to run sequentially by default. The <code>availableCores()</code> function of the <strong><a href="https://parallelly.futureverse.org">parallelly</a></strong> package is fully backward compatible with <code>detectCores()</code>, while it avoids the most common problems that comes with <code>detectCores()</code>, plus it is agile to a lot more CPU-related settings, including settings that the end-user, the systems administrator, job schedulers and Linux containers control. It is designed to take care of common overuse issues so that you do not have to spend time worry about them.</p> <h2 id="background">Background</h2> <p>There are several problems with using <code>detectCores()</code> from the <strong>parallel</strong> package for deciding how many parallel workers to use. But before we get there, I want you to know that we find this function commonly used in R script and R packages, and frequently suggested in tutorials. So, do not feel ashamed if you use it.</p> <p>If we scan the code of the R packages on CRAN (e.g. by <a href="https://github.com/search?q=org%3Acran+language%3Ar+%22detectCores%28%29%22&amp;type=code">searching GitHub</a><sup>1</sup>), or on Bioconductor (e.g. by <a href="https://code.bioconductor.org/search/search?q=detectCores%28%29)">searching Bioc::CodeSearch</a>) we find many cases where <code>detectCores()</code> is used. Here are some variants we see in the wild:</p> <pre><code class="language-r">cl &lt;- makeCluster(detectCores()) cl &lt;- makeCluster(detectCores() - 1) y &lt;- mclapply(..., mc.cores = detectCores()) registerDoParallel(detectCores()) </code></pre> <p>We also find functions that let the user choose the number of workers via some argument, which defaults to <code>detectCores()</code>. Sometimes the default is explicit, as in:</p> <pre><code class="language-r">fast_fcn &lt;- function(x, ncores = parallel::detectCores()) { if (ncores &gt; 1) { cl &lt;- makeCluster(ncores) ... } } </code></pre> <p>and sometimes it&rsquo;s implicit, as in:</p> <pre><code class="language-r">fast_fcn &lt;- function(x, ncores = NULL) { if (is.null(ncores)) ncores &lt;- parallel::detectCores() - 1 if (ncores &gt; 1) { cl &lt;- makeCluster(ncores) ... } } </code></pre> <p>As we will see next, all the above examples are potentially buggy and might result in run-time errors.</p> <h2 id="common-mistakes-when-using-detectcores">Common mistakes when using detectCores()</h2> <h3 id="issue-1-detectcores-may-return-a-missing-value">Issue 1: detectCores() may return a missing value</h3> <p>A small, but important detail about <code>detectCores()</code> that is often missed is the following section in <code>help(&quot;detectCores&quot;, package = &quot;parallel&quot;)</code>:</p> <blockquote> <p><strong>Value</strong></p> <p>An integer, <strong>NA if the answer is unknown</strong>.</p> </blockquote> <p>Because of this, we cannot rely on:</p> <pre><code class="language-r">ncores &lt;- detectCores() </code></pre> <p>to always work, i.e. we might end up with errors like:</p> <pre><code class="language-r">ncores &lt;- detectCores() workers &lt;- parallel::makeCluster(ncores) Error in makePSOCKcluster(names = spec, ...) : numeric 'names' must be &gt;= 1 </code></pre> <p>We need to account for this, especially as package developers. One way to handle it is simply by using:</p> <pre><code class="language-r">ncores &lt;- detectCores() if (is.na(ncores)) ncores &lt;- 1L </code></pre> <p>or, by using the following shorter, but also harder to understand, one-liner:</p> <pre><code class="language-r">ncores &lt;- max(1L, detectCores(), na.rm = TRUE) </code></pre> <p>This construct is guaranteed to always return at least one core.</p> <p><em>Shameless advertisement for the <strong><a href="https://parallelly.futureverse.org">parallelly</a></strong> package</em>: In contrast to <code>detectCores()</code>, <code>parallelly::availableCores()</code> handles the above case automatically, and it guarantees to always return at least one core.</p> <h3 id="issue-2-detectcores-may-return-one">Issue 2: detectCores() may return one</h3> <p>Although it&rsquo;s rare to run into hardware with single-core CPUs these days, you might run into a virtual machine (VM) configured to have a single core. Because of this, you cannot reliably use:</p> <pre><code class="language-r">ncores &lt;- detectCores() - 1L </code></pre> <p>or</p> <pre><code class="language-r">ncores &lt;- detectCores() - 2L </code></pre> <p>in your code. If you use these constructs, a user of your code might end up with zero or a negative number of cores here, which another way we can end up with an error downstream. A real-world example of this problem can be found in continous integration (CI) services, e.g. <code>detectCores()</code> returns 2 in GitHub Actions jobs. So, we need to account also for this case, which we can do by using the above <code>max()</code> solution, e.g.</p> <pre><code class="language-r">ncores &lt;- max(1L, detectCores() - 2L, na.rm = TRUE) </code></pre> <p>This is guaranteed to always return at least one.</p> <p><em>Shameless advertisement for the <strong><a href="https://parallelly.futureverse.org">parallelly</a></strong> package</em>: In contrast, <code>parallelly::availableCores()</code> handles this case via argument <code>omit</code>, which makes it easier to understand the code, e.g.</p> <pre><code class="language-r">ncores &lt;- availableCores(omit = 2) </code></pre> <p>This construct is guaranteed to return at least one core, e.g. if there are one, two, or three CPU cores on this machine, <code>ncores</code> will be one in all three cases.</p> <h3 id="issue-3-detectcores-may-return-too-many-cores">Issue 3: detectCores() may return too many cores</h3> <p>When we use PSOCK, SOCK, or MPI clusters as defined by the <strong>parallel</strong> package, the communication between the main R session and the parallel workers is done via R socket connection. Low-level functions <code>parallel::makeCluster()</code>, <code>parallelly::makeClusterPSOCK()</code>, and legacy <code>snow::makeCluster()</code> create these types of clusters. In turn, there are higher-level functions that rely on these low-level functions, e.g. <code>doParallel::registerDoParallel()</code> uses <code>parallel::makeCluster()</code> if you are on MS Windows, <code>BiocParallel::SnowParam()</code> uses <code>snow::makeCluster()</code>, and <code>plan(multisession)</code> and <code>plan(cluster)</code> of the <strong><a href="https://future.futureverse.org">future</a></strong> package uses <code>parallelly::makeClusterPSOCK()</code>.</p> <p>R has a limit in the number of connections it can have open at any time. As of R 4.2.2, <a href="https://github.com/HenrikBengtsson/Wishlist-for-R/issues/28">the limit is 125 open connections</a>. Because of this, we can use at most 125 parallel PSOCK, SOCK, or MPI workers. In practice, this limit is lower, because some connections may already be in use elsewhere. To find the current number of free connections, we can use <a href="https://parallelly.futureverse.org/reference/availableConnections.html"><code>parallelly::freeConnections()</code></a>. If we try to launch a cluster with too many workers, there will not be enough connections available for the communication and the setup of the cluster will fail. For example, a user running on a 192-core machine will get errors such as:</p> <pre><code class="language-r">&gt; cl &lt;- parallel::makeCluster(detectCores()) Error in socketAccept(socket = socket, blocking = TRUE, open = &quot;a+b&quot;, : all connections are in use </code></pre> <p>and</p> <pre><code class="language-r">&gt; cl &lt;- parallelly::makeClusterPSOCK(detectCores()) Error: Cannot create 192 parallel PSOCK nodes. Each node needs one connection, but there are only 124 connections left out of the maximum 128 available on this R installation </code></pre> <p>Thus, if we use <code>detectCores()</code>, our R code will not work on larger, modern machines. This is a problem that will become more and more common as more users get access to more powerful computers. Hopefully, R will increase this connection limit in a future release, but until then, you as the developer are responsible to handle also this case. To make your code agile to this limit, also if R increases it, you can use:</p> <pre><code class="language-r">ncores &lt;- max(1L, detectCores(), na.rm = TRUE) ncores &lt;- min(parallelly::freeConnections(), ncores) </code></pre> <p>This is guaranteed to return at least zero (sic!) and never more than what is required to create a PSOCK, SOCK, and MPI cluster with than many parallel workers.</p> <p><em>Shameless advertisement for the <strong><a href="https://parallelly.futureverse.org">parallelly</a></strong> package</em>: In the upcoming <strong>parallelly</strong> 1.33.0 version, you can use <code>parallelly::availableCores(constraints = &quot;connections&quot;)</code> to limit the result to the current number of available R connections. In addition, you can control the maximum number of cores that <code>availableCores()</code> returns by setting R option <code>parallelly.availableCores.system</code>, or environment variable <code>R_PARALLELLY_AVAILABLECORES_SYSTEM</code>, e.g. <code>R_PARALLELLY_AVAILABLECORES_SYSTEM=120</code>.</p> <h2 id="issue-4-detectcores-does-not-give-the-number-of-allowed-cores">Issue 4: detectCores() does not give the number of &ldquo;allowed&rdquo; cores</h2> <p>There&rsquo;s a note in <code>help(&quot;detectCores&quot;, package = &quot;parallel&quot;)</code> that touches on the above problems, but also on other important limitations that we should know of:</p> <blockquote> <p><strong>Note</strong></p> <p>This [= <code>detectCores()</code>] is not suitable for use directly for the <code>mc.cores</code> argument of <code>mclapply</code> nor specifying the number of cores in <code>makeCluster</code>. First because it may return <code>NA</code>, second because it does not give the number of <em>allowed</em> cores, and third because on Sparc Solaris and some Windows boxes it is not reasonable to try to use all the logical CPUs at once.</p> </blockquote> <p><strong>When is this relevant? The answer is: Always!</strong> This is because as package developers, we cannot really know when this occurs, because we never know on what type of hardware and system our code will run. So, we have to account for these unknowns too.</p> <p>Let&rsquo;s look at some real-world case where using <code>detectCores()</code> can become a real issue.</p> <h3 id="4a-a-personal-computer">4a. A personal computer</h3> <p>A user might want to run other software tools at the same time while running the R analysis. A very common pattern we find in R code is to save one core for other purposes, say, browsing the web, e.g.</p> <pre><code class="language-r">ncores &lt;- detectCores() - 1L </code></pre> <p>This is a good start. It is the first step toward your software tool acknowledging that there might be other things running on the same machine. However, contrary to end-users, we as package developers cannot know how many cores the user needs, or wishes, to set aside. Because of this, it is better to let the user make this decision.</p> <p>A related scenario is when the user wants to run two concurrent R sessions on the same machine, both using your code. If your code assumes it can use all cores on the machine (i.e. <code>detectCores()</code> cores), the user will end up running the machine at 200% of its capacity. Whenever we use over 100% of the available CPU resources, we get penalized and waste our computational cycles on overhead from context switching, sub-optimal memory access, and more. This is where we end up with the situation illustrated in the left part of Figure&nbsp;1.</p> <p>Note also that users might not know that they use an R function that runs on all cores by default. They might not even be aware that this is a problem. Now, imagine if the user runs three or four such R sessions, resulting in a 300-400% CPU load. This is when things start to run slowly. The computer will be sluggish, maybe unresponsive, and mostly likely going to get very hot (&ldquo;we&rsquo;re frying the computer&rdquo;). By the time the four concurrent R processes complete, the user might have been able to finish six to eight similar processes if they would not have been fighting each other for the limited CPU resources.</p> <!-- If this happens on a shared system, the user might get an email from the systems adminstrator asking you why they are "trying to fry the computer". The user gets blamed for something that is our fault - it is us that decided to run on `detectCores()` CPU cores by default. This leads us to another scenario where a user might run into a case where the CPUs are overwhelmed because a software tool assumes it has exclusive right to all cores. --> <h3 id="4b-a-shared-computer">4b. A shared computer</h3> <p>In the academia and the industry, it is common that several users share the same compute server or set of compute nodes. It might be as simple as they SSH into a shared machine with many cores and large amounts of memory to run their analysis there. On such setups, load balancing between users is often based on an honor system, where each user checks how many resources are available before launching an analysis. This helps to make sure they don’t end up using too many cores, or too much memory, slowing down the computer for everyone else.</p> <div style="width: 38%; float: right;"> <figure style="margin-top: 1ex;"> <img src="https://www.jottr.org/post/detectCores_bad.png" alt="The left-handside graph of Figure 1, which shows mostly red bars at near 100% load for 24 CPU cores." style="width: 100%; margin: 0; margin-bottom: 2ex;"/> <figcaption> Figure 2: Overusing the CPU cores brings everything to a halt. </figcaption> </figure> </div> <p>Now, imagine they run a software tool that uses all CPU cores by default. In that case, there is a significant risk they will step on the other users&rsquo; processes, slowing everything down for everyone, especially if there is already a big load on the machine. From my experience in academia, this happens frequently. The user causing the problem is often not aware, because they just launch the problematic software with the default settings, leave it running, with a plan to coming back to it a few hours or a few days later. In the meantime, other users might wonder why their command-line prompts become sluggish or even non-responsive, and their analyses suddenly take forever to complete. Eventually, someone or something alerts the systems administrators to the problem, who end up having to drop everything else and start troubleshooting. This often results in them terminating the wild-running processes and reaching out to the user who runs the problematic software, which leads to a large amount of time and resources being wasted among users and administrators. All this is only because we designed our R package to use all cores by default. This is not a made-up toy story; it is a very likely scenario that happens on shared servers if you make <code>detectCores()</code> the default in your R code.</p> <p><em>Shameless advertisement for the <strong><a href="https://parallelly.futureverse.org">parallelly</a></strong> package</em>: In contrast to <code>detectCores()</code>, if you use <code>parallelly::availableCores()</code> the user, or the systems administrator, can limit the default number of CPU cores returned by setting environment variable <code>R_PARALLELLY_AVAILABLECORES_FALLBACK</code>. For instance, by setting it to <code>R_PARALLELLY_AVAILABLECORES_FALLBACK=2</code> centrally, <code>availableCores()</code> will, unless there are other settings that allow the process to use more, return two cores regardless how many CPU cores the machine has. This will lower the damage any single process can inflict on the system. It will take many such processes running at the same time in order for them to have an overall a negative impact. The risk for that to happen by mistake is much lower than when using <code>detectCores()</code> by default.</p> <h3 id="4c-a-shared-compute-cluster-with-many-machines">4c. A shared compute cluster with many machines</h3> <p>Other, larger compute systems, often referred to as high-performance compute (HPC) cluster, have a job scheduler for running scripts in batches distributed across multiple machines. When users submit their scripts to the scheduler&rsquo;s job queue, they request how many cores and how much memory each job requires. For example, a user on a Slurm cluster can request that their <code>run_my_rscript.sh</code> script gets to run with 48 CPU cores and 256 GiB of RAM by submitting it to the scheduler as:</p> <pre><code class="language-sh">sbatch --cpus-per-task=48 --mem=256G run_my_rscript.sh </code></pre> <p>The scheduler keeps track of all running and queued jobs, and when enough compute slots are freed up, it will launch the next job in the queue, giving it the compute resources it requested. This is a very convenient and efficient way to batch process a large amount of analyses coming from many users.</p> <p>However, just like with a shared server, it is important that the software tools running this way respect the compute resources that the job scheduler allotted to the job. The <code>detectCores()</code> function does <em>not</em> know about job schedulers - all it does is return the number of CPU cores on the current machine regardless of how many cores the job has been allotted by the scheduler. So, if your R package uses <code>detectCores()</code> cores by default, then it will overuse the CPUs and slow things down for everyone running on the same compute node. Again, when this happens, it often slows everything done and triggers lots of wasted user and admin efforts spent on troubleshooting and communication back and forth.</p> <p><em>Shameless advertisement for the <strong><a href="https://parallelly.futureverse.org">parallelly</a></strong> package</em>: In contrast, <code>parallelly::availableCores()</code> respects the number of CPU slots that the job scheduler has given to the job. It recognizes environment variables set by our most common HPC schedulers, including Fujitsu Technical Computing Suite (PJM), Grid Engine (SGE), Load Sharing Facility (LSF), PBS/Torque, and Simple Linux Utility for Resource Management (Slurm).</p> <h3 id="4d-running-r-via-cgroups-on-in-a-linux-container">4d. Running R via CGroups on in a Linux container</h3> <p>This far, we have been concerned about the overuse of the CPU cores affecting other processes and other users running on the same machine. Some systems are configured to protect against misbehaving software from affecting other users. In Linux, this can be done with so-called control groups (&ldquo;cgroups&rdquo;), where a process gets allotted a certain amount of CPU cores. If the process uses too many parallel workers, they cannot break out from the sandbox set up by cgroups. From the outside, it will look like the process uses its maximum amount of allocated CPU cores. Some HPC job schedulers have this feature enabled, but not all of them. You find the same feature for Linux containers, e.g. we can limit the number of CPU cores, or throttle the CPU load, using command-line options when you launch a Docker container, e.g. <code>docker run --cpuset-cpus=0-2,8 …</code> or <code>docker run --cpu=3.4 …</code>.</p> <p>So, if you are a user on a system where compute resources are compartmentalized this way, you run a much lower risk for wreaking havoc on a shared system. That is good news, but if you run too many parallel workers, that is, try to use more cores than available to you, then you will clog up your own analysis. The behavior would be the same as if you request 96 parallel workers on your local eight-core notebook (the scenario in the left panel of Figure&nbsp;1), with the exception that you will not overheat the computer.</p> <p>The problem with <code>detectCores()</code> is that it returns the number of CPU cores on the hardware, regardless of the cgroups settings. So, if your R process is limited to eight cores by cgroups, and you use <code>ncores = detectCores()</code> on a 96-core machine, you will end up running 96 parallel workers fighting for the resources on eight cores. A real-world example of this happens for those of you who have a free account on RStudio Cloud. In that case, you are given only a single CPU core to run your R code on, but the underlying machine typically has 16 cores. If you use <code>detectCores()</code> there, you will end up creating 16 parallel workers, running on the same CPU core, which is a very ineffecient way to run the code.</p> <p><em>Shameless advertisement for the <strong><a href="https://parallelly.futureverse.org">parallelly</a></strong> package</em>: In contrast to <code>detectCores()</code>, <code>parallelly::availableCores()</code> respects cgroups, and will return eight cores instead of 96 in the above example, and a single core on a free RStudio Cloud account.</p> <h2 id="my-opinionated-recommendation">My opinionated recommendation</h2> <div style="width: 38%; float: right;"> <figure style="margin-top: 1ex;"> <img src="https://www.jottr.org/post/detectCores_good.png" alt="The right-handside graph of Figure 1, which shows mostly green bars at near 100% load for 24 CPU cores." style="width: 100%; margin: 0; margin-bottom: 2ex;"/> <figcaption> Figure 3: If we avoid overusing the CPU cores, then everything will run much smoother and much faster. </figcaption> </figure> </div> <p>As developers, I think we should at least be aware of these problems, and acknowledge that they exist and they are indeed real problem that people run into &ldquo;out there&rdquo;. We should also accept that we cannot predict on what type of compute environment our R code will run on. Unfortunately, I don&rsquo;t have a magic solution that addresses all the problems reported here. That said, I think the best we can do is to be conservative and don&rsquo;t make hard-coded decisions on parallelization in our R packages and R scripts.</p> <p>Because of this, I argue that <strong>the safest is to design your R package to run sequentially by default (e.g. <code>ncores = 1L</code>), and leave it to the user to decide on the number of parallel workers to use.</strong></p> <p>The <strong>second-best alternative</strong> that I can come up with, is to replace <code>detectCores()</code> with <code>availableCores()</code>, e.g. <code>ncores = parallelly::availableCores()</code>. It is designed to respect common system and R settings that control the number of allowed CPU cores. It also respects R options and environment variables commonly used to limit CPU usage, including those set by our most common HPC job schedulers. In addition, it is possible to control the <em>fallback</em> behavior so that it uses only a few cores when nothing else being set. For example, if the environment variable <code>R_PARALLELLY_AVAILABLECORES_FALLBACK</code> is set to <code>2</code>, then <code>availableCores()</code> returns two cores by default, unless other settings allowing more are available. A conservative systems administrator may want to set <code>export R_PARALLELLY_AVAILABLECORES_FALLBACK=1</code> in <code>/etc/profile.d/single-core-by-default.sh</code>. To see other benefits from using <code>availableCores()</code>, see <a href="https://parallelly.futureverse.org">https://parallelly.futureverse.org</a>.</p> <p>Believe it or not, there&rsquo;s actually more to be said on this topic, but I think this is already more than a mouthful, so I will save that for another blog post. If you made it this far, I applaud you and I thank you for your interest. If you agree, or disagree, or have additional thoughts around this, please feel free to reach out on the <a href="https://github.com/HenrikBengtsson/future/discussions/">Future Discussions Forum</a>.</p> <p>Over and out,</p> <p>Henrik</p> <p><small><sup>1</sup> Searching code on GitHub, requires you to log in to GitHub.</small></p> <p>UPDATE 2022-12-06: <a href="https://github.com/HenrikBengtsson/future/discussions/656">Alex Chubaty pointed out another problem</a>, where <code>detectCores()</code> can be too large on modern machines, e.g. machines with 128 or 192 CPU cores. I&rsquo;ve added Section &lsquo;Issue 3: detectCores() may return too many cores&rsquo; explaining and addressing this problem.</p> <p>UPDATE 2022-12-11: Mention upcoming <code>parallelly::availableCores(constraints = &quot;connections&quot;)</code>.</p> </description>
</item>
<item>
<title>useR! 2022: My 'Futureverse: Profile Parallel Code' Slides</title>
<link>https://www.jottr.org/2022/06/23/future-user2022-slides/</link>
<pubDate>Thu, 23 Jun 2022 17:00:00 -0700</pubDate>
<guid>https://www.jottr.org/2022/06/23/future-user2022-slides/</guid>
<description> <figure style="margin-top: 3ex;"> <img src="https://www.jottr.org/post/BengtssonH_20220622-Future-useR2022_slide18.png" alt="Screenshot of Slide #18 in my presentation. A graphical time-chart representation of the events that takes place when calling the following code in R: plan(cluster, workers = 2); fs <- lapply(1:2, function(x) future(slow(x)); vs <- value(fs); There are two futures displayed in the time chart. Each future is represented by a blue, horizontal 'lifespan' bar. The second future starts slightly after the first one. Each future is evaluated in a separate worker, which is represented as pink horizontal 'evaluate' bar. The two 'lifespan' and the two 'evaluation' bars are overlapping indicating they run in parallel." style="width: 100%; margin: 0;"/> <figcaption> Figure 1: A time chart of logged events for two futures resolved by two parallel workers. This is a screenshot of Slide #18 in my talk. </figcaption> </figure> <p><img src="https://www.jottr.org/post/user2022-logo_450x300.webp" alt="The useR 2022 logo" style="width: 30%; float: right; margin: 2ex;"/></p> <p>Below are the slides for my <em>Futureverse: Profile Parallel Code</em> talk that I presented at the <a href="https://user2022.r-project.org/">useR! 2022</a> conference online and hosted by the Department of Biostatistics at Vanderbilt University Medical Center.</p> <p>Title: Futureverse: Profile Parallel Code<br /> Speaker: Henrik Bengtsson<br /> Session: <a href="https://user2022.r-project.org/program/talks/#session-21-parallel-computing">#21: Parallel Computing</a>, chaired by Ilias Moutsopoulos<br /> Slides: <a href="https://docs.google.com/presentation/d/e/2PACX-1vTnpyj7qvyKr-COHaJAYjoGveoOJPYrstTmvC4farFk2vdwWb8O79kA5tn7klTS67_uoJJdKFPgKNql/pub?start=true&amp;loop=false&amp;delayms=60000&amp;slide=id.gf778290f24_0_165">HTML</a>, <a href="https://www.jottr.org/presentations/useR2022/BengtssonH_20220622-Future-useR2022.pdf">PDF</a> (24 slides)<br /> Video: <a href="https://www.youtube.com/watch?v=_lrPgNqT3SM&amp;t=2528s">official recording</a> (27m30s long starting at 42m10s)</p> <p>Abstract:</p> <p>&ldquo;In this presentation, I share recent enhancements that allow developers and end-users to profile R code running in parallel via the future framework. With these new, frequently requested features, we can study how and where our computational resources are used. With the help of visualization (e.g., ggplot2 and Shiny), we can identify bottlenecks in our code and parallel setup. For example, if we find that some parallel workers are more idle than expected, we can tweak settings to improve the overall CPU utilization and thereby increase the total throughput and decrease the turnaround time (latency). These new benchmarking tools work out of the box on existing code and packages that build on the future package, including future.apply, furrr, and doFuture.</p> <p>The future framework, available on CRAN since 2016, has been used by hundreds of R packages and is among the top 1% of most downloaded packages. It is designed to unify and leverage common parallelization frameworks in R and to make new and existing R code faster with minimal efforts of the developer. The futureverse allows you, the developer, to stay with your favorite programming style, and end-users are free to choose the parallel backend to use (e.g., on a local machine, across multiple machines, in the cloud, or on a high-performance computing (HPC) cluster).&rdquo;</p> <hr /> <p>I want to send out a big thank you to useR! organizers, staff, and volunteers, and everyone else who contributed to this event.</p> <p>/Henrik</p> <h2 id="links">Links</h2> <ul> <li>useR! 2022: <a href="https://user2022.r-project.org/">https://user2022.r-project.org/</a></li> <li><strong>futureverse</strong> website: <a href="https://www.futureverse.org/">https://www.futureverse.org/</a></li> <li><strong>future</strong> package <a href="https://cran.r-project.org/package=future">CRAN</a>, <a href="https://github.com/HenrikBengtsson/future">GitHub</a>, <a href="https://future.futureverse.org/">pkgdown</a></li> </ul> </description>
</item>
<item>
<title>parallelly: Support for Fujitsu Technical Computing Suite High-Performance Compute (HPC) Environments</title>
<link>https://www.jottr.org/2022/06/09/parallelly-support-for-fujitsu-technical-computing-suite-high-performance-compute-hpc-environments/</link>
<pubDate>Thu, 09 Jun 2022 13:00:00 -0700</pubDate>
<guid>https://www.jottr.org/2022/06/09/parallelly-support-for-fujitsu-technical-computing-suite-high-performance-compute-hpc-environments/</guid>
<description> <div style="padding: 2ex; float: right;"/> <center> <img src="https://www.jottr.org/post/parallelly-logo.png" alt="The 'parallelly' hexlogo"/> </center> </div> <p><strong><a href="https://parallelly.futureverse.org">parallelly</a></strong> 1.32.0 is now on CRAN. One of the major updates is that <code>availableCores()</code> and <code>availableWorkers()</code>, and therefore also the <strong>future</strong> framework, gained support for the &lsquo;Fujitsu Technical Computing Suite&rsquo; job scheduler. For other updates, please see <a href="https://parallelly.futureverse.org/news/index.html">NEWS</a>.</p> <p>The <strong>parallelly</strong> package enhances the <strong>parallel</strong> package - our built-in R package for parallel processing - by improving on existing features and by adding new ones. Somewhat simplified, <strong>parallelly</strong> provides the things that you would otherwise expect to find in the <strong>parallel</strong> package. The <strong><a href="https://future.futureverse.org">future</a></strong> package relies on the <strong>parallelly</strong> package internally for local and remote parallelization.</p> <h2 id="support-for-the-fujitsu-technical-computing-suite">Support for the Fujitsu Technical Computing Suite</h2> <p>Functions <a href="https://parallelly.futureverse.org/reference/availableCores.html"><code>availableCores()</code></a> and <a href="https://parallelly.futureverse.org/reference/availableWorkers.html"><code>availableWorkers()</code></a> now support the Fujitsu Technical Computing Suite. Fujitsu Technical Computing Suite is a high-performance compute (HPC) job scheduler, which is popular in Japan among other places, e.g. at RIKEN and Kyushu University.</p> <p>Specifically, these functions now recognize environment variables <code>PJM_VNODE_CORE</code>, <code>PJM_PROC_BY_NODE</code>, and <code>PJM_O_NODEINF</code> set by the Fujitsu Technical Computing Suite scheduler. For example, if we submit a job script with:</p> <pre><code class="language-sh">$ pjsub -L vnode=4 -L vnode-core=10 script.sh </code></pre> <p>the scheduler will allocate four slots with ten cores each on one or more compute nodes. For example, we might get:</p> <pre><code class="language-r">parallelly::availableCores() #&gt; [1] 10 parallelly::availableWorkers() #&gt; [1] &quot;node032&quot; &quot;node032&quot; &quot;node032&quot; &quot;node032&quot; &quot;node032&quot; #&gt; [6] &quot;node032&quot; &quot;node032&quot; &quot;node032&quot; &quot;node032&quot; &quot;node032&quot; #&gt; [11] &quot;node032&quot; &quot;node032&quot; &quot;node032&quot; &quot;node032&quot; &quot;node032&quot; #&gt; [16] &quot;node032&quot; &quot;node032&quot; &quot;node032&quot; &quot;node032&quot; &quot;node032&quot; #&gt; [21] &quot;node032&quot; &quot;node032&quot; &quot;node032&quot; &quot;node032&quot; &quot;node032&quot; #&gt; [26] &quot;node032&quot; &quot;node032&quot; &quot;node032&quot; &quot;node032&quot; &quot;node032&quot; #&gt; [31] &quot;node109&quot; &quot;node109&quot; &quot;node109&quot; &quot;node109&quot; &quot;node109&quot; #&gt; [36] &quot;node109&quot; &quot;node109&quot; &quot;node109&quot; &quot;node109&quot; &quot;node109&quot; </code></pre> <p>In this example, the scheduler allocated three 10-core slots on compute node <code>node032</code> and one 10-core slot on compute node <code>node109</code>, totalling 40 CPU cores, as requested. Because of this, users on these systems can now use <a href="https://parallelly.futureverse.org/reference/makeClusterPSOCK.html"><code>makeClusterPSOCK()</code></a> to set up a parallel PSOCK cluster as:</p> <pre><code class="language-r">library(parallelly) cl &lt;- makeClusterPSOCK(availableWorkers(), rshcmd = &quot;pjrsh&quot;) </code></pre> <p>As shown above, this code picks up whatever <code>vnode</code> and <code>vnode-core</code> configuration were requested via the <code>pjsub</code> submission, and launch 40 parallel R workers via the <code>pjrsh</code> tool part of the Fujitsu Technical Computing Suite.</p> <p>This also means that we can use:</p> <pre><code class="language-r">library(future) plan(cluster, rshcmd = &quot;pjrsh&quot;) </code></pre> <p>when using the <strong>future</strong> framework, which uses <code>makeClusterPSOCK()</code> and <code>availableWorkers()</code> internally.</p> <h2 id="avoid-having-to-specify-rshcmd-pjrsh">Avoid having to specify rshcmd = &ldquo;pjrsh&rdquo;</h2> <p>To avoid having to manually specify argument <code>rshcmd = &quot;pjrsh&quot;</code> manually, we can set it via environment variable <a href="https://parallelly.futureverse.org/reference/parallelly.options.html"><code>R_PARALLELLY_MAKENODEPSOCK_RSHCMD</code></a> (sic!) before launching R, e.g.</p> <pre><code class="language-sh">export R_PARALLELLY_MAKENODEPSOCK_RSHCMD=pjrsh </code></pre> <p>To make this persistent, the user can add this line to their <code>~/.bashrc</code> shell startup script. Alternatively, the system administrator can add it to a <code>/etc/profile.d/*.sh</code> file of their choice.</p> <p>With this environment variable set, it&rsquo;s sufficient to do:</p> <pre><code>library(parallelly) cl &lt;- makeClusterPSOCK(availableWorkers()) </code></pre> <p>and</p> <pre><code class="language-r">library(future) plan(cluster) </code></pre> <p>In addition to not having to remember using <code>rshcmd = &quot;pjrsh&quot;</code>, a major advantage of this approach is that the same R script works also on other systems, including the user&rsquo;s local machine and HPC environments such as Slurm and SGE.</p> <p>Over and out, and welcome to all Fujitsu Technical Computing Suite users!</p> <h2 id="links">Links</h2> <ul> <li><strong>parallelly</strong> package: <a href="https://cran.r-project.org/package=parallelly">CRAN</a>, <a href="https://github.com/HenrikBengtsson/parallelly">GitHub</a>, <a href="https://parallelly.futureverse.org">pkgdown</a></li> <li><strong>future</strong> package: <a href="https://cran.r-project.org/package=future">CRAN</a>, <a href="https://github.com/HenrikBengtsson/future">GitHub</a>, <a href="https://future.futureverse.org">pkgdown</a></li> </ul> </description>
</item>
<item>
<title>parallelly 1.32.0: makeClusterPSOCK() Didn't Work with Chinese and Korean Locales</title>
<link>https://www.jottr.org/2022/06/08/parallelly-1.32.0-makeclusterpsock-didnt-work-with-chinese-and-korean-locales/</link>
<pubDate>Wed, 08 Jun 2022 14:00:00 -0700</pubDate>
<guid>https://www.jottr.org/2022/06/08/parallelly-1.32.0-makeclusterpsock-didnt-work-with-chinese-and-korean-locales/</guid>
<description> <div style="padding: 2ex; float: right;"/> <center> <img src="https://www.jottr.org/post/parallelly-logo.png" alt="The 'parallelly' hexlogo"/> </center> </div> <p><strong><a href="https://parallelly.futureverse.org">parallelly</a></strong> 1.32.0 is on CRAN. This release fixes an important bug that affected users running with the Simplified Chinese, Traditional Chinese (Taiwan), or Korean locale. The bug caused <code>makeClusterPSOCK()</code>, and therefore also <code>future::plan(&quot;multisession&quot;)</code>, to fail with an error. For other updates, please see <a href="https://parallelly.futureverse.org/news/index.html">NEWS</a>.</p> <p>The <strong>parallelly</strong> package enhances the <strong>parallel</strong> package - our built-in R package for parallel processing - by improving on existing features and by adding new ones. Somewhat simplified, <strong>parallelly</strong> provides the things that you would otherwise expect to find in the <strong>parallel</strong> package. The <strong><a href="https://future.futureverse.org">future</a></strong> package relies on the <strong>parallelly</strong> package internally for local and remote parallelization.</p> <h2 id="important-bug-fix-for-chinese-and-korean-users">Important bug fix for Chinese and Korean users</h2> <p>It turns out that <a href="https://parallelly.futureverse.org/reference/makeClusterPSOCK.html"><code>makeClusterPSOCK()</code></a> has never<sup>[1]</sup> worked for users that have their computers set to use a Korean (<code>LANGUAGE=ko</code>), a Simplified Chinese (<code>LANGUAGE=zh_CN</code>), or a Traditional Chinese (Taiwan) (<code>LANGUAGE=zh_TW</code>) locale. For example,</p> <pre><code class="language-r">Sys.setLanguage(&quot;zh_CN&quot;) library(parallelly) cl &lt;- parallelly::makeClusterPSOCK(2) #&gt; 错误: ‘node$session_info$process$pid == pid’ is not TRUE #&gt; 此外: Warning message: #&gt; In add_cluster_session_info(cl[ii]) : 强制改变过程中产生了NA </code></pre> <p>The workaround was to pass <code>validate = FALSE</code>, e.g.</p> <pre><code class="language-r">cl &lt;- parallelly::makeClusterPSOCK(2, validate = FALSE) </code></pre> <p>This bug was because of an internal assertion that made incorrect assumptions about what <code>print()</code> for <code>SOCK0node</code> and <code>SOCKnode</code> object would output. It worked with most locales, but not with the above three. I have fixed this in the most recent release of <strong>parallelly</strong>.</p> <p>Since the &lsquo;multisession&rsquo; strategy of the <strong><a href="https://future.futureverse.org">future</a></strong> framework relies on <code>makeClusterPSOCK()</code>, this bug affected also the <strong>future</strong> package, e.g.</p> <pre><code class="language-r">Sys.setLanguage(&quot;ko&quot;) library(future) plan(multisession) #&gt; 에러: 'node$session_info$process$pid == pid' is not TRUE #&gt; 추가정보: 경고메시지(들): #&gt; add_cluster_session_info(cl[ii])에서: 강제형변환에 의해 생성된 NA 입니다 </code></pre> <p>So, if you run into these errors, upgrade to the latest version of <strong>parallelly</strong>, e.g. <code>update.packages()</code>, restart R, and it will work as you would expect.</p> <!-- Source: https://chinesefor.us/lessons/say-sorry-chinese-apologize-duibuqi/ and https://www.wikihow.com/Apologize-in-Korean --> <p>To prevent this from happening again, I am now making sure to always check the package with also these locales, in addition to English. CRAN already checks packages <a href="https://cran.r-project.org/web/checks/check_flavors.html">with different English and German locales</a>.</p> <p>I am sorry, 对不起, 미안해요, about this. Hopefully, it&rsquo;ll work smoother from now on.</p> <p>Happy parallelization!</p> <h2 id="links">Links</h2> <ul> <li><strong>parallelly</strong> package: <a href="https://cran.r-project.org/package=parallelly">CRAN</a>, <a href="https://github.com/HenrikBengtsson/parallelly">GitHub</a>, <a href="https://parallelly.futureverse.org">pkgdown</a></li> <li><strong>future</strong> package: <a href="https://cran.r-project.org/package=future">CRAN</a>, <a href="https://github.com/HenrikBengtsson/future">GitHub</a>, <a href="https://future.futureverse.org">pkgdown</a></li> </ul> <p><sup>[1]</sup> The last time it worked was with <strong>future</strong> 1.4.0 (2017-03-13), when this function was still part of the <strong>future</strong> package.</p> </description>
</item>
<item>
<title>progressr 0.10.1: Plyr Now Supports Progress Updates also in Parallel</title>
<link>https://www.jottr.org/2022/06/03/progressr-0.10.1/</link>
<pubDate>Fri, 03 Jun 2022 13:00:00 -0700</pubDate>
<guid>https://www.jottr.org/2022/06/03/progressr-0.10.1/</guid>
<description> <div style="padding: 2ex; float: right;"/> <center> <img src="https://www.jottr.org/post/three_in_chinese.gif" alt="Three strokes writing three in Chinese"/> </center> </div> <p><strong><a href="https://progressr.futureverse.org">progressr</a></strong> 0.10.1 is on CRAN. I dedicate this release to all <strong><a href="https://cran.r-project.org/package=plyr">plyr</a></strong> users and developers out there.</p> <p>The <strong>progressr</strong> package provides a minimal API for reporting progress updates in R. The design is to separate the representation of progress updates from how they are presented. What type of progress to signal is controlled by the developer. How these progress updates are rendered is controlled by the end user. For instance, some users may prefer visual feedback, such as a horizontal progress bar in the terminal, whereas others may prefer auditory feedback. The <strong>progressr</strong> package works also when processing R in parallel or distributed using the <strong><a href="https://future.futureverse.org">future</a></strong> framework.</p> <h2 id="plyr-future-progressr-parallel-progress-reporting"><strong>plyr</strong> + <strong>future</strong> + <strong>progressr</strong> ⇒ parallel progress reporting</h2> <p>The major update in this release, is that <strong><a href="https://cran.r-project.org/package=plyr">plyr</a></strong> (&gt;= 1.8.7) now has built-in support for the <strong>progressr</strong> package when running in parallel. For example,</p> <pre><code class="language-r">library(plyr) ## Parallelize on the local machine future::plan(&quot;multisession&quot;) doFuture::registerDoFuture() library(progressr) handlers(global = TRUE) y &lt;- llply(1:100, function(x) { Sys.sleep(1) sqrt(x) }, .progress = &quot;progressr&quot;, .parallel = TRUE) #&gt; |============ | 28% </code></pre> <p>Previously, <strong>plyr</strong> only had built-in support for progress reporting when running sequentially. Note that the <strong>progressr</strong> is the only package that supports progress reporting when using <code>.parallel = TRUE</code> in <strong>plyr</strong>.</p> <p>Also, whenever using <strong>progressr</strong>, the user has plenty of options for where and how progress is reported. For example, <code>handlers(&quot;rstudio&quot;)</code> uses the progress bar in the RStudio job interface, <code>handlers(&quot;progress&quot;)</code> uses terminal progress bars of the <strong>progress</strong> package, and <code>handlers(&quot;beep&quot;)</code> reports on progress using sounds. It&rsquo;s also possible to report progress in the Shiny. See my blog post <a href="https://www.jottr.org/2021/06/11/progressr-0.8.0/">&lsquo;progressr 0.8.0 - RStudio’s Progress Bar, Shiny Progress Updates, and Absolute Progress&rsquo;</a> for more information.</p> <h2 id="there-s-actually-a-better-way">There&rsquo;s actually a better way</h2> <p>I actually recommend another way for reporting on progress with <strong>plyr</strong> map-reduce functions, which is more in line with the design philosophy of <strong>progressr</strong>:</p> <blockquote> <p>The developer is responsible for providing progress updates, but it’s only the end user who decides if, when, and how progress should be presented. No exceptions will be allowed.</p> </blockquote> <p>Please see Section &lsquo;plyr::llply(…, .parallel = TRUE) with doFuture&rsquo; in the <a href="https://progressr.futureverse.org/articles/progressr-intro.html">&lsquo;progressr: An Introduction&rsquo;</a> vignette for this alternative approach, which has worked for long time already. But, of course, adding <code>.progress = &quot;progressr&quot;</code> to your already existing <strong>plyr</strong> <code>.parallel = TRUE</code> code is as simple as it gets.</p> <p>Now, make some progress!</p> <h2 id="other-posts-on-progress-reporting">Other posts on progress reporting</h2> <ul> <li><a href="https://www.jottr.org/2021/06/11/progressr-0.8.0/">progressr 0.8.0 - RStudio&rsquo;s Progress Bar, Shiny Progress Updates, and Absolute Progress</a>, 2021-06-11</li> <li><a href="https://www.jottr.org/2020/07/04/progressr-erum2020-slides/">e-Rum 2020 Slides on Progressr</a>, 2020-07-04</li> <li>See also <a href="https://www.jottr.org/tags/#progressr-list">&lsquo;progressr&rsquo;</a> tag.</li> </ul> <h2 id="links">Links</h2> <ul> <li><strong>progressr</strong> package: <a href="https://cran.r-project.org/package=progressr">CRAN</a>, <a href="https://github.com/HenrikBengtsson/progressr">GitHub</a>, <a href="https://progressr.futureverse.org">pkgdown</a></li> <li><strong>plyr</strong> package: <a href="https://cran.r-project.org/package=plyr">CRAN</a>, <a href="https://github.com/hadley/plyr">GitHub</a>, <a href="http://plyr.had.co.nz/">pkgdown-ish</a></li> <li><strong>future</strong> package: <a href="https://cran.r-project.org/package=future">CRAN</a>, <a href="https://github.com/HenrikBengtsson/future">GitHub</a>, <a href="https://future.futureverse.org">pkgdown</a></li> </ul> </description>
</item>
<item>
<title>parallelly 1.31.1: Better at Inferring Number of CPU Cores with Cgroups and Linux Containers</title>
<link>https://www.jottr.org/2022/04/22/parallelly-1.31.1/</link>
<pubDate>Fri, 22 Apr 2022 11:00:00 -0700</pubDate>
<guid>https://www.jottr.org/2022/04/22/parallelly-1.31.1/</guid>
<description> <div style="padding: 2ex; float: right;"/> <center> <img src="https://www.jottr.org/post/parallelly-logo.png" alt="The 'parallelly' hexlogo"/> </center> </div> <p><strong><a href="https://parallelly.futureverse.org">parallelly</a></strong> 1.31.1 is on CRAN. The <strong>parallelly</strong> package enhances the <strong>parallel</strong> package - our built-in R package for parallel processing - by improving on existing features and by adding new ones. Somewhat simplified, <strong>parallelly</strong> provides the things that you would otherwise expect to find in the <strong>parallel</strong> package. The <strong><a href="https://future.futureverse.org">future</a></strong> package relies on the <strong>parallelly</strong> package internally for local and remote parallelization.</p> <p>Since my <a href="https://www.jottr.org/2021/11/22/parallelly-1.29.0/">previous post on <strong>parallelly</strong></a> in November 2021, I&rsquo;ve fixed a few bugs and added some new features to the package;</p> <ul> <li><p><code>availableCores()</code> detects more cgroups settings, e.g. it now detects the number of CPUs available to your RStudio Cloud session</p></li> <li><p><code>makeClusterPSOCK()</code> gained argument <code>default_packages</code> to control which packages to attach at startup on the R workers</p></li> <li><p><code>makeClusterPSOCK()</code> gained <code>rscript_sh</code> to explicitly control what type of shell quotes to use on the R workers</p></li> </ul> <p>Below is a detailed description of these new features. Some of them, and some of the bug fixes, were added to version 1.30.0, while others to versions 1.31.0 and 1.31.1.</p> <h2 id="availablecores-detects-more-cgroups-settings">availableCores() detects more cgroups settings</h2> <p><em><a href="https://www.wikipedia.org/wiki/Cgroups">Cgroups</a></em>, short for control groups, is a low-level feature in Linux to control which and how much resources a process may use. This prevents individual processes from taking up all resources. For example, an R process can be limited to use at most four CPU cores, even if the underlying hardware has 48 CPU cores. Imagine we parallelize with <code>parallel::detectCores()</code> background workers, e.g.</p> <pre><code class="language-r">library(future) plan(multisession, workers = parallel::detectCores()) </code></pre> <p>This will spawn 48 background R processes. Without cgroups, these 48 parallel R workers will run across all 48 CPU cores on the machine, competing with all other software and all other users running on the same machine. With cgroups limiting us to, say, four CPU cores, there will still be 48 parallel R workers running, but they will now run isolated on only four CPU cores, leaving the other 44 CPU cores alone.</p> <p>Of course, running 48 parallel workers on four CPU cores is not very efficient. There will be a lot of wasted CPU cycles due to context switching. The problem is that we use <code>parallel::detectCores()</code> here, which is what gives us 48 workers. If we instead use <a href="https://parallelly.futureverse.org/reference/availableCores.html"><code>availableCores()</code></a> of <strong>parallelly</strong>;</p> <pre><code class="language-r">library(future) plan(multisession, workers = parallelly::availableCores()) </code></pre> <p>we get four parallel workers, which reflects the four CPU cores that cgroups gives us. Basic support for this was introduced in <strong>parallelly</strong> 1.22.0 (2020-12-12), by querying <code>nproc</code>. This required that <code>nproc</code> was installed on the system, and although it worked in many cases, it did not work for all cgroups configurations. Specifically, it would not work when cgroups was <em>throttling</em> the CPU usage rather than limiting the process to a specific set of CPU cores. To illustrate this, assume we run R via Docker using <a href="https://www.rocker-project.org/">Rocker</a>:</p> <pre><code class="language-sh">$ docker run --cpuset-cpus=0-2,8 rocker/r-base </code></pre> <p>then cgroups will isolate the Linux container to run on CPU cores 0, 1, 2, and 8 of the host. In this case <code>nproc</code>, e.g. <code>system(&quot;nproc&quot;)</code> from within R, returns four (4), and therefore also <code>parallelly::availableCores()</code>. Starting with <strong>parallelly</strong> 1.31.0, <code>parallelly::availableCores()</code> detects this also when <code>nproc</code> is not installed on the system. An alternative to limit the CPU resources, is to throttle the average CPU load. Using Docker, this can be done as:</p> <pre><code class="language-sh">$ docker run --cpus=3.5 rocker/r-base </code></pre> <p>In this case, cgroups will throttle our R process to consume at most 350% worth of CPU on the host, where 100% corresponds to a single CPU. Here, <code>nproc</code> is of no use and simply gives the number of CPUs on the host (e.g. 48). Starting with <strong>parallelly</strong> 1.31.0, <code>parallelly::availableCores()</code> can detect that cgroups throttles R to an average load of 3.5 CPUs. Since we cannot run 3.5 parallel workers, <code>parallelly::availableCores()</code> rounds down to the nearest integer and return three (3). The <a href="https://rstudio.cloud/">RStudio Cloud</a> is one example where CPU throttling is used, so if you work in RStudio Cloud, use <code>parallelly::availableCores()</code> and you will be good.</p> <p>While talking about RStudio Cloud, if you use a free account, you have access to only a single CPU core (&ldquo;nCPUs = 1&rdquo;). In this case, <code>plan(multisession, workers = parallelly::availableCores())</code>, or equivalently, <code>plan(multisession)</code>, will fall back to sequential processing, because there is no point in running in parallel on a single core. If you still want to <em>prototype</em> parallel processing in a single-core environment, say with two cores, you can set option <code>parallelly.availableCores.min = 2</code>. This makes <code>availableCores()</code> return two (2).</p> <h2 id="makeclusterpsock-gained-more-skills">makeClusterPSOCK() gained more skills</h2> <p>Since <strong>parallelly</strong> 1.29.0, <a href="https://parallelly.futureverse.org/reference/makeClusterPSOCK.html"><code>makeClusterPSOCK()</code></a> has gained arguments <code>default_packages</code> and <code>rscript_sh</code>.</p> <h3 id="new-argument-default-packages">New argument <code>default_packages</code></h3> <p>Argument <code>default_packages</code> controls which R packages are attached on each worker during startup. Previously, it was only possible, via logical argument <code>methods</code> to control whether or not the <strong>methods</strong> package should be attached - an argument that stems from <code>parallel::makePSOCKcluster()</code>. With the new <code>default_packages</code> argument, we have full control of which packages are attached. For instance, if we want to go minimal, we can do:</p> <pre><code class="language-r">cl &lt;- parallelly::makeClusterPSOCK(1, default_packages = &quot;base&quot;) </code></pre> <p>This will result in one R worker with only the <strong>base</strong> package <em>attached</em>;</p> <pre><code class="language-r">&gt; parallel::clusterEvalQ(cl, { search() }) [[1]] [1] &quot;.GlobalEnv&quot; &quot;Autoloads&quot; &quot;package:base&quot; </code></pre> <p>Having said that, note that more packages are <em>loaded</em>;</p> <pre><code class="language-r">&gt; parallel::clusterEvalQ(cl, { loadedNamespaces() }) [[1]] [1] &quot;compiler&quot; &quot;parallel&quot; &quot;utils&quot; &quot;base&quot; </code></pre> <p>Like <strong>base</strong>, <strong>compiler</strong> is a package that R always loads. The <strong>parallel</strong> package is loaded because it provides the code for running the background R workers. The <strong>utils</strong> package is loaded because <code>makeClusterPSOCK()</code> validates that the workers are functional by collecting extra information from the R workers that later may be useful when reporting on errors. To skip this, pass argument <code>validate = FALSE</code>.</p> <h3 id="new-argument-rscript-sh">New argument <code>rscript_sh</code></h3> <p>The new argument <code>rscript_sh</code> can be used in the rare case where one launches remote R workers on non-Unix machines from a Unix-like machine. For example, if we, from a Linux machine launch remote MS Windows workers, we need to use <code>rscript_sh = &quot;cmd&quot;</code>.</p> <p>That covers the most important additions to <strong>parallelly</strong>. For bug fixes and minor updates, please see <a href="https://parallelly.futureverse.org/news/index.html">NEWS</a>.</p> <p>Over and out!</p> <h2 id="links">Links</h2> <ul> <li><strong>parallelly</strong> package: <a href="https://cran.r-project.org/package=parallelly">CRAN</a>, <a href="https://github.com/HenrikBengtsson/parallelly">GitHub</a>, <a href="https://parallelly.futureverse.org">pkgdown</a></li> <li><strong>future</strong> package: <a href="https://cran.r-project.org/package=future">CRAN</a>, <a href="https://github.com/HenrikBengtsson/future">GitHub</a>, <a href="https://future.futureverse.org">pkgdown</a></li> </ul> </description>
</item>
<item>
<title>future 1.24.0: Forwarding RNG State also for Stand-Alone Futures</title>
<link>https://www.jottr.org/2022/02/22/future-1.24.0-forwarding-rng-state-also-for-stand-alone-futures/</link>
<pubDate>Tue, 22 Feb 2022 13:00:00 -0800</pubDate>
<guid>https://www.jottr.org/2022/02/22/future-1.24.0-forwarding-rng-state-also-for-stand-alone-futures/</guid>
<description> <p><strong><a href="https://future.futureverse.org">future</a></strong> 1.24.0 is on CRAN. It comes with one significant update related to random number generation, further deprecation of legacy future strategies, a slight improvement to <code>plan()</code> and <code>tweaks()</code>, and some bug fixes. Below are the most important changes.</p> <figure style="padding: 2ex; float: right;"/> <center> <img src="https://www.jottr.org/post/xkcd_221-random_number.png" alt="A one-box XKCD comic with the following handwritten code: int getRandomNumber() { return 4; // chosen by fair dice roll. // guaranteed to be random. } "/> </center> <figcaption style="font-size: small; font-style: italic;">One of many possible random number generators. This one was carefully designed by <a href="https://xkcd.com/221/">XKCD</a> [CC BY-NC 2.5]. </figcaption> </figure> <h2 id="future-seed-true-updates-rng-state">future(&hellip;, seed = TRUE) updates RNG state</h2> <p>In <strong>future</strong> (&lt; 1.24.0), using <a href="https://future.futureverse.org/reference/future.html"><code>future(..., seed = TRUE)</code></a> would <em>not</em> forward the state of the random number generator (RNG). For example, if we generated random numbers in individual futures this way, they would become <em>identical</em>, e.g.</p> <pre><code class="language-r">f &lt;- future(rnorm(n = 1L), seed = TRUE) value(f) #&gt; [1] -1.424997 f &lt;- future(rnorm(n = 1L), seed = TRUE) value(f) #&gt; [1] -1.424997 </code></pre> <p>This was a deliberate, conservative design, because it is not obvious exactly how the RNG state should be forwarded in this case, especially if we consider random numbers may be generated also in the main R session. The more I dug into the problem, the further down I ended up in a rabbit hole. Because of this, I have held back on addressing this problem and leaving it to the developer to solve it, i.e. they had to roll their own RNG streams designed for parallel processing, and populate each future with a unique seed from those RNG streams, i.e. <code>future(..., seed = &lt;seed&gt;)</code>. This is how <strong><a href="https://future.apply.futureverse.org">future.apply</a></strong> and <strong><a href="https://furrr.futureverse.org">furrr</a></strong> already do it internally.</p> <p>However, I understand that design was confusing, and if not understood, it could silently lead to RNG mistakes and correlated, and even identical random numbers. I also sometimes got confused about this when I needed to do something quickly with individual futures and random numbers. I even considered making <code>seed = TRUE</code> an error until resolved, and, looking back, maybe I should have done so.</p> <p>Anyway, because it is rather tedious to roll your own L&rsquo;Ecuyer-CMRG RNG streams, I decided to update <code>future(..., seed = TRUE)</code> to provide a good-enough solution internally, where it forwards the RNG state and then provides the future with an RNG substream based on the updated RNG state. In <strong>future</strong> (&gt;= 1.24.0), we now get:</p> <pre><code class="language-r">f &lt;- future(rnorm(n = 1L), seed = TRUE) v &lt;- value(f) print(v) #&gt; [1] -1.424997 f &lt;- future(rnorm(n = 1L), seed = TRUE) v &lt;- value(f) print(v) #&gt; [1] -1.985136 </code></pre> <p>This update only affects code that currently uses <code>future(..., seed = TRUE)</code>. It does <em>not</em> affect code that relies on <strong>future.apply</strong> or <strong>furrr</strong>, which already worked correctly. That is, you can keep using <code>y &lt;- future_lapply(..., future.seed = TRUE)</code> and <code>y &lt;- future_map(..., .options = furrr_options(seed = TRUE))</code>.</p> <h2 id="deprecating-future-strategies-transparent-and-remote">Deprecating future strategies &lsquo;transparent&rsquo; and &lsquo;remote&rsquo;</h2> <p>It&rsquo;s on the <a href="https://futureverse.org/roadmap.html">roadmap</a> to provide mechanisms for the developer to declare what resources a particular future needs and for the end-user to specify multiple parallel-backend alternatives, so that the future can be processed on a worker that best can meet its resource requirements. In order to support this, we need to restrict the future backend API further, which has been in the works over the last couple of years in collaboration with existing package developers.</p> <p>In this release, I am formally deprecating future strategies <code>transparent</code> and <code>remote</code>. When used, they now produce an informative warning. The <code>transparent</code> strategy is deprecated in favor of using <code>sequential</code> with argument <code>split = TRUE</code> set. If you still use <code>remote</code>, please migrate to <code>cluster</code>, which since a long time can achieve everything that <code>remote</code> can do.</p> <p>On a related note, if you are still using <code>multiprocess</code>, which is deprecated in <strong>future</strong> (&gt;= 1.20.0) since 2020-11-03, please migrate to <code>multisession</code> so you won&rsquo;t get surprised when <code>multiprocess</code> becomes defunct.</p> <p>For the other updates, please see the <a href="https://future.futureverse.org/news/index.html">NEWS</a>.</p> <p>Happy futuring!</p> <p>Henrik</p> <h2 id="other-posts-on-random-numbers-in-parallel-processing">Other posts on random numbers in parallel processing</h2> <ul> <li><p><a href="https://www.jottr.org/2020/09/22/push-for-statistical-sound-rng/">future 1.19.1 - Making Sure Proper Random Numbers are Produced in Parallel Processing</a>, 2020-09-22</p></li> <li><p><a href="https://www.jottr.org/2020/09/21/detect-when-the-random-number-generator-was-used/">Detect When the Random Number Generator Was Used</a>, 2020-09-21</p></li> <li><p><a href="https://www.jottr.org/2017/02/19/future-rng/">future 1.3.0: Reproducible RNGs, future_lapply() and More</a>, 2017-02-19</p></li> </ul> <h2 id="links">Links</h2> <ul> <li><strong>future</strong> package: <a href="https://cran.r-project.org/package=future">CRAN</a>, <a href="https://github.com/HenrikBengtsson/future">GitHub</a>, <a href="https://future.futureverse.org">pkgdown</a></li> <li><strong>future.apply</strong> package: <a href="https://cran.r-project.org/package=future.apply">CRAN</a>, <a href="https://github.com/HenrikBengtsson/future.apply">GitHub</a>, <a href="https://future.apply.futureverse.org">pkgdown</a></li> <li><strong>furrr</strong> package: <a href="https://cran.r-project.org/package=furrr">CRAN</a>, <a href="https://github.com/HenrikBengtsson/furrr">GitHub</a>, <a href="https://furrr.futureverse.org">pkgdown</a></li> </ul> </description>
</item>
<item>
<title>Future Improvements During 2021</title>
<link>https://www.jottr.org/2022/01/07/future-during-2021/</link>
<pubDate>Fri, 07 Jan 2022 14:00:00 -0800</pubDate>
<guid>https://www.jottr.org/2022/01/07/future-during-2021/</guid>
<description> <div style="padding: 2ex; float: right;"/> <center> <img src="https://www.jottr.org/post/paragliding_mount_tamalpais_20220101.jpg" alt="First person view while paragliding during a sunny day with blue skies. The pilot's left hand with a glove can be seen pulling the left break with lines going up to the white, left wing tip above. The pilot is in a left turn high above the mountain side with open patched of grass among the tree. Two other paragliders further down can be seen in the distance. Down below, to the left, there is a long ocean beach slowly curving up towards a point in the horizon. Inside the beach, there is a lagoon. Part of the mountain ridge can be seen to the right."/> </center> </div> <p>Happy New Year! I made some updates to the future framework during 2021 that involve overall improvements and essential preparations to go forward with some exciting new features that I&rsquo;m keen to work on during 2022.</p> <p>The <a href="https://futureverse.org">future framework</a> makes it easy to parallelize existing R code - often with only a minor change of code. The goal is to lower the barriers so that anyone can quickly and safely speed up their existing R code in a worry-free manner.</p> <p><strong><a href="https://future.futureverse.org">future</a></strong> 1.22.1 was released in August 2021, followed by <strong>future</strong> 1.23.0 at the end of October 2021. Below, I summarize the updates that came with those two releases:</p> <ul> <li><a href="#new-features">New features</a></li> <li><a href="#performance-improvements">Performance improvements</a></li> <li><a href="#cleanups-to-make-room-for-new-features">Cleanups to make room for new features</a></li> <li><a href="#significant-changes-preparing-for-the-future">Significant changes preparing for the future</a></li> <li><a href="#roadmap-ahead">Roadmap ahead</a></li> </ul> <p>There were also several updates to the related <strong><a href="https://parallelly.futureverse.org">parallelly</a></strong> and <strong><a href="https://progressr.futureverse.org">progressr</a></strong> packages, which you can read about in earlier blog posts under the <a href="https://www.jottr.org/tags/#parallelly-list">#parallelly</a> and <a href="https://www.jottr.org/tags/#progressr-list">#progressr</a> blog tags.</p> <h2 id="new-features">New features</h2> <h3 id="futuresessioninfo-for-troubleshooting-and-issue-reporting">futureSessionInfo() for troubleshooting and issue reporting</h3> <p>Function <a href="https://future.futureverse.org/reference/futureSessionInfo.html"><code>futureSessionInfo()</code></a> was added to <strong>future</strong> 1.22.0. It outputs information useful for troubleshooting problems related to the future framework. It also runs some basic tests to validate that the current future backend works as expected. If you have problems getting futures to work on your machine, please run this function before reporting issues at <a href="https://github.com/HenrikBengtsson/future/discussions">Future Discussions</a>. Here&rsquo;s an example:</p> <pre><code class="language-r">&gt; library(future) &gt; plan(multisession, workers = 2) &gt; futureSessionInfo() *** Package versions future 1.23.0, parallelly 1.30.0, parallel 4.1.2, globals 0.14.0, listenv 0.8.0 *** Allocations availableCores(): system nproc 8 8 availableWorkers(): $system [1] &quot;localhost&quot; &quot;localhost&quot; &quot;localhost&quot; [4] &quot;localhost&quot; &quot;localhost&quot; &quot;localhost&quot; [7] &quot;localhost&quot; &quot;localhost&quot; *** Settings - future.plan=&lt;not set&gt; - future.fork.multithreading.enable=&lt;not set&gt; - future.globals.maxSize=&lt;not set&gt; - future.globals.onReference=&lt;not set&gt; - future.resolve.recursive=&lt;not set&gt; - future.rng.onMisuse='warning' - future.wait.timeout=&lt;not set&gt; - future.wait.interval=&lt;not set&gt; - future.wait.alpha=&lt;not set&gt; - future.startup.script=&lt;not set&gt; *** Backends Number of workers: 2 List of future strategies: 1. multisession: - args: function (..., workers = 2, envir = parent.frame()) - tweaked: TRUE - call: plan(multisession, workers = 2) *** Basic tests worker pid r sysname release 1 1 19291 4.1.2 Linux 5.4.0-91-generic 2 2 19290 4.1.2 Linux 5.4.0-91-generic version 1 #102~18.04.1-Ubuntu SMP Thu Nov 11 14:46:36 UTC 2021 2 #102~18.04.1-Ubuntu SMP Thu Nov 11 14:46:36 UTC 2021 nodename machine login user effective_user 1 my-laptop x86_64 alice alice alice 2 my-laptop x86_64 alice alice alice Number of unique PIDs: 2 (as expected) </code></pre> <h3 id="working-around-utf-8-escaping-on-ms-windows">Working around UTF-8 escaping on MS Windows</h3> <p>Because of limitations in R itself, UTF-8 symbols outputted on MS Windows parallel workers would be <a href="https://github.com/HenrikBengtsson/future/issues/473">relayed as escaped symbols</a> when using futures. Now, the future framework, and, more specifically, <a href="https://future.futureverse.org/reference/value.html"><code>value()</code></a>, attempts to recover such MS Windows output to UTF-8 before outputting it.</p> <p>For example, in <strong>future</strong> (&lt; 1.23.0) you would get the following:</p> <pre><code class="language-r">f &lt;- future({ cat(&quot;\u2713 Everything is OK&quot;) ; 42 }) v &lt;- value(f) #&gt; &lt;U+2713&gt; Everything is OK </code></pre> <p>when, and only when, those futures are resolved on a MS Windows machine. In <strong>future</strong> (&gt;= 1.23.0), we work around this problem by looking for <code>&lt;U+NNNN&gt;</code> like patterns in the output and decode them as UTF-8 symbols;</p> <pre><code class="language-r">f &lt;- future({ cat(&quot;\u2713 Everything is OK&quot;) ; 42 }) v &lt;- value(f) #&gt; ✓ Everything is OK </code></pre> <p><em>Comment</em>: From <a href="https://developer.r-project.org/Blog/public/2021/12/07/upcoming-changes-in-r-4.2-on-windows/index.html">R 4.2.0, R will have native support for UTF-8 also on MS Windows</a>. More testing and validation is needed to confirm this will work out of the box in R (&gt;= 4.2.0) when running R in the terminal, in the R GUI, in the RStudio Console, and so on. If so, <strong>future</strong> will be updated to only apply this workaround for R (&lt; 4.2.0).</p> <h3 id="harmonization-of-future-futureassign-and-futurecall">Harmonization of future(), futureAssign(), and futureCall()</h3> <p>Prior to <strong>future</strong> 1.22.0, argument <code>seed</code> for <a href="https://future.futureverse.org/reference/future.html"><code>futureAssign()</code></a> and <a href="https://future.futureverse.org/reference/future.html"><code>futureCall()</code></a> defaulted to <code>TRUE</code>, whereas it defaulted to <code>FALSE</code> for <a href="https://future.futureverse.org/reference/future.html"><code>future()</code></a>. This was an oversight. In <strong>future</strong> (&gt;= 1.22.0), <code>seed = FALSE</code> is the default for all these functions.</p> <h3 id="protecting-against-non-exportable-results">Protecting against non-exportable results</h3> <p>Analogously to how globals may be scanned for <a href="https://future.futureverse.org/articles/future-4-non-exportable-objects.html">&ldquo;non-exportable&rdquo; objects</a> when option <code>future.globals.onReference</code> is set to <code>&quot;error&quot;</code> or <code>&quot;warning&quot;</code>, <code>value()</code> will now check for similar problems in the value returned from parallel workers. For example, in <strong>future</strong> (&lt; 1.23.0) we would get:</p> <pre><code class="language-r">library(future) plan(multisession, workers = 2) options(future.globals.onReference = &quot;error&quot;) f &lt;- future(xml2::read_xml(&quot;&lt;body&gt;&lt;/body&gt;&quot;)) v &lt;- value(f) print(v) #&gt; Error in doc_type(x) : external pointer is not valid </code></pre> <p>whereas in <strong>future</strong> (&gt;= 1.23.0) we get:</p> <pre><code class="language-r">library(future) plan(multisession, workers = 2) options(future.globals.onReference = &quot;error&quot;) f &lt;- future(xml2::read_xml(&quot;&lt;body&gt;&lt;/body&gt;&quot;)) v &lt;- value(f) #&gt; Error: Detected a non-exportable reference ('externalptr') in the value #&gt; (of class 'xml_document') of the resolved future </code></pre> <h3 id="finer-control-of-what-type-of-conditions-are-captured-and-replayed">Finer control of what type of conditions are captured and replayed</h3> <p>Besides specifying which condition classes to be captured and relayed, in <strong>future</strong> (&gt;= 1.22.0), it is possible to specify also condition classes to be ignored. For example,</p> <pre><code class="language-r">f &lt;- future(..., conditions = structure(&quot;condition&quot;, exclude = &quot;message&quot;)) </code></pre> <p>captures all conditions but message conditions. The default is <code>conditions = &quot;condition&quot;</code>, which captures and relays any type of condition.</p> <h2 id="performance-improvements">Performance improvements</h2> <p>I always prioritize correctness over performance in the <strong>future</strong> framework. So, whenever optimizing for performance, one always has to make sure we are not breaking things somewhere else. Thankfully, there are now <a href="https://www.futureverse.org/statistics.html">over 200 reverse-dependency packages on CRAN</a> and Bioconductor that I can validate against. They provide another comfy cushion against mistakes than what we already get from package unit tests and the <strong><a href="https://future.tests.futureverse.org">future.tests</a></strong> test suite. Below are some recent performance improvements made.</p> <h3 id="less-latency-for-multicore-multisession-and-cluster-futures">Less latency for multicore, multisession, and cluster futures</h3> <p>In <strong>future</strong> 1.22.0, the default timeout of <a href="https://future.futureverse.org/reference/resolved.html"><code>resolved()</code></a> was decreased from 0.20 seconds to 0.01 seconds for multicore, multisession, and cluster futures. This means that less time is now spent on checking for results from these future backends when they are not yet available. After making sure it is safe to do so, we might decrease the default timeout to zero in a later release.</p> <h3 id="less-overhead-when-initiating-futures">Less overhead when initiating futures</h3> <p>The overhead of initiating futures was significantly reduced in <strong>future</strong> 1.22.0. For example, the round-trip time for <code>value(future(NULL))</code> is about twice as fast for sequential, cluster, and multisession futures. For multicore futures the round-trip speedup is about 20%.</p> <p>The speedup comes from pre-compiling the future&rsquo;s R expression into an R expression template, which then can quickly re-compiled into the final expression to be evaluated. Specifically, instead of calling <code>expr &lt;- base::bquote(tmpl)</code> for each future, which is computationally expensive, we take a two-step approach where we first call <code>tmpl_cmp &lt;- bquote_compile(tmpl)</code> once per session such that we only have to call the much faster <code>expr &lt;- bquote_apply(tmpl_cmp)</code> for each future.(*) This new pre-compile approach speeds up the construction of the final future expression from the original future expression ~10 times.</p> <p>(*) These are <a href="https://github.com/HenrikBengtsson/future/blob/1064c4ec2c37a70fa8fff8887d0030a5f03c46da/R/000.bquote.R#L56-L131">internal functions</a> of the <strong>future</strong> package.</p> <h3 id="environment-variables-are-only-used-when-package-is-loaded">Environment variables are only used when package is loaded</h3> <p>All R <a href="https://future.futureverse.org/reference/future.options.html">options specific to the future framework</a> have defaults that fall back to corresponding environment variables. For example, the default for option <code>future.rng.onMisuse</code> can be set by environment variable <code>R_FUTURE_RNG_ONMISUSE</code>.</p> <p>The purpose of the environment variables is to make it possible to configure the future framework before launching R, e.g. in shell startup scripts, or in shell scripts submitted to job schedulers in high-performance compute (HPC) environments. When R is already running, the best practice is to use the R options to configure the future framework.</p> <p>In order to avoid the overhead from querying and parsing environment variables at runtime, but also to clarify how and when environment variables should be set, starting with <strong>future</strong> 1.22.0, <em><code>R_FUTURE_*</code> environment variables are only used when the <strong>future</strong> package is loaded</em>. Then, if set, they are used for setting the corresponding <code>future.*</code> option.</p> <h2 id="cleanups-to-make-room-for-new-features">Cleanups to make room for new features</h2> <p>The <code>values()</code> function is defunct since <strong>future</strong> 1.23.0 in favor of <code>value()</code>. All CRAN and Bioconductor packages that depend on <strong>future</strong> have been updated since a long time. If you get the error:</p> <pre><code class="language-r">Error: values() is defunct in future (&gt;= 1.20.0). Use value() instead. </code></pre> <p>make sure to update your R packages. A few users of <strong><a href="https://furrr.futureverse.org">furrr</a></strong> have run into this error - updating to <strong>furrr</strong> (&gt;= 0.2.0) solved the problem.</p> <p>Continuing, to further harmonize how developers use the Future API, we are moving away from odds-and-ends features, especially the ones that are holding us back from adding new features. The goal is to ensure that more code using futures can truly run anywhere, not just on a particular parallel backend that the developer work with.</p> <p>In this spirit, we are slowly moving away from &ldquo;persistent&rdquo; workers. For example, in <strong>future</strong> (&gt;= 1.23.0), <code>plan(multisession, persistent = TRUE)</code> is no longer supported and will produce an error if attempted. The same will eventually happen also for <code>plan(cluster, persistent = TRUE)</code>, but not until we have <a href="https://www.futureverse.org/roadmap.html">support for caching &ldquo;sticky&rdquo; globals</a>, which is the main use case for persistent workers.</p> <p>Another example is transparent futures, which are prepared for deprecation in <strong>future</strong> (&gt;= 1.23.0). If used, <code>plan(transparent)</code> produces a warning, which soon will be upgraded to a formal deprecation warning. In a later release, it will produce an error. Transparent futures were added during the early days in order to simplify troubleshooting of futures. A better approach these days is to use <code>plan(sequential, split = TRUE)</code>, which makes interactive troubleshooting tools such as <code>browser()</code> and <code>debug()</code> to work.</p> <h2 id="significant-changes-preparing-for-the-future">Significant changes preparing for the future</h2> <p>Prior to <strong>future</strong> 1.22.0, lazy futures were assigned to the currently set future backend immediately when created. For example, if we do:</p> <pre><code class="language-r">library(future) plan(multisession, workers = 2) f &lt;- future(42, lazy = TRUE) </code></pre> <p>with <strong>future</strong> (&lt; 1.22.0), we would get:</p> <pre><code class="language-r">class(f) #&gt; [1] &quot;MultisessionFuture&quot; &quot;ClusterFuture&quot; &quot;MultiprocessFuture&quot; #&gt; [4] &quot;Future&quot; &quot;environment&quot; </code></pre> <p>Starting with <strong>future</strong> 1.22.0, lazy futures remain generic futures until they are launched, which means they are not assigned a backend class until they have to. Now, the above example gives:</p> <pre><code class="language-r">class(f) #&gt; [1] &quot;Future&quot; &quot;environment&quot; </code></pre> <p>This change opens up the door for storing futures themselves to file and sending them elsewhere. More precisely, this means we can start working towards a <em>queue of futures</em>, which then can be processed on whatever compute resources we have access to at the moment, e.g. some futures might be resolved on the local computer, others on machines on a local cluster, and when those fill up, we can burst out to cloud resources, or maybe process them via a community-driven peer-to-peer cluster.</p> <h2 id="roadmap-ahead">Roadmap ahead</h2> <p>There are lots of new features on the roadmap related to the above and other things. I hope to make progress on several of them during 2022. If you&rsquo;re curious about what&rsquo;s coming up, see the <a href="https://www.futureverse.org/roadmap.html">Project Roadmap</a>, stay tuned on this blog (<a href="https://www.jottr.org/index.xml">feed</a>), or follow <a href="https://twitter.com/henrikbengtsson/">me on Twitter</a>.</p> <p>Happy futuring!</p> <p>Henrik</p> <h2 id="links">Links</h2> <ul> <li><strong>future</strong> package: <a href="https://cran.r-project.org/package=future">CRAN</a>, <a href="https://github.com/HenrikBengtsson/future">GitHub</a>, <a href="https://future.futureverse.org">pkgdown</a></li> <li><strong>parallelly</strong> package: <a href="https://cran.r-project.org/package=parallelly">CRAN</a>, <a href="https://github.com/HenrikBengtsson/parallelly">GitHub</a>, <a href="https://parallelly.futureverse.org">pkgdown</a></li> </ul> </description>
</item>
<item>
<title>parallelly 1.29.0: New Skills and Less Communication Latency on Linux</title>
<link>https://www.jottr.org/2021/11/22/parallelly-1.29.0/</link>
<pubDate>Mon, 22 Nov 2021 21:00:00 -0800</pubDate>
<guid>https://www.jottr.org/2021/11/22/parallelly-1.29.0/</guid>
<description> <div style="padding: 2ex; float: right;"/> <center> <img src="https://www.jottr.org/post/parallelly-logo.png" alt="The 'parallelly' hexlogo"/> </center> </div> <p><strong><a href="https://parallelly.futureverse.org">parallelly</a></strong> 1.29.0 is on CRAN. The <strong>parallelly</strong> package enhances the <strong>parallel</strong> package - our built-in R package for parallel processing - by improving on existing features and by adding new ones. Somewhat simplified, <strong>parallelly</strong> provides the things that you would otherwise expect to find in the <strong>parallel</strong> package. The <strong><a href="https://future.futureverse.org">future</a></strong> package rely on the <strong>parallelly</strong> package internally for local and remote parallelization.</p> <p>Since my <a href="https://www.jottr.org/2021/06/10/parallelly-1.26.0/">previous post on <strong>parallelly</strong></a> five months ago, the <strong>parallelly</strong> package had some bugs fixed, and it gained a few new features;</p> <ul> <li><p>new <code>isForkedChild()</code> to test if R runs in a forked process,</p></li> <li><p>new <code>isNodeAlive()</code> to test if one or more cluster-node processes are running,</p></li> <li><p><code>availableCores()</code> now respects also Bioconductor settings,</p></li> <li><p><code>makeClusterPSOCK(..., rscript = &quot;*&quot;)</code> automatically expands to the proper Rscript executable,</p></li> <li><p><code>makeClusterPSOCK(…, rscript_envs = c(UNSET_ME = NA_character_))</code> unsets environment variables on cluster nodes, and</p></li> <li><p><code>makeClusterPSOCK()</code> sets up clusters with less communication latency on Unix.</p></li> </ul> <p>Below is a detailed description of these new features.</p> <h2 id="new-function-isforkedchild">New function isForkedChild()</h2> <p>If you run R on Unix and macOS, you can parallelize code using so called <em>forked</em> parallel processing. It is a very convenient way of parallelizing code, especially since forking is implemented at the core of the operating system and there is very little extra you have to do at the R level to get it to work. Compared with other parallelization solutions, forked processing has often less overhead, resulting in shorter turnaround times. To date, the most famous method for parallelizing using forks is <code>mclapply()</code> of the <strong>parallel</strong> package. For example,</p> <pre><code class="language-r">library(parallel) y &lt;- mclapply(X, some_slow_fcn, mc.cores = 4) </code></pre> <p>works just like <code>lapply(X, some_slow_fcn)</code> but will perform the same tasks in parallel using four (4) CPU cores. MS Windows does not support <a href="https://en.wikipedia.org/wiki/Fork_(system_call)">forked processing</a>; any attempt to use <code>mclapply()</code> there will cause it to silently fall back to a sequential <code>lapply()</code> call.</p> <p>In the <strong>future</strong> ecosystem, you get forked parallelization with the <code>multicore</code> backend, e.g.</p> <pre><code class="language-r">library(future.apply) plan(multicore, workers = 4) y &lt;- future_lapply(X, some_slow_fcn) </code></pre> <p>Unfortunately, we cannot parallelize all types of code using forks. If done, you might get an error, but in the worst case you crash (segmentation fault) your R process. For example, some graphical user interfaces (GUIs) do not play well with forked processing, e.g. the RStudio Console, but also other GUIs. Multi-threaded parallelization has also been reported to cause problems when run within forked parallelization. We sometime talk about <em>non-fork-safe code</em>, in contrast to <em>fork-safe</em> code, to refer to code that risks crashing the software if run in forked processes.</p> <p>Here is what R-core developer Simon Urbanek and author of <code>mclapply()</code> wrote in the R-devel thread <a href="https://stat.ethz.ch/pipermail/r-devel/2020-April/079384.html">&lsquo;mclapply returns NULLs on MacOS when running GAM&rsquo;</a> on 2020-04-28:</p> <blockquote> <p>Do NOT use <code>mcparallel()</code> in packages except as a non-default option that user can set for the reasons &hellip; explained [above]. Multicore is intended for HPC applications that need to use many cores for computing-heavy jobs, but it does not play well with RStudio and more importantly you don&rsquo;t know the resource available so only the user can tell you when it is safe to use. Multi-core machines are often shared so using all detected cores is a very bad idea. The user should be able to explicitly enable it, but it should not be enabled by default.</p> </blockquote> <p>It is not always obvious to know whether a certain function call in R is fork safe, especially not if we haven&rsquo;t written all the code ourselves. Because of this, it is more of a trial and error so see if works. However, when we know that a certain function call is <em>not</em> fork safe, it is useful to protect against using it in forked parallelization. In <strong>parallelly</strong> (&gt;= 1.28.0), we can use function <a href="https://parallelly.futureverse.org/reference/isForkedChild.html"><code>isForkedChild()</code></a> test whether or not R runs in a forked child process. For example, the author of <code>some_slow_fcn()</code> above, could protect against mistakes by:</p> <pre><code class="language-r">some_slow_fcn &lt;- function(x) { if (parallelly::isForkedChild()) { stop(&quot;This function must not be used in *forked* parallel processing&quot;) } y &lt;- non_fork_safe_code(x) ... } </code></pre> <p>or, if they have an alternative, less preferred, non-fork-safe implementation, they could run that conditionally on R being executed in a forked child process:</p> <pre><code class="language-r">some_slow_fcn &lt;- function(x) { if (parallelly::isForkedChild()) { y &lt;- fork_safe_code(x) } else { y &lt;- alternative_code(x) } ... } </code></pre> <h2 id="new-function-isnodealive">New function isNodeAlive()</h2> <p>The new function <a href="https://parallelly.futureverse.org/reference/isNodeAlive.html"><code>isNodeAlive()</code></a> checks whether one or more nodes are alive. For instance,</p> <pre><code class="language-r">library(parallelly) cl &lt;- makeClusterPSOCK(3) isNodeAlive(cl) #&gt; [1] TRUE TRUE TRUE </code></pre> <p>Imagine the second parallel worker crashes, which we can emulate with</p> <pre><code class="language-r">clusterEvalQ(cl[2], tools::pskill(Sys.getpid())) #&gt; Error in unserialize(node$con) : error reading from connection </code></pre> <p>then we get:</p> <pre><code class="language-r">isNodeAlive(cl) #&gt; [1] TRUE FALSE TRUE </code></pre> <p>The <code>isNodeAlive()</code> function works by querying the operating system to see if those processes are still running, based their process IDs (PIDs) recorded by <code>makeClusterPSOCK()</code> when launched. If the workers&rsquo; PIDs are unknown, then <code>NA</code> is returned instead. For instance, contrary to <code>parallelly::makeClusterPSOCK()</code>, <code>parallel::makeCluster()</code> does not record the PIDs and we get missing values as the result;</p> <pre><code class="language-r">library(parallelly) cl &lt;- parallel::makeCluster(3) isNodeAlive(cl) #&gt; [1] NA NA NA </code></pre> <p>Similarly, if one of the parallel workers runs on a remote machine, we cannot easily query the remote machine for the PID existing or not. In such cases, <code>NA</code> is returned. Maybe we will be able to query also remote machines in a future version of <strong>parallelly</strong>, but for now, it is not possible.</p> <h2 id="availablecores-respects-bioconductor-settings">availableCores() respects Bioconductor settings</h2> <p>Function <a href="https://parallelly.futureverse.org/reference/availableCores.html"><code>availableCores()</code></a> queries the hardware and the system environment to find out how many CPU cores it may run on. It does this by checking system settings, environment variables, and R options that may be set by the end-user, the system administrator, the parent R process, the operating system, a job scheduler, and so on. When you use <code>availableCores()</code>, you don&rsquo;t have to worry about using more CPU resources than you were assigned, which helps guarantee that it runs nicely together with everything else on the same machine.</p> <p>In <strong>parallelly</strong> (&gt;= 1.29.0), <code>availableCores()</code> is now also agile to Bioconductor-specific settings. For example, <strong><a href="https://bioconductor.org/packages/BiocParallel">BiocParallel</a></strong> 1.27.2 introduced environment variable <code>BIOCPARALLEL_WORKER_NUMBER</code>, which sets the default number of parallel workers when using <strong>BiocParallel</strong> for parallelization. Similarly, on Bioconductor check servers, they set environment variable <code>BBS_HOME</code>, which <strong>BiocParallel</strong> uses to limit the number of cores to four (4). Now <code>availableCores()</code> reflects also those settings, which, in turn, means that <strong>future</strong> settings like <code>plan(multisession)</code> will also automatically respect the Bioconductor settings.</p> <p>Function <a href="https://parallelly.futureverse.org/reference/availableWorkers.html"><code>availableWorkers()</code></a>, which relies on <code>availableCores()</code> as a fallback, is therefore also agile to these Bioconductor environment variables.</p> <!-- ## Improvements to makeClusterPSOCK() arguments 'rscript' and 'rscript_envs' Three improvements to [`makeClusterPSOCK()`] has been made: * A `*` value in argument `rscript` to `makeClusterPSOCK()` expands to the corrent `Rscript` executable * Argument `rscript_envs` of `makeClusterPSOCK()` can be used to unset environment variables onthe parallel workers * On Unix, the _communication latency_ between the main R session and the parallel workers is not much smaller when using `makeClusterPSOCK()` --> <h2 id="makeclusterpsock-rscript">makeClusterPSOCK(&hellip;, rscript = &ldquo;*&ldquo;)</h2> <p>Argument <code>rscript</code> of <code>makeClusterPSOCK()</code> can be used to control exactly which <code>Rscript</code> executable is used to launch the parallel workers, and also how that executable is launched. The default settings is often sufficient, but if you want to launch a worker, say, within a Linux container you can do so by adjusting <code>rscript</code>. The help page for <a href="https://parallelly.futureverse.org/reference/makeClusterPSOCK.html"><code>makeClusterPSOCK()</code></a> has several examples of this. It may also be used for other setups. For example, to launch two parallel workers on a remote Linux machine, such that their CPU priority is less than other processing running on that machine, we can use (*):</p> <pre><code class="language-r">workers &lt;- rep(&quot;remote.example.org&quot;, times = 2) cl &lt;- makeClusterPSOCK(workers, rscript = c(&quot;nice&quot;, &quot;Rscript&quot;)) </code></pre> <p>This causes the two R workers to be launched using <code>nice Rscript ...</code>. The Unix command <code>nice</code> is what makes <code>Rscript</code> to run with a lower CPU priority. By running at a lower priority, we decrease the risk for our parallel tasks to have a negative impact on other software running on that machine, e.g. someone might use that machine for interactive work without us knowing. We can do the same thing on our local machine via:</p> <pre><code class="language-r">cl &lt;- makeClusterPSOCK(2L, rscript = c(&quot;nice&quot;, file.path(R.home(&quot;bin&quot;), &quot;Rscript&quot;))) </code></pre> <p>Here we specified the absolute path to <code>Rscript</code> to make sure we run the same version of R as the main R session, and not another <code>Rscript</code> that may be on the system <code>PATH</code>.</p> <p>Starting with <strong>parallelly</strong> 1.29.0, we can replace the Rscript specification in the above two examples with <code>&quot;*&quot;</code>, as in:</p> <pre><code class="language-r">workers &lt;- rep(&quot;remote-machine.example.org, times = 2L) cl &lt;- makeClusterPSOCK(workers, rscript = c(&quot;nice&quot;, &quot;*&quot;)) </code></pre> <p>and</p> <pre><code class="language-r">cl &lt;- makeClusterPSOCK(2L, rscript = c(&quot;nice&quot;, &quot;*&quot;)) </code></pre> <p>When used, <code>makeClusterPSOCK()</code> will expand <code>&quot;*&quot;</code> to the proper Rscript specification depending on running remotely or not. To further emphasize the convenience of this, consider:</p> <pre><code class="language-r">workers &lt;- c(&quot;localhost&quot;, &quot;remote-machine.example.org&quot;) cl &lt;- makeClusterPSOCK(workers, rscript = c(&quot;nice&quot;, &quot;*&quot;)) </code></pre> <p>which launches two parallel workers - one running on local machine and one running on the remote machine.</p> <p>Note that, when using <strong><a href="https://future.futureverse.org">future</a></strong>, we can pass <code>rscript</code> to <code>plan(multisession)</code> and <code>plan(cluster)</code> to achieve the same thing, as in</p> <pre><code class="language-r">plan(cluster, workers = workers, rscript = c(&quot;nice&quot;, &quot;*&quot;)) </code></pre> <p>and</p> <pre><code class="language-r">plan(multisession, workers = 2L, rscript = c(&quot;nice&quot;, &quot;*&quot;)) </code></pre> <p>(*) Here we use <code>nice</code> as an example, because it is a simple way to illustrate how <code>rscript</code> can be used. As a matter of fact, there is already an <a href="https://parallelly.futureverse.org/reference/makeClusterPSOCK.html">argument <code>renice</code></a>, which we can use to achieve the same without using the <code>rscript</code> argument.</p> <h2 id="makeclusterpsock-rscript-envs-c-unset-me-na-character">makeClusterPSOCK(&hellip;, rscript_envs = c(UNSET_ME = NA_character_))</h2> <p>Argument <code>rscript_envs</code> of <code>makeClusterPSOCK()</code> can be used to set environment variables on cluster nodes, or copy existing ones from the main R session to the cluster nodes. For example,</p> <pre><code class="language-r">cl &lt;- makeClusterPSOCK(2, rscript_envs = c(PI = &quot;3.14&quot;, &quot;MY_EMAIL&quot;)) </code></pre> <p>will, during startup, set environment variable <code>PI</code> on each of the two cluster nodes to have value <code>3.14</code>. It will also set <code>MY_EMAIL</code> on them to the value of <code>Sys.getenv(&quot;MY_EMAIL&quot;)</code> in the current R session.</p> <p>Starting with <strong>parallelly</strong> 1.29.0, we can now also <em>unset</em> environment variables, in case they are set on the cluster nodes. Any named element with a missing value causes the corresponding environment variable to be unset, e.g.</p> <pre><code class="language-r">cl &lt;- makeClusterPSOCK(2, rscript_envs = c(_R_CHECK_LENGTH_1_CONDITION_ = NA_character_)) </code></pre> <p>This results in passing <code>-e 'Sys.unsetenv(&quot;_R_CHECK_LENGTH_1_CONDITION_&quot;)'</code> to <code>Rscript</code> when launching each worker.</p> <h2 id="makeclusterpsock-sets-up-clusters-with-less-communication-latency-on-unix">makeClusterPSOCK() sets up clusters with less communication latency on Unix</h2> <p>It turns out that, in R <em>on Unix</em>, there is <a href="https://stat.ethz.ch/pipermail/r-devel/2020-November/080060.html">a significant <em>latency</em> in the communication between the parallel workers and the main R session</a> (**). Starting in R (&gt;= 4.1.0), it is possible to decrease this latency by setting a dedicated R option <em>on each of the workers</em>, e.g.</p> <pre><code class="language-r">rscript_args &lt;- c(&quot;-e&quot;, shQuote(&quot;options(socketOptions = 'no-delay')&quot;) cl &lt;- parallel::makeCluster(workers, rscript_args = rscript_args)) </code></pre> <p>This is quite verbose, so I&rsquo;ve made this the new default in <strong>parallelly</strong> (&gt;= 1.29.0), i.e. you can keep using:</p> <pre><code class="language-r">cl &lt;- parallelly::makeClusterPSOCK(workers) </code></pre> <p>to benefit from the above. See help for <a href="https://parallelly.futureverse.org/reference/makeClusterPSOCK.html"><code>makeClusterPSOCK()</code></a> for options on how to change this new default.</p> <p>Here is an example that illustrates the difference in latency with and without the new settings;</p> <pre><code class="language-r">cl_parallel &lt;- parallel::makeCluster(1) cl_parallelly &lt;- parallelly::makeClusterPSOCK(1) res &lt;- bench::mark(iterations = 1000L, parallel = parallel::clusterEvalQ(cl_parallel, iris), parallelly = parallel::clusterEvalQ(cl_parallelly, iris) ) res[, c(1:4,9)] #&gt; # A tibble: 2 × 5 #&gt; expression min median `itr/sec` total_time #&gt; &lt;bch:expr&gt; &lt;bch:tm&gt; &lt;bch:tm&gt; &lt;dbl&gt; &lt;bch:tm&gt; #&gt; 1 parallel 277µs 44ms 22.5 44.4s #&gt; 2 parallelly 380µs 582µs 1670. 598.3ms </code></pre> <p>From this, we see that the total latency overhead for 1,000 parallel tasks went from 44 seconds down to 0.60 seconds, which is ~75 times less on average. Does this mean your parallel code will run faster? No, it is just the communication <em>latency</em> that has decreased. But, why waste time on <em>waiting</em> on your results when you don&rsquo;t have to? This is why I changed the defaults in <strong>parallelly</strong>. It will also bring the experience on Unix on par with MS Windows and macOS.</p> <p>Note that the relatively high latency affects only Unix. MS Windows and macOS do not suffer from this extra latency. For example, on MS Windows 10 that runs in a virtual machine on the same Linux computer as above, I get:</p> <pre><code class="language-r">#&gt; # A tibble: 2 × 5 #&gt; expression min median `itr/sec` total_time #&gt; &lt;bch:expr&gt; &lt;bch:tm&gt; &lt;bch:tm&gt; &lt;dbl&gt; &lt;bch:tm&gt; #&gt; 1 parallel 191us 314us 2993. 333ms #&gt; 2 parallelly 164us 311us 3227. 310ms </code></pre> <p>If you&rsquo;re using <strong><a href="https://future.futureverse.org">future</a></strong> with <code>plan(multisession)</code> or <code>plan(cluster)</code>, you&rsquo;re already benefitting from the performance gain, because those rely on <code>parallelly::makeClusterPSOCK()</code> internally.</p> <!-- avoid a quite large latency in the communication between parallel workers and the main R session ```r gg <- plot(res) + labs(x = element_blank()) + theme(text = element_text(size = 20)) + theme(legend.position = "none") ggsave("parallelly_faster_turnarounds-figure.png", plot = gg, width = 7.0, height = 5.0) ``` <center> <img src="https://www.jottr.org/post/parallelly_faster_turnarounds-figure.png" alt="..." style="width: 65%;"/><br/> </center> <small><em>Figure: ...<br/></em></small> --> <p>(**) <em>Technical details</em>: Options <code>socketOptions</code> sets the default value of argument <code>options</code> of <code>base::socketConnection()</code>. The default is <code>NULL</code>, but if we set it to <code>&quot;no-delay&quot;</code>, the created TCP socket connections are configured to use the <code>TCP_NODELAY</code> flag. When using <code>TCP_NODELAY</code>, a TCP connection will no longer use the so called <a href="https://www.wikipedia.org/wiki/Nagle%27s_algorithm">Nagle&rsquo;s algorithm</a>, which otherwise is used to reduces the number of TCP packets needed to be sent over the network by making sure TCP fills up each packet before sending it off. When using the new <code>&quot;no-delay&quot;</code>, this buffering is disabled and packets are sent as soon as data come in. Credits for this improvement should go to Jeff Keller, who identified and <a href="https://stat.ethz.ch/pipermail/r-devel/2020-November/080060.html">reported the problem to R-devel</a>, to Iñaki Úcar who pitched in, and to Simon Urbanek, who implemented <a href="https://github.com/wch/r-source/commit/82369f73fc297981e64cac8c9a696d05116f0797">support for <code>socketConnection(..., options = &quot;no-delay&quot;)</code></a> for R 4.1.0.</p> <h2 id="bug-fixes">Bug fixes</h2> <p>Finally, the most important bug fixes since <strong>parallelly</strong> 1.26.0 are:</p> <ul> <li><p><code>availableCores()</code> would produce an error on Linux systems without <code>nproc</code> installed.</p></li> <li><p><code>makeClusterPSOCK()</code> failed with &ldquo;Error in freePort(port) : Unknown value on argument ‘port’: &lsquo;auto&rsquo;&rdquo; if environment variable <code>R_PARALLEL_PORT</code> was set to a port number.</p></li> <li><p>In R environments not supporting <code>setup_strategy = &quot;parallel&quot;</code>, <code>makeClusterPSOCK()</code> failed to fall back to <code>setup_strategy = &quot;sequential&quot;</code>.</p></li> </ul> <p>For all other bug fixes and updates, please see <a href="https://parallelly.futureverse.org/news/index.html">NEWS</a>.</p> <!-- <center> <img src="https://www.jottr.org/post/parallelly_faster_turnarounds.png" alt="..." style="width: 65%;"/><br/> </center> <small><em>Figure: Our parallel results are now turned around much faster on Linux than before.<br/></em></small> --> <p>Over and out!</p> <h2 id="links">Links</h2> <ul> <li><strong>parallelly</strong> package: <a href="https://cran.r-project.org/package=parallelly">CRAN</a>, <a href="https://github.com/HenrikBengtsson/parallelly">GitHub</a>, <a href="https://parallelly.futureverse.org">pkgdown</a></li> <li><strong>future</strong> package: <a href="https://cran.r-project.org/package=future">CRAN</a>, <a href="https://github.com/HenrikBengtsson/future">GitHub</a>, <a href="https://future.futureverse.org">pkgdown</a></li> <li><strong>future.apply</strong> package: <a href="https://cran.r-project.org/package=future.apply">CRAN</a>, <a href="https://github.com/HenrikBengtsson/future.apply">GitHub</a>, <a href="https://future.apply.futureverse.org">pkgdown</a></li> </ul> <!-- nworkers <- 4L cl_parallel <- parallel::makeCluster(nworkers) cl_parallelly <- parallelly::makeClusterPSOCK(nworkers) plan(cluster, workers = cl_parallel) stats <- bench::mark(iterations = 100L, parallel = { f <- cluster(iris, workers = cl_parallel); value(f) }, parallelly = { f <- cluster(iris, workers = cl_parallelly); value(f) } ) plan(cluster, workers = cl_parallelly) stats2 <- bench::mark(iterations = 10L, parallelly = { f <- future(iris); value(f) } ) stats <- rbind(stats1, stats2) --> </description>
</item>
<item>
<title>matrixStats: Consistent Support for Name Attributes via GSoC Project</title>
<link>https://www.jottr.org/2021/08/23/matrixstats-gsoc-2021/</link>
<pubDate>Mon, 23 Aug 2021 00:10:00 +0200</pubDate>
<guid>https://www.jottr.org/2021/08/23/matrixstats-gsoc-2021/</guid>
<description> <p><em>Author: Angelina Panagopoulou, GSoC student developer, undergraduate in the Department of Informatics &amp; Telecommunications (DIT), University of Athens, Greece</em></p> <p><center> <img src="https://www.jottr.org/post/2048px-GSoC_logo.svg.png" alt="Google Summer of Code logo" style="width: 40%"/> <!-- Image source: https://commons.wikimedia.org/wiki/File:GSoC_logo.svg --> </center></p> <p>We are glad to announce recent CRAN releases of <strong><a href="https://cran.r-project.org/package=matrixStats">matrixStats</a></strong> with support for handling and returning name attributes. This feature is added to make <strong>matrixStats</strong> functions handle names in the same manner as the corresponding base R functions. In particular, the behavior of <strong>matrixStats</strong> functions is now the same as the <code>apply()</code> function in R, resolving previous lack of, or inconsistent, handling of row and column names. The added support for <code>names</code> and <code>dimnames</code> attributes has already reached a wide, active user base, while at the same time we expect to attract users and developers who lack this feature and therefore could not use <strong>matrixStats</strong> package for their needs.</p> <p>The <strong>matrixStats</strong> package provides high-performing functions operating on rows and columns of matrices. These functions are optimized such that both memory use and processing time are minimized. In order to minimize the overhead of handling name attributes, the naming support is implemented in native (C) code, where possible. In <strong>matrixStats</strong> (&gt;= 0.60.0), handling of row and column names is optional. This is done to allow for maximum performance where needed. In addition, in order to avoid breaking some scripts and packages that rely on the previous semi-inconsistent behavior of functions, special care has been taken to ensure backward compatibility by default for the time being. We have validated the correctness of these newly implemented features by extending existing package tests to check name attributes, measuring the code coverage with the <strong><a href="https://cran.r-project.org/package=covr">covr</a></strong> package, and checking all 358 reverse-dependency packages using the <strong><a href="https://github.com/r-lib/revdepcheck">revdepcheck</a></strong> package.</p> <h2 id="example">Example</h2> <p><code>useNames</code> is an argument added to each of the <strong>matrixStats</strong> functions that gained support naming. It takes values <code>TRUE</code>, <code>FALSE</code>, or <code>NA</code>. For backward compatible reasons, the default value of <code>useNames</code> is <code>NA</code>, meaning the default behavior from earlier versions of <strong>matrixStats</strong> is preserved. If <code>TRUE</code>, <code>names</code> or <code>dimnames</code> attribute of result is set, otherwise, if <code>FALSE</code>, the results do not have name attributes set. For example, consider the following 5-by-3 matrix with row and column names:</p> <pre><code class="language-r">&gt; x &lt;- matrix(rnorm(5 * 3), nrow = 5, ncol = 3, dimnames = list(letters[1:5], LETTERS[1:3])) &gt; x A B C a 0.30292612 1.3825644 -0.2125219 b 0.15812229 2.7719647 1.6237263 c -0.09881700 -0.6468119 -0.6481911 d 0.38520941 -0.8466505 -0.4779964 e -0.01599926 -0.8907434 0.6334347 </code></pre> <p>If we use the base R method to calculate row medians, we see that the names attribute of the results reflects the row names of the input matrix:</p> <pre><code class="language-r">&gt; library(stats) &gt; apply(x, MARGIN = 1, FUN = median) a b c d e 0.30292612 1.62372626 -0.64681187 -0.47799635 -0.01599926 </code></pre> <p>If we use <strong>matrixStats</strong> function <code>rowMedians()</code> with argument <code>useNames = TRUE</code> set, we get the same result as above:</p> <pre><code class="language-r">&gt; library(matrixStats) &gt; rowMedians(x, useNames = TRUE) a b c d e 0.30292612 1.62372626 -0.64681187 -0.47799635 -0.01599926 </code></pre> <p>If the name attributes are not of interest, we can use <code>useNames = FALSE</code> as in:</p> <pre><code class="language-r">&gt; rowMedians(x, useNames = FALSE) [1] 0.30292612 1.62372626 -0.64681187 -0.47799635 -0.01599926 </code></pre> <p>Doing so will also avoid the overhead, time and memory, that otherwise comes from processing name attributes.</p> <p>If we don&rsquo;t specify <code>useNames</code> explicitly, the default is currently <code>useNames = NA</code>, which corresponds to the non-documented behavior that existed in <strong>matrixStats</strong> (&lt; 0.60.0). For several functions, that corresponded to setting <code>useNames = FALSE</code>, however for other functions it corresponds to setting <code>useNames = TRUE</code>, and for others it might have set, say, row names but not column names. In our example, the default happens to be the same as <code>useNames = FALSE</code>:</p> <pre><code class="language-r">&gt; rowMedians(x) # default as in matrixStats (&lt; 0.60.0) [1] 0.30292612 1.62372626 -0.64681187 -0.47799635 -0.01599926 </code></pre> <h2 id="future-plan">Future Plan</h2> <p>The future plan is to change the default value of <code>useNames</code> to <code>TRUE</code> or <code>FALSE</code> and eventually deprecate the backward-compatible behavior of <code>useNames = NA</code>. The default value of <code>useNames</code> is a design choice that requires further investigation. On the one hand, <code>useNames = TRUE</code> as the default is more convenient, but creates an additional performance and memory overhead when name attributes are not needed. On the other hand, make <code>FALSE</code> the default is appropriate for users and packages that rely on the maximum performance. Whatever the new default will become, we will make sure to work with package maintainers to minimize the risk for breaking existing code.</p> <h2 id="google-summer-of-code-2021">Google Summer of Code 2021</h2> <p>The project that introduces the consistent support for name attributes on the <strong>matrixStats</strong> package is a part of the <a href="https://github.com/rstats-gsoc/gsoc2021/wiki">R Project&rsquo;s participation in the Google Summer of Code 2021</a>.</p> <h3 id="links">Links</h3> <ul> <li><a href="https://github.com/rstats-gsoc/gsoc2021/wiki/matrixStats">The matrixStats GSoC 2021 project</a></li> <li><a href="https://cran.r-project.org/web/packages/matrixStats/index.html">matrixStats CRAN page</a></li> <li><a href="https://github.com/HenrikBengtsson/matrixStats">matrixStats GitHub page</a></li> <li><a href="https://github.com/HenrikBengtsson/matrixStats/commits?author=AngelPn">All commits during GSoC 2021 - author Angelina Panagopoulou</a></li> </ul> <h3 id="authors">Authors</h3> <ul> <li><a href="https://github.com/AngelPn">Angelina Panagopoulou</a> - <em>Student Developer</em>: I am an undergraduate in the Department of Informatics &amp; Telecommunications (DIT) in University of Athens.</li> <li><a href="https://github.com/yaccos">Jakob Peder Pettersen</a> - <em>Mentor</em>: PhD Student, Department of Biotechnology and Food Science, Norwegian University of Science and Technology (NTNU). Jakob is a part of the <a href="https://almaaslab.nt.ntnu.no/">Almaas Lab</a> and does research on genome-scale metabolic modeling and behavior of microbial communities.</li> <li><a href="https://github.com/HenrikBengtsson/">Henrik Bengtsson</a> - <em>Co-Mentor</em>: Associate Professor, Department of Epidemiology and Biostatistics, University of California San Francisco (UCSF). He is the author and maintainer of a large number of CRAN and Bioconductor packages including <strong>matrixStats</strong>.</li> </ul> <h3 id="contributions">Contributions</h3> <p><strong>Phase I</strong></p> <ul> <li>All functions implements <code>useNames = NA/FALSE/TRUE</code> using R code and tests are written.</li> <li>Identify reverse dependency packages that rely on <code>useNames = NA/FALSE/TRUE</code>.</li> <li>New release on CRAN with <code>useNames = NA</code>. This allow useRs and package maintainers to complain if anything breaks.</li> </ul> <p><strong>Phase II</strong></p> <ul> <li>Changed C code structure such that <code>validateIndices()</code> always return <code>R_xlen_t*</code>. Clean up unnecessary macros. <ul> <li>Outcome: shorter compile times, smaller compiled package/library, fewer exported symbols.</li> </ul></li> <li>Simplify C API for <code>setNames()/setDimnames()</code>.</li> <li>Implemented <code>useNames = NA/FALSE/TRUE</code> in C code where possible and cleanup work too.</li> </ul> <h3 id="summary">Summary</h3> <p>We have completed all goals that we had initially planned. The release 0.60.0 of <strong>matrixStats</strong> on CRAN included the contributions of GSoC Phase I (&ldquo;implementation in R&rdquo;) and a new release of version 0.60.1 includes the contributions of Phase II (&ldquo;implementation in C&rdquo;).</p> <h3 id="experience">Experience</h3> <p>When I first heard about the Google Summer of Code, I really wanted to participate in it, but I thought that maybe I do not have the prerequisite knowledge yet. And it was true. It was difficult for me to find a project that I had at least half of the mentioned prerequisites. So, I started looking for a project based on what I would be interested in doing during the summer. This project was an opportunity for me to learn a new programming language, the R language, and also to get in touch with advanced R. I am grateful for all the learning opportunities: programming in R, developing an R package, using a variety of tools that make developing R packages easier and more productive, working with GitHub tools, interacting with the open source community. My mentors had an understanding of the lack of experience and really helped me achieve this. Participating in Google Summer of Code 2021 as student developer is definitely worth it and I recommend every student who wants to open source contribute to give it a try.</p> <h2 id="acknowledgements">Acknowledgements</h2> <ul> <li>The Google Summer of Code program for bringing more student developers into open source software development.</li> <li>Jacob Pettersen for being a great project leader and for providing guidance and willingness to impart his knowledge. Henrik Bengtsson whose insight and knowledge into the subject matter steered me through R package development. I am very grateful for the immense amount of useful discussions and valuable feedback.</li> <li>The members of the R community for building this warming community.</li> </ul> </description>
</item>
<item>
<title>progressr 0.8.0: RStudio's Progress Bar, Shiny Progress Updates, and Absolute Progress</title>
<link>https://www.jottr.org/2021/06/11/progressr-0.8.0/</link>
<pubDate>Fri, 11 Jun 2021 19:00:00 -0700</pubDate>
<guid>https://www.jottr.org/2021/06/11/progressr-0.8.0/</guid>
<description> <p><strong><a href="https://progressr.futureverse.org">progressr</a></strong> 0.8.0 is on CRAN. It comes with some new features:</p> <ul> <li>A new &lsquo;rstudio&rsquo; handler that reports on progress via the RStudio job interface in RStudio</li> <li><code>withProgressShiny()</code> now updates the <code>detail</code> part, instead of the <code>message</code> part</li> <li>In addition to signalling relative amounts of progress, it&rsquo;s now also possible to signal total amounts</li> </ul> <p>If you&rsquo;re curious what <strong>progressr</strong> is about, have a look at my <a href="https://www.jottr.org/2020/07/04/progressr-erum2020-slides/">e-Rum 2020 presentation</a>.</p> <h2 id="progress-updates-in-rstudio-s-job-interface">Progress updates in RStudio&rsquo;s job interface</h2> <p>If you&rsquo;re using RStudio Console, you can now report on progress in the RStudio&rsquo;s job interface as long as the progress originates from a <strong>progressr</strong>-signalling function. I’ve shown an example of this in Figure&nbsp;1.</p> <figure style="margin-top: 3ex;"> <img src="https://www.jottr.org/post/progressr-rstudio.png" alt="A screenshot of the upper part of the RStudio Console panel. Below the title bar, which says 'R 4.1.0 ~/', there is a row with the text 'Console 05:50:51 PM' left of a green progress bar at 30% followed by the text '0:03'. Below these two lines are the R commands called this far, which are the same as in the below example. Following the commands, is output 'M: Added value 1', 'M: Added value 2', and 'M: Added value 3', from the first steps that have completed this far."/> <figcaption> Figure 1: The RStudio job interface can show progress bars and we can use it with **progressr**. The progress bar title - "Console 05:50:51 PM" - shows at what time the progress began. The '0:03' shows for how long the progress has been running - here 3 seconds. </figcaption> </figure> <p>To try this yourself, run the below in the RStudio Console.</p> <pre><code class="language-r">library(progressr) handlers(global = TRUE) handlers(&quot;rstudio&quot;) y &lt;- slow_sum(1:10) </code></pre> <p>The progress bar disappears when the calculation completes.</p> <h2 id="tweaks-to-withprogressshiny">Tweaks to withProgressShiny()</h2> <p>The <code>withProgressShiny()</code> function, which is a <strong>progressr</strong>-aware version of <code>withProgress()</code>, gained argument <code>inputs</code>. It defaults to <code>inputs = list(message = NULL, detail = &quot;message&quot;)</code>, which says that a progress message should update the &lsquo;detail&rsquo; part of the Shiny progress panel. For example,</p> <pre><code class="language-r">X &lt;- 1:10 withProgressShiny(message = &quot;Calculation in progress&quot;, detail = &quot;Starting ...&quot;, value = 0, { p &lt;- progressor(along = X) y &lt;- lapply(X, FUN=function(x) { Sys.sleep(0.25) p(sprintf(&quot;x=%d&quot;, x)) }) }) </code></pre> <p>will start out as in the left panel of Figure&nbsp;2, and, as soon as the first progress signal is received, the &lsquo;detail&rsquo; part is updated with <code>x=1</code> as shown in the right panel.</p> <figure style="margin-top: 3ex;"> <table style="margin: 1ex;"> <tr style="margin: 1ex;"> <td> <img src="https://www.jottr.org/post/withProgressShiny_A_x=0.png" alt="A Shiny progress bar panel with a progress bar at 0% on top, with 'Calculation in progress' written in a bold large font, with 'Starting ...' written in a normal small font below."/> </td> <td> <img src="https://www.jottr.org/post/withProgressShiny_A_x=1.png" alt="A Shiny progress bar panel with a progress bar at 10% on top, with 'Calculation in progress' written in a bold large font, with 'x=1' written in a normal small font below."/> </td> </tr> </table> <figcaption> Figure 2: A Shiny progress panel that start out with the 'message' part displaying "Calculation in progress" and the 'detail' part displaying "Starting ..." (left), and whose 'detail' part is updated to "x=1" (right) as soon the first progress update comes in. </figcaption> </figure> <p>Prior to this new release, the default behavior was to update the &lsquo;message&rsquo; part of the Shiny progress panel. To revert to the old behavior, set argument <code>inputs</code> as in:</p> <pre><code class="language-r">X &lt;- 1:10 withProgressShiny(message = &quot;Starting ...&quot;, detail = &quot;Calculation in progress&quot;, value = 0, { p &lt;- progressor(along = X) y &lt;- lapply(X, FUN=function(x) { Sys.sleep(0.25) p(sprintf(&quot;x=%d&quot;, x)) }) }, inputs = list(message = &quot;message&quot;, detail = NULL)) </code></pre> <p>This results in what you see in Figure&nbsp;3. I think that the new behavior, as shown in Figure&nbsp;2, looks better and makes more sense.</p> <figure style="margin-top: 3ex;"> <table style="margin: 1ex;"> <tr style="margin: 1ex;"> <td> <img src="https://www.jottr.org/post/withProgressShiny_B_x=0.png" alt="A Shiny progress bar panel with a progress bar at 0% on top, with 'Starting ...' written in a bold large font, with 'Calculation in progress' written to the right of it and wrapping onto the next row."/> </td> <td> <img src="https://www.jottr.org/post/withProgressShiny_B_x=1.png" alt="A Shiny progress bar panel with a progress bar at 10% on top, with 'x=1' written in a bold large font, with 'Calculation in progress' written to the right of it."/> </td> </tr> </table> <figcaption> Figure 3: A Shiny progress panel that start out with the 'message' part displaying "Starting ..." and the 'detail' part displaying "Calculation in progress" (left), and whose 'message' part is updated to "x=1" (right) as soon the first progress update comes in. </figcaption> </figure> <h2 id="update-to-a-specific-amount-of-total-progress">Update to a specific amount of total progress</h2> <p>When using <strong>progressr</strong>, we start out by creating a progressor function that we then call to signal progress. For example, if we do:</p> <pre><code class="language-r">my_slow_fun &lt;- function() { p &lt;- progressr::progressor(steps = 10) count &lt;- 0 for (i in 1:10) { count &lt;- count + 1 Sys.sleep(1) p(sprintf(&quot;count=%d&quot;, count)) } count }) </code></pre> <p>each call to <code>p()</code> corresponds to <code>p(amount = 1)</code>, which signals that our function have moved <code>amount = 1</code> steps closer to the total amount <code>steps = 10</code>. We can take smaller or bigger steps by specifying another <code>amount</code>.</p> <p>In this new version, I&rsquo;ve introduced a new beta feature that allows us to signal progress that says where we are in <em>absolute terms</em>. With this, we can do things like:</p> <pre><code class="language-r">my_slow_fun &lt;- function() { p &lt;- progressr::progressor(steps = 10) count &lt;- 0 for (i in 1:5) { count &lt;- count + 1 Sys.sleep(1) if (runif(1) &lt; 0.5) break p(sprintf(&quot;count=%d&quot;, count)) } ## In case we broke out of the loop early, ## make sure to update to 5/10 progress p(step = 5) for (i in 1:5) { count &lt;- count + 1 Sys.sleep(1) p(sprintf(&quot;count=%d&quot;, count)) } count } </code></pre> <p>When calling <code>my_slow_fun()</code>, we might see progress being reported as:</p> <pre><code>- [------------------------------------------------] 0% \ [===&gt;-------------------------------------] 10% count=1 | [=======&gt;---------------------------------] 20% count=2 \ [===================&gt;---------------------] 50% count=3 ... </code></pre> <p>Note how it took a leap from 20% to 50% when <code>count == 2</code>. If we run it again, the move to 50% might happen at another iteration.</p> <h2 id="wrapping-up">Wrapping up</h2> <p>There are also a few bug fixes, which you can read about in <a href="https://progressr.futureverse.org/news/index.html">NEWS</a>. And a usual, all of this work also when you run in parallel using the <a href="https://futureverse.org">future framework</a>.</p> <p>Make progress!</p> <h2 id="links">Links</h2> <ul> <li><strong>progressr</strong> package: <a href="https://cran.r-project.org/package=progressr">CRAN</a>, <a href="https://github.com/HenrikBengtsson/progressr">GitHub</a>, <a href="https://progressr.futureverse.org">pkgdown</a></li> </ul> </description>
</item>
<item>
<title>parallelly 1.26.0: Fast, Concurrent Setup of Parallel Workers (Finally)</title>
<link>https://www.jottr.org/2021/06/10/parallelly-1.26.0/</link>
<pubDate>Thu, 10 Jun 2021 15:00:00 -0700</pubDate>
<guid>https://www.jottr.org/2021/06/10/parallelly-1.26.0/</guid>
<description> <p><strong><a href="https://parallelly.futureverse.org">parallelly</a></strong> 1.26.0 is on CRAN. It comes with one major improvement and one new function:</p> <ul> <li><p>The setup of parallel workers is now <em>much faster</em>, which comes from using a concurrent, instead of sequential, setup strategy</p></li> <li><p>The new <code>freePort()</code> can be used to find a TCP port that is currently available</p></li> </ul> <h2 id="faster-setup-of-local-parallel-workers">Faster setup of local, parallel workers</h2> <p>In R 4.0.0, which was released in May 2020, <code>parallel::makeCluster(n)</code> gained the power of setting up the <code>n</code> local cluster nodes all at the same time, which greatly reduces to total setup time. Previously, because it was setting up the workers one after the other, which involved a lot of waiting for each worker to get ready. You can read about the details in the <a href="https://developer.r-project.org/Blog/public/2020/03/17/socket-connections-update/index.html">Socket Connections Update</a> blog post by Tomas Kalibera and Luke Tierney on 2020-03-17.</p> <p><center> <img src="https://www.jottr.org/post/parallelly_faster_setup_of_cluster.png" alt="An X-Y graph with 'Total setup time (s)' on the vertical axis ranging from 0 to 55, and 'Number of cores' on the horizontal axis ranging from 0 to 128. Two smooth curves, which look very linear with intersection at the origin and unnoticeable variance, are drawn for the two setup strategies 'sequential' and 'parallel'. The 'sequential' line is much steeper." style="width: 65%;"/><br/> </center> <small><em>Figure: The total setup time versus the number of local cluster workers for the &ldquo;sequential&rdquo; setup strategy (red) and the new &ldquo;parallel&rdquo; strategy (turquoise). Data were collected on a 128-core Linux machine.<br/></em></small></p> <p>With this release of <strong>parallelly</strong>, <code>parallelly::makeClusterPSOCK(n)</code> gained the same skills. I benchmarked the new, default &ldquo;parallel&rdquo; setup strategy against the previous &ldquo;sequential&rdquo; strategy on a CentOS 7 Linux machine with 128 CPU cores and 512 GiB RAM while the machine was idle. I ran these benchmarks five times, which are summarized as smooth curves in the above figure. The variance between the replicate runs is tiny and the smooth curves appear almost linear. Assuming a linear relationship between setup time and number of cluster workers, a linear fit of gives a speedup of approximately 50 times on this machine. It took 52 seconds to set up 122 (sic!) workers when using the &ldquo;sequential&rdquo; approach, whereas it took only 1.1 seconds with the &ldquo;parallel&rdquo; approach. Not surprisingly, rerunning these benchmarks with <code>parallel::makePSOCKcluster()</code> instead gives nearly identical results.</p> <p>Importantly, the &ldquo;parallel&rdquo; setup strategy, which is the new default, can only be used when setting up parallel workers running on the local machine. When setting up workers on external or remote machines, the &ldquo;sequential&rdquo; setup strategy will still be used.</p> <p>If you&rsquo;re using <strong><a href="https://future.futureverse.org">future</a></strong> and use</p> <pre><code class="language-r">plan(multisession) </code></pre> <p>you&rsquo;ll immediately benefit from this performance gain, because it relies on <code>parallelly::makeClusterPSOCK()</code> internally.</p> <p>All credit for this improvement in <strong>parallelly</strong> and <code>parallelly::makeClusterPSOCK()</code> should go to Tomas Kalibera and Luke Tierney, who implemented support for this in R 4.0.0.</p> <p><em>Edit 2021-06-11 and 2021-07-01</em>: There&rsquo;s a bug in R (&gt;= 4.0.0 &amp;&amp; &lt;= 4.1.0) causing the new <code>setup_strategy = &quot;parallel&quot;</code> to fail in the RStudio Console on some systems. If you&rsquo;re running <em>RStudio Console</em> and get &ldquo;Error in makeClusterPSOCK(workers, &hellip;) : Cluster setup failed. 8 of 8 workers failed to connect.&ldquo;, update to <strong>parallelly</strong> 1.26.1 released on 2021-06-30:</p> <pre><code class="language-r">install.packages(&quot;parallelly&quot;) </code></pre> <p>which will work around this problem. Alternatively, you can manually set:</p> <pre><code class="language-r">options(parallelly.makeNodePSOCK.setup_strategy = &quot;sequential&quot;) </code></pre> <p><em>Comment</em>: Note that I could only test with up to 122 parallel workers, and not 128, which is the number of CPU cores available on the test machine. The reason for this is that each worker consumes one R connection in the main R session, and R has a limit in the number of connection it can have open at any time. The typical R installation can only have 128 connections open, and three are always occupied by the standard input (stdin), the standard output (stdout), and the standard error (stderr). Thus, the absolute maximum number of workers I could use 125. However, because I used the <strong><a href="https://progressr.futureverse.org">progressr</a></strong> package to report on progress, and a few other things that consumed a few more connections, I could only test up to 122 workers. You can read more about this limit in <a href="https://parallelly.futureverse.org/reference/availableConnections.html"><code>?parallelly::freeConnections</code></a>, which also gives a reference for how to increase this limit by recompling R from source.</p> <h2 id="find-an-available-tcp-port">Find an available TCP port</h2> <p>I&rsquo;ve also added <code>freePort()</code>, which will find a random port in [1024,65535] that is currently not occupied by another process on the machine. For example,</p> <pre><code class="language-r">&gt; freePort() [1] 30386 &gt; freePort() [1] 37882 </code></pre> <p>Using this function to pick a TCP port at random lowers the risk of trying to use a port already occupied as when using just <code>sample(1024:65535, size=1)</code>.</p> <p>Just like <code>parallel::makePSOCKcluster()</code>, <code>parallelly::makeClusterPSOCK()</code> still uses <code>sample(11000:11999, size=1)</code> to find a random port. I want <code>freePort()</code> to get some more mileage and CRAN validation before switching over, but the plan is to use <code>freePort()</code> by default in the next release of <strong>parallelly</strong>.</p> <p>Over and out!</p> <h2 id="links">Links</h2> <ul> <li><strong>parallelly</strong> package: <a href="https://cran.r-project.org/package=parallelly">CRAN</a>, <a href="https://github.com/HenrikBengtsson/parallelly">GitHub</a>, <a href="https://parallelly.futureverse.org">pkgdown</a></li> <li><strong>future</strong> package: <a href="https://cran.r-project.org/package=future">CRAN</a>, <a href="https://github.com/HenrikBengtsson/future">GitHub</a>, <a href="https://future.futureverse.org">pkgdown</a></li> <li><strong>progressr</strong> package: <a href="https://cran.r-project.org/package=progressr">CRAN</a>, <a href="https://github.com/HenrikBengtsson/progressr">GitHub</a>, <a href="https://progressr.futureverse.org">pkgdown</a></li> </ul> </description>
</item>
<item>
<title>parallelly 1.25.0: availableCores(omit=n) and, Finally, Built-in SSH Support for MS Windows 10 Users</title>
<link>https://www.jottr.org/2021/04/30/parallelly-1.25.0/</link>
<pubDate>Fri, 30 Apr 2021 15:00:00 -0700</pubDate>
<guid>https://www.jottr.org/2021/04/30/parallelly-1.25.0/</guid>
<description> <p><center> <img src="https://www.jottr.org/post/nasa-climate-ice-core-small.jpg" alt="A 25-cm long ice core is held in front of the camera on a sunny day. The background is an endless snow-covered flat landscape and a bright blue sky." style="width: 65%;"/><br/> <small><em>A piece of an ice core - more pleasing to look at than yet another illustration of a CPU core<br/> <small>(Image credit: Ludovic Brucker, NASA&rsquo;s Goddard Space Flight Center)</small> </em></small> </center></p> <p><strong><a href="https://cran.r-project.org/package=parallelly">parallelly</a></strong> 1.25.0 is on CRAN. It comes with two major improvements:</p> <ul> <li><p>You can now use <code>availableCores(omit = n)</code> to ask for all but <code>n</code> CPU cores</p></li> <li><p><code>makeClusterPSOCK()</code> can finally use the built-in SSH client on MS Windows 10 to set up remote workers</p></li> </ul> <h1 id="availablecores-omit-n-is-your-new-friend">availableCores(omit = n) is your new friend</h1> <p>When running R code in parallel, many choose to parallelize on as many CPU cores as possible, e.g.</p> <pre><code class="language-r">ncores &lt;- parallel::detectCores() </code></pre> <p>It&rsquo;s also common to leave out a few cores so that we can still use the computer for other basic tasks, e.g. checking email, editing files, and browsing the web. This is often done by something like:</p> <pre><code class="language-r">ncores &lt;- parallel::detectCores() - 1 </code></pre> <p>which will return seven on a machine with eight CPU cores. If you look around, you also find that some leave two cores aside for other tasks;</p> <pre><code class="language-r">ncores &lt;- parallel::detectCores() - 2 </code></pre> <p>I&rsquo;m sorry to be the party killer, but <em>none of the above is guaranteed to work everywhere</em>. It might work on your computer but not on your collaborator&rsquo;s computer, or in the cloud, or on continuous integration (CI) services, etc. There are two problems with the above approaches. The help page of <code>parallel::detectCores()</code> describes the first problem:</p> <blockquote> <p><strong>Value</strong><br /> An integer, <code>NA</code> if the answer is unknown.</p> </blockquote> <p>Yup, <code>detectCores()</code> might return <code>NA</code>. Ouf!</p> <p>The second problem is that your code might run on a machine that has only one or two CPU cores. That means that <code>parallel::detectCores() - 1</code> may return zero, and <code>parallel::detectCores() - 2</code> may even return a negative one. You might think such machines no longer exists, but they do. The most common cases these days are virtual machines (VMs) running in the cloud. Note, if you&rsquo;re a package developer, GitHub Actions, Travis CI, and AppVeyor CI are all running in VMs with two cores.</p> <p>So, to make sure your code will run everywhere, you need to do something like:</p> <pre><code class="language-r">ncores &lt;- max(parallel::detectCores() - 1, 1, na.rm = TRUE) </code></pre> <p>With that approach, we know that <code>ncores</code> is at least one and never a missing value. I don&rsquo;t know about you, but I often do thinkos where I mix up <code>min()</code> and <code>max()</code>, which I&rsquo;m sure we don&rsquo;t want. So, let me introduce you to your new friend:</p> <pre><code class="language-r">ncores &lt;- parallelly::availableCores(omit = 1) </code></pre> <p>Just use that and you&rsquo;ll be fine everywhere - it&rsquo;ll always give you a value of one or greater. It&rsquo;s neater and less error prone. Also, in contrast to <code>parallel::detectCores()</code>, <code>parallelly::availableCores()</code> respects various CPU settings and configurations that the system wants your to follow.</p> <h1 id="makeclusterpsock-to-remote-machines-works-out-of-the-box-also-ms-windows-10">makeClusterPSOCK() to remote machines works out-of-the-box also MS Windows 10</h1> <p>If you&rsquo;re into parallelizing across multiple machines, either on your local network, or remotely, say in the cloud, you can use:</p> <pre><code class="language-r">workers &lt;- parallelly::makeClusterPSOCK(c(&quot;n1.example.org&quot;, &quot;n2.example.org&quot;)) </code></pre> <p>to spawn two R workers running in the background on those two machines. We can use these workers with different R parallel backends, e.g. with bare-bone <strong>parallel</strong></p> <pre><code class="language-r">y &lt;- parallel::parLapply(workers, X, slow_fcn) </code></pre> <p>with <strong>foreach</strong> and the classical <strong>doParallel</strong> adapter,</p> <pre><code class="language-r">library(foreach) doParallel::registerDoParallel(workers) y &lt;- foreach(x = X) %dopar% slow_fcn(x) </code></pre> <p>and, obviously, my favorite, the <strong>future</strong> framework, which comes with lots of alternatives, e.g.</p> <pre><code class="language-r">library(future) plan(cluster, workers = workers) y &lt;- future.apply::future_lapply(X, slow_fcn) y &lt;- furrr::future_map(X, slow_fcn) library(foreach) doFuture::registerDoFuture() y &lt;- foreach(x = X) %dopar% slow_fcn(x) y &lt;- BiocParallel::bplapply(X, slow_fcn) </code></pre> <p>Now, in order to set up remote workers out of the box as shown above, you need to make sure you can do the following from the terminal:</p> <pre><code class="language-r">{local}$ ssh n1.example.org Rscript --version R scripting front-end version 4.0.4 (2021-02-15) </code></pre> <p>If you can get to that point, you can also use those two remote machines to parallel from your local computer, which, at least I think, is pretty cool. To get to that point, you basically need to configure SSH locally and remotely so that you can log in without having to enter a password, which you do by using SSH keys. It does <em>not</em> require admin rights, and it&rsquo;s not that hard to do when you know how to do it. Search the web for &ldquo;SSH key authentication&rdquo; for instructions, but the gist is that you create a public-private key pair locally and you copy the public one to the remote machine. The setup is the same for Linux, macOS, and MS Windows 10.</p> <p>What&rsquo;s new in <strong>parallelly</strong> 1.25.0 is that <em>MS Windows 10 users no longer have to install the PuTTY SSH client</em> - the Unix-compatible <code>ssh</code> client that comes with all MS Windows 10 installations works out of the box.</p> <p>The reason why we couldn&rsquo;t use the built-in Windows 10 client before is that it has an <a href="https://github.com/PowerShell/Win32-OpenSSH/issues/1265">bug preventing us from using it for reverse tunneling</a>, which is needed for remote, parallel processing. However, someone found a workaround, so that bug is no longer a blocker. Thus, now <code>makeClusterPSOCK()</code> works as we always wanted it to.</p> <h2 id="take-homes">Take-homes</h2> <ul> <li><p>Use <code>parallelly::availableCores()</code></p></li> <li><p>Remote parallelization from MS Windows 10 is now as easy as from Linux and macOS</p></li> </ul> <p>For all updates, including what bugs have been fixed, see the <a href="https://parallelly.futureverse.org/news/index.html">NEWS</a> of <strong>parallelly</strong>.</p> <p>Over and out!</p> <h2 id="links">Links</h2> <ul> <li><p><strong>parallelly</strong> package: <a href="https://cran.r-project.org/package=parallelly">CRAN</a>, <a href="https://github.com/HenrikBengtsson/parallelly">GitHub</a>, <a href="https://parallelly.futureverse.org">pkgdown</a></p></li> <li><p><strong>future</strong> package: <a href="https://cran.r-project.org/package=future">CRAN</a>, <a href="https://github.com/HenrikBengtsson/future">GitHub</a>, <a href="https://future.futureverse.org">pkgdown</a></p></li> <li><p><strong>future.apply</strong> package: <a href="https://cran.r-project.org/package=future.apply">CRAN</a>, <a href="https://github.com/HenrikBengtsson/future.apply">GitHub</a>, <a href="https://future.apply.futureverse.org">pkgdown</a></p></li> <li><p><strong>furrr</strong> package: <a href="https://cran.r-project.org/package=furrr">CRAN</a>, <a href="https://github.com/HenrikBengtsson/furrr">GitHub</a>, <a href="https://furrr.futureverse.org">pkgdown</a></p></li> </ul> <p>PS. If you&rsquo;re interested in learning more about ice cores and how they are used to track changes in our atmosphere and climate, see <a href="https://climate.nasa.gov/news/2616/core-questions-an-introduction-to-ice-cores/">Core questions: An introduction to ice cores</a> by Jessica Stoller-Conrad, NASA&rsquo;s Jet Propulsion Laboratory.</p> </description>
</item>
<item>
<title>Using Kubernetes and the Future Package to Easily Parallelize R in the Cloud</title>
<link>https://www.jottr.org/2021/04/08/future-and-kubernetes/</link>
<pubDate>Thu, 08 Apr 2021 19:00:00 -0700</pubDate>
<guid>https://www.jottr.org/2021/04/08/future-and-kubernetes/</guid>
<description> <p><em>This is a guest post by <a href="https://www.stat.berkeley.edu/~paciorek">Chris Paciorek</a>, Department of Statistics, University of California at Berkeley.</em></p> <p>In this post, I&rsquo;ll demonstrate that you can easily use the <strong><a href="https://cran.r-project.org/package=future">future</a></strong> package in R on a cluster of machines running in the cloud, specifically on a Kubernetes cluster.</p> <p>This allows you to easily doing parallel computing in R in the cloud. One advantage of doing this in the cloud is the ability to easily scale the number and type of (virtual) machines across which you run your parallel computation.</p> <h2 id="why-use-kubernetes-to-start-a-cluster-in-the-cloud">Why use Kubernetes to start a cluster in the cloud?</h2> <p>Kubernetes is a platform for managing containers. You can think of the containers as lightweight Linux machines on which you can do your computation. By using the Kubernetes service of a cloud provider such as Google Cloud Platform (GCP) or Amazon Web Services (AWS), you can easily start up a cluster of (virtual) machines.</p> <p>There have been (and are) approaches to starting up a cluster of machines on AWS easily from the command line on your laptop. Some tools that are no longer actively maintained are <a href="http://star.mit.edu/cluster">StarCluster</a> and <a href="https://cfncluster.readthedocs.io/en/latest">CfnCluster</a>. And there is now something called <a href="https://docs.aws.amazon.com/parallelcluster/latest/ug/getting_started.html">AWS ParallelCluster</a>. But doing it via Kubernetes allows you to build upon an industry standard platform that can be used on various cloud providers. A similar effort (which I heavily borrowed from in developing the setup described here) allows one to run a <a href="https://docs.dask.org/en/latest/setup/kubernetes-helm.html">Python Dask cluster</a> accessed via a Jupyter notebook.</p> <p>Many of the cloud providers have Kubernetes services (and it&rsquo;s also possible you&rsquo;d have access to a Kubernetes service running at your institution or company). In particular, I&rsquo;ve experimented with <a href="https://cloud.google.com/kubernetes-engine">Google Kubernetes Engine (GKE)</a> and <a href="https://aws.amazon.com/eks">Amazon&rsquo;s Elastic Kubernetes Service (EKS)</a>. This post will demonstrate setting up your cluster using Google&rsquo;s GKE, but see my GitHub <a href="https://github.com/paciorek/future-kubernetes">future-kubernetes</a> repository for details on doing it on Amazon&rsquo;s EKS. Note that while I&rsquo;ve gotten things to work on EKS, there have been <a href="https://github.com/paciorek/future-kubernetes#AWS-troubleshooting">various headaches</a> that I haven&rsquo;t encountered on GKE.</p> <p>I&rsquo;m not a Kubernetes expert, nor a GCP or AWS expert (that might explain the headaches I just mentioned), but one upside is that hopefully I&rsquo;ll go through all the details at a level someone who is not an expert can follow along. In fact, part of my goal in setting this up has been to learn more about Kubernetes, which I&rsquo;ve done, but note that there&rsquo;s <em>a lot</em> to it.</p> <p>More details about the setup, including how it was developed and troubleshooting tips can be found in my <a href="https://github.com/paciorek/future-kubernetes">future-kubernetes</a> repository.</p> <h2 id="how-it-works-briefly">How it works (briefly)</h2> <p>This diagram in Figure 1 outlines the pieces of the setup.</p> <figure> <img src="https://www.jottr.org/post/k8s.png" alt="Overview of using future on a Kubernetes cluster" width="700"/> <figcaption style="font-style: italic;">Figure 1. Overview of using future on a Kubernetes cluster</figcaption> </figure> <p>Work on a Kubernetes cluster is divided amongst <em>pods</em>, which carry out the components of your work and can communicate with each other. A pod is basically a Linux container. (Strictly speaking a pod can contain multiple containers and shared resources for those containers, but for our purposes, it&rsquo;s simplest just to think of a pod as being a Linux container.) The pods run on the nodes in the Kubernetes cluster, where each Kubernetes node runs on a compute instance of the cloud provider. These instances are themselves virtual machines running on the cloud provider&rsquo;s actual hardware. (I.e., somewhere out there, behind all the layers of abstraction, there are actual real computers running on endless aisles of computer racks in some windowless warehouse!) One of the nice things about Kubernetes is that if a pod dies, Kubernetes will automatically restart it.</p> <p>The basic steps are:</p> <ol> <li>Start your Kubernetes cluster on the cloud provider&rsquo;s Kubernetes service</li> <li>Start the pods using Helm, the Kubernetes package manager</li> <li>Connect to the RStudio Server session running on the cluster from your browser</li> <li>Run your future-based computation</li> <li>Terminate the Kubernetes cluster</li> </ol> <p>We use the Kubernetes package manager, Helm, to run the pods of interest:</p> <ul> <li>one (scheduler) pod for a main process that runs RStudio Server and communicates with the workers</li> <li>multiple (worker) pods, each with one R worker process to act as the workers managed by the <strong>future</strong> package</li> </ul> <p>Helm manages the pods and related <em>services</em>. An example of a service is to open a port on the scheduler pod so the R worker processes can connect to that port, allowing the scheduler pod RStudio Server process to communicate with the worker R processes. I have a <a href="https://github.com/paciorek/future-helm-chart">Helm chart</a> that does this; it borrows heavily from the <a href="https://github.com/dask/helm-chart">Dask Helm chart</a> for the Dask package for Python.</p> <p>Each pod runs a Docker container. I use my own <a href="https://github.com/paciorek/future-kubernetes-docker">Docker container</a> that layers a bit on top of the <a href="https://rocker-project.org">Rocker</a> container that contains R and RStudio Server.</p> <h2 id="step-1-start-the-kubernetes-cluster">Step 1: Start the Kubernetes cluster</h2> <p>Here I assume you have already installed:</p> <ul> <li>the command line interface to Google Cloud,</li> <li>the <code>kubectl</code> interface for interacting with Kubernetes, and</li> <li><code>helm</code> for installing Helm charts (i.e., Kubernetes packages).</li> </ul> <p>Installation details can be found in the <a href="https://github.com/paciorek/future-kubernetes">future-kubernetes</a> repository.</p> <p>First we&rsquo;ll start our cluster (the first part of Step 1 in Figure 1):</p> <pre><code class="language-sh">gcloud container clusters create \ --machine-type n1-standard-1 \ --num-nodes 4 \ --zone us-west1-a \ --cluster-version latest \ my-cluster </code></pre> <p>I&rsquo;ve asked for four virtual machines (nodes), using the basic (and cheap) <code>n1-standard-1</code> instance type (which has a single CPU per virtual machine) from Google Cloud Platform.</p> <p>You&rsquo;ll want to specify the total number of cores on the virtual machines to be equal to the number of R workers that you want to start and that you specify in the Helm chart (as discussed below). Here we ask for four one-cpu nodes, and our Helm chart starts four workers, so all is well. See the <a href="#modifications">Modifications section</a> below on how to start up a different number of workers.</p> <p>Since the RStudio Server process that you interact with wouldn&rsquo;t generally be doing heavy computation at the same time as the workers, it&rsquo;s OK that the RStudio scheduler pod and a worker pod would end up using the same virtual machine.</p> <h2 id="step-2-install-the-helm-chart-to-set-up-your-pods">Step 2: Install the Helm chart to set up your pods</h2> <p>Next we need to get our pods going by installing the Helm chart (i.e., package) on the cluster; the installed chart is called a <em>release</em>. As discussed above, the Helm chart tells Kubernetes what pods to start and how they are configured.</p> <p>First we need to give our account permissions to perform administrative actions:</p> <pre><code class="language-sh">kubectl create clusterrolebinding cluster-admin-binding \ --clusterrole=cluster-admin </code></pre> <p>Now let&rsquo;s install the release. This code assumes the use of Helm version 3 or greater (for older versions <a href="https://github.com/paciorek/future-kubernetes">see my full instructions</a>).</p> <pre><code class="language-sh">git clone https://github.com/paciorek/future-helm-chart # download the materials tar -czf future-helm.tgz -C future-helm-chart . # create a zipped archive (tarball) that `helm install` needs helm install --wait test ./future-helm.tgz # install (start the pods) </code></pre> <p>You&rsquo;ll need to name your release; I&rsquo;ve used &lsquo;test&rsquo; above.</p> <p>The <code>--wait</code> flag tells helm to wait until all the pods have started. Once that happens, you&rsquo;ll see a message about the release and how to connect to the RStudio interface, which we&rsquo;ll discuss further in the next section.</p> <p>We can check the pods are running:</p> <pre><code class="language-sh">kubectl get pods </code></pre> <p>You should see something like this (the alphanumeric characters at the ends of the names will differ in your case):</p> <pre><code>NAME READY STATUS RESTARTS AGE future-scheduler-6476fd9c44-mvmz6 1/1 Running 0 116s future-worker-54db85cb7b-47qsd 1/1 Running 0 115s future-worker-54db85cb7b-4xf4x 1/1 Running 0 115s future-worker-54db85cb7b-rj6bj 1/1 Running 0 116s future-worker-54db85cb7b-wvp4n 1/1 Running 0 115s </code></pre> <p>As expected, we have one scheduler and four workers.</p> <h2 id="step-3-connect-to-rstudio-server-running-in-the-cluster">Step 3: Connect to RStudio Server running in the cluster</h2> <p>Next we&rsquo;ll connect to the RStudio instance running via RStudio Server on our main (scheduler) pod, using the browser on our laptop (Step 3 in Figure 1).</p> <p>After installing the Helm chart, you should have seen a printout with some instructions on how to do this. First you need to connect a port on your laptop to the RStudio port on the main pod (running of course in the cloud):</p> <pre><code class="language-sh">export RSTUDIO_SERVER_IP=&quot;127.0.0.1&quot; export RSTUDIO_SERVER_PORT=8787 kubectl port-forward --namespace default svc/future-scheduler $RSTUDIO_SERVER_PORT:8787 &amp; </code></pre> <p>You can now connect from your browser to the RStudio Server instance by going to the URL: <a href="http://127.0.0.1:8787">http://127.0.0.1:8787</a>.</p> <p>Enter <code>rstudio</code> as the username and <code>future</code> as the password to login to RStudio.</p> <p>What&rsquo;s happening is that port 8787 on your laptop is forwarding to the port on the main pod on which RStudio Server is listening (which is also port 8787). So you can just act as if RStudio Server is accessible directly on your laptop.</p> <p>One nice thing about this is that there is no public IP address for someone to maliciously use to connect to your cluster. Instead the access is handled securely entirely through <code>kubectl</code> running on your laptop. However, it also means that you couldn&rsquo;t easily share your cluster with a collaborator. For details on configuring things so there is a public IP, please see <a href="https://github.com/paciorek/future-kubernetes#connecting-to-the-rstudio-instance-when-starting-the-cluster-from-a-remote-machine">my repository</a>.</p> <p>Note that there is nothing magical about running your computation via RStudio. You could <a href="#connect-to-a-pod">connect to the main pod</a> and simply run R in it and then use the <strong>future</strong> package.</p> <h2 id="step-4-run-your-future-based-parallel-r-code">Step 4: Run your future-based parallel R code</h2> <p>Now we&rsquo;ll start up our future cluster and run our computation (Step 4 in Figure 1):</p> <pre><code class="language-r">library(future) plan(cluster, manual = TRUE, quiet = TRUE) </code></pre> <p>The key thing is that we set <code>manual = TRUE</code> above. This ensures that the functions from the <strong>future</strong> package don&rsquo;t try to start R processes on the workers, as those R processes have already been started by Kubernetes and are waiting to connect to the main (RStudio Server) process.</p> <p>Note that we don&rsquo;t need to say how many future workers we want. This is because the Helm chart sets an environment variable in the scheduler pod&rsquo;s <code>Renviron</code> file based on the number of worker pod replicas. Since that variable is used by the <strong>future</strong> package (via <code>parallelly::availableCores()</code>) as the default number of future workers, this ensures that there are only as many future workers as you have worker pods. However, if you modify the number of worker pods after installing the Helm chart, you may need to set the <code>workers</code> argument to <code>plan()</code> manually. (And note that if you were to specify more future workers than R worker processes (i.e., pods) you would get an error and if you were to specify fewer, you wouldn&rsquo;t be using all the resources that you are paying for.)</p> <p>Now we can use the various tools in the <strong>future</strong> package as we would if on our own machine or working on a Linux cluster.</p> <p>Let&rsquo;s run our parallelized operations. I&rsquo;m going to do the world&rsquo;s least interesting calculation of calculating the mean of many (10 million) random numbers forty separate times in parallel. Not interesting, but presumably if you&rsquo;re reading this you have your own interesting computation in mind and hopefully know how to do it using future&rsquo;s tools such as <strong><a href="https://cran.r-project.org/package=future.apply">future.apply</a></strong> and <strong><a href="https://cran.r-project.org/package=foreach">foreach</a></strong> with <strong><a href="https://cran.r-project.org/package=doFuture">doFuture</a></strong>.</p> <pre><code class="language-r">library(future.apply) output &lt;- future_sapply(1:40, function(i) mean(rnorm(1e7)), future.seed = TRUE) </code></pre> <p>Note that all of this assumes you&rsquo;re working interactively, but you can always reconnect to the RStudio Server instance after closing the browser, and any long-running code should continue running even if you close the browser.</p> <p>Figure 2 shows a screenshot of the RStudio interface.</p> <figure> <img src="https://www.jottr.org/post/rstudio.png" alt="RStudio interface, demonstrating use of future commands" width="700"/> <figcaption style="font-style: italic;">Figure 2. Screenshot of the RStudio interface</figcaption> </figure> <h3 id="working-with-files">Working with files</h3> <p>Note that <code>/home/rstudio</code> will be your default working directory in RStudio and the RStudio Server process will be running as the user <code>rstudio</code>.</p> <p>You can use <code>/tmp</code> and <code>/home/rstudio</code> for files, both within RStudio and within code running on the workers, but note that files (even in <code>/home/rstudio</code>) are not shared between workers nor between the workers and the RStudio Server pod.</p> <p>To make data available to your RStudio process or get output data back to your laptop, you can use <code>kubectl cp</code> to copy files between your laptop and the RStudio Server pod. Here&rsquo;s an example of copying to/from <code>/home/rstudio</code>:</p> <pre><code class="language-sh">## create a variable with the name of the scheduler pod export SCHEDULER=$(kubectl get pod --namespace default -o jsonpath='{.items[?(@.metadata.labels.component==&quot;scheduler&quot;)].metadata.name}') ## copy a file to the scheduler pod kubectl cp my_laptop_file ${SCHEDULER}:home/rstudio/ ## copy a file from the scheduler pod kubectl cp ${SCHEDULER}:home/rstudio/my_output_file . </code></pre> <p>Of course you can also interact with the web from your RStudio process, so you could download data to the RStudio process from the internet.</p> <h2 id="step-5-cleaning-up">Step 5: Cleaning up</h2> <p>Make sure to shut down your Kubernetes cluster, so you don&rsquo;t keep getting charged.</p> <pre><code class="language-sh">gcloud container clusters delete my-cluster --zone=us-west1-a </code></pre> <h2 id="modifications">Modifications</h2> <p>You can modify the Helm chart in advance, before installing it. For example you might want to install other R packages for use in your parallel code or change the number of workers.</p> <p>To add additional R packages, go into the <code>future-helm-chart</code> directory (which you created using the directions above in Step 2) and edit the <a href="https://github.com/paciorek/future-helm-chart/blob/master/values.yaml">values.yaml</a> file. Simply modify the lines that look like this:</p> <pre><code class="language-yaml"> env: # - name: EXTRA_R_PACKAGES # value: data.table </code></pre> <p>by removing the &ldquo;#&rdquo; comment characters and putting the R packages you want installed in place of <code>data.table</code>, with the names of the packages separated by spaces, e.g.,</p> <pre><code class="language-yaml"> env: - name: EXTRA_R_PACKAGES value: foreach doFuture </code></pre> <p>In many cases you may want these packages installed on both the scheduler pod (where RStudio Server runs) and on the workers. If so, make sure to modify the lines above in both the <code>scheduler</code> and <code>worker</code> stanzas.</p> <p>To modify the number of workers, modify the <code>replicas</code> line in the <code>worker</code> stanza of the <a href="https://github.com/paciorek/future-helm-chart/blob/master/values.yaml">values.yaml</a> file.</p> <p>Then rebuild the Helm chart:</p> <pre><code class="language-sh">cd future-helm-chart ## ensure you are in the directory containing `values.yaml` tar -czf ../future-helm.tgz . </code></pre> <p>and install as done previously.</p> <p>Note that doing the above to increase the number of workers would probably only make sense if you also modify the number of virtual machines you start your Kubernetes cluster with such that the total number of cores across the cloud provider compute instances matches the number of worker replicas.</p> <p>You may also be able to modify a running cluster. For example you could use <code>gcloud container clusters resize</code>. I haven&rsquo;t experimented with this.</p> <p>To modify if your Helm chart is already installed (i.e., your release is running), one simple option is to reinstall the Helm chart as discussed below. You may also need to kill the <code>port-forward</code> process discussed in Step 3.</p> <p>For some changes, you can also also update a running release without uninstalling it by &ldquo;patching&rdquo; the running release or scaling resources. I won&rsquo;t go into details here.</p> <h2 id="troubleshooting">Troubleshooting</h2> <p>Things can definitely go wrong in getting all the pods to start up and communicate with each other. Here are some suggestions for monitoring what is going on and troubleshooting.</p> <p>First, you can use <code>kubectl</code> to check the pods are running:</p> <pre><code class="language-sh">kubectl get pods </code></pre> <h3 id="connect-to-a-pod">Connect to a pod</h3> <p>To connect to a pod, which allows you to check on installed software, check on what the pod is doing, and other troubleshooting, you can do the following</p> <pre><code class="language-sh">export SCHEDULER=$(kubectl get pod --namespace default -o jsonpath='{.items[?(@.metadata.labels.component==&quot;scheduler&quot;)].metadata.name}') export WORKERS=$(kubectl get pod --namespace default -o jsonpath='{.items[?(@.metadata.labels.component==&quot;worker&quot;)].metadata.name}') ## access the scheduler pod: kubectl exec -it ${SCHEDULER} -- /bin/bash ## access a worker pod: echo $WORKERS kubectl exec -it &lt;insert_name_of_a_worker&gt; -- /bin/bash </code></pre> <p>Alternatively just determine the name of the pod with <code>kubectl get pods</code> and then run the <code>kubectl exec -it ...</code> invocation above.</p> <p>Note that once you are in a pod, you can install software in the usual fashion of a Linux machine (in this case using <code>apt</code> commands such as <code>apt-get install</code>).</p> <h3 id="connect-to-a-virtual-machine">Connect to a virtual machine</h3> <p>Or to connect directly to an underlying VM, you can first determine the name of the VM and then use the <code>gcloud</code> tools to connect to it.</p> <pre><code class="language-sh">kubectl get nodes ## now, connect to one of the nodes, 'gke-my-cluster-default-pool-8b490768-2q9v' in this case: gcloud compute ssh gke-my-cluster-default-pool-8b490768-2q9v --zone us-west1-a </code></pre> <h3 id="check-your-running-code">Check your running code</h3> <p>To check that your code is actually running in parallel, one can run the following test and see that the result returns the names of distinct worker pods.</p> <pre><code class="language-r">library(future.apply) future_sapply(seq_len(nbrOfWorkers()), function(i) Sys.info()[[&quot;nodename&quot;]]) </code></pre> <p>You should see something like this:</p> <pre><code>[1] future-worker-54db85cb7b-47qsd future-worker-54db85cb7b-4xf4x [3] future-worker-54db85cb7b-rj6bj future-worker-54db85cb7b-wvp4n </code></pre> <p>One can also connect to the pods or to the underlying virtual nodes (as discussed above) and run Unix commands such as <code>top</code> and <code>free</code> to understand CPU and memory usage.</p> <h3 id="reinstall-the-helm-release">Reinstall the Helm release</h3> <p>You can restart your release (i.e., restarting the pods, without restarting the whole Kubernetes cluster):</p> <pre><code class="language-sh">helm uninstall test helm install --wait test ./future-helm.tgz </code></pre> <p>Note that you may need to restart the entire Kubernetes cluster if you&rsquo;re having difficulties that reinstalling the release doesn&rsquo;t fix.</p> <h2 id="how-does-it-work">How does it work?</h2> <p>I&rsquo;ve provided many of the details of how it works in my <a href="https://github.com/paciorek/future-kubernetes">future-kubernetes</a> repository.</p> <p>The key pieces are:</p> <ol> <li>The <a href="https://github.com/paciorek/future-helm-chart">Helm chart</a> with the instructions for how to start the pods and any associated services.</li> <li>The <a href="https://github.com/paciorek/future-kubernetes-docker">Rocker-based Docker container(s)</a> that the pods run.</li> </ol> <p>That&rsquo;s all there is to it &hellip; plus <a href="https://github.com/paciorek/future-kubernetes">these instructions</a>.</p> <p>Briefly:</p> <ol> <li>Based on the Helm chart, Kubernetes starts up the &lsquo;main&rsquo; or &lsquo;scheduler&rsquo; pod running RStudio Server and multiple worker pods each running an R process. All of the pods are running the Rocker-based Docker container</li> <li>The RStudio Server main process and the workers use socket connections (via the R function <code>socketConnection()</code>) to communicate: <ul> <li>the worker processes start R processes that are instructed to regularly make a socket connection using a particular port on the main scheduler pod</li> <li>when you run <code>future::plan()</code> (which calls <code>makeClusterPSOCK()</code>) in RStudio, the RStudio Server process attempts to make socket connections to the workers using that same port</li> </ul></li> <li>Once the socket connections are established, command of the RStudio session returns to you and you can run your future-based parallel R code.</li> </ol> <p>One thing I haven&rsquo;t had time to work through is how to easily scale the number of workers after the Kubernetes cluster is running and the Helm chart installed, or even how to auto-scale &ndash; starting up workers as needed based on the number of workers requested via <code>plan()</code>.</p> <h2 id="wrap-up">Wrap up</h2> <p>If you&rsquo;re interested in extending or improving this or collaborating in some fashion, please feel free to get in touch with me via the <a href="https://github.com/paciorek/future-kubernetes/issues">&lsquo;future-kubernetes&rsquo; issue tracker</a> or by email.</p> <p>And if you&rsquo;re interested in using R with Kubernetes, note that RStudio provides an integration of RStudio Server Pro with Kubernetes that should allow one to run future-based workflows in parallel.</p> <p>/Chris</p> <h2 id="links">Links</h2> <ul> <li><p>future-kubernetes repository:</p> <ul> <li>GitHub page: <a href="https://github.com/paciorek/future-kubernetes">https://github.com/paciorek/future-kubernetes</a></li> </ul></li> <li><p>future-kubernetes Helm chart:</p> <ul> <li>GitHub page: <a href="https://github.com/paciorek/future-helm-chart">https://github.com/paciorek/future-helm-chart</a></li> </ul></li> <li><p>future-kubernetes Docker container:</p> <ul> <li>GitHub page: <a href="https://github.com/paciorek/future-kubernetes-docker">https://github.com/paciorek/future-kubernetes-docker</a></li> </ul></li> <li><p>future package:</p> <ul> <li>CRAN page: <a href="https://cran.r-project.org/package=future">https://cran.r-project.org/package=future</a></li> <li>GitHub page: <a href="https://github.com/HenrikBengtsson/future">https://github.com/HenrikBengtsson/future</a></li> </ul></li> </ul> </description>
</item>
<item>
<title>future.BatchJobs - End-of-Life Announcement</title>
<link>https://www.jottr.org/2021/01/08/future.batchjobs-end-of-life-announcement/</link>
<pubDate>Fri, 08 Jan 2021 09:00:00 -0800</pubDate>
<guid>https://www.jottr.org/2021/01/08/future.batchjobs-end-of-life-announcement/</guid>
<description> <div style="width: 40%; margin: 2ex; float: right;"/> <center> <img src="https://www.jottr.org/post/sign_out_of_service_do_not_use.png" alt="Sign: Out of Service - Do not use!"/> </center> </div> <p>This is an announcement that <strong><a href="https://cran.r-project.org/package=future.BatchJobs">future.BatchJobs</a></strong> - <em>A Future API for Parallel and Distributed Processing using BatchJobs</em> has been archived on CRAN. The package has been deprecated for years with a recommendation of using <strong><a href="https://cran.r-project.org/package=future.batchtools">future.batchtools</a></strong> instead. The latter has been on CRAN since June 2017 and builds upon the <strong><a href="https://cran.r-project.org/package=batchtools">batchtools</a></strong> package, which itself supersedes the <strong><a href="https://cran.r-project.org/package=BatchJobs">BatchJobs</a></strong> package.</p> <p>To wrap up the three-and-a-half year long life of <strong><a href="https://cran.r-project.org/package=future.BatchJobs">future.BatchJobs</a></strong>, the very last version, 0.17.0, reached CRAN on 2021-01-04 and passed on CRAN checks as of 2020-01-08, when the the package was requested to be formally archived. All versions ever existing on CRAN can be found at <a href="https://cran.r-project.org/src/contrib/Archive/future.BatchJobs/">https://cran.r-project.org/src/contrib/Archive/future.BatchJobs/</a>.</p> <p>Archiving the <strong>future.BatchJobs</strong> package will speed up new releases of the <strong>future</strong> package. In the past, some of the <strong>future</strong> releases required internal updates to reverse packages dependencies such as <strong>future.BatchJobs</strong> to be rolled out on CRAN first in order for <strong>future</strong> to pass the CRAN incoming checks.</p> <h2 id="postscript">Postscript</h2> <p>The <a href="https://cran.r-project.org/package=future.BatchJobs">https://cran.r-project.org/package=future.BatchJobs</a> page mentions:</p> <blockquote> <p>Archived on 2021-01-08 at the request of the maintainer.</p> <p>Consider using package ‘<a href="https://cran.r-project.org/package=future.batchtools">future.batchtools</a>’ instead.</p> </blockquote> <p>I&rsquo;m happy to see that we can suggest another package on our archived package pages. All I did to get this was to mention it in my email to CRAN:</p> <blockquote> <p>Hi,</p> <p>please archive the &lsquo;future.BatchJobs&rsquo; package. It has zero reverse dependencies. The package has been labelled deprecated for a long time now and has been superseded by the &lsquo;future.batchtools&rsquo; package.</p> <p>Thank you,<br /> Henrik</p> </blockquote> <h2 id="links">Links</h2> <ul> <li><p>future package:</p> <ul> <li>CRAN page: <a href="https://cran.r-project.org/package=future">https://cran.r-project.org/package=future</a></li> <li>GitHub page: <a href="https://github.com/HenrikBengtsson/future">https://github.com/HenrikBengtsson/future</a></li> </ul></li> <li><p>future.BatchJobs package:</p> <ul> <li>CRAN page: <a href="https://cran.r-project.org/package=future.BatchJobs">https://cran.r-project.org/package=future.BatchJobs</a></li> <li>All CRAN versions: <a href="https://cran.r-project.org/src/contrib/Archive/future.BatchJobs/">https://cran.r-project.org/src/contrib/Archive/future.BatchJobs/</a></li> <li>GitHub page: <a href="https://github.com/HenrikBengtsson/future.BatchJobs">https://github.com/HenrikBengtsson/future.BatchJobs</a></li> </ul></li> <li><p>future.batchtools package:</p> <ul> <li>CRAN page: <a href="https://cran.r-project.org/package=future.batchtools">https://cran.r-project.org/package=future.batchtools</a></li> <li>GitHub page: <a href="https://github.com/HenrikBengtsson/future.batchtools">https://github.com/HenrikBengtsson/future.batchtools</a></li> </ul></li> </ul> </description>
</item>
<item>
<title>My Keynote 'Future' Presentation at the European Bioconductor Meeting 2020</title>
<link>https://www.jottr.org/2020/12/19/future-eurobioc2020-slides/</link>
<pubDate>Sat, 19 Dec 2020 10:00:00 -0800</pubDate>
<guid>https://www.jottr.org/2020/12/19/future-eurobioc2020-slides/</guid>
<description> <div style="width: 40%; margin: 2ex; float: right;"/> <center> <img src="https://www.jottr.org/post/LukeZapia_20201218-EuroBioc2020-future_mindmap.jpg" alt="A hand-drawn summary of Henrik Bengtsson's future talk at the European Bioconductor Meeting 2020 in the form of a mindmap on a whileboard" style="border: 1px solid #666;/> <span style="font-size: 80%; font-style: italic;"><a href="https://twitter.com/_lazappi_">Luke Zappia</a>'s summary of the talk</span> </center> </div> <p>I presented <em>Future: A Simple, Extendable, Generic Framework for Parallel Processing in R</em> at the <a href="https://eurobioc2020.bioconductor.org/">European Bioconductor Meeting 2020</a>, which took place online during the week of December 14-18, 2020.</p> <p>You&rsquo;ll find my slides (39 slides + Q&amp;A slides; 35 minutes) below:</p> <ul> <li><a href="https://www.jottr.org/presentations/EuroBioc2020/BengtssonH_20201218-futures-EuroBioc2020.abstract.txt">Title &amp; Abstract</a></li> <li><a href="https://docs.google.com/presentation/d/e/2PACX-1vTVyeaWRH251Pm8BfrlH1yK4Bd_YojEmo1I0VFxkoehnoxYJXglLdDf5T6_bTDv7lFJjwrXNYFBtfHT/pub?start=false&amp;loop=false&amp;delayms=10000">HTML</a> (Google Slides; requires online access)</li> <li><a href="https://www.jottr.org/presentations/EuroBioc2020/BengtssonH_20201218-futures-EuroBioc2020.pdf">PDF</a> (flat slides)</li> <li><a href="https://www.youtube.com/watch?v=Ph8jItU7Dlo">Video</a> (YouTube)</li> </ul> <p>I want to thank the organizers for inviting me to this Bioconductor conference. The <a href="http://bioconductor.org/">Bioconductor Project</a> provides a powerful and an important technical and social environment for developing and conducting computational research in bioinformatics and genomics. It has a great, world-wide community and engaging leadership which effortlessly keep delivering great tools (~2,000 R packages as of December 2020) and <a href="http://bioconductor.org/help/course-materials/">training</a> year after year. I am honored for the opportunity to give a keynote presentation to this community.</p> <p>- Henrik</p> <h2 id="links">Links</h2> <ul> <li><p>Relevant packages mentioned in this talk:</p> <ul> <li><strong>future</strong> package: <a href="https://cran.r-project.org/package=future">CRAN</a>, <a href="https://github.com/HenrikBengtsson/future">GitHub</a></li> <li><strong>future.apply</strong> package: <a href="https://cran.r-project.org/package=future.apply">CRAN</a>, <a href="https://github.com/HenrikBengtsson/future.apply">GitHub</a></li> <li><strong>furrr</strong> package: <a href="https://cran.r-project.org/package=furrr">CRAN</a>, <a href="https://github.com/DavisVaughan/furrr">GitHub</a></li> <li><strong>foreach</strong> package: <a href="https://cran.r-project.org/package=foreach">CRAN</a>, <a href="https://github.com/RevolutionAnalytics/foreach">GitHub</a></li> <li><strong>doFuture</strong> package: <a href="https://cran.r-project.org/package=doFuture">CRAN</a>, <a href="https://github.com/HenrikBengtsson/doFuture">GitHub</a></li> <li><strong>doParallel</strong> package: <a href="https://cran.r-project.org/package=doParallel">CRAN</a>, <a href="https://github.com/RevolutionAnalytics/doParallel">GitHub</a></li> <li><strong>future.batchtools</strong> package: <a href="https://cran.r-project.org/package=future.batchtools">CRAN</a>, <a href="https://github.com/HenrikBengtsson/future.batchtools">GitHub</a></li> <li><strong>future.callr</strong> package: <a href="https://cran.r-project.org/package=future.callr">CRAN</a>, <a href="https://github.com/HenrikBengtsson/future.callr">GitHub</a></li> <li><strong>clustermq</strong> package: <a href="https://cran.r-project.org/package=clustermq">CRAN</a>, <a href="https://github.com/mschubert/clustermq">GitHub</a></li> <li><strong>BiocParallel</strong> package: <a href="https://cran.r-project.org/package=BiocParallel">CRAN</a>, <a href="https://github.com/Bioconductor/BiocParallel">GitHub</a></li> </ul></li> </ul> </description>
</item>
<item>
<title>NYC R Meetup: Slides on Future</title>
<link>https://www.jottr.org/2020/11/12/future-nycmeetup-slides/</link>
<pubDate>Thu, 12 Nov 2020 19:30:00 -0800</pubDate>
<guid>https://www.jottr.org/2020/11/12/future-nycmeetup-slides/</guid>
<description> <div style="width: 35%; margin: 2ex; float: right;"/> <center> <img src="https://www.jottr.org/post/poster-for-nycmeetup2020-talk.png" alt="The official poster for this New York Open Statistical Programming Meetup"/> </center> </div> <p>I presented <em>Future: Simple, Friendly Parallel Processing for R</em> (67 minutes; 59 slides + Q&amp;A slides) at <a href="https://nyhackr.org/">New York Open Statistical Programming Meetup</a>, on November 9, 2020:</p> <ul> <li><a href="https://docs.google.com/presentation/d/1E2Gcm33_uMrhQL7jLzodlMXUefnSshHUdYsoXWAkFYE/edit?usp=sharing">HTML</a> (incremental Google Slides; requires online access)</li> <li><a href="https://www.jottr.org/presentations/NYCMeetup2020/BengtssonH_20191109-futures-NYC.pdf">PDF</a> (flat slides)</li> <li><a href="https://youtu.be/2ZlpFkFMy7E?t=630">Video</a> (presentation starts at 0h10m30s, Q&amp;A starts at 1h17m40s)</li> </ul> <p>I like to thanks everyone who attented and everyone who asked lots of brilliant questions during the Q&amp;A. I&rsquo;d also want to express my gratitude to Amada, Jared, and Noam for the invitation and making this event possible. It was great fun.</p> <p>- Henrik</p> <h2 id="links">Links</h2> <ul> <li><p>Relevant packages mentioned in this talk:</p> <ul> <li><strong>future</strong> package: <a href="https://cran.r-project.org/package=future">CRAN</a>, <a href="https://github.com/HenrikBengtsson/future">GitHub</a></li> <li><strong>future.apply</strong> package: <a href="https://cran.r-project.org/package=future.apply">CRAN</a>, <a href="https://github.com/HenrikBengtsson/future.apply">GitHub</a></li> <li><strong>furrr</strong> package: <a href="https://cran.r-project.org/package=furrr">CRAN</a>, <a href="https://github.com/DavisVaughan/furrr">GitHub</a></li> <li><strong>foreach</strong> package: <a href="https://cran.r-project.org/package=foreach">CRAN</a>, <a href="https://github.com/RevolutionAnalytics/foreach">GitHub</a></li> <li><strong>doFuture</strong> package: <a href="https://cran.r-project.org/package=doFuture">CRAN</a>, <a href="https://github.com/HenrikBengtsson/doFuture">GitHub</a></li> <li><strong>doParallel</strong> package: <a href="https://cran.r-project.org/package=doParallel">CRAN</a>, <a href="https://github.com/RevolutionAnalytics/doParallel">GitHub</a></li> <li><strong>future.batchtools</strong> package: <a href="https://cran.r-project.org/package=future.batchtools">CRAN</a>, <a href="https://github.com/HenrikBengtsson/future.batchtools">GitHub</a></li> <li><strong>future.callr</strong> package: <a href="https://cran.r-project.org/package=future.callr">CRAN</a>, <a href="https://github.com/HenrikBengtsson/future.callr">GitHub</a></li> <li><strong>future.tests</strong> package: <a href="https://cran.r-project.org/package=future.tests">CRAN</a>, <a href="https://github.com/HenrikBengtsson/future.tests">GitHub</a></li> <li><strong>clustermq</strong> package: <a href="https://cran.r-project.org/package=clustermq">CRAN</a>, <a href="https://github.com/mschubert/clustermq">GitHub</a></li> </ul></li> </ul> </description>
</item>
<item>
<title>future 1.20.1 - The Future Just Got a Bit Brighter</title>
<link>https://www.jottr.org/2020/11/06/future-1.20.1-the-future-just-got-a-bit-brighter/</link>
<pubDate>Fri, 06 Nov 2020 13:00:00 -0800</pubDate>
<guid>https://www.jottr.org/2020/11/06/future-1.20.1-the-future-just-got-a-bit-brighter/</guid>
<description> <p><center> <img src="https://www.jottr.org/post/sparkles-through-space.gif" alt="&quot;Short-loop artsy animation: Flying through colorful, sparkling lights positioned in circles with star-like lights on a black background in the distance&quot;" /> </center></p> <p><strong><a href="https://cran.r-project.org/package=future">future</a></strong> 1.20.1 is on CRAN. It adds some new features, deprecates old and unwanted behaviors, adds a couple of vignettes, and fixes a few bugs.</p> <h1 id="interactive-debugging">Interactive debugging</h1> <p>First out among the new features, and a long-running feature request, is the addition of argument <code>split</code> to <code>plan()</code>, which allows us to split, or &ldquo;tee&rdquo;, any output produced by futures.</p> <p>The default is <code>split = FALSE</code> for which standard output and conditions are captured by the future and only relayed after the future has been resolved, i.e. the captured output is displayed and re-signaled on the main R session when value of the future is queried. This emulates what we experience in R when not using futures, e.g. we can add temporary <code>print()</code> and <code>message()</code> statements to our code for quick troubleshooting. You can read more about this in blog post &lsquo;<a href="https://www.jottr.org/2018/07/23/output-from-the-future/">future 1.9.0 - Output from The Future</a>&rsquo;.</p> <p>However, if we want to use <code>debug()</code> or <code>browser()</code> for interactive debugging, we quickly realize they&rsquo;re not very useful because no output is visible, which is because also their output is captured by the future. This is where the new &ldquo;split&rdquo; feature comes to rescue. By using <code>split = TRUE</code>, the standard output and all non-error conditions are split (&ldquo;tee:d&rdquo;) on the worker&rsquo;s end, while still being captured by the future to be relayed back to the main R session at a later time. This means that we can debug &lsquo;sequential&rsquo; future interactively. Here is an illustration of using <code>browser()</code> for debugging a future:</p> <pre><code class="language-r">&gt; library(future) &gt; plan(sequential, split = TRUE) &gt; mysqrt &lt;- function(x) { browser(); y &lt;- sqrt(x); y } &gt; f &lt;- future(mysqrt(1:3)) Called from: mysqrt(1:3) Browse[1]&gt; str(x) int [1:3] 1 2 3 Browse[1]&gt; debug at #1: y &lt;- sqrt(x) Browse[2]&gt; debug at #1: y Browse[2]&gt; str(y) num [1:3] 1 1.41 1.73 Browse[2]&gt; y[1] &lt;- 0 Browse[2]&gt; cont &gt; v &lt;- value(f) Called from: mysqrt(1:3) int [1:3] 1 2 3 debug at #1: y &lt;- sqrt(x) debug at #1: y num [1:3] 1 1.41 1.73 &gt; v [1] 0.000000 1.414214 1.732051 </code></pre> <p><em>Comment</em>: Note how the output produced while debugging is relayed also when <code>value()</code> is called. This is a somewhat unfortunate side effect from futures capturing <em>all</em> output produced while they are active.</p> <h1 id="preserved-logging-on-workers-e-g-future-batchtools">Preserved logging on workers (e.g. future.batchtools)</h1> <p>The added support for <code>split = TRUE</code> also means that we can now preserve all output in any log files that might be produced on parallel workers. For example, if you use <strong><a href="https://cran.r-project.org/package=future.batchtools">future.batchtools</a></strong> on a Slurm scheduler, you can use <code>plan(future.batchtools::batchtools_slurm, split = TRUE)</code> to make sure standard output, messages, warnings, etc. are ending up in the <strong><a href="https://cran.r-project.org/package=batchtools">batchtools</a></strong> log files while still being relayed to the main R session at the end. This way we can inspect cluster jobs while they still run, among other things. Here is a proof-of-concept example using a &lsquo;batchtools_local&rsquo; future:</p> <pre><code class="language-r">&gt; library(future.batchtools) &gt; plan(batchtools_local, split = TRUE) &gt; f &lt;- future({ message(&quot;Hello world&quot;); y &lt;- 42; print(y); sqrt(y) }) &gt; v &lt;- value(f) [1] 42 Hello world &gt; v [1] 6.480741 &gt; loggedOutput(f) [1] &quot;### [bt]: This is batchtools v0.9.14&quot; [2] &quot;### [bt]: Starting calculation of 1 jobs&quot; [3] &quot;### [bt]: Setting working directory to '/home/alice/repositories/future'&quot; [4] &quot;### [bt]: Memory measurement disabled&quot; [5] &quot;### [bt]: Starting job [batchtools job.id=1]&quot; [6] &quot;### [bt]: Setting seed to 15794 ...&quot; [7] &quot;Hello world&quot; [8] &quot;[1] 42&quot; [9] &quot;&quot; [10] &quot;### [bt]: Job terminated successfully [batchtools job.id=1]&quot; [11] &quot;### [bt]: Calculation finished!&quot; </code></pre> <p>Without <code>split = TRUE</code>, we would not get lines 7 and 8 in the <strong>batchtools</strong> logs.</p> <h1 id="near-live-progress-updates-also-from-multicore-futures">Near-live progress updates also from &lsquo;multicore&rsquo; futures</h1> <p>Second out among the new features is &lsquo;multicore&rsquo; futures, which now join &lsquo;sequential&rsquo;, &lsquo;multisession&rsquo;, and (local and remote) &lsquo;cluster&rsquo; futures in the ability of relaying progress updates of <strong><a href="https://cran.r-project.org/package=progressr">progressr</a></strong> in a near-live fashion. This means that all of our most common parallelization backends support near-live progress updates. If this is the first time you hear of <strong>progressr</strong>, here&rsquo;s an example of how it can be used in parallel processing:</p> <pre><code class="language-r">library(future.apply) plan(multicore) library(progressr) handlers(&quot;progress&quot;) xs &lt;- 1:5 with_progress({ p &lt;- progressor(along = xs) y &lt;- future_lapply(xs, function(x, ...) { Sys.sleep(6.0-x) p(sprintf(&quot;x=%g&quot;, x)) sqrt(x) }) }) # [=================&gt;------------------------------] 40% x=2 </code></pre> <p>Note that the progress updates signaled by <code>p()</code>, updates the progress bar almost instantly, even if the parallel workers run on a remote machine.</p> <h1 id="multisession-futures-agile-to-changes-in-r-s-library-path">Multisession futures agile to changes in R&rsquo;s library path</h1> <p>Third out is &lsquo;multisession&rsquo; futures. It now automatically inherits the package library path from the main R session. For instance, if you use <code>.libPaths()</code> to adjust your library path and <em>then</em> call <code>plan(multisession)</code>, the multisession workers will see the same packages as the parent session. This change is based on a feature request related to RStudio Connect. With this update, it no longer matters which type of local futures you use - &lsquo;sequential&rsquo;, &lsquo;multisession&rsquo;, or &lsquo;multicore&rsquo; - your future code has access to the same set of installed packages.</p> <p>As a proof of concept, assume that we add <code>tempdir()</code> as a new folder to R&rsquo;s library path;</p> <pre><code class="language-r">&gt; .libPaths(c(tempdir(), .libPaths())) &gt; .libPaths() [1] &quot;/tmp/alice/RtmpwLKdrG&quot; [2] &quot;/home/alice/R/x86_64-pc-linux-gnu-library/4.0-custom&quot; [3] &quot;/home/alice/software/R-devel/tags/R-4-0-3/lib/R/library&quot; </code></pre> <p>If we then launch a &lsquo;multisession&rsquo; future, we find that it uses the same library path;</p> <pre><code class="language-r">&gt; library(future) &gt; plan(multisession) &gt; f &lt;- future(.libPaths()) &gt; value(f) [1] &quot;/tmp/alice/RtmpwLKdrG&quot; [2] &quot;/home/alice/R/x86_64-pc-linux-gnu-library/4.0-custom&quot; [3] &quot;/home/alice/software/R-devel/tags/R-4-0-3/lib/R/library&quot; </code></pre> <h1 id="best-practices-for-package-developers">Best practices for package developers</h1> <p>I&rsquo;ve added a vignette &lsquo;<a href="https://cran.r-project.org/web/packages/future/vignettes/future-7-for-package-developers.html">Best Practices for Package Developers</a>&rsquo;, which hopefully provides some useful guidelines on how to write and validate future code so it will work on as many parallel backends as possible.</p> <h1 id="saying-goodbye-to-multiprocess-but-don-t-worry">Saying goodbye to &lsquo;multiprocess&rsquo; - but don&rsquo;t worry &hellip;</h1> <p>Ok, lets discuss what is being removed. Using <code>plan(multiprocess)</code>, which was just an alias for &ldquo;<code>plan(multicore)</code> on Linux and macOS and <code>plan(multisession)</code> on MS Windows&rdquo;, is now deprecated. If used, you will get a one-time warning:</p> <pre><code class="language-r">&gt; plan(multiprocess) Warning message: Strategy 'multiprocess' is deprecated in future (&gt;= 1.20.0). Instead, explicitly specify either 'multisession' or 'multicore'. In the current R session, 'multiprocess' equals 'multicore'. </code></pre> <p>I recommend that you use <code>plan(multisession)</code> as a replacement for <code>plan(multiprocess)</code>. If you are on Linux or macOS, and are 100% sure that your code and all its dependencies is fork-safe, then you can also use <code>plan(multicore)</code>.</p> <p>Although &lsquo;multiprocess&rsquo; was neat to use in documentation and examples, it was at the same time ambiguous, and it risked introducing a platform-dependent behavior to those examples. For instance, it could be that the parallel code worked only for users on Linux and macOS because some non-exportable globals were used. If a user or MS Windows tried the same code, they might have gotten run-time errors. Vice versa, it could also be that code works on MS Windows but not on Linux or macOS. Moreover, in <strong>future</strong> 1.13.0 (2019-05-08), support for &lsquo;multicore&rsquo; futures was disabled when running R via RStudio. This was done because forked parallel processing was deemed unstable in RStudio. This meant that a user on macOS who used <code>plan(multiprocess)</code> would end up getting &lsquo;multicore&rsquo; futures when running in the terminal while getting &lsquo;multisession&rsquo; futures when running in RStudio. These types of platform-specific, environment-specific user experiences were confusing and complicates troubleshooting and communications, which is why it was decided to move away from &lsquo;multiprocess&rsquo; in favor of explicitly specifying &lsquo;multisession&rsquo; or &lsquo;multicore&rsquo;.</p> <h1 id="saying-goodbye-to-local-false-a-good-thing">Saying goodbye to &lsquo;local = FALSE&rsquo; - a good thing</h1> <p>In an effort of refining the Future API, the use of <code>future(..., local = FALSE)</code> is now deprecated. The only place where it is still supported, for backward compatible reason, is when using &lsquo;cluster&rsquo; futures that are persistent, i.e. <code>plan(cluster, ..., persistent = TRUE)</code>. If you use the latter, I recommended that you start thinking about moving away from using <code>local = FALSE</code> also in those cases. Although <code>persistent = TRUE</code> is rarely used, I am aware that some of you got use cases that require objects to remain on the parallel workers also after a future has been resolved. If you have such needs, please see <a href="https://github.com/HenrikBengtsson/future/issues/433">future Issue #433</a>, particularly the parts on &ldquo;sticky globals&rdquo;. Feel free to add your comments and suggestions for how we best could move forward on this. The long-term goals is to get rid of both <code>local</code> and <code>persistent</code> in order to harmonize the Future API across <em>all</em> future backends.</p> <p>For recent bug fixes, please see the package <a href="https://cran.r-project.org/web/packages/future/NEWS">NEWS</a>.</p> <h1 id="what-s-on-the-horizon">What&rsquo;s on the horizon?</h1> <p>There are still lots of things on the roadmap. In no specific order, here are the few things in the works:</p> <ul> <li><p>Sticky globals for caching globals on workers. This will decrease the number of globals that need to be exported when launching futures. It addresses several related feature requests, e.g. future Issues <a href="https://github.com/HenrikBengtsson/future/issues/273">#273</a>, <a href="https://github.com/HenrikBengtsson/future/issues/339">#339</a>, <a href="https://github.com/HenrikBengtsson/future/issues/346">#346</a>, <a href="https://github.com/HenrikBengtsson/future/issues/431">#431</a>, and <a href="https://github.com/HenrikBengtsson/future/issues/437">#437</a>.</p></li> <li><p>Ability to terminate futures (for backends supporting it), which opens up for the possibility of restarting failed futures and more. This is a frequently requested feature, e.g. Issues <a href="https://github.com/HenrikBengtsson/future/issues/93">#93</a>, <a href="https://github.com/HenrikBengtsson/future/issues/188">#188</a>, <a href="https://github.com/HenrikBengtsson/future/issues/205">#205</a>, <a href="https://github.com/HenrikBengtsson/future/issues/213">#213</a>, and <a href="https://github.com/HenrikBengtsson/future/issues/236">#236</a>.</p></li> <li><p>Optional, zero-cost generic hook function. Having them in place opens up for adding a framework for doing time-and-memory profiling/benchmarking futures and their backends. Being able profile futures and their backends will help identify bottlenecks and improve the performance on some of our parallel backends, e.g. Issues <a href="https://github.com/HenrikBengtsson/future/issues/49">#59</a>, <a href="https://github.com/HenrikBengtsson/future/issues/142">#142</a>, <a href="https://github.com/HenrikBengtsson/future/issues/239">#239</a>, and <a href="https://github.com/HenrikBengtsson/future/issues/437">#437</a>.</p></li> <li><p>Add support for global calling handlers in <strong>progressr</strong>. This is not specific to the future framework but since its closely related, I figured I mention this here too. A global calling handler for progress updates would remove the need for having to use <code>with_progress()</code> when monitoring progress. This would also help resolve the common problem where package developers want to provide progress updates without having to ask the user to use <code>with_progress()</code>, e.g. <strong>progressr</strong> Issues <a href="https://github.com/HenrikBengtsson/progressr/issues/78">#78</a>, <a href="https://github.com/HenrikBengtsson/progressr/issues/83">#83</a>, and <a href="https://github.com/HenrikBengtsson/progressr/issues/85">#85</a>.</p></li> </ul> <p>That&rsquo;s all for now - Happy futuring!</p> <h2 id="links">Links</h2> <ul> <li><strong>future</strong> package: <a href="https://cran.r-project.org/package=future">CRAN</a>, <a href="https://github.com/HenrikBengtsson/future">GitHub</a></li> <li><strong>future.batchtools</strong> package: <a href="https://cran.r-project.org/package=future.batchtools">CRAN</a>, <a href="https://github.com/HenrikBengtsson/future.batchtools">GitHub</a></li> <li><strong>progressr</strong> package: <a href="https://cran.r-project.org/package=progressr">CRAN</a>, <a href="https://github.com/HenrikBengtsson/progressr">GitHub</a></li> </ul> <h2 id="see-also">See also</h2> <ul> <li><a href="https://www.jottr.org/2016/10/22/future-hpc/">High-Performance Compute in R Using Futures</a>, 2016-10-22</li> <li><a href="https://www.jottr.org/2018/07/23/output-from-the-future/">future 1.9.0 - Output from The Future</a>, 2018-07-23</li> <li><a href="https://www.jottr.org/2020/07/04/progressr-erum2020-slides/">e-Rum 2020 Slides on Progressr</a>, 2020-07-04</li> </ul> </description>
</item>
<item>
<title>parallelly, future - Cleaning Up Around the House</title>
<link>https://www.jottr.org/2020/11/04/parallelly-future-cleaning-up-around-the-house/</link>
<pubDate>Wed, 04 Nov 2020 18:00:00 -0800</pubDate>
<guid>https://www.jottr.org/2020/11/04/parallelly-future-cleaning-up-around-the-house/</guid>
<description> <blockquote cite="https://www.merriam-webster.com/dictionary/parallelly" style="font-size: 150%"> <strong>parallelly</strong> adverb<br> par·​al·​lel·​ly | \ ˈpa-rə-le(l)li \ <br> Definition: in a parallel manner </blockquote> <blockquote cite="https://www.merriam-webster.com/dictionary/future" style="font-size: 150%"> <strong>future</strong> noun<br> fu·​ture | \ ˈfyü-chər \ <br> Definition: existing or occurring at a later time </blockquote> <p>I&rsquo;ve cleaned up around the house - with the recent release of <strong><a href="https://cran.r-project.org/package=future">future</a></strong> 1.20.1, the package gained a dependency on the new <strong><a href="https://cran.r-project.org/package=parallelly">parallelly</a></strong> package. Now, if you&rsquo;re like me and concerned about bloating package dependencies, I&rsquo;m sure you immediately wondered why I chose to introduce a new dependency. I&rsquo;ll try to explain this below, but let me be start by clarifying a few things:</p> <ul> <li><p>The functions in the <strong>parallelly</strong> package used to be part of the <strong>future</strong> package</p></li> <li><p>The functions have been removed from the <strong>future</strong> making that package smaller while its total installation &ldquo;weight&rdquo; remains about the same when adding the <strong>parallelly</strong></p></li> <li><p>The <strong>future</strong> package re-exports these functions, i.e. for the time being, everything works as before</p></li> </ul> <p>Specifically, I’ve moved the following functions from the <strong>future</strong> package to the <strong>parallelly</strong> package:</p> <ul> <li><code>as.cluster()</code> - Coerce an object to a &lsquo;cluster&rsquo; object</li> <li><code>c(...)</code> - Combine multiple &lsquo;cluster&rsquo; objects into a single, large cluster</li> <li><code>autoStopCluster()</code> - Automatically stop a &lsquo;cluster&rsquo; when garbage collected</li> <li><code>availableCores()</code> - Get number of available cores on the current machine; a better, safer alternative to <code>parallel::detectCores()</code></li> <li><code>availableWorkers()</code> - Get set of available workers</li> <li><code>makeClusterPSOCK()</code> - Create a PSOCK cluster of R workers for parallel processing; a more powerful alternative to <code>parallel::makePSOCKcluster()</code></li> <li><code>makeClusterMPI()</code> - Create a message passing interface (MPI) cluster of R workers for parallel processing; a tweaked version of <code>parallel::makeMPIcluster()</code></li> <li><code>supportsMulticore()</code> - Check if forked processing (&ldquo;multicore&rdquo;) is supported</li> </ul> <p>Because these are re-exported as-is, you can still use them as if they were part of the <strong>future</strong> package. For example, you may now use <code>availableCores()</code> as</p> <pre><code class="language-r">ncores &lt;- parallelly::availableCores() </code></pre> <p>or keep using it as</p> <pre><code class="language-r">ncores &lt;- future::availableCores() </code></pre> <p>One reason for moving these functions to a separate package is to make them readily available also outside of the future framework. For instance, using <code>parallelly::availableCores()</code> for decided on the number of parallel workers is a <em>much</em> better and safer alternative than using <code>parallel::detectCores()</code> - see <code>help(&quot;availableCores&quot;, package = &quot;parallelly&quot;)</code> for why. Making these functions available in a lightweight package will attract additional users and developers that are not using futures. More users means more real-world validation, more vetting, and more feedback, which will improve these functions further and indirectly also the future framework.</p> <p>Another reason is that several of the functions in <strong>parallelly</strong> are bug fixes and improvements to functions in the <strong>parallel</strong> package. By extracting these functions from the <strong>future</strong> package and putting them in a standalone package, it should be more clear what these improvements are. At the same time, it should lower the threshold of getting these improvements into the <strong>parallel</strong> package, where I hope they will end up one day. <em>The <strong>parallelly</strong> package comes with an open invitation to the R Core to incorporate <strong>parallelly</strong>&rsquo;s implementation or ideas into <strong>parallel</strong>.</em></p> <p>For users of the future framework, maybe the most important reason for this migration is <em>speedier implementation of improvements and feature requests for the <strong>future</strong> package and the future ecosystem</em>. Over the years, many discussions around enhancing <strong>future</strong> came down to enhancing the functions that are now part of the <strong>parallelly</strong> package, especially for adding new features to <code>makeClusterPSOCK()</code>, which is the internal work horse for setting up &lsquo;multisession&rsquo; parallel workers but also used explicitly by many when setting up other types of &lsquo;cluster&rsquo; workers. The roles and responsibility of the <strong>parallelly</strong> and <strong>future</strong> packages are well separated, which should make it straightforward to further improve on these functions. For example, if we want to introduce a new argument to <code>makeClusterPSOCK()</code>, or change one of its defaults (e.g. use the faster <code>useXDR = FALSE</code>), we can now discuss and test them quicker and often without having to bring in futures into the discussion. Don&rsquo;t worry - <strong>parallelly</strong> will undergo the same, <a href="https://www.jottr.org/2020/11/04/trust-the-future/">strict validation process as the <strong>future</strong> package</a> does to avoid introducing breaking changes to the future framework. For example, reverse-dependency checks will be run on first (e.g. <strong>future</strong>), and second (e.g. <strong>future.apply</strong>, <strong>furrr</strong>, <strong>doFuture</strong>, <strong>drake</strong>, <strong>mlr3</strong>, <strong>plumber</strong>, <strong>promises</strong>,and <strong>Seurat</strong>) generation dependencies.</p> <p>Happy parallelly futuring!</p> <p><small> <sup>*</sup> I&rsquo;ll try to make another post in a couple of days covering the new features that comes with <strong>future</strong> 1.20.1. Stay tuned. </small></p> <h2 id="links">Links</h2> <ul> <li><strong>future</strong> package: <a href="https://cran.r-project.org/package=future">CRAN</a>, <a href="https://github.com/HenrikBengtsson/future">GitHub</a></li> <li><strong>parallelly</strong> package: <a href="https://cran.r-project.org/package=parallelly">CRAN</a>, <a href="https://github.com/HenrikBengtsson/parallelly">GitHub</a></li> </ul> </description>
</item>
<item>
<title>Trust the Future</title>
<link>https://www.jottr.org/2020/11/04/trust-the-future/</link>
<pubDate>Wed, 04 Nov 2020 14:00:00 -0800</pubDate>
<guid>https://www.jottr.org/2020/11/04/trust-the-future/</guid>
<description> <p><center> <img src="https://www.jottr.org/post/you_dont_have_to_worry_about_your_future.jpg" alt="A fortune cookie that reads 'You do not have to worry about your future'" style="border: solid 1px; max-width: 70%"/> </center></p> <p>Each time we use R to analyze data, we rely on the assumption that functions used produce correct results. If we can&rsquo;t make this assumption, we have to spend a lot of time validating every nitty detail. Luckily, we don&rsquo;t have to do this. There are many reasons for why we can comfortably use R for our analyses and some of them are unique to R. Here are some I could think of while writing this blog post - I&rsquo;m sure I forgot something:</p> <ul> <li><p>R is a functional language with few side effects (&ldquo;just like mathematical functions&rdquo;)</p></li> <li><p>R, and its predecessor S, has undergone lots of real-world validation over the last two-three decades</p></li> <li><p>Millions of users and developers use and vet R regularly, which increases the chances for detecting mistakes and bugs</p></li> <li><p>R has one established, agreed-upon framework for validating an R package: <code>R CMD check</code></p></li> <li><p>The majority of R packages are distributed through a single repository (CRAN)</p></li> <li><p>CRAN requires that all R packages pass checks on past, current, and upcoming R versions, across operating systems (MS Windows, Linux, macOS, and Solaris), and on different compilers</p></li> <li><p>New checks are continuously added to <code>R CMD check</code> causing the quality of new and existing R packages to improve over time</p></li> <li><p>CRAN asserts that package updates do not break reverse package dependencies</p></li> <li><p>R developers spend a substantial amount of time validating their packages</p></li> <li><p>R has users and developers with various backgrounds and areas of expertise</p></li> <li><p>R has a community that actively engages in discussions on best practices, troubleshooting, bug fixes, testing, and language development</p></li> <li><p>There are many third-party contributed tools for developing and testing R packages</p></li> </ul> <p>I think <a href="https://twitter.com/j_v_66">Jan Vitek</a> summarized it well in the &lsquo;Why R?&rsquo; panel discussion on <a href="https://youtu.be/uiEhmKN1RJo?t=1917">&lsquo;Performance in R&rsquo;</a> on 2020-09-26:</p> <blockquote> <p>R is an ecosystem. It is not a language. The language is the little bit on top. You come for the ecosystem - the books, all of the questions and answers, the snippets of code, the quality of CRAN. &hellip; The quality assurance that CRAN brings &hellip; we don&rsquo;t have that in any other language that I know of.</p> </blockquote> <p>Without the above technical and social ecosystem, I believe the quality of my own R packages would have been substantially lower. Regardless of how many unit tests I would write, I could never achieve the same amount of validation that the full R ecosystem brings to the table.</p> <p>When you use the <a href="https://cran.r-project.org/package=future">future framework for parallel and distributed processing</a>, it is essential that it delivers a corresponding level of correctness and reproducibility to that you get when implementing the same task sequentially. Because of this, validation is a <em>top priority</em> and part of the design and implementation throughout the future ecosystem. Below, I summarize how it is validated:</p> <ul> <li><p>All the essential core packages part of the future framework, <strong><a href="https://cran.r-project.org/package=future">future</a></strong>, <strong><a href="https://CRAN.R-Project.org/package=globals">globals</a></strong>, <strong><a href="https://CRAN.R-Project.org/package=listenv">listenv</a></strong>, and <strong><a href="https://cran.r-project.org/package=parallelly">parallelly</a></strong>, implement a rich set of package tests. These are validated regularly across the wide-range of operating systems (Linux, Solaris, macOS, and MS Windows) and R versions available on CRAN, on continuous integration (CI) services (<a href="https://github.com/features/actions">GitHub Actions</a>, <a href="https://travis-ci.org/">Travis CI</a>, and <a href="https://www.appveyor.com/">AppVeyor CI</a>), an on <a href="https://builder.r-hub.io/">R-hub</a>.</p></li> <li><p>For each new release, these packages undergo full reverse-package dependency checks using <strong><a href="https://github.com/r-lib/revdepcheck">revdepcheck</a></strong>. As of October 2020, the <strong>future</strong> package is tested against more than 140 direct reverse-package dependencies available on CRAN and Bioconductor, including packages <strong><a href="https://cran.r-project.org/package=future.apply">future.apply</a></strong>, <strong><a href="https://cran.r-project.org/package=furrr">furrr</a></strong>, <strong><a href="https://cran.r-project.org/package=doFuture">doFuture</a></strong>, <strong><a href="https://cran.r-project.org/package=drake">drake</a></strong>, <strong><a href="https://cran.r-project.org/package=googleComputeEngineR">googleComputeEngineR</a></strong>, <strong><a href="https://cran.r-project.org/package=mlr3">mlr3</a></strong>, <strong><a href="https://cran.r-project.org/package=plumber">plumber</a></strong>, <strong><a href="https://cran.r-project.org/package=promises">promises</a></strong> (used by <strong><a href="https://cran.r-project.org/package=shiny">shiny</a></strong>), and <strong><a href="https://cran.r-project.org/package=Seurat">Seurat</a></strong>. These checks are performed on Linux with both the default settings and when forcing tests to use multisession workers (SOCK clusters), which further validates that globals and packages are identified correctly.</p></li> <li><p>A suite of <em>Future API conformance tests</em> available in the <strong><a href="https://cran.r-project.org/package=future.tests">future.tests</a></strong> package validates the correctness of all future backends. Any new future backend developed must pass these tests to comply with the <em>Future API</em>. By conforming to this API, the end-user can trust that the backend will produce the same correct and reproducible results as any other backend, including the ones that the developer have tested on. Also, by making it the responsibility of the developer to assert that their new future backend conforms to the <em>Future API</em>, we relieve other developers from having to test that their future-based software works on all backends. It would be a daunting task for a developer to validate the correctness of their software with all existing backends. Even if they would achieve that, there may be additional third-party future backends that they are not aware of, that they do not have the possibility to test with, or that are yet to be developed. The <strong>future.tests</strong> framework was sponsored by an <a href="https://www.r-consortium.org/projects/awarded-projects">R Consortium ISC grant</a>.</p></li> <li><p>Since <strong><a href="https://CRAN.R-Project.org/package=foreach">foreach</a></strong> is used by a large number of essential CRAN packages, it provides an excellent opportunity for supplementary validation. Specifically, I dynamically tweak the examples of <strong><a href="https://CRAN.R-Project.org/package=foreach">foreach</a></strong> and popular CRAN packages <strong><a href="https://CRAN.R-Project.org/package=caret">caret</a></strong>, <strong><a href="https://CRAN.R-Project.org/package=glmnet">glmnet</a></strong>, <strong><a href="https://CRAN.R-Project.org/package=NMF">NMF</a></strong>, <strong><a href="https://CRAN.R-Project.org/package=plyr">plyr</a></strong>, and <strong><a href="https://CRAN.R-Project.org/package=TSP">TSP</a></strong> to use the <strong><a href="https://cran.r-project.org/package=doFuture">doFuture</a></strong> adaptor. This allows me to run these examples with a variety of future backends to validate that the examples produce no run-time errors, which indirectly validates the backends and the <em>Future API</em>. In the past, these types of tests helped to identify and resolve corner cases where automatic identification of global variables would fail. As a side note, several of these foreach-based examples fail when using a parallel foreach adaptor because they do not properly export globals or declare package dependencies. The exception is when using the sequential <em>doSEQ</em> adaptor (default), fork-based ones such as <strong><a href="https://CRAN.R-Project.org/package=doMC">doMC</a></strong>, or the generic <strong><a href="https://cran.r-project.org/package=doFuture">doFuture</a></strong>, which supports any future backend and relies on the future framework for handling globals and packages.</p></li> <li><p>Analogously to above reverse-dependency checks of each new release, CRAN and Bioconductor continuously run checks on all these direct, but also indirect, reverse dependencies, which further increases the validation of the <em>Future API</em> and the future ecosystem at large.</p></li> </ul> <p>May the future be with you!</p> <h2 id="links">Links</h2> <ul> <li><strong>future</strong> package: <a href="https://cran.r-project.org/package=future">CRAN</a>, <a href="https://github.com/HenrikBengtsson/future">GitHub</a></li> <li><strong>future.tests</strong> package: <a href="https://cran.r-project.org/package=future.tests">CRAN</a>, <a href="https://github.com/HenrikBengtsson/future.tests">GitHub</a></li> </ul> </description>
</item>
<item>
<title>future 1.19.1 - Making Sure Proper Random Numbers are Produced in Parallel Processing</title>
<link>https://www.jottr.org/2020/09/22/push-for-statistical-sound-rng/</link>
<pubDate>Tue, 22 Sep 2020 19:00:00 -0700</pubDate>
<guid>https://www.jottr.org/2020/09/22/push-for-statistical-sound-rng/</guid>
<description> <p><center> <img src="https://www.jottr.org/post/Digital_rain_animation_medium_letters_clear.gif" alt="&quot;Animation of green Japanese Kana symbols raining down in parallel on a black background inspired by The Matrix movie&quot;" /> <small><em>Parallel &lsquo;Digital Rain&rsquo; by <a href="https://commons.wikimedia.org/w/index.php?curid=63377054">Jahobr</a></em></small> </center></p> <p>After two-and-a-half months, <strong><a href="https://cran.r-project.org/package=future">future</a></strong> 1.19.1 is now on CRAN. As usual, there are some bug fixes and minor improvements here and there (<a href="https://cran.r-project.org/web/packages/future/NEWS">NEWS</a>), including things needed by the next version of <strong><a href="https://cran.r-project.org/package=furrr">furrr</a></strong>. For those of you who use Slurm or LSF/OpenLava as a scheduler on your high-performance compute (HPC) cluster, <code>future::availableCores()</code> will now do a better job respecting the CPU resources that those schedulers allocate for your R jobs.</p> <p>With all that said, the most significant update is that <strong>an informative warning is now given if random numbers were produced unexpectedly</strong>. Here &ldquo;unexpectedly&rdquo; means that the developer did not declare that their code needs random numbers.</p> <p>If you are just interested in the updates regarding random numbers and how to make sure your code is compliant, skip down to the section on &lsquo;<a href="#random-number-generation-in-the-future-framework">Random Number Generation in the Future Framework</a>&rsquo;. If you are curious how R generates random numbers and how that matters when we use parallel processing, keep on reading.</p> <p><em>Disclaimer</em>: I should clarify that, although I understand some algorithms and statistical aspects behind random number generation, my knowledge is limited. If you find mistakes below, please let me know so I can correct them. If you have ideas on how to improve this blog post, or parallel random number generation, I am grateful for such suggestions.</p> <h2 id="random-number-generation-in-r">Random Number Generation in R</h2> <p>Being able to generate high-quality random numbers is essential in many areas. For example, we use random number generation in cryptography to produce public-private key pairs. If there is a correlation in the random numbers produced, there is a risk that someone can reverse engineer the private key. In statistics, we need random numbers in simulation studies, bootstrap, and permutation tests. The correctness of these methods rely on the assumptions that the random numbers drawn are &ldquo;as random as possible&rdquo;. What we mean by &ldquo;as random as possible&rdquo; depends on context and there are several ways to measure &ldquo;amount of randomness&rdquo;, e.g. amount of autocorrelation in the sequence of numbers produced.</p> <p>As developers, statisticians, and data scientists, we often have better things to do than validating the quality of random numbers. Instead, we just want to rely on the computer to produce random numbers that are &ldquo;good enough.&rdquo; This is often safe to do because most programming languages produce high-quality random numbers out of the box. However, <strong>when we run our algorithms in parallel, random number generation becomes more complicated</strong> and we have to make efforts to get it right.</p> <p>In software, a so-called <em>random number generator</em> (RNG) produces all random numbers. Although hardware RNGs exist (e.g. thermal noise), by far the most common way to produce random numbers is through a pseudo RNG. A pseudo RNG uses an algorithm that produces a sequence of numbers that appear to be random but is fully deterministic given its initial state. For example, in R, we can draw one or more (pseudo) random numbers in $[0,1]$ using <code>runif()</code>, e.g.</p> <pre><code class="language-r">&gt; runif(n = 5) [1] 0.9400145 0.9782264 0.1174874 0.4749971 0.5603327 </code></pre> <p>We can control the RNG state via <code>set.seed()</code>, e.g.</p> <pre><code class="language-r">&gt; set.seed(42) &gt; runif(n = 5) [1] 0.9148060 0.9370754 0.2861395 0.8304476 0.6417455 </code></pre> <p>If we use this technique, we can regenerate the same pseudo random numbers at a later state if we reset to the same initial RNG state, i.e.</p> <pre><code class="language-r">&gt; set.seed(42) &gt; runif(n = 5) [1] 0.9148060 0.9370754 0.2861395 0.8304476 0.6417455 </code></pre> <p>This works also after restarting R, on other computers, and other operating systems. Being able to set the initial RNG state this way allows us to produce numerically reproducible results even when the methods involved rely on randomness.</p> <p>There is no need to set the RNG state, which is also referred to as &ldquo;the random seed&rdquo;. If not set, R uses a “random” initial RNG state based on various “random” properties such as the current timestamp and the process ID of the current R session. Because of this, we rarely have to set the random seed and things just work.</p> <h2 id="random-number-generation-for-parallel-processing">Random Number Generation for Parallel Processing</h2> <p>R does a superb job of taking care of us when it comes to random number generation - as long as we run our analysis sequentially in a single R process. Formally R uses the Mersenne Twister RNG algorithm [1] by default, which can we can set explicitly using <code>RNGkind(&quot;Mersenne-Twister&quot;)</code>. However, like many other RNG algorithms, the authors designed this one for generating random number sequentially but not in parallel. If we use it in parallel code, there is a risk that there will a correlation between the random numbers generated in parallel, and, when taken together, they may no longer be &ldquo;random enough&rdquo; for our needs.</p> <p>A not-so-uncommon, ad hoc attempt to overcome this problem is to set a unique random seed for each parallel iteration, e.g.</p> <pre><code class="language-r">library(parallel) cl &lt;- makeCluster(4) y &lt;- parLapply(cl, 1:10, function(i) { set.seed(i) runif(n = 5) }) stopCluster(cl) </code></pre> <p>The idea is that although <code>i</code> and <code>i+1</code> are deterministic, <code>set.seed(i)</code> and <code>set.seed(i+1)</code> will set two different RNG states that are &ldquo;non-deterministic&rdquo; compared to each other, e.g. if we know one of them, we cannot predict the other. We can also find other variants of this approach. For instance, we can pre-generate a set of &ldquo;random&rdquo; random seeds and use them one-by-one in each iteration;</p> <pre><code class="language-r">library(parallel) cl &lt;- makeCluster(4) set.seed(42) seeds &lt;- sample.int(n = 10) y &lt;- parLapply(cl, seeds, function(seed) { set.seed(seed) runif(n = 5) }) stopCluster(cl) </code></pre> <p><strong>However, these approaches do <em>not</em> guarantee high-quality random numbers</strong>. Although not parallel-safe by itself, the latter approach resembles the gist of RNG algorithms designed for parallel processing.</p> <p>The L&rsquo;Ecuyer Combined Multiple Recursive random number Generators (CMRG) method [2,3] provides an RNG algorithm that works also for parallel processing. R has built-in support for this method via the <strong>parallel</strong> package. See <code>help(&quot;nextRNGStream&quot;, package = &quot;parallel&quot;)</code> for additional information. One way to use this is:</p> <pre><code class="language-r">library(parallel) cl &lt;- makeCluster(4) RNGkind(&quot;L'Ecuyer-CMRG&quot;) set.seed(42) seeds &lt;- list(.Random.seed) for (i in 2:10) seeds[[i]] &lt;- nextRNGStream(seeds[[i - 1]]) y &lt;- parLapply(cl, seeds, function(seed) { .Random.seed &lt;- seed runif(n = 5) }) stopCluster(cl) </code></pre> <p>Note the similarity to the previous attempt above. For convenience, R provides <code>parallel::clusterSetRNGStream()</code>, which allows us to do:</p> <pre><code class="language-r">library(parallel) cl &lt;- makeCluster(4) clusterSetRNGStream(cl, iseed = 42) y &lt;- parLapply(cl, 1:10, function(i) { runif(n = 5) }) stopCluster(cl) </code></pre> <p><em>Comment</em>: Contrary to the manual approach, <code>clusterSetRNGStream()</code> does not create one RNG seed per iteration (here ten) but one per workers (here four). Because of this, the two examples will <em>not</em> produce the same random numbers despite using the same initial seed (42). When using <code>clusterSetRNGStream()</code>, the sequence of random numbers produced will depend on the number of parallel workers used, meaning the results will not be numerically identical unless we use the same number of parallel workers. Having said this, we are using a parallel-safe RNG algorithm here, so we still get high-quality random numbers without risking to compromising our statistical analysis, if that is what we are running.</p> <h2 id="random-number-generation-in-the-future-framework">Random Number Generation in the Future Framework</h2> <p>The <strong><a href="https://cran.r-project.org/package=future">future</a></strong> framework, which provides a unifying approach to parallel processing in R, uses the L&rsquo;Ecuyer CMRG algorithm to generate all random numbers. There is no need to specify <code>RNGkind(&quot;L'Ecuyer-CMRG&quot;)</code> - if not already set, the future framework will still use it internally. At the lowest level, the Future API supports specifying the random seed for each individual future. However, most developers and end-users use the higher-level map-reduce APIs provided by the <strong><a href="https://cran.r-project.org/package=future.apply">future.apply</a></strong> and <strong><a href="https://cran.r-project.org/package=furrr">furrr</a></strong> package, which provides &ldquo;seed&rdquo; arguments for controlling the RNG behavior. Importantly, generating L&rsquo;Ecuyer-CMRG RNG streams comes with a significant overhead. Because of this, the default is to <em>not</em> generate them. If we intend to produce random numbers, we need to specify that via the &ldquo;seed&rdquo; argument, e.g.</p> <pre><code class="language-r">library(future.apply) y &lt;- future_lapply(1:10, function(i) { runif(n = 5) }, future.seed = TRUE) </code></pre> <p>and</p> <pre><code class="language-r">library(furrr) y &lt;- future_map(1:10, function(i) { runif(n = 5) }, .options = future_options(seed = TRUE)) </code></pre> <p>Contrary to generating RNG streams, checking if a future has used random numbers is quick. All we have to do is to keep track of the RNG state and check if it still the same afterward (after the future has been resolved). Starting with <strong>future</strong> 1.19.0, <strong>the future framework will warn us whenever we use the RNG without declaring it</strong>. For instance,</p> <pre><code class="language-r">&gt; y &lt;- future_lapply(1:10, function(i) { + runif(n = 5) + }) Warning message: UNRELIABLE VALUE: Future ('future_lapply-1') unexpectedly generated random numbers without specifying argument '[future.]seed'. There is a risk that those random numbers are not statistically sound and the overall results might be invalid. To fix this, specify argument '[future.]seed', e.g. 'seed=TRUE'. This ensures that proper, parallel-safe random numbers are produced via the L'Ecuyer-CMRG method. To disable this check, use [future].seed=NULL, or set option 'future.rng.onMisuse' to &quot;ignore&quot;. </code></pre> <p>Although technically unnecessary, this warning will also be produced when running sequentially. This is to make sure that all future-based code will produce correct results when switching to a parallel backend.</p> <p>When using <strong><a href="https://cran.r-project.org/package=foreach">foreach</a></strong> the best practice is to use the <strong><a href="https://cran.r-project.org/package=doRNG">doRNG</a></strong> package to produce parallel-safe random numbers. This is true regardless of foreach adaptor and parallel backend used. Specifically, instead of using <code>%dopar%</code> we want to use <code>%dorng%</code>. For example, here is what it looks like if we use the <strong><a href="https://cran.r-project.org/package=doFuture">doFuture</a></strong> adaptor;</p> <pre><code class="language-r">library(foreach) library(doRNG) doFuture::registerDoFuture() future::plan(&quot;multisession&quot;) y &lt;- foreach(i = 1:10) %dorng% { runif(n = 5) } </code></pre> <p>The benefit of using the <strong>doFuture</strong> adaptor is that it will also detect when we, or packages that use <strong>foreach</strong>, forget to declare that the RNG is needed, e.g.</p> <pre><code class="language-r">y &lt;- foreach(i = 1:10) %dopar% { runif(n = 5) } Warning messages: 1: UNRELIABLE VALUE: Future ('doFuture-1') unexpectedly generated random numbers without specifying argument '[future.]seed'. There is a risk that those random numbers are not statistically sound and the overall results might be invalid. To fix this, specify argument '[future.]seed', e.g. 'seed=TRUE'. This ensures that proper, parallel-safe random numbers are produced via the L'Ecuyer-CMRG method. To disable this check, use [future].seed=NULL, or set option 'future.rng.onMisuse' to &quot;ignore&quot;. ... </code></pre> <p>Note that there will be one warning per future, which in the above examples, means one warning per parallel worker.</p> <p>If you are an end-user of a package that uses futures internally and you get these warnings, please report them to the maintainer of that package. You might have to use <code>options(warn = 2)</code> to upgrade to an error and then <code>traceback()</code> to track down from where the warning originates. It is not unlikely that they have forgotten or are not aware of the problem of using a proper RNG for parallel processing. Regardless, the fix is for them to declare <code>future.seed = TRUE</code>. If these warnings are irrelevant and the maintainer does not believe there is an RNG issue, then they can declare that using <code>future.seed = NULL</code>, e.g.</p> <pre><code class="language-r">y &lt;- future_lapply(X, function(x) { ... }, future.seed = NULL) </code></pre> <p>The default is <code>future.seed = FALSE</code>, which means &ldquo;no random numbers will be produced, and if there are, then it is a mistake.&rdquo;</p> <p>Until the maintainer has corrected this, as an end-user you can silence these warnings by setting:</p> <pre><code class="language-r">options(future.rng.onMisuse = &quot;ignore&quot;) </code></pre> <p>which was the default until <strong>future</strong> 1.19.0. If you want to be conservative, you can even upgrade the warning to a run-time error by setting this option to <code>&quot;error&quot;</code>.</p> <p>If you are a developer and struggle to narrow down exactly which part of your code uses random number generation, see my blog post &lsquo;<a href="https://www.jottr.org/2020/09/21/detect-when-the-random-number-generator-was-used/">Detect When the Random Number Generator Was Used</a>&rsquo; for an example how you can track the RNG state at the R prompt and get a notification whenever a function call used the RNG internally.</p> <h2 id="what-s-next-regarding-rng-and-futures">What&rsquo;s next regarding RNG and futures?</h2> <ul> <li><p>The higher-level map-reduce APIs in the future framework support perfectly reproducible random numbers regardless of future backend and number of parallel workers being used. This is convenient because it allows us to get identical results when we, for instance, move from a notebook to an HPC environment. The downside is that this RNG strategy requires that one RNG stream is created per iteration, which is expensive when there are many elements to iterate over. If one does not need numerically reproducible random numbers, then it would be sufficient and valid to produce one RNG stream per chunk, where we often have one chunk per worker, similar to what <code>parallel::clusterSetRNGStream()</code> does. It has been on the roadmap for a while to <a href="https://github.com/HenrikBengtsson/future.apply/issues/20">add support for per-chunk RNG streams</a> as well. The remaining thing we need to resolve is to decide on exactly how to specify that type of strategy, e.g. <code>future_lapply(..., future.seed = &quot;per-chunk&quot;)</code> versus <code>future_lapply(..., future.seed = &quot;per-element&quot;)</code>, where the latter is an alternative to today&rsquo;s <code>future.seed = TRUE</code>. I will probably address this in a new utility package <strong>future.mapreduce</strong> that can serve <strong>future.apply</strong> and <strong>furrr</strong> and likes, so that they do not have to re-implement this locally, which is error prone and how it works at the moment.</p></li> <li><p>L&rsquo;Ecuyer CMRG is not the only RNG algorithm designed for parallel processing but some developers might want to use another method. There are already many CRAN packages that provide alternatives, e.g. <strong><a href="https://cran.r-project.ogr/package=dqrng">dqrng</a></strong>, <strong><a href="https://cran.r-project.ogr/package=qrandom">qrandom</a></strong>, <strong><a href="https://cran.r-project.ogr/package=random">random</a></strong>, <strong><a href="https://cran.r-project.ogr/package=randtoolbox">randtoolbox</a></strong>, <strong><a href="https://cran.r-project.ogr/package=rlecuyer">rlecuyer</a></strong>, <strong><a href="https://cran.r-project.ogr/package=rngtools">rngtools</a></strong>, <strong><a href="https://cran.r-project.ogr/package=rngwell19937">rngwell19937</a></strong>, <strong><a href="https://cran.r-project.ogr/package=rstream">rstream</a></strong>, <strong><a href="https://cran.r-project.ogr/package=rTRNG">rTRNG</a></strong>, and <strong><a href="https://cran.r-project.ogr/package=sitmo">sitmo</a></strong>. It is on the long-term road map to support other types of parallel RNG methods. It will require a fair bit of work to come up with a unifying API for this and then a substantial amount of testing and validation to make sure it is correct.</p></li> </ul> <p>Happy random futuring!</p> <h2 id="references">References</h2> <ol> <li><p>Matsumoto, M. and Nishimura, T. (1998). Mersenne Twister: A 623-dimensionally equidistributed uniform pseudo-random number generator, <em>ACM Transactions on Modeling and Computer Simulation</em>, 8, 3–30.</p></li> <li><p>L&rsquo;Ecuyer, P. (1999). Good parameters and implementations for combined multiple recursive random number generators. <em>Operations Research</em>, 47, 159–164. doi: <a href="https://doi.org/10.1287/opre.47.1.159">10.1287/opre.47.1.159</a>.</p></li> <li><p>L&rsquo;Ecuyer, P., Simard, R., Chen, E. J. and Kelton, W. D. (2002). An object-oriented random-number package with many long streams and substreams. <em>Operations Research</em>, 50, 1073–1075. doi: <a href="https://doi.org/10.1287/opre.50.6.1073.358">10.1287/opre.50.6.1073.358</a>.</p></li> </ol> <h2 id="links">Links</h2> <ul> <li><strong>future</strong> package: <a href="https://cran.r-project.org/package=future">CRAN</a>, <a href="https://github.com/HenrikBengtsson/future">GitHub</a></li> <li><strong>future.apply</strong> package: <a href="https://cran.r-project.org/package=future.apply">CRAN</a>, <a href="https://github.com/HenrikBengtsson/future.apply">GitHub</a></li> <li><strong>doFuture</strong> package: <a href="https://cran.r-project.org/package=doFuture">CRAN</a>, <a href="https://github.com/HenrikBengtsson/doFuture">GitHub</a> (a <strong><a href="https://cran.r-project.org/package=foreach">foreach</a></strong> adapter)</li> </ul> </description>
</item>
<item>
<title>Detect When the Random Number Generator Was Used</title>
<link>https://www.jottr.org/2020/09/21/detect-when-the-random-number-generator-was-used/</link>
<pubDate>Mon, 21 Sep 2020 18:45:00 -0700</pubDate>
<guid>https://www.jottr.org/2020/09/21/detect-when-the-random-number-generator-was-used/</guid>
<description><p><center> <img src="https://www.jottr.org/post/DistortedRecentEland_50pct.gif" alt="&quot;An animated close-up of a spinning roulette wheel&quot;" /> </center></p> <p>If you ever need to figure out if a function call in R generated a random number or not, here is a simple trick that you can use in an interactive R session. Add the following to your <code>~/.Rprofile</code>(*):</p> <pre><code class="language-r">if (interactive()) { invisible(addTaskCallback(local({ last &lt;- .GlobalEnv$.Random.seed function(...) { curr &lt;- .GlobalEnv$.Random.seed if (!identical(curr, last)) { msg &lt;- &quot;TRACKER: .Random.seed changed&quot; if (requireNamespace(&quot;crayon&quot;, quietly=TRUE)) msg &lt;- crayon::blurred(msg) message(msg) last &lt;&lt;- curr } TRUE } }), name = &quot;RNG tracker&quot;)) } </code></pre> <p>It works by checking whether or not the state of the random number generator (RNG), that is, <code>.Random.seed</code> in the global environment, was changed. If it has, a note is produced. For example,</p> <pre><code class="language-r">&gt; sum(1:100) [1] 5050 &gt; runif(1) [1] 0.280737 TRACKER: .Random.seed changed &gt; </code></pre> <p>It is not always obvious that a function generates random numbers internally. For instance, the <code>rank()</code> function may or may not updated the RNG state depending on argument <code>ties</code> as illustrated in the following example:</p> <pre><code class="language-r">&gt; x &lt;- c(1, 4, 3, 2) &gt; rank(x) [1] 1.0 2.5 2.5 4.0 &gt; rank(x, ties.method = &quot;random&quot;) [1] 1 3 2 4 TRACKER: .Random.seed changed &gt; </code></pre> <p>For some functions, it may even depend on the input data whether or not random numbers are generated, e.g.</p> <pre><code class="language-r">&gt; y &lt;- matrixStats::rowRanks(matrix(c(1,2,2), nrow=2, ncol=3), ties.method = &quot;random&quot;) TRACKER: .Random.seed changed &gt; y &lt;- matrixStats::rowRanks(matrix(c(1,2,3), nrow=2, ncol=3), ties.method = &quot;random&quot;) &gt; </code></pre> <p>I have this RNG tracker enabled all the time to learn about functions that unexpectedly draw random numbers internally, which can be important to know when you run statistical analysis in parallel.</p> <p>As a bonus, if you have the <strong><a href="https://cran.r-project.org/package=crayon">crayon</a></strong> package installed, the RNG tracker will output the note with a style that is less intrusive.</p> <p>(*) If you use the <strong><a href="https://cran.r-project.org/package=startup">startup</a></strong> package, you can add it to a new file <code>~/.Rprofile.d/interactive=TRUE/rng_tracker.R</code>. To learn more about the <strong>startup</strong> package, have a look at the <a href="https://www.jottr.org/tags/startup/">blog posts on <strong>startup</strong></a>.</p> <p>EDIT 2020-09-23: Changed the message prefix from &lsquo;NOTE:&rsquo; to &lsquo;TRACKER:&lsquo;.</p> </description>
</item>
<item>
<title>future and future.apply - Some Recent Improvements</title>
<link>https://www.jottr.org/2020/07/11/future-future.apply-recent-improvements/</link>
<pubDate>Sat, 11 Jul 2020 22:15:00 -0700</pubDate>
<guid>https://www.jottr.org/2020/07/11/future-future.apply-recent-improvements/</guid>
<description> <p>There are new versions of <strong><a href="https://cran.r-project.org/package=future">future</a></strong> and <strong><a href="https://cran.r-project.org/package=future.apply">future.apply</a></strong> - your friends in the parallelization business - on CRAN. These updates are mostly maintenance updates with bug fixes, some improvements, and preparations for upcoming changes. It&rsquo;s been some time since I blogged about these packages, so here is the summary of the main updates this far since early 2020:</p> <ul> <li><p><strong>future</strong>:</p> <ul> <li><p><code>values()</code> for lists and other containers was renamed to <code>value()</code> to simplify the API [future 1.17.0]</p></li> <li><p>When future results in an evaluation error, the <code>result()</code> object of the future holds also the session information when the error occurred [future 1.17.0]</p></li> <li><p><code>value()</code> can now detect and warn if a <code>future(..., seed=FALSE)</code> call generated random numbers, which then might give unreliable results because non-parallel safe, non-statistically sound random number generation (RNG) was used [future 1.16.0]</p></li> <li><p>Progress updates by <strong><a href="https://github.com/HenrikBengtsson/progressr">progressr</a></strong> are relayed in a near-live fashion for multisession and cluster futures [future 1.16.0]</p></li> <li><p><code>makeClusterPSOCK()</code> gained argument <code>rscript_envs</code> for setting or copying environment variables <em>during</em> the startup of each worker, e.g. <code>rscript_envs=c(FOO=&quot;hello world&quot;, &quot;BAR&quot;)</code> [future 1.17.0]. In addition, on Linux and macOS, it also possible to set environment variables <em>prior</em> to launching the workers, e.g. <code>rscript=c(&quot;TMPDIR=/tmp/foo&quot;, &quot;FOO='hello world'&quot;, &quot;Rscript&quot;)</code> [future 1.18.0]</p></li> <li><p>Error messages of severe cluster future failures are more informative and include details on the affected worker include hostname and R version [future 1.17.0 and 1.18.0]</p></li> </ul></li> <li><p><strong>future.apply</strong>:</p> <ul> <li><p><code>future_apply()</code> gained argument <code>simplify</code>, which has been added to <code>base::apply()</code> in R-devel (to become R 4.1.0) [future.apply 1.6.0]</p></li> <li><p>Added <code>future_.mapply()</code> corresponding to <code>base::.mapply()</code> [future.apply 1.5.0]</p></li> <li><p><code>future_lapply()</code> and friends set a label on each future that reflects the name of the function and the index of the chunk, e.g. &lsquo;future_lapply-3&rsquo; [future.apply 1.4.0]</p></li> <li><p>The assertion of the maximum size of globals per chunk is significantly faster for <code>future_apply()</code> [future.apply 1.4.0]</p></li> </ul></li> </ul> <p>There have also been updates to <strong><a href="https://cran.r-project.org/package=doFuture">doFuture</a></strong> and <strong><a href="https://cran.r-project.org/package=future.batchtools">future.batchtools</a></strong>. Please see their NEWS files for the details.</p> <h2 id="what-s-next">What&rsquo;s next?</h2> <p>I&rsquo;m working on cleaning up and harmonization the Future API even further. This is necessary so I can add some powerful features later on. One example of this cleanup is making sure that all types of futures are resolved in a local environment, which means that the <code>local</code> argument can be deprecated and eventually removed. Another example is to deprecate argument <code>persistent</code> for cluster futures, which is an &ldquo;outlier&rdquo; and remnant from the past. I&rsquo;m aware that some of you use <code>plan(cluster, persistent=TRUE)</code>, which, as far as I understand, is because you need to keep persistent variables around throughout the lifetime of the workers. I&rsquo;ve got a prototype of &ldquo;sticky globals&rdquo; that solves this problem differently, without the need for <code>persistent=FALSE</code>. I&rsquo;ll try my best to make sure everyone&rsquo;s needs are met. If you&rsquo;ve got questions, feedback, or a special use case, please reach out on <a href="https://github.com/HenrikBengtsson/future/issues/382">https://github.com/HenrikBengtsson/future/issues/382</a>.</p> <p>I&rsquo;ve also worked with the maintainers of <strong><a href="https://cran.r-project.org/package=foreach">foreach</a></strong> to harmonize the end-user and developer experience of <strong>foreach</strong> with that of the <strong>future</strong> framework. For example, in <code>y &lt;- foreach(...) %dopar% { ... }</code>, the <code>{ ... }</code> expression is now always evaluated in a local environment, just like futures. This helps avoid some quite common beginner mistakes that happen when moving from sequential to parallel processing. You can read about this change in the <a href="https://blog.revolutionanalytics.com/2020/03/foreach-150-released.html">&lsquo;foreach 1.5.0 now available on CRAN&rsquo;</a> blog post by Hong Ooi. There is also <a href="https://github.com/RevolutionAnalytics/foreach/issues/2">a discussion</a> on updating how <strong>foreach</strong> identifies global variables and packages so that it works the same as in the <strong>future</strong> framework.</p> <p>Happy futuring!</p> <h2 id="links">Links</h2> <ul> <li><strong>future</strong> package: <a href="https://cran.r-project.org/package=future">CRAN</a>, <a href="https://github.com/HenrikBengtsson/future">GitHub</a></li> <li><strong>future.apply</strong> package: <a href="https://cran.r-project.org/package=future.apply">CRAN</a>, <a href="https://github.com/HenrikBengtsson/future.apply">GitHub</a></li> <li><strong>doFuture</strong> package: <a href="https://cran.r-project.org/package=doFuture">CRAN</a>, <a href="https://github.com/HenrikBengtsson/doFuture">GitHub</a> (a <strong><a href="https://cran.r-project.org/package=foreach">foreach</a></strong> adapter)</li> <li><strong>future.batchtools</strong> package: <a href="https://cran.r-project.org/package=future.batchtools">CRAN</a>, <a href="https://github.com/HenrikBengtsson/future.batchtools">GitHub</a></li> <li><strong>future.callr</strong> package: <a href="https://cran.r-project.org/package=future.callr">CRAN</a>, <a href="https://github.com/HenrikBengtsson/future.callr">GitHub</a></li> <li><strong>future.tests</strong> package: <a href="https://cran.r-project.org/package=future.tests">CRAN</a>, <a href="https://github.com/HenrikBengtsson/future.tests">GitHub</a></li> <li><strong>progressr</strong> package: <a href="https://github.com/HenrikBengtsson/progressr">GitHub</a></li> </ul> <p>UPDATE: Added link to GitHub issue to discuss deprecation of <code>local</code> and <code>persistent</code> /2020-07-16</p> </description>
</item>
<item>
<title>e-Rum 2020 Slides on Progressr</title>
<link>https://www.jottr.org/2020/07/04/progressr-erum2020-slides/</link>
<pubDate>Sat, 04 Jul 2020 17:30:00 -0700</pubDate>
<guid>https://www.jottr.org/2020/07/04/progressr-erum2020-slides/</guid>
<description> <div style="width: 25%; margin: 2ex; float: right;"/> <center> <img src="https://www.jottr.org/post/three_in_chinese.gif" alt="Animated strokes for writing three in Chineses; one, two, three strokes"/> <span style="font-size: 80%; font-style: italic;">Source: <a href="https://en.wiktionary.org/wiki/File:%E4%B8%89-order.gif">Wiktionary.org</a></span> </center> </div> <p>I presented <em>Progressr: An Inclusive, Unifying API for Progress Updates</em> (15 minutes; 20 slides) at <a href="https://2020.erum.io/">e-Rum 2020</a>, on June 17, 2020:</p> <ul> <li><a href="https://www.jottr.org/presentations/eRum2020/BengtssonH_20200617-progressr-An_Inclusive,_Unifying_API_for_Progress_Updates.abstract.txt">Abstract</a></li> <li><a href="https://docs.google.com/presentation/d/11RymPwL90rPc0dQwpNCnw5KQC_76tuDK7uB7rq26oIg/present#slide=id.g88962cfdb7_0_0">HTML</a> (incremental Google Slides; requires online access)</li> <li><a href="https://www.jottr.org/presentations/eRum2020/BengtssonH_20200617-progressr-An_Inclusive,_Unifying_API_for_Progress_Updates.pdf">PDF</a> (flat slides)</li> <li><a href="https://www.youtube.com/watch?v=NwVOvfpGq4o&amp;t=3001s">Video</a> (starts at 00h49m58s)</li> </ul> <p>I am grateful for everyone involved who made e-Rum 2020 possible. I cannot imagine having to cancel the on-site Milano conference that had planned for more than a year and then start over to re-organize and create a fabulous online experience for ~1,500 participants in such short notice. Your contribution to the R Community in these times is invaluable - thank you soo much.</p> <p>As a speaker, I found it a bit of a challenge. It was my first presentation at an all online conference, so I wasn&rsquo;t sure what to expect and how it would go. As others said, it is indeed a bit unusual to present to an audience you know is there but that you cannot see or interact with during the talk. I gave my presentation a bit before seven o&rsquo;clock in the morning my time, and halfway through, my mind tried to convince me that it would be ok to get up and pour myself another cup of coffee - hehe - I certainly did not expect that one.</p> <p>Now, let&rsquo;s make some progress in this world!</p> <p>- Henrik</p> <h2 id="links">Links</h2> <ul> <li>e-Rum 2020: <ul> <li>Conference site: <a href="https://2020.erum.io/">https://2020.erum.io/</a></li> </ul></li> <li>Packages useful for understanding this talk (in order of appearance): <ul> <li><strong>progressr</strong> package: <a href="https://cran.r-project.org/package=progressr">CRAN</a>, <a href="https://github.com/HenrikBengtsson/progressr">GitHub</a></li> <li><strong>progress</strong> package: <a href="https://cran.r-project.org/package=progress">CRAN</a>, <a href="https://github.com/r-lib/progress">GitHub</a></li> <li><strong>beepr</strong> package: <a href="https://cran.r-project.org/package=beepr">CRAN</a>, <a href="https://github.com/rasmusab/beepr">GitHub</a></li> <li><strong>future</strong> package: <a href="https://cran.r-project.org/package=future">CRAN</a>, <a href="https://github.com/HenrikBengtsson/future">GitHub</a></li> </ul></li> </ul> </description>
</item>
<item>
<title>rstudio::conf 2020 Slides on Futures</title>
<link>https://www.jottr.org/2020/02/01/future-rstudioconf2020-slides/</link>
<pubDate>Sat, 01 Feb 2020 19:30:00 -0800</pubDate>
<guid>https://www.jottr.org/2020/02/01/future-rstudioconf2020-slides/</guid>
<description> <div style="width: 25%; margin: 2ex; float: right;"/> <center> <img src="https://www.jottr.org/post/future-logo.png" alt="The future logo"/> <span style="font-size: 80%; font-style: italic;">Design: <a href="https://twitter.com/embiggenData">Dan LaBar</a></span> </center> </div> <p>I presented <em>Future: Simple Async, Parallel &amp; Distributed Processing in R Why and What’s New?</em> at <a href="https://rstudio.com/conference/">rstudio::conf 2020</a> in San Francisco, USA, on January 29, 2020. Below are the slides for my talk (17 slides; ~18+2 minutes):</p> <ul> <li><a href="https://docs.google.com/presentation/d/1Wn5S91UGIOrc4IyXoV074ij5vGF8I0Km0tCfintyIa4/present?includes_info_params=1&amp;eisi=CM2mhIXwsecCFQyuJgodBQAJ8A#slide=id.p">HTML</a> (incremental Google Slides; requires online access)</li> <li><a href="https://www.jottr.org/presentations/rstudioconf2020/BengtssonH_20200129-future-rstudioconf2020.pdf">PDF</a> (flat slides)</li> <li><a href="https://resources.rstudio.com/rstudio-conf-2020/future-simple-async-parallel-amp-distributed-processing-in-r-whats-next-henrik-bengtsson">Video</a> with closed captions (official rstudio::conf recording)</li> </ul> <p>First of all, a big thank you goes out to Dan LaBar (<a href="https://twitter.com/embiggenData">@embiggenData</a>) for proposing and contributing the original design of the future hex sticker. All credits to Dan. (You can blame me for the tweaked background.)</p> <p>This was my first rstudio::conf and it was such a pleasure to be part of it. I&rsquo;d like to thank <a href="https://blog.rstudio.com/2020/01/29/rstudio-pbc">RStudio, PBC</a> for the invitation to speak and everyone who contributed to the conference - organizers, staff, speakers, poster presenters, and last but not the least, all the wonderful participants. Each one of you makes our R community what it is today.</p> <p><em>Happy futuring!</em></p> <p>- Henrik</p> <h2 id="links">Links</h2> <ul> <li>rstudio::conf 2020: <ul> <li>Conference site: <a href="https://rstudio.com/conference/">https://rstudio.com/conference/</a></li> </ul></li> <li>Packages essential to the understanding of this talk (in order of appearance): <ul> <li><strong>future</strong> package: <a href="https://cran.r-project.org/package=future">CRAN</a>, <a href="https://github.com/HenrikBengtsson/future">GitHub</a></li> <li><strong>future.apply</strong> package: <a href="https://cran.r-project.org/package=future.apply">CRAN</a>, <a href="https://github.com/HenrikBengtsson/future.apply">GitHub</a></li> <li><strong>purrr</strong> package: <a href="https://cran.r-project.org/package=purrr">CRAN</a>, <a href="https://github.com/tidyverse/purrr">GitHub</a></li> <li><strong>furrr</strong> package: <a href="https://cran.r-project.org/package=furrr">CRAN</a>, <a href="https://github.com/DavisVaughan/furrr">GitHub</a></li> <li><strong>foreach</strong> package: <a href="https://cran.r-project.org/package=foreach">CRAN</a>, <a href="https://github.com/RevolutionAnalytics/foreach">GitHub</a></li> <li><strong>doFuture</strong> package: <a href="https://cran.r-project.org/package=doFuture">CRAN</a>, <a href="https://github.com/HenrikBengtsson/doFuture">GitHub</a></li> <li><strong>future.batchtools</strong> package: <a href="https://cran.r-project.org/package=future.batchtools">CRAN</a>, <a href="https://github.com/HenrikBengtsson/future.batchtools">GitHub</a></li> <li><strong>batchtools</strong> package: <a href="https://cran.r-project.org/package=batchtools">CRAN</a>, <a href="https://github.com/mllg/batchtools">GitHub</a></li> <li><strong>shiny</strong> package: <a href="https://cran.r-project.org/package=shiny">CRAN</a>, <a href="https://github.com/rstudio/shiny/issues">GitHub</a></li> <li><strong>future.tests</strong> package: <del>CRAN</del>, <a href="https://github.com/HenrikBengtsson/future.tests">GitHub</a></li> <li><strong>progressr</strong> package: <a href="https://cran.r-project.org/package=progressr">CRAN</a>, <a href="https://github.com/HenrikBengtsson/progressr">GitHub</a></li> <li><strong>progress</strong> package: <a href="https://cran.r-project.org/package=progress">CRAN</a>, <a href="https://github.com/r-lib/progress">GitHub</a></li> <li><strong>beepr</strong> package: <a href="https://cran.r-project.org/package=beepr">CRAN</a>, <a href="https://github.com/rasmusab/beepr">GitHub</a></li> </ul></li> </ul> </description>
</item>
<item>
<title>future 1.15.0 - Lazy Futures are Now Launched if Queried</title>
<link>https://www.jottr.org/2019/11/09/resolved-launches-lazy-futures/</link>
<pubDate>Sat, 09 Nov 2019 11:00:00 -0800</pubDate>
<guid>https://www.jottr.org/2019/11/09/resolved-launches-lazy-futures/</guid>
<description> <p><img src="https://www.jottr.org/post/lazy_dog_in_park.gif" alt="&quot;Lazy dog does not want to leave park&quot;" /> <small><em>No dogs were harmed while making this release</em></small></p> <p><strong><a href="https://cran.r-project.org/package=future">future</a></strong> 1.15.0 is now on CRAN, accompanied by a recent, related update of <strong><a href="https://cran.r-project.org/package=future.callr">future.callr</a></strong> 0.5.0. The main update is a change to the Future API:</p> <p><center> <code>resolved()</code> will now also launch lazy futures </center></p> <p>Although this change does not look much to the world, I&rsquo;d like to think of this as part of a young person slowly finding themselves. This change in behavior helps us in cases where we create lazy futures upfront;</p> <pre><code class="language-r">fs &lt;- lapply(X, future, lazy = TRUE) </code></pre> <p>Such futures remain dormant until we call <code>value()</code> on them, or, as of this release, when we call <code>resolved()</code> on them. Contrary to <code>value()</code>, <code>resolved()</code> is a non-blocking function that allows us to check in on one or more futures to see if they are resolved or not. So, we can now do:</p> <pre><code class="language-r">while (!all(resolved(fs))) { do_something_else() } </code></pre> <p>to run that loop until all futures are resolved. Any lazy future that is still dormant will be launched when queried the first time. Previously, we would have had to write specialized code for the <code>lazy=TRUE</code> case to trigger lazy futures to launch. If not, the above loop would have run forever. This change means that the above design pattern works the same regardless of whether we use <code>lazy=TRUE</code> or <code>lazy=FALSE</code> (default). There is now one less thing to worry about when working with futures. Less mental friction should be good.</p> <h2 id="what-else">What else?</h2> <p>The Future API now guarantees that <code>value()</code> relays the &ldquo;visibility&rdquo; of a future&rsquo;s value. For example,</p> <pre><code class="language-r">&gt; f &lt;- future(invisible(42)) &gt; value(f) &gt; v &lt;- value(f) &gt; v [1] 42 </code></pre> <p>Other than that, I have fixed several non-critical bugs and improved some documentation. See <code>news(package=&quot;future&quot;)</code> or <a href="https://cran.r-project.org/web/packages/future/NEWS">NEWS</a> for all updates.</p> <h2 id="what-s-next">What&rsquo;s next?</h2> <ul> <li><p>I&rsquo;ll be talking about futures at <a href="https://rstudio.com/conference/">rstudio::conf 2020</a> (San Francisco, CA, USA) at the end of January 2020. Please come and say hi - I am keen to hear your R story.</p></li> <li><p>I will wrap up the deliverables for the project <a href="https://github.com/HenrikBengtsson/future.tests">Future Minimal API: Specification with Backend Conformance Test Suite</a> sponsored by the R Consortium. This project helps to robustify the future ecosystem and validate that all backends fulfill the Future API specification. It also serves to refine the Future API specifications. For example, the above change to <code>resolved()</code> resulted from this project.</p></li> <li><p>The maintainers of <strong><a href="https://cran.r-project.org/package=foreach">foreach</a></strong> plan to harmonize how <code>foreach()</code> identifies global variables with how the <strong>future</strong> framework identifies them. The idea is to migrate <strong><a href="https://cran.r-project.org/package=foreach">foreach</a></strong> to use the same approach as <strong>future</strong>, which relies on the <strong><a href="https://cran.r-project.org/package=globals">globals</a></strong> package. If you&rsquo;re curious, you can find out more about this over at the <a href="https://github.com/RevolutionAnalytics/foreach/issues">foreach issue tracker</a>. Yeah, the foreach issue tracker is a fairly recent thing - it&rsquo;s a great addition.</p></li> <li><p>The <strong><a href="https://github.com/HenrikBengtsson/progressr">progressr</a></strong> package (GitHub only) is a proof-of-concept and a working <em>prototype</em> showing how to signal progress updates when doing parallel processing. It works out of the box with the core Future API and higher-level Future APIs such as <strong><a href="https://cran.r-project.org/package=future.apply">future.apply</a></strong>, <strong><a href="https://cran.r-project.org/package=foreach">foreach</a></strong> with <strong><a href="https://cran.r-project.org/package=doFuture">doFuture</a></strong>, <strong><a href="https://cran.r-project.org/package=furrr">furrr</a></strong>, and <strong><a href="https://cran.r-project.org/package=plyr">plyr</a></strong> - regardless of what parallel backend is being used. It should also work with all known non-parallel map-reduce frameworks, including <strong>base</strong> <code>lapply()</code> and <strong><a href="https://cran.r-project.org/package=purrr">purrr</a></strong>. For parallel processing, the &ldquo;granularity&rdquo; of progress updates varies with the type of parallel worker used. Right now, you will get live updates for sequential processing, whereas for parallel processing the updates will come in chunks along with the value whenever it is collected for a particular future. I&rsquo;m working on adding support for &ldquo;live&rdquo; progress updates also for some parallel backends including when running on local and remote workers.</p></li> </ul> <p>Happy futuring!</p> <h2 id="links">Links</h2> <ul> <li><strong>future</strong> package: <a href="https://cran.r-project.org/package=future">CRAN</a>, <a href="https://github.com/HenrikBengtsson/future">GitHub</a></li> <li><strong>future.batchtools</strong> package: <a href="https://cran.r-project.org/package=future.batchtools">CRAN</a>, <a href="https://github.com/HenrikBengtsson/future.batchtools">GitHub</a></li> <li><strong>future.callr</strong> package: <a href="https://cran.r-project.org/package=future.callr">CRAN</a>, <a href="https://github.com/HenrikBengtsson/future.callr">GitHub</a></li> <li><strong>future.apply</strong> package: <a href="https://cran.r-project.org/package=future.apply">CRAN</a>, <a href="https://github.com/HenrikBengtsson/future.apply">GitHub</a></li> <li><strong>doFuture</strong> package: <a href="https://cran.r-project.org/package=doFuture">CRAN</a>, <a href="https://github.com/HenrikBengtsson/doFuture">GitHub</a> (a <strong><a href="https://cran.r-project.org/package=foreach">foreach</a></strong> adapter)</li> <li><strong>progressr</strong> package: <a href="https://github.com/HenrikBengtsson/progressr">GitHub</a></li> <li><a href="https://www.videoman.gr/en/70385" target="_blank">&ldquo;So, what happened to the dog?&rdquo;</a></li> </ul> </description>
</item>
<item>
<title>useR! 2019 Slides on Futures</title>
<link>https://www.jottr.org/2019/07/12/future-user2019-slides/</link>
<pubDate>Fri, 12 Jul 2019 16:00:00 +0200</pubDate>
<guid>https://www.jottr.org/2019/07/12/future-user2019-slides/</guid>
<description> <p><img src="https://www.jottr.org/post/useR2019-logo_400x400.jpg" alt="The useR 2019 logo" style="width: 30%; float: right; margin: 2ex;"/></p> <p>Below are the slides for my <em>Future: Simple Parallel and Distributed Processing in R</em> that I presented at the <a href="https://user2019.r-project.org/">useR! 2019</a> conference in Toulouse, France on July 9-12, 2019.</p> <p>My talk (25 slides; ~15+3 minutes):</p> <ul> <li>Title: <em>Future: Simple Parallel and Distributed Processing in R</em></li> <li><a href="https://docs.google.com/presentation/d/e/2PACX-1vQDLsnzhfp03zAf-BG69mnwO6nqGyLP9Zuj5ShW0gbewY955wop6KO5bidbWxtrIydFj7lznwi1op__/pub?start=false&amp;loop=false&amp;delayms=60000">HTML</a> (incremental Google Slides; requires online access)</li> <li><a href="https://www.jottr.org/presentations/useR2019/BengtssonH_20190712-future-useR2019.pdf">PDF</a> (flat slides)</li> <li><a href="https://www.youtube.com/watch?v=4B3wPFL_Syo&amp;list=PL4IzsxWztPdliwImi5JLBC4BrvqxG-vcA&amp;index=69">Video</a> (official recording)</li> </ul> <p>I want to send out a big thank you to everyone making the useR! conference such wonderful experience.</p> <h2 id="links">Links</h2> <ul> <li>useR! 2019: <ul> <li>Conference site: <a href="https://user2019.r-project.org/">https://user2019.r-project.org/</a></li> </ul></li> <li><strong>future</strong> package: <ul> <li>CRAN page: <a href="https://cran.r-project.org/package=future">https://cran.r-project.org/package=future</a></li> <li>GitHub page: <a href="https://github.com/HenrikBengtsson/future">https://github.com/HenrikBengtsson/future</a></li> </ul></li> <li><strong>future.apply</strong> package: <ul> <li>CRAN page: <a href="https://cran.r-project.org/package=future.apply">https://cran.r-project.org/package=future.apply</a></li> <li>GitHub page: <a href="https://github.com/HenrikBengtsson/future.apply">https://github.com/HenrikBengtsson/future.apply</a></li> </ul></li> <li><strong>progressr</strong> package: <ul> <li>CRAN page: <a href="https://cran.r-project.org/package=progressr">https://cran.r-project.org/package=progressr</a></li> <li>GitHub page: <a href="https://github.com/HenrikBengtsson/progressr">https://github.com/HenrikBengtsson/progressr</a></li> </ul></li> </ul> <p>Edits 2020-02-01: Added link to video recording of presentation and link to the CRAN package page of the progressr package (submitted to CRAN on 2020-01-23).</p> </description>
</item>
<item>
<title>startup - run R startup files once per hour, day, week, ...</title>
<link>https://www.jottr.org/2019/05/26/startup-sometimes/</link>
<pubDate>Sun, 26 May 2019 21:00:00 -0700</pubDate>
<guid>https://www.jottr.org/2019/05/26/startup-sometimes/</guid>
<description> <p>New release: <strong><a href="https://cran.r-project.org/package=startup">startup</a></strong> 0.12.0 is now on CRAN. This version introduces support for processing some of the R startup files with a certain frequency, e.g. once per day, once per week, or once per month. See below for two examples.</p> <p><img src="https://www.jottr.org/post/startup_0.10.0-zxspectrum.gif" alt="ZX Spectrum animation" /> <em>startup::startup() is cross platform.</em></p> <p>The <a href="https://cran.r-project.org/package=startup">startup</a> package makes it easy to split up a long, complicated <code>.Rprofile</code> startup file into multiple, smaller files in a <code>.Rprofile.d/</code> folder. For instance, setting R option <code>repos</code> in a separate file <code>~/.Rprofile.d/repos.R</code> makes it easy to find and update the option. Analogously, environment variables can be configured by using multiple <code>.Renviron.d/</code> files. To make use of this, install the <strong>startup</strong> package, and then call <code>startup::install()</code> once, which will tweak your <code>~/.Rprofile</code> file and create <code>~/.Renviron.d/</code> and <code>~/.Rprofile.d/</code> folders, if missing. For an introduction, see <a href="https://www.jottr.org/2016/12/22/startup/">Start Me Up</a>.</p> <h2 id="example-show-a-fortune-once-per-hour">Example: Show a fortune once per hour</h2> <p>The <a href="https://cran.r-project.org/package=fortunes"><strong>fortunes</strong></a> package is a collection of quotes and wisdom related to the R language. By adding</p> <pre><code class="language-r">if (interactive()) print(fortunes::fortune()) </code></pre> <p>to our <code>~/.Rprofile</code> file, a random fortune will be displayed each time we start R, e.g.</p> <pre><code>$ R --quiet I think, therefore I R. -- William B. King (in his R tutorials) http://ww2.coastal.edu/kingw/statistics/R-tutorials/ (July 2010) &gt; </code></pre> <p>Now, if we&rsquo;re launching R frequently, it might be too much to see a new fortune each time R is started. With <strong>startup</strong> (&gt;= 0.12.0), we can limit how often a certain startup file should be processed via <code>when=&lt;frequency&gt;</code> declarations. Currently supported values are <code>when=once</code>, <code>when=hourly</code>, <code>when=daily</code>, <code>when=weekly</code>, <code>when=fortnighly</code>, and <code>when=monthly</code>. See the package vignette for more details.</p> <p>For instance, we can limit ourselves to one fortune per hour, by creating a file <code>~/.Rprofile.d/interactive=TRUE/when=hourly/package=fortunes.R</code> containing:</p> <pre><code class="language-r">print(fortunes::fortune()) </code></pre> <p>The <code>interactive=TRUE</code> part declares that the file should only be processed in an interactive session, the <code>when=hourly</code> part that it should be processed at most once per hour, and the <code>package=fortunes</code> part that it should be processed only if the <strong>fortunes</strong> package is installed. It not all of these declarations are fulfilled, then the file will <em>not</em> be processed.</p> <h2 id="example-check-the-status-of-your-cran-packages-once-per-day">Example: Check the status of your CRAN packages once per day</h2> <p>If you are a developer with one or more packages on CRAN, the <a href="https://cran.r-project.org/package=foghorn"><strong>foghorn</strong></a> package provides <code>foghorn::summary_cran_results()</code> which is a neat way to get a summary of the CRAN statuses of your packages. I use the following two files to display the summary of my CRAN packages once per day:</p> <p>File <code>~/.Rprofile.d/interactive=TRUE/when=daily/package=foghorn.R</code>:</p> <pre><code class="language-r">try(local({ if (nzchar(email &lt;- Sys.getenv(&quot;MY_CRAN_EMAIL&quot;))) { foghorn::summary_cran_results(email) } }), silent = TRUE) </code></pre> <p>File <code>~/.Renviron.d/private/me</code>:</p> <pre><code>[email protected] </code></pre> <h2 id="links">Links</h2> <ul> <li><strong>startup</strong> package: <ul> <li>CRAN page: <a href="https://cran.r-project.org/package=startup">https://cran.r-project.org/package=startup</a> (<a href="https://cran.r-project.org/web/packages/startup/NEWS">NEWS</a>, <a href="https://cran.r-project.org/web/packages/startup/vignettes/startup-intro.html">vignette</a>)</li> <li>GitHub page: <a href="https://github.com/HenrikBengtsson/startup">https://github.com/HenrikBengtsson/startup</a></li> </ul></li> </ul> <h2 id="related">Related</h2> <ul> <li><a href="https://www.jottr.org/2016/12/22/startup/">Start Me Up</a> on 2016-12-22.</li> <li><a href="https://www.jottr.org/2018/03/30/startup-secrets/">Startup with Secrets - A Poor Man&rsquo;s Approach</a> on 2018-03-30.</li> </ul> </description>
</item>
<item>
<title>SatRday LA 2019 Slides on Futures</title>
<link>https://www.jottr.org/2019/05/16/future-satrdayla2019-slides/</link>
<pubDate>Thu, 16 May 2019 12:00:00 -0800</pubDate>
<guid>https://www.jottr.org/2019/05/16/future-satrdayla2019-slides/</guid>
<description> <p><img src="https://www.jottr.org/post/SatRdayLA2019-Logo.png" alt="The satRday LA 2019 logo" style="width: 30%; float: right; margin: 2ex;"/></p> <p>A bit late but here are my slides on <em>Future: Friendly Parallel Processing in R for Everyone</em> that I presented at the <a href="https://losangeles2019.satrdays.org/">satRday LA 2019</a> conference in Los Angeles, CA, USA on April 6, 2019.</p> <p>My talk (33 slides; ~45 minutes):</p> <ul> <li>Title: <em>: Friendly Parallel and Distributed Processing in R for Everyone</em></li> <li><a href="https://www.jottr.org/presentations/satRdayLA2019/BengtssonH_20190406-SatRdayLA2019,flat.html">HTML</a> (incremental slides; requires online access)</li> <li><a href="https://www.jottr.org/presentations/satRdayLA2019/BengtssonH_20190406-SatRdayLA2019,flat.pdf">PDF</a> (flat slides)</li> <li><a href="https://www.youtube.com/watch?v=KP3pgLfKr00&amp;list=PLQRHxIa9tfRvXYyaVS77zshvD0i17Y60s">Video</a> (44 min; YouTube; sorry, different page numbers)</li> </ul> <p>Thank you all for making this a stellar satRday event. I enjoyed it very much!</p> <h2 id="links">Links</h2> <ul> <li>satRday LA 2019: <ul> <li>Conference site: <a href="https://losangeles2019.satrdays.org/">https://losangeles2019.satrdays.org/</a></li> <li>Conference material: <a href="https://github.com/satRdays/losangeles/tree/master/2019">https://github.com/satRdays/losangeles/tree/master/2019</a></li> </ul></li> <li>future package: <ul> <li>CRAN page: <a href="https://cran.r-project.org/package=future">https://cran.r-project.org/package=future</a></li> <li>GitHub page: <a href="https://github.com/HenrikBengtsson/future">https://github.com/HenrikBengtsson/future</a></li> </ul></li> <li>future.apply package: <ul> <li>CRAN page: <a href="https://cran.r-project.org/package=future.apply">https://cran.r-project.org/package=future.apply</a></li> <li>GitHub page: <a href="https://github.com/HenrikBengtsson/future.apply">https://github.com/HenrikBengtsson/future.apply</a></li> </ul></li> <li>doFuture package: <ul> <li>CRAN page: <a href="https://cran.r-project.org/package=doFuture">https://cran.r-project.org/package=doFuture</a></li> <li>GitHub page: <a href="https://github.com/HenrikBengtsson/doFuture">https://github.com/HenrikBengtsson/doFuture</a></li> </ul></li> </ul> </description>
</item>
<item>
<title>SatRday Paris 2019 Slides on Futures</title>
<link>https://www.jottr.org/2019/03/07/future-satrdayparis2019-slides/</link>
<pubDate>Thu, 07 Mar 2019 12:00:00 -0800</pubDate>
<guid>https://www.jottr.org/2019/03/07/future-satrdayparis2019-slides/</guid>
<description> <p><img src="https://www.jottr.org/post/satRdayParis2019-logo.png" alt="The satRday Paris 2019 logo" style="width: 30%; float: right; margin: 2ex;"/></p> <p>Below are links to my slides from my talk on <em>Future: Friendly Parallel Processing in R for Everyone</em> that I presented last month at the <a href="https://paris2019.satrdays.org/">satRday Paris 2019</a> conference in Paris, France (February 23, 2019).</p> <p>My talk (32 slides; ~40 minutes):</p> <ul> <li>Title: <em>Future: Friendly Parallel Processing in R for Everyone</em></li> <li><a href="https://www.jottr.org/presentations/satRdayParis2019/BengtssonH_20190223-SatRdayParis2019.html">HTML</a> (incremental slides; requires online access)</li> <li><a href="https://www.jottr.org/presentations/satRdayParis2019/BengtssonH_20190223-SatRdayParis2019.pdf">PDF</a> (flat slides)</li> </ul> <p>A big shout out to the organizers, all the volunteers, and everyone else for making it a great satRday.</p> <h2 id="links">Links</h2> <ul> <li>satRday Paris 2019: <ul> <li>Conference site: <a href="https://paris2019.satrdays.org/">https://paris2019.satrdays.org/</a></li> </ul></li> <li>future package: <ul> <li>CRAN page: <a href="https://cran.r-project.org/package=future">https://cran.r-project.org/package=future</a></li> <li>GitHub page: <a href="https://github.com/HenrikBengtsson/future">https://github.com/HenrikBengtsson/future</a></li> </ul></li> <li>future.apply package: <ul> <li>CRAN page: <a href="https://cran.r-project.org/package=future.apply">https://cran.r-project.org/package=future.apply</a></li> <li>GitHub page: <a href="https://github.com/HenrikBengtsson/future.apply">https://github.com/HenrikBengtsson/future.apply</a></li> </ul></li> <li>doFuture package: <ul> <li>CRAN page: <a href="https://cran.r-project.org/package=doFuture">https://cran.r-project.org/package=doFuture</a></li> <li>GitHub page: <a href="https://github.com/HenrikBengtsson/doFuture">https://github.com/HenrikBengtsson/doFuture</a></li> </ul></li> </ul> </description>
</item>
<item>
<title>Parallelize a For-Loop by Rewriting it as an Lapply Call</title>
<link>https://www.jottr.org/2019/01/11/parallelize-a-for-loop-by-rewriting-it-as-an-lapply-call/</link>
<pubDate>Fri, 11 Jan 2019 12:00:00 -0800</pubDate>
<guid>https://www.jottr.org/2019/01/11/parallelize-a-for-loop-by-rewriting-it-as-an-lapply-call/</guid>
<description> <p>A commonly asked question in the R community is:</p> <blockquote> <p>How can I parallelize the following for-loop?</p> </blockquote> <p>The answer almost always involves rewriting the <code>for (...) { ... }</code> loop into something that looks like a <code>y &lt;- lapply(...)</code> call. If you can achieve that, you can parallelize it via for instance <code>y &lt;- future.apply::future_lapply(...)</code> or <code>y &lt;- foreach::foreach() %dopar% { ... }</code>.</p> <p>For some for-loops it is straightforward to rewrite the code to make use of <code>lapply()</code> instead, whereas in other cases it can be a bit more complicated, especially if the for-loop updates multiple variables in each iteration. However, as long as the algorithm behind the for-loop is <em><a href="https://en.wikipedia.org/wiki/Embarrassingly_parallel">embarrassingly parallel</a></em>, it can be done. Whether it should be parallelized in the first place, or it&rsquo;s worth parallelizing it, is a whole other discussion.</p> <p>Below are a few walk-through examples on how to transform a for-loop into an lapply call.</p> <p><img src="https://www.jottr.org/post/Honolulu_IFSS_Teletype1964.jpg" alt="Paper tape relay operation at US FAA's Honolulu flight service station in 1964 showing a large number of punch tapes" /> <em>Run your loops in parallel.</em></p> <h1 id="example-1-a-well-behaving-for-loop">Example 1: A well-behaving for-loop</h1> <p>I will use very simple function calls throughout the examples, e.g. <code>sqrt(x)</code>. For these code snippets to make sense, let us pretend that those functions take a long time to finish and by parallelizing multiple such calls we will shorten the overall processing time.</p> <p>First, consider the following example:</p> <pre><code class="language-r">X &lt;- 1:5 y &lt;- list() for (ii in seq_along(X)) { x &lt;- X[[ii]] tmp &lt;- sqrt(x) ## Assume this takes a long time y[[ii]] &lt;- tmp } </code></pre> <p>When run, this will give us the following result:</p> <pre><code class="language-r">&gt; str(y) List of 5 $ : num 1 $ : num 1.41 $ : num 1.73 $ : num 2 $ : num 2.24 </code></pre> <p>Because the result of each iteration in the for-loop is a single value (variable <code>tmp</code>) it is straightforward to turn this for-loop into an lapply call. I&rsquo;ll first show a version that resembles the original for-loop as far as possible, with one minor but important change. I&rsquo;ll wrap up the &ldquo;iteration&rdquo; code inside <code>local()</code> to make sure it is evaluated in a <em>local environment</em> in order to prevent it from assigning values to the global environment. It is only the &ldquo;result&rdquo; of <code>local()</code> call that I will allow updating <code>y</code>. Here we go:</p> <pre><code class="language-r">y &lt;- list() for (ii in seq_along(X)) { y[[ii]] &lt;- local({ x &lt;- X[[ii]] tmp &lt;- sqrt(x) tmp ## same as return(tmp) }) } </code></pre> <p>By making these, apparently, small adjustments, we lower the risk for missing some critical side effects that may be used in some for-loops. If those exists and we miss to adjust for them, then the for-loop is likely to give the wrong results.</p> <p>If this syntax is unfamiliar to you, run it first to convince yourself that it works. How does it work? The code inside <code>local()</code> will be evaluated in a local environment and it is only its last value (here <code>tmp</code>) that will be returned. It is also neat that <code>x</code>, <code>tmp</code>, and any other created variables, will <em>not</em> clutter up the global environment. Instead, they will vanish after each iteration just like local variables used inside functions. Retry the above after <code>rm(x, tmp)</code> to see that this is really the case.</p> <p>Now we&rsquo;re in a really good position to turn the for-loop into an lapply call. To share my train of thought, I&rsquo;ll start by showing how to do it in a way that best resembles the latter for-loop;</p> <pre><code class="language-r">y &lt;- lapply(seq_along(X), function(ii) { x &lt;- X[[ii]] tmp &lt;- sqrt(x) tmp }) </code></pre> <p>Just like the for-loop with <code>local()</code>, it is the last value (here <code>tmp</code>) that is returned, and everything is evaluated in a local environment, e.g. variable <code>tmp</code> will <em>not</em> show up in our global environment.</p> <p>There is one more update that we can do, namely instead of passing the index <code>ii</code> as an argument and then extract element <code>x &lt;- X[[ii]]</code> inside the function, we can pass that element directly using:</p> <pre><code class="language-r">y &lt;- lapply(X, function(x) { tmp &lt;- sqrt(x) tmp }) </code></pre> <p>If we get this far and have <strong>confirmed that we get the expected results</strong>, then we&rsquo;re home.</p> <p>From here, there are few ways to parallelize the lapply call. The <strong>parallel</strong> package provides the commonly known <code>mclapply()</code> and <code>parLapply()</code> functions, which are found in many examples and inside several R packages. As the author of the <strong><a href="https://cran.r-project.org/package=future">future</a></strong> package, I claim that your life as a developer will be a bit easier if you instead use the future framework. It will also bring more power and options to the end user. Below are a few options for parallelization.</p> <h2 id="future-apply-future-lapply">future.apply::future_lapply()</h2> <p>The parallelization update that takes the least amount of changes is provided by the <strong><a href="https://cran.r-project.org/package=future.apply">future.apply</a></strong> package. All we have to do is to replace <code>lapply()</code> with <code>future_lapply()</code>:</p> <pre><code class="language-r">library(future.apply) plan(multisession) ## =&gt; parallelize on your local computer X &lt;- 1:5 y &lt;- future_lapply(X, function(x) { tmp &lt;- sqrt(x) tmp }) </code></pre> <p>and we&rsquo;re done.</p> <h2 id="foreach-foreach-dopar">foreach::foreach() %dopar% { &hellip; }</h2> <p>If we wish to use the <strong><a href="https://cran.r-project.org/package=foreach">foreach</a></strong> framework, we can do:</p> <pre><code class="language-r">library(doFuture) registerDoFuture() plan(multisession) X &lt;- 1:5 y &lt;- foreach(x = X) %dopar% { tmp &lt;- sqrt(x) tmp } </code></pre> <p>Here I choose the <strong><a href="https://cran.r-project.org/package=doFuture">doFuture</a></strong> adaptor because it provides us with access to the future framework and the full range of parallel backends that comes with it (controlled via <code>plan()</code>).</p> <p>If there is only one thing you should remember from this post, it is the following:</p> <p><strong>It is a common misconception that <code>foreach()</code> works like a regular for-loop. It is doesn&rsquo;t! Instead, think of it as a version of <code>lapply()</code> with a few bells and whistles and always make sure to use it as <code>y &lt;- foreach(...) %dopar% { ... }</code>.</strong></p> <p>To clarify further, the following is <em>not</em> (I repeat: <em>not</em>) a working solution:</p> <pre><code class="language-r">X &lt;- 1:5 y &lt;- list() foreach(x = X) %dopar% { tmp &lt;- sqrt(x) y[[ii]] &lt;- tmp } </code></pre> <p>No, it isn&rsquo;t.</p> <h2 id="additional-parallelization-options">Additional parallelization options</h2> <p>There are several more options available, which are conceptually very similar to the above lapply-like approaches, e.g. <code>y &lt;- furrr::future_map(X, ...)</code>, <code>y &lt;- plyr::llply(X, ..., .parallel = TRUE)</code> or <code>y &lt;- BiocParallel::bplapply(X, ..., BPPARAM = DoparParam())</code>. For also the latter two to parallelize via one of the many future backends, we need to set <code>doFuture::registerDoFuture()</code>. See also my blog post <a href="https://www.jottr.org/2017/06/05/many-faced-future/">The Many-Faced Future</a>.</p> <h1 id="example-2-a-slightly-complicated-for-loop">Example 2: A slightly complicated for-loop</h1> <p>Now, what do we do if the for-loop writes multiple results in each iteration? For example,</p> <pre><code class="language-r">X &lt;- 1:5 y &lt;- list() z &lt;- list() for (ii in seq_along(X)) { x &lt;- X[[ii]] tmp1 &lt;- sqrt(x) y[[ii]] &lt;- tmp1 tmp2 &lt;- x^2 z[[ii]] &lt;- tmp2 } </code></pre> <p>The way to turn this into an lapply call, is to rewrite the code by gathering all the results at the very end of the iteration and then put them into a list;</p> <pre><code class="language-r">X &lt;- 1:5 yz &lt;- list() for (ii in seq_along(X)) { x &lt;- X[[ii]] tmp1 &lt;- sqrt(x) tmp2 &lt;- x^2 yz[[ii]] &lt;- list(y = tmp1, z = tmp2) } </code></pre> <p>This one we know how to rewrite;</p> <pre><code class="language-r">yz &lt;- lapply(X, function(x) { tmp1 &lt;- sqrt(x) tmp2 &lt;- x^2 list(y = tmp1, z = tmp2) }) </code></pre> <p>which we in turn can parallelize with one of the above approaches.</p> <p>The only difference from the original for-loop is that the &lsquo;y&rsquo; and &lsquo;z&rsquo; results are no longer in two separate lists. This makes it a bit harder to get a hold of the two elements. In some cases, then downstream code can work with the new <code>yz</code> format as is but if not, we can always do:</p> <pre><code class="language-r">y &lt;- lapply(yz, function(t) t$y) z &lt;- lapply(yz, function(t) t$z) rm(yz) </code></pre> <h1 id="example-3-a-somewhat-complicated-for-loop">Example 3: A somewhat complicated for-loop</h1> <p>Another, somewhat complicated, for-loop is when, say, one column of a matrix is updated per iteration. For example,</p> <pre><code class="language-r">X &lt;- 1:5 Y &lt;- matrix(0, nrow = 2, ncol = length(X)) rownames(Y) &lt;- c(&quot;sqrt&quot;, &quot;square&quot;) for (ii in seq_along(X)) { x &lt;- X[[ii]] Y[, ii] &lt;- c(sqrt(x), x^2) ## assume this takes a long time } </code></pre> <p>which gives</p> <pre><code class="language-r">&gt; Y [,1] [,2] [,3] [,4] [,5] sqrt 1 1.414214 1.732051 2 2.236068 square 1 4.000000 9.000000 16 25.000000 </code></pre> <p>To turn this into an lapply call, the approach is the same as in Example 2 - we rewrite the for-loop to assign to a list and only afterward we worry about putting those values into a matrix. To keep it simple, this can be done using something like:</p> <pre><code class="language-r">X &lt;- 1:5 tmp &lt;- lapply(X, function(x) { c(sqrt(x), x^2) ## assume this takes a long time }) Y &lt;- matrix(0, nrow = 2, ncol = length(X)) rownames(Y) &lt;- c(&quot;sqrt&quot;, &quot;square&quot;) for (ii in seq_along(tmp)) { Y[, ii] &lt;- tmp[[ii]] } rm(tmp) </code></pre> <p>To parallelize this, all we have to do is to rewrite the lapply call as:</p> <pre><code class="language-r">tmp &lt;- future_lapply(X, function(x) { c(sqrt(x), x^2) }) </code></pre> <h1 id="example-4-a-non-embarrassingly-parallel-for-loop">Example 4: A non-embarrassingly parallel for-loop</h1> <p>Now, if our for-loop is such that one iteration depends on the previous iterations, things becomes much more complicated. For example,</p> <pre><code class="language-r">X &lt;- 1:5 y &lt;- list() y[[1]] &lt;- 1 for (ii in 2:length(X)) { x &lt;- X[[ii]] tmp &lt;- sqrt(x) y[[ii]] &lt;- y[[ii - 1]] + tmp } </code></pre> <p>does <em>not</em> use an embarrassingly parallel for-loop. This code cannot be rewritten as an lapply call and therefore it cannot be parallelized.</p> <h1 id="summary">Summary</h1> <p>To parallelize a for-loop:</p> <ol> <li>Rewrite your for-loop such that each iteration is done inside a <code>local()</code> call (most of the work is done here)</li> <li>Rewrite this new for-loop as an lapply call (straightforward)</li> <li>Replace the lapply call with a parallel implementation of your choice (straightforward)</li> </ol> <p><em>Happy futuring!</em></p> <h2 id="see-also">See also</h2> <ul> <li><a href="https://www.jottr.org/2019/01/07/maintenance-updates-of-future-backends-and-dofuture/">Maintenance Updates of Future Backends and doFuture</a>, 2019-01-07</li> <li><a href="https://www.jottr.org/2018/07/23/output-from-the-future/">future 1.9.0 - Output from The Future</a>, 2018-07-23</li> <li><a href="https://www.jottr.org/2018/06/23/future.apply_1.0.0/">future.apply - Parallelize Any Base R Apply Function</a>, 2018-06-23</li> <li><a href="https://www.jottr.org/2018/06/18/future-erum2018-slides/">Delayed Future(Slides from eRum 2018)</a>, 2018-06-19</li> <li><a href="https://www.jottr.org/2018/04/12/future-results/">future 1.8.0: Preparing for a Shiny Future</a>, 2018-04-12</li> <li><a href="https://www.jottr.org/2017/06/05/many-faced-future/">The Many-Faced Future</a>, 2017-06-05</li> <li><a href="https://www.jottr.org/2017/02/19/future-rng/">future 1.3.0: Reproducible RNGs, future&#95;lapply() and More</a>, 2017-02-19</li> <li><a href="https://www.jottr.org/2016/10/22/future-hpc/">High-Performance Compute in R Using Futures</a>, 2016-10-22</li> <li><a href="https://www.jottr.org/2016/10/11/future-remotes/">Remote Processing Using Futures</a>, 2016-10-11</li> <li><a href="http://127.0.0.1:4321/2016/07/02/future-user2016-slides/">A Future for R: Slides from useR 2016</a>, 2016-07-02</li> </ul> <h1 id="appendix">Appendix</h1> <h2 id="a-regular-for-loop-with-future-future">A regular for-loop with future::future()</h2> <p>In order to lower the risk for mistakes, and because I think the for-loop-to-lapply approach is the one that the works out of the box in the most cases, I decided to not mention the following approach in the main text above, but if you&rsquo;re interested, here it is. With the core building blocks of the Future API, we can actually do parallel processing using a regular for-loop. Have a look at that second code snippet in Example 1 where we use a for-loop together with <code>local()</code>. All we need to do is to replace <code>local()</code> with <code>future()</code> and make sure to &ldquo;collect&rdquo; the values after the for-loop;</p> <pre><code class="language-r">library(future) plan(multisession) X &lt;- 1:5 y &lt;- list() for (ii in seq_along(X)) { y[[ii]] &lt;- future({ x &lt;- X[[ii]] tmp &lt;- sqrt(x) tmp }) } y &lt;- values(y) ## collect values </code></pre> <p>Note that this approach does <em>not</em> perform load balancing*. That is, contrary to the above mentioned lapply-like options, it will not chunk up the elements in <code>X</code> into equally-sized portions for each parallel worker to process. Instead, it will call each worker multiple times, which can bring some significant overhead, especially if there are many elements to iterate over.</p> <p>However, one neat feature of this bare-bones approach is that we have full control of the iteration. For instance, we can initiate each iteration using a bit of sequential code before we use parallel code. This can be particularly useful for subsetting large objects to avoid passing them to each worker, which otherwise can be costly. For example, we can rewrite the above as:</p> <pre><code class="language-r">library(future) plan(multisession) X &lt;- 1:5 y &lt;- list() for (ii in seq_along(X)) { x &lt;- X[[ii]] y[[ii]] &lt;- future({ tmp &lt;- sqrt(x) tmp }) } y &lt;- values(y) </code></pre> <p>This is just one example. I&rsquo;ve run into several other use cases in my large-scale genomics research, where I found it extremely useful to be able to perform the beginning of an iteration sequentially in the main processes before passing on the remaining part to be processed in parallel by the workers.</p> <p>(*) I do have some ideas on how to get the above code snippet to do automatic workload balancing &ldquo;under the hood&rdquo;, but that is quite far into the future of the future framework.</p> <p>UPDATE 2022-12-11: Update examples that used the deprecated <code>multiprocess</code> future backend alias to use the <code>multisession</code> backend.</p> </description>
</item>
<item>
<title>Maintenance Updates of Future Backends and doFuture</title>
<link>https://www.jottr.org/2019/01/07/maintenance-updates-of-future-backends-and-dofuture/</link>
<pubDate>Mon, 07 Jan 2019 00:00:00 +0000</pubDate>
<guid>https://www.jottr.org/2019/01/07/maintenance-updates-of-future-backends-and-dofuture/</guid>
<description> <p>New versions of the following future backends are available on CRAN:</p> <ul> <li><strong><a href="https://cran.r-project.org/package=future.callr">future.callr</a></strong> - parallelization via <strong><a href="https://cran.r-project.org/package=callr">callr</a></strong>, i.e. on the local machine</li> <li><strong><a href="https://cran.r-project.org/package=future.batchtools">future.batchtools</a></strong> - parallelization via <strong><a href="https://cran.r-project.org/package=batchtools">batchtools</a></strong>, i.e. on a compute cluster with job schedulers (SLURM, SGE, Torque/PBS, etc.) but also on the local machine</li> <li><strong><a href="https://cran.r-project.org/package=future.BatchJobs">future.BatchJobs</a></strong> - (maintained for legacy reasons) parallelization via <strong><a href="https://cran.r-project.org/package=BatchJobs">BatchJobs</a></strong>, which is the predecessor of batchtools</li> </ul> <p>These releases fix a few small bugs and inconsistencies that were identified with help of the <strong><a href="https://github.com/HenrikBengtsson/future.tests">future.tests</a></strong> framework that is being developed with <a href="https://www.r-consortium.org/projects/awarded-projects">support from the R Consortium</a>.</p> <p>I also released a new version of:</p> <ul> <li><strong><a href="https://cran.r-project.org/package=doFuture">doFuture</a></strong> - use <em>any</em> future backend for <code>foreach()</code> parallelization</li> </ul> <p>which comes with a few improvements and bug fixes.</p> <p><img src="https://www.jottr.org/post/the-future-is-now.gif" alt="An old TV screen struggling to display the text &quot;THE FUTURE IS NOW&quot;" /> <em>The future is now.</em></p> <h2 id="the-future-is-what">The future is &hellip; what?</h2> <p>If you never heard of the future framework before, here is a simple example. Assume that you want to run</p> <pre><code class="language-r">y &lt;- lapply(X, FUN = my_slow_function) </code></pre> <p>in parallel on your local computer. The most straightforward way to achieve this is to use:</p> <pre><code class="language-r">library(future.apply) plan(multisession) y &lt;- future_lapply(X, FUN = my_slow_function) </code></pre> <p>If you have SSH access to a few machines here and there with R installed, you can use:</p> <pre><code class="language-r">library(future.apply) plan(cluster, workers = c(&quot;localhost&quot;, &quot;gandalf.remote.edu&quot;, &quot;server.cloud.org&quot;)) y &lt;- future_lapply(X, FUN = my_slow_function) </code></pre> <p>Even better, if you have access to compute cluster with an SGE job scheduler, you could use:</p> <pre><code class="language-r">library(future.apply) plan(future.batchtools::batchtools_sge) y &lt;- future_lapply(X, FUN = my_slow_function) </code></pre> <h2 id="the-future-is-why">The future is &hellip; why?</h2> <p>The <strong><a href="https://cran.r-project.org/package=future">future</a></strong> package provides a simple, cross-platform, and lightweight API for parallel processing in R. At its core, there are three core building blocks for doing parallel processing - <code>future()</code>, <code>resolved()</code> and <code>value()</code>- which are used for creating the asynchronous evaluation of an R expression, querying whether it&rsquo;s done or not, and collecting the results. With these fundamental building blocks, a large variety of parallel tasks can be performed, either by using these functions directly or indirectly via more feature rich higher-level parallelization APIs such as <strong><a href="https://cran.r-project.org/package=future.apply">future.apply</a></strong>, <strong><a href="https://cran.r-project.org/package=foreach">foreach</a></strong>, <strong><a href="https://bioconductor.org/packages/release/bioc/html/BiocParallel.html">BiocParallel</a></strong> or <strong><a href="https://cran.r-project.org/package=plyr">plyr</a></strong> with <strong><a href="https://cran.r-project.org/package=doFuture">doFuture</a></strong>, and <strong><a href="https://cran.r-project.org/package=furrr">furrr</a></strong>. In all cases, how and where future R expressions are evaluated, that is, how and where the parallelization is performed, depends solely on which <em>future backend</em> is currently used, which is controlled by the <code>plan()</code> function.</p> <p>One advantage of the Future API, whether it is used directly as is or via one of the higher-level APIs, is that it encapsulates the details on <em>how</em> and <em>where</em> the code is parallelized allowing the developer to instead focus on <em>what</em> to parallelize. Another advantage is that the end user will have control over which future backend to use. For instance, one user may choose to run an analysis in parallel on their notebook or in the cloud, whereas another may want to run it via a job scheduler in a high-performance compute (HPC) environment.</p> <h2 id="what-s-next">What’s next?</h2> <p>I&rsquo;ve spent a fair bit of time working on <strong><a href="https://github.com/HenrikBengtsson/future.tests">future.tests</a></strong>, which is a single framework for testing future backends. It will allow developers of future backends to validate that they fully conform to the Future API. This will lower the barrier for creating a new backend (e.g. <a href="https://github.com/HenrikBengtsson/future/issues/204">future.clustermq</a> on top of <strong><a href="https://cran.r-project.org/package=clustermq">clustermq</a></strong> or <a href="https://github.com/HenrikBengtsson/future/issues/151">one on top Redis</a>) and it will add trust for existing ones such that end users can reliably switch between backends without having to worry about the results being different or even corrupted. So, backed by <strong><a href="https://github.com/HenrikBengtsson/future.tests">future.tests</a></strong>, I feel more comfortable attacking some of the feature requests - and there are <a href="https://github.com/HenrikBengtsson/future/issues?q=is%3Aissue+is%3Aopen+label%3A%22feature+request%22">quite a few of them</a>. Indeed, I&rsquo;ve already implemented one of them. More news coming soon &hellip;</p> <p><em>Happy futuring!</em></p> <p>UPDATE 2022-12-11: Update examples that used the deprecated <code>multiprocess</code> future backend alias to use the <code>multisession</code> backend.</p> <h2 id="see-also">See also</h2> <ul> <li><a href="https://www.jottr.org/2018/07/23/output-from-the-future/">future 1.9.0 - Output from The Future</a>, 2018-07-23</li> <li><a href="https://www.jottr.org/2018/06/23/future.apply_1.0.0/">future.apply - Parallelize Any Base R Apply Function</a>, 2018-06-23</li> <li><a href="https://www.jottr.org/2018/06/18/future-erum2018-slides/">Delayed Future(Slides from eRum 2018)</a>, 2018-06-19</li> <li><a href="https://www.jottr.org/2018/04/12/future-results/">future 1.8.0: Preparing for a Shiny Future</a>, 2018-04-12</li> <li><a href="https://www.jottr.org/2017/06/05/many-faced-future/">The Many-Faced Future</a>, 2017-06-05</li> <li><a href="https://www.jottr.org/2017/02/19/future-rng/">future 1.3.0 Reproducible RNGs, future&#95;lapply() and More</a>, 2017-02-19</li> <li><a href="https://www.jottr.org/2016/10/22/future-hpc/">High-Performance Compute in R Using Futures</a>, 2016-10-22</li> <li><a href="https://www.jottr.org/2016/10/11/future-remotes/">Remote Processing Using Futures</a>, 2016-10-11</li> <li><a href="http://127.0.0.1:4321/2016/07/02/future-user2016-slides/">A Future for R: Slides from useR 2016</a>, 2016-07-02</li> </ul> </description>
</item>
<item>
<title>future 1.9.0 - Output from The Future</title>
<link>https://www.jottr.org/2018/07/23/output-from-the-future/</link>
<pubDate>Mon, 23 Jul 2018 00:00:00 +0000</pubDate>
<guid>https://www.jottr.org/2018/07/23/output-from-the-future/</guid>
<description> <p><strong><a href="https://cran.r-project.org/package=future">future</a></strong> 1.9.0 - <em>Unified Parallel and Distributed Processing in R for Everyone</em> - is on CRAN. This is a milestone release:</p> <p><strong>Standard output is now relayed from futures back to the master R session - regardless of where the futures are processed!</strong></p> <p><em>Disclaimer:</em> A future&rsquo;s output is relayed only after it is resolved and when its value is retrieved by the master R process. In other words, the output is not streamed back in a &ldquo;live&rdquo; fashion as it is produced. Also, it is only the standard output that is relayed. See below, for why the standard error cannot be relayed.</p> <p><img src="https://www.jottr.org/post/Signaling_by_Napoleonic_semaphore_line.jpg" alt="Illustration of communication by mechanical semaphore in 1800s France. Lines of towers supporting semaphore masts were built within visual distance of each other. The arms of the semaphore were moved to different positions, to spell out text messages. The operators in the next tower would read the message and pass it on. Invented by Claude Chappee in 1792, semaphore was a popular communication technology in the early 19th century until the telegraph replaced it. (source: wikipedia.org)" /> <em>Relaying standard output from far away</em></p> <h2 id="examples">Examples</h2> <p>Assume we have access to three machines with R installed on our local network. We can distribute our R processing to these machines using futures by:</p> <pre><code class="language-r">&gt; library(future) &gt; plan(cluster, workers = c(&quot;n1&quot;, &quot;n2&quot;, &quot;n3&quot;)) &gt; nbrOfWorkers() [1] 3 </code></pre> <p>With the above, future expressions will now be processed across those three machines. To see which machine a future ends up being resolved by, we can output the hostname, e.g.</p> <pre><code class="language-r">&gt; printf &lt;- function(...) cat(sprintf(...)) &gt; f &lt;- future({ + printf(&quot;Hostname: %s\n&quot;, Sys.info()[[&quot;nodename&quot;]]) + 42 + }) &gt; v &lt;- value(f) Hostname: n1 &gt; v [1] 42 </code></pre> <p>We see that this particular future was resolved on the <em>n1</em> machine. Note how <em>the output is relayed when we call <code>value()</code></em>. This means that if we call <code>value()</code> multiple times, the output will also be relayed multiple times, e.g.</p> <pre><code class="language-r">&gt; v &lt;- value(f) Hostname: n1 &gt; value(f) Hostname: n1 [1] 42 </code></pre> <p>This is intended and by design. In case you are new to futures, note that <em>a future is only evaluated once</em>. In other words, calling <code>value()</code> multiple times will not re-evaluate the future expression.</p> <p>The output is also relayed when using future assignments (<code>%&lt;-%</code>). For example,</p> <pre><code class="language-r">&gt; v %&lt;-% { + printf(&quot;Hostname: %s\n&quot;, Sys.info()[[&quot;nodename&quot;]]) + 42 + } &gt; v Hostname: n1 [1] 42 &gt; v [1] 42 </code></pre> <p>In this case, the output is only relayed the first time we print <code>v</code>. The reason for this is because when first set up, <code>v</code> is a promise (delayed assignment), and as soon as we &ldquo;touch&rdquo; (here print) it, it will internally call <code>value()</code> on the underlying future and then be resolved to a regular variable <code>v</code>. This is also intended and by design.</p> <p>In the spirit of the Future API, any <em>output behaves exactly the same way regardless of future backend used</em>. In the above, we see that output can be relayed from three external machines back to our local R session. We would get the exact same if we run our futures in parallel, or sequentially, on our local machine, e.g.</p> <pre><code class="language-r">&gt; plan(sequential) v %&lt;-% { printf(&quot;Hostname: %s\n&quot;, Sys.info()[[&quot;nodename&quot;]]) 42 } &gt; v Hostname: my-laptop [1] 42 </code></pre> <p>This also works when we use nested futures wherever the workers are located (local or remote), e.g.</p> <pre><code class="language-r">&gt; plan(list(sequential, multisession)) &gt; a %&lt;-% { + printf(&quot;PID: %d\n&quot;, Sys.getpid()) + b %&lt;-% { + printf(&quot;PID: %d\n&quot;, Sys.getpid()) + 42 + } + b + } &gt; a PID: 360547 PID: 484252 [1] 42 </code></pre> <h2 id="higher-level-future-frontends">Higher-Level Future Frontends</h2> <p>The core Future API, that is, the explicit <code>future()</code>-<code>value()</code> functions and the implicit future-assignment operator <code>%&lt;-%</code> function, provides the foundation for all of the future ecosystem. Because of this, <em>relaying of output will work out of the box wherever futures are used</em>. For example, when using <strong>future.apply</strong> we get:</p> <pre><code>&gt; library(future.apply) &gt; plan(cluster, workers = c(&quot;n1&quot;, &quot;n2&quot;, &quot;n3&quot;)) &gt; printf &lt;- function(...) cat(sprintf(...)) &gt; y &lt;- future_lapply(1:5, FUN = function(x) { + printf(&quot;Hostname: %s (x = %g)\n&quot;, Sys.info()[[&quot;nodename&quot;]], x) + sqrt(x) + }) Hostname: n1 (x = 1) Hostname: n1 (x = 2) Hostname: n2 (x = 3) Hostname: n3 (x = 4) Hostname: n3 (x = 5) &gt; unlist(y) [1] 1.000000 1.414214 1.732051 2.000000 2.236068 </code></pre> <p>and similarly when, for example, using <strong>foreach</strong>:</p> <pre><code class="language-r">&gt; library(doFuture) &gt; registerDoFuture() &gt; plan(cluster, workers = c(&quot;n1&quot;, &quot;n2&quot;, &quot;n3&quot;)) &gt; printf &lt;- function(...) cat(sprintf(...)) &gt; y &lt;- foreach(x = 1:5) %dopar% { + printf(&quot;Hostname: %s (x = %g)\n&quot;, Sys.info()[[&quot;nodename&quot;]], x) + sqrt(x) + } Hostname: n1 (x = 1) Hostname: n1 (x = 2) Hostname: n2 (x = 3) Hostname: n3 (x = 4) Hostname: n3 (x = 5) &gt; unlist(y) [1] 1.000000 1.414214 1.732051 2.000000 2.236068 </code></pre> <h2 id="what-about-standard-error">What about standard error?</h2> <p>Unfortunately, it is <em>not possible</em> to relay output sent to the standard error (stderr), that is, output by <code>message()</code>, <code>cat(..., file = stderr())</code>, and so on, is not taken care of. This is due to a <a href="https://github.com/HenrikBengtsson/Wishlist-for-R/issues/55">limitation in R</a>, preventing us from capturing stderr in a reliable way. The gist of the problem is that, contrary to stdout (&ldquo;output&rdquo;), there can only be a single stderr (&ldquo;message&rdquo;) sink active in R at any time. What really is the show stopper is that if we allocate such a message sink, it will be stolen from us the moment other code/functions request the message sink. In other words, message sinks cannot be used reliably in R unless one fully controls the whole software stack. As long as this is the case, it is not possible to collect and relay stderr in a consistent fashion across <em>all</em> future backends (*). But, of course, I&rsquo;ll keep on trying to find a solution to this problem. If anyone has a suggestion for a workaround or a patch to R, please let me know.</p> <p>(*) The <strong><a href="https://cran.r-project.org/package=callr">callr</a></strong> package captures stdout and stderr in a consistent manner, so for the <strong><a href="https://cran.r-project.org/package=future.callr">future.callr</a></strong> backend, we could indeed already now relay stderr. We could probably also find a solution for <strong><a href="https://cran.r-project.org/package=future.batchtools">future.batchtools</a></strong> backends, which targets HPC job schedulers by utilizing the <strong><a href="https://cran.r-project.org/package=batchtools">batchtools</a></strong> package. However, if code becomes dependent on using specific future backends, it will limit the end users&rsquo; options - we want to avoid that as far as ever possible. Having said this, it is possible that we&rsquo;ll start out supporting stderr by making it an <a href="https://github.com/HenrikBengtsson/future/issues/172">optional feature of the Future API</a>.</p> <h2 id="poor-man-s-debugging">Poor Man&rsquo;s debugging</h2> <p>Because the output is also relayed when there is an error, e.g.</p> <pre><code class="language-r">&gt; x &lt;- &quot;42&quot; &gt; f &lt;- future({ + str(list(x = x)) + log(x) + }) &gt; value(f) List of 1 $ x: chr &quot;42&quot; Error in log(x) : non-numeric argument to mathematical function </code></pre> <p>it can be used for simple troubleshooting and narrowing down errors. For example,</p> <pre><code class="language-r">&gt; library(doFuture) &gt; registerDoFuture() &gt; plan(multisession) &gt; nbrOfWorkers() [1] 2 &gt; x &lt;- list(1, &quot;2&quot;, 3, 4, 5) &gt; y &lt;- foreach(x = x) %dopar% { + str(list(x = x)) + log(x) + } List of 1 $ x: num 1 List of 1 $ x: chr &quot;2&quot; List of 1 $ x: num 3 List of 1 $ x: num 4 List of 1 $ x: num 5 Error in { : task 2 failed - &quot;non-numeric argument to mathematical function&quot; &gt; </code></pre> <p>From the error message, we get that there was an &ldquo;non-numeric argument&rdquo; (element) passed to a function. By adding the <code>str()</code>, we can also see that it is of type character and what its value is. This will help us go back to the data source (<code>x</code>) and continue the troubleshooting there.</p> <h2 id="what-s-next">What&rsquo;s next?</h2> <p>Progress bar information is one of several frequently <a href="https://github.com/HenrikBengtsson/future/labels/feature%20request">requested features</a> in the future framework. I hope to attack the problem of progress bars and progress messages in higher-level future frontends such as <strong><a href="https://cran.r-project.org/package=future.apply">future.apply</a></strong>. Ideally, this can be done in a uniform and generic fashion to meet all needs. A possible implementation that has been discussed, is to provide a set of basic hook functions (e.g. on-start, on-resolved, on-value) that any ProgressBar API (e.g. <strong><a href="https://github.com/ropenscilabs/jobstatus">jobstatus</a></strong>) can build upon. This could help avoid tie-in to a particular progress-bar implementation.</p> <p>Another feature I&rsquo;d like to get going is (optional) <a href="https://github.com/HenrikBengtsson/future/issues/59">benchmarking of processing time and memory consumption</a>. This type of information will help optimize parallel and distributed processing by identifying and understand the various sources of overhead involved in parallelizing a particular piece of code in a particular compute environment. This information will also help any efforts trying to automate load balancing. It may even be used for progress bars that try to estimate the remaining processing time (&ldquo;ETA&rdquo;).</p> <p>So, lots of work ahead. Oh well &hellip;</p> <p><em>Happy futuring!</em></p> <p>UPDATE 2022-12-11: Update examples that used the deprecated <code>multiprocess</code> future backend alias to use the <code>multisession</code> backend.</p> <h2 id="see-also">See also</h2> <ul> <li><p>About <a href="https://www.wikipedia.org/wiki/Semaphore_line">Semaphore Telegraphs</a>, Wikipedia</p></li> <li><p><a href="https://www.jottr.org/2018/06/23/future.apply_1.0.0/">future.apply - Parallelize Any Base R Apply Function</a>, 2018-06-23</p></li> <li><p><a href="https://www.jottr.org/2018/06/18/future-erum2018-slides/">Delayed Future(Slides from eRum 2018)</a>, 2018-06-19</p></li> <li><p><a href="https://www.jottr.org/2018/04/12/future-results/">future 1.8.0: Preparing for a Shiny Future</a>, 2018-04-12</p></li> <li><p><a href="https://www.jottr.org/2017/06/05/many-faced-future/">The Many-Faced Future</a>, 2017-06-05</p></li> <li><p><a href="https://www.jottr.org/2017/02/19/future-rng/">future 1.3.0 Reproducible RNGs, future&#95;lapply() and More</a>, 2017-02-19</p></li> <li><p><a href="https://www.jottr.org/2016/10/22/future-hpc/">High-Performance Compute in R Using Futures</a>, 2016-10-22</p></li> <li><p><a href="https://www.jottr.org/2016/10/11/future-remotes/">Remote Processing Using Futures</a>, 2016-10-11</p></li> </ul> <h2 id="links">Links</h2> <ul> <li>future - <em>Unified Parallel and Distributed Processing in R for Everyone</em> <ul> <li>CRAN page: <a href="https://cran.r-project.org/package=future">https://cran.r-project.org/package=future</a></li> <li>GitHub page: <a href="https://github.com/HenrikBengtsson/future">https://github.com/HenrikBengtsson/future</a></li> </ul></li> <li>future.apply - <em>Apply Function to Elements in Parallel using Futures</em> <ul> <li>CRAN page: <a href="https://cran.r-project.org/package=future.apply">https://cran.r-project.org/package=future.apply</a></li> <li>GitHub page: <a href="https://github.com/HenrikBengtsson/future.apply">https://github.com/HenrikBengtsson/future.apply</a></li> </ul></li> <li>doFuture - <em>A Universal Foreach Parallel Adaptor using the Future API of the &lsquo;future&rsquo; Package</em> <ul> <li>CRAN page: <a href="https://cran.r-project.org/package=doFuture">https://cran.r-project.org/package=doFuture</a></li> <li>GitHub page: <a href="https://github.com/HenrikBengtsson/doFuture">https://github.com/HenrikBengtsson/doFuture</a></li> </ul></li> <li>future.batchtools - <em>A Future API for Parallel and Distributed Processing using &lsquo;batchtools&rsquo;</em> <ul> <li>CRAN page: <a href="https://cran.r-project.org/package=future.batchtools">https://cran.r-project.org/package=future.batchtools</a></li> <li>GitHub page: <a href="https://github.com/HenrikBengtsson/future.batchtools">https://github.com/HenrikBengtsson/future.batchtools</a></li> </ul></li> <li>future.callr - <em>A Future API for Parallel Processing using &lsquo;callr&rsquo;</em> <ul> <li>CRAN page: <a href="https://cran.r-project.org/package=future.callr">https://cran.r-project.org/package=future.callr</a></li> <li>GitHub page: <a href="https://github.com/HenrikBengtsson/future.callr">https://github.com/HenrikBengtsson/future.callr</a></li> </ul></li> </ul> </description>
</item>
<item>
<title>R.devices - Into the Void</title>
<link>https://www.jottr.org/2018/07/21/suppressgraphics/</link>
<pubDate>Sat, 21 Jul 2018 00:00:00 +0000</pubDate>
<guid>https://www.jottr.org/2018/07/21/suppressgraphics/</guid>
<description> <p><strong><a href="https://cran.r-project.org/package=R.devices">R.devices</a></strong> 2.16.0 - <em>Unified Handling of Graphics Devices</em> - is on CRAN. With this release, you can now easily <strong>suppress unwanted graphics</strong>, e.g. graphics produced by one of those do-everything-in-one-call functions that we all bump into once in a while. To suppress graphics, the <strong>R.devices</strong> package provides graphics device <code>nulldev()</code>, and function <code>suppressGraphics()</code>, which both send any produced graphics into the void. This works on all operating systems, including Windows.</p> <p><img src="https://www.jottr.org/post/guillaume_nery_into_the_void_2.gif" alt="&quot;Into the void&quot;" /> <small><em><a href="https://www.youtube.com/watch?v=uQITWbAaDx0">Guillaume Nery base jumping at Dean&rsquo;s Blue Hole, filmed on breath hold by Julie Gautier</a></em></small> <!-- GIF from https://blog.francetvinfo.fr/l-instit-humeurs/2013/09/01/vis-ma-vie-dinstit-en-gif-anime-9.html --></p> <h2 id="examples">Examples</h2> <pre><code class="language-r">library(R.devices) nulldev() plot(1:100, main = &quot;Some Ignored Graphics&quot;) dev.off() </code></pre> <pre><code class="language-r">R.devices::suppressGraphics({ plot(1:100, main = &quot;Some Ignored Graphics&quot;) }) </code></pre> <h2 id="other-features">Other Features</h2> <p>Some other reasons for using the <strong>R.devices</strong> package:</p> <ul> <li><p><strong>No need to call dev.off()</strong> - Did you ever forgot to call <code>dev.off()</code>, or did a function call produce an error causing <code>dev.off()</code> not to be reached, leaving a graphics device open? By using one of the <code>toPDF()</code>, <code>toPNG()</code>, &hellip; functions, or the more general <code>devEval()</code> function, <code>dev.off()</code> is automatically taken care of.</p></li> <li><p><strong>No need to specify filename extension</strong> - Did you ever switch from using <code>png()</code> to, say, <code>pdf()</code>, and forgot to update the filename resulting in a <code>my_plot.png</code> file that is actually a PDF file? By using one of the <code>toPDF()</code>, <code>toPNG()</code>, &hellip; functions, or the more general <code>devEval()</code> function, filename extensions are automatically taken care of - just specify the part without the extension.</p></li> <li><p><strong>Specify the aspect ratio</strong> - rather than having to manually calculate device-specific arguments <code>width</code> or <code>height</code>, e.g. <code>toPNG(&quot;my_plot&quot;, { plot(1:10) }, aspectRatio = 2/3)</code>. This is particularly useful when switching between device types, or when outputting to multiple ones at the same time.</p></li> <li><p><strong>Unified API for graphics options</strong> - conveniently set (most) graphics options including those that can otherwise only be controlled via arguments, e.g. <code>devOptions(&quot;png&quot;, width = 1024)</code>.</p></li> <li><p><strong>Control where figure files are saved</strong> - the default is folder <code>figures/</code> but can be set per device type or globally, e.g. <code>devOptions(&quot;*&quot;, path = &quot;figures/col/&quot;)</code>.</p></li> <li><p><strong>Easily produce EPS and favicons</strong> - <code>toEPS()</code> and <code>toFavicon()</code> are friendly wrappers for producing EPS and favicon graphics.</p></li> <li><p><strong>Capture and replay graphics</strong> - for instance, use <code>future::plan(remote, workers = &quot;remote.server.org&quot;); p %&lt;-% capturePlot({ plot(1:10) })</code> to produce graphics on a remote machine, and then display it locally by printing <code>p</code>.</p></li> </ul> <h3 id="some-more-examples">Some more examples</h3> <pre><code class="language-r">R.devices::toPDF(&quot;my_plot&quot;, { plot(1:100, main = &quot;Amazing Graphics&quot;) }) ### [1] &quot;figures/my_plot.pdf&quot; </code></pre> <pre><code class="language-r">R.devices::toPNG(&quot;my_plot&quot;, { plot(1:100, main = &quot;Amazing Graphics&quot;) }) ### [1] &quot;figures/my_plot.png&quot; </code></pre> <pre><code class="language-r">R.devices::toEPS(&quot;my_plot&quot;, { plot(1:100, main = &quot;Amazing Graphics&quot;) }) ### [1] &quot;figures/my_plot.eps&quot; </code></pre> <pre><code class="language-r">R.devices::devEval(c(&quot;png&quot;, &quot;pdf&quot;, &quot;eps&quot;), name = &quot;my_plot&quot;, { plot(1:100, main = &quot;Amazing Graphics&quot;) }, aspectRatio = 1.3) ### $png ### [1] &quot;figures/my_plot.png&quot; ### ### $pdf ### [1] &quot;figures/my_plot.pdf&quot; ### ### $eps ### [1] &quot;figures/my_plot.eps&quot; </code></pre> <h2 id="links">Links</h2> <ul> <li>R.devices package: <ul> <li>CRAN page: <a href="https://cran.r-project.org/package=R.devices">https://cran.r-project.org/package=R.devices</a></li> <li>GitHub page: <a href="https://github.com/HenrikBengtsson/R.devices">https://github.com/HenrikBengtsson/R.devices</a></li> </ul></li> </ul> <h2 id="see-also">See also</h2> <ul> <li><a href="https://www.jottr.org/2016/07/02/future-user2016-slides/">A Future for R: Slides from useR 2016</a>, 2016-07-02 <ul> <li>See Slide 17 for an example of using <code>capturePlot()</code> remotely and plotting locally</li> </ul></li> </ul> </description>
</item>
<item>
<title>future.apply - Parallelize Any Base R Apply Function</title>
<link>https://www.jottr.org/2018/06/23/future.apply_1.0.0/</link>
<pubDate>Sat, 23 Jun 2018 00:00:00 +0000</pubDate>
<guid>https://www.jottr.org/2018/06/23/future.apply_1.0.0/</guid>
<description> <p><img src="https://www.jottr.org/post/future.apply_1.0.0-htop_32cores.png" alt="0% to 100% utilization" /> <em>Got compute?</em></p> <p><a href="https://cran.r-project.org/package=future.apply">future.apply</a> 1.0.0 - <em>Apply Function to Elements in Parallel using Futures</em> - is on CRAN. With this milestone release, all<sup>*</sup> base R apply functions now have corresponding futurized implementations. This makes it easier than ever before to parallelize your existing <code>apply()</code>, <code>lapply()</code>, <code>mapply()</code>, &hellip; code - just prepend <code>future_</code> to an apply call that takes a long time to complete. That&rsquo;s it! The default is sequential processing but by using <code>plan(multisession)</code> it&rsquo;ll run in parallel.</p> <p><br> <em>Table: All future_nnn() functions in the <strong>future.apply</strong> package. Each function takes the same arguments as the corresponding <strong>base</strong> function does.</em><br></p> <table> <thead> <tr> <th>Function</th> <th>Description</th> </tr> </thead> <tbody> <tr> <td><code>future_<strong>apply()</strong></code></td> <td>Apply Functions Over Array Margins</td> </tr> <tr> <td><code>future_<strong>lapply()</strong></code></td> <td>Apply a Function over a List or Vector</td> </tr> <tr> <td><code>future_<strong>sapply()</strong></code></td> <td>- &ldquo; -</td> </tr> <tr> <td><code>future_<strong>vapply()</strong></code></td> <td>- &ldquo; -</td> </tr> <tr> <td><code>future_<strong>replicate()</strong></code></td> <td>- &ldquo; -</td> </tr> <tr> <td><code>future_<strong>mapply()</strong></code></td> <td>Apply a Function to Multiple List or Vector Arguments</td> </tr> <tr> <td><code>future_<strong>Map()</strong></code></td> <td>- &ldquo; -</td> </tr> <tr> <td><code>future_<strong>eapply()</strong></code></td> <td>Apply a Function Over Values in an Environment</td> </tr> <tr> <td><code>future_<strong>tapply()</strong></code></td> <td>Apply a Function Over a Ragged Array</td> </tr> </tbody> </table> <p><sup>*</sup> <code>future_<strong>rapply()</strong></code> - Recursively Apply a Function to a List - is yet to be implemented.</p> <h2 id="a-motivating-example">A Motivating Example</h2> <p>In the <strong>parallel</strong> package there is an example - in <code>?clusterApply</code> - showing how to perform bootstrap simulations in parallel. After some small modifications to clarify the steps, it looks like the following:</p> <pre><code class="language-r">library(parallel) library(boot) run1 &lt;- function(...) { library(boot) cd4.rg &lt;- function(data, mle) MASS::mvrnorm(nrow(data), mle$m, mle$v) cd4.mle &lt;- list(m = colMeans(cd4), v = var(cd4)) boot(cd4, corr, R = 500, sim = &quot;parametric&quot;, ran.gen = cd4.rg, mle = cd4.mle) } cl &lt;- makeCluster(4) ## Parallelize using four cores clusterSetRNGStream(cl, 123) cd4.boot &lt;- do.call(c, parLapply(cl, 1:4, run1)) boot.ci(cd4.boot, type = c(&quot;norm&quot;, &quot;basic&quot;, &quot;perc&quot;), conf = 0.9, h = atanh, hinv = tanh) stopCluster(cl) </code></pre> <p>The script defines a function <code>run1()</code> that produces 500 bootstrap samples, and then it calls this function four times, combines the four replicated samples into one <code>cd4.boot</code>, and at the end it uses <code>boot.ci()</code> to summarize the results.</p> <p>The corresponding sequential implementation would look something like:</p> <pre><code class="language-r">library(boot) run1 &lt;- function(...) { cd4.rg &lt;- function(data, mle) MASS::mvrnorm(nrow(data), mle$m, mle$v) cd4.mle &lt;- list(m = colMeans(cd4), v = var(cd4)) boot(cd4, corr, R = 500, sim = &quot;parametric&quot;, ran.gen = cd4.rg, mle = cd4.mle) } set.seed(123) cd4.boot &lt;- do.call(c, lapply(1:4, run1)) boot.ci(cd4.boot, type = c(&quot;norm&quot;, &quot;basic&quot;, &quot;perc&quot;), conf = 0.9, h = atanh, hinv = tanh) </code></pre> <p>We notice a few things about these two code snippets. First of all, in the parallel code, there are two <code>library(boot)</code> calls; one in the main code and one inside the <code>run1()</code> function. The reason for this is to make sure that the <strong>boot</strong> package is also attached in the parallel, background R session when <code>run1()</code> is called there. The <strong>boot</strong> package defines the <code>boot.ci()</code> function, as well as the <code>boot()</code> function and the <code>cd4</code> data.frame - both used inside <code>run1()</code>. If <strong>boot</strong> is not attached inside the function, we would get an error on <code>&quot;object 'cd4' not found&quot;</code> when running the parallel code. In contrast, we do not need to do this in the sequential code. Also, if we later would turn our parallel script into a package, then <code>R CMD check</code> would complain if we kept the <code>library(boot)</code> call inside the <code>run1()</code> function.</p> <p>Second, the example uses <code>MASS::mvrnorm()</code> in <code>run1()</code>. The reason for this is related to the above - if we use only <code>mvrnorm()</code>, we need to attach the <strong>MASS</strong> package using <code>library(MASS)</code> and also do so inside <code>run1()</code>. Since there is only one <strong>MASS</strong> function called, it&rsquo;s easier and neater to use the form <code>MASS::mvrnorm()</code>.</p> <p>Third, the random-seed setup differs between the sequential and the parallel approach.</p> <p>In summary, in order to turn the sequential script into a script that parallelizes using the <strong>parallel</strong> package, we would have to not only rewrite parts of the code but also be aware of important differences in order to avoid getting run-time errors due to missing packages or global variables.</p> <p>One of the objectives of the <strong>future.apply</strong> package, and the <strong>future</strong> ecosystem in general, is to make transitions from writing sequential code to writing parallel code as simple and frictionless as possible.</p> <p>Here is the same example parallelized using the <strong>future.apply</strong> package:</p> <pre><code class="language-r">library(future.apply) plan(multisession, workers = 4) ## Parallelize using four cores library(boot) run1 &lt;- function(...) { cd4.rg &lt;- function(data, mle) MASS::mvrnorm(nrow(data), mle$m, mle$v) cd4.mle &lt;- list(m = colMeans(cd4), v = var(cd4)) boot(cd4, corr, R = 500, sim = &quot;parametric&quot;, ran.gen = cd4.rg, mle = cd4.mle) } set.seed(123) cd4.boot &lt;- do.call(c, future_lapply(1:4, run1, future.seed = TRUE)) boot.ci(cd4.boot, type = c(&quot;norm&quot;, &quot;basic&quot;, &quot;perc&quot;), conf = 0.9, h = atanh, hinv = tanh) </code></pre> <p>The difference between the sequential base-R implementation and the <strong>future.apply</strong> implementation is minimal. The <strong>future.apply</strong> package is attached, the parallel plan of four workers is set up, and the <code>apply()</code> function is replaced by <code>future_apply()</code>, where we specify <code>future.seed = TRUE</code> to get statistical sound and numerically reproducible parallel random number generation (RNG). More importantly, notice how there is no need to worry about which packages need to be attached on the workers and which global variables need to be exported. That is all taken care of automatically by the <strong>future</strong> framework.</p> <h2 id="q-a">Q&amp;A</h2> <p>Q. <em>What are my options for parallelization?</em><br> A. Everything in <strong>future.apply</strong> is processed through the <a href="https://cran.r-project.org/package=future">future</a> framework. This means that all parallelization backends supported by the <strong>parallel</strong> package are supported out of the box, e.g. on your <strong>local machine</strong>, and on <strong>local</strong> or <strong>remote</strong> ad-hoc <strong>compute clusters</strong> (also in the <strong>cloud</strong>). Additional parallelization and distribution schemas are provided by backends such as <strong><a href="https://cran.r-project.org/package=future.callr">future.callr</a></strong> (parallelization on your local machine) and <strong><a href="https://cran.r-project.org/package=future.batchtools">future.batchtools</a></strong> (large-scale parallelization via <strong>HPC job schedulers</strong>). For other alternatives, see the CRAN Page for the <strong><a href="https://cran.r-project.org/package=future">future</a></strong> package and the <a href="https://cran.r-project.org/web/views/HighPerformanceComputing.html">High-Performance and Parallel Computing with R</a> CRAN Task View.</p> <p>Q. <em>Righty-oh, so how do I specify which parallelization backend to use?</em><br> A. A fundamental design pattern of the future framework is that <em>the end user decides <strong>how and where</strong> to parallelize</em> while <em>the developer decides <strong>what</strong> to parallelize</em>. This means that you do <em>not</em> specify the backend via some argument to the <code>future_nnn()</code> functions. Instead, the backend is specified by the <code>plan()</code> function - you can almost think of it as a global option that the end user controls. For example, <code>plan(multisession)</code> will parallelize on the local machine, so will <code>plan(future.callr::callr)</code>, whereas <code>plan(cluster, workers = c(&quot;n1&quot;, &quot;n2&quot;, &quot;remote.server.org&quot;))</code> will parallelize on two local machines and one remote machine. Using <code>plan(future.batchtools::batchtools_sge)</code> will distribute the processing on your SGE-supported compute cluster. BTW, you can also have <a href="https://cran.r-project.org/web/packages/future/vignettes/future-3-topologies.html">nested parallelization strategies</a>, e.g. <code>plan(list(tweak(cluster, workers = nodes), multisession))</code> where <code>nodes = c(&quot;n1&quot;, &quot;n2&quot;, &quot;remote.server.org&quot;)</code>.</p> <p>Q. <em>What about load balancing?</em><br> A. The default behavior of all functions is to distribute <strong>equally-sized chunks</strong> of elements to each available background worker - such that each worker process exactly one chunk (= one future). If the processing times vary significantly across chunks, you can increase the average number of chunks processed by each worker, e.g. to have them process two chunks on average, specify <code>future.scheduling = 2.0</code>. Alternatively, you can specify the number of elements processed per chunk, e.g. <code>future.chunk.size = 10L</code> (an analog to the <code>chunk.size</code> argument added to the <strong>parallel</strong> package in R 3.5.0).</p> <p>Q. <em>What about random number generation (RNG)? I&rsquo;ve heard it&rsquo;s tricky to get right when running in parallel.</em><br> A. Just add <code>future.seed = TRUE</code> and you&rsquo;re good. This will use <strong>parallel safe</strong> and <strong>statistical sound</strong> <strong>L&rsquo;Ecuyer-CMRG RNG</strong>, which is a well-established parallel RNG algorithm and used by the <strong>parallel</strong> package. The <strong>future.apply</strong> functions use this in a way that is also <strong>invariant to</strong> the future backend and the amount of &ldquo;chunking&rdquo; used. To produce numerically reproducible results, set <code>set.seed(123)</code> before (as in the above example), or simply use <code>future.seed = 123</code>.</p> <p>Q. <em>What about global variables? Whenever I&rsquo;ve tried to parallelize code before, I often ran into errors on &ldquo;this or that variable is not found&rdquo;.</em><br> A. This is very rarely a problem when using the <a href="https://cran.r-project.org/package=future">future</a> framework - things work out of the box. <strong>Global variables and packages</strong> needed are <strong>automatically identified</strong> from static code inspection and passed on to the workers - even when the workers run on remote computers or in the cloud.</p> <p><em>Happy futuring!</em></p> <p>UPDATE 2022-12-11: Update examples that used the deprecated <code>multiprocess</code> future backend alias to use the <code>multisession</code> backend.</p> <h2 id="links">Links</h2> <ul> <li>future.apply package: <ul> <li>CRAN page: <a href="https://cran.r-project.org/package=future.apply">https://cran.r-project.org/package=future.apply</a></li> <li>GitHub page: <a href="https://github.com/HenrikBengtsson/future.apply">https://github.com/HenrikBengtsson/future.apply</a></li> </ul></li> <li>future package: <ul> <li>CRAN page: <a href="https://cran.r-project.org/package=future">https://cran.r-project.org/package=future</a></li> <li>GitHub page: <a href="https://github.com/HenrikBengtsson/future">https://github.com/HenrikBengtsson/future</a></li> </ul></li> <li>future.batchtools package: <ul> <li>CRAN page: <a href="https://cran.r-project.org/package=future.batchtools">https://cran.r-project.org/package=future.batchtools</a></li> <li>GitHub page: <a href="https://github.com/HenrikBengtsson/future.batchtools">https://github.com/HenrikBengtsson/future.batchtools</a></li> </ul></li> <li>doFuture package: <ul> <li>CRAN page: <a href="https://cran.r-project.org/package=doFuture">https://cran.r-project.org/package=doFuture</a></li> <li>GitHub page: <a href="https://github.com/HenrikBengtsson/doFuture">https://github.com/HenrikBengtsson/doFuture</a></li> </ul></li> </ul> <h2 id="see-also">See also</h2> <ul> <li><a href="https://www.jottr.org/2018/06/18/future-erum2018-slides/">Delayed Future(Slides from eRum 2018)</a>, 2018-06-19</li> <li><a href="https://www.jottr.org/2018/04/12/future-results/">future 1.8.0: Preparing for a Shiny Future</a>, 2018-04-12</li> <li><a href="https://www.jottr.org/2017/06/05/many-faced-future/">The Many-Faced Future</a>, 2017-06-05</li> <li><a href="https://www.jottr.org/2017/02/19/future-rng/">future 1.3.0 Reproducible RNGs, future&#95;lapply() and More</a>, 2017-02-19</li> <li><a href="https://www.jottr.org/2016/10/22/future-hpc/">High-Performance Compute in R Using Futures</a>, 2016-10-22</li> <li><a href="https://www.jottr.org/2016/10/11/future-remotes/">Remote Processing Using Futures</a>, 2016-10-11</li> </ul> </description>
</item>
<item>
<title>Delayed Future(Slides from eRum 2018)</title>
<link>https://www.jottr.org/2018/06/18/future-erum2018-slides/</link>
<pubDate>Mon, 18 Jun 2018 00:00:00 +0000</pubDate>
<guid>https://www.jottr.org/2018/06/18/future-erum2018-slides/</guid>
<description> <p><img src="https://www.jottr.org/post/erum2018--hexlogo.jpg" alt="The eRum 2018 hex sticker" /></p> <p>As promised - though a bit delayed - below are links to my slides and the video of my talk on <em>Future: Parallel &amp; Distributed Processing in R for Everyone</em> that I presented last month at the <a href="https://2018.erum.io/">eRum 2018</a> conference in Budapest, Hungary (May 14-16, 2018).</p> <p>The conference was very well organized (thank you everyone involved) with a great lineup of several brilliant workshop sessions, talks, and poster presentations (thanks all). It was such a pleasure to attend this conference and to connect and reconnect with so many of the lovely people that we are fortunate to have in the R Community. I&rsquo;m looking forward to meeting you all again.</p> <p>My talk (22 slides plus several appendix slides):</p> <ul> <li>Title: <em>Future: Parallel &amp; Distributed Processing in R for Everyone</em></li> <li><a href="https://www.jottr.org/presentations/eRum2018/BengtssonH_20180516-eRum2018.html">HTML</a> (incremental slides; requires online access)</li> <li><a href="https://www.jottr.org/presentations/eRum2018/BengtssonH_20180516-eRum2018.pdf">PDF</a> (flat slides)</li> <li><a href="https://www.youtube.com/watch?v=doa7avxbptQ">Video</a> (22 mins)</li> </ul> <p>May the future be with you!</p> <h2 id="links">Links</h2> <ul> <li>eRum 2018: <ul> <li>Conference site: <a href="https://2018.erum.io/">https://2018.erum.io/</a></li> <li>All talks (slides &amp; videos): <a href="https://2018.erum.io/#talk-abstracts">https://2018.erum.io/#talk-abstracts</a></li> </ul></li> <li>future package: <ul> <li>CRAN page: <a href="https://cran.r-project.org/package=future">https://cran.r-project.org/package=future</a></li> <li>GitHub page: <a href="https://github.com/HenrikBengtsson/future">https://github.com/HenrikBengtsson/future</a></li> </ul></li> <li>future.batchtools package: <ul> <li>CRAN page: <a href="https://cran.r-project.org/package=future.batchtools">https://cran.r-project.org/package=future.batchtools</a></li> <li>GitHub page: <a href="https://github.com/HenrikBengtsson/future.batchtools">https://github.com/HenrikBengtsson/future.batchtools</a></li> </ul></li> <li>doFuture package: <ul> <li>CRAN page: <a href="https://cran.r-project.org/package=doFuture">https://cran.r-project.org/package=doFuture</a></li> <li>GitHub page: <a href="https://github.com/HenrikBengtsson/doFuture">https://github.com/HenrikBengtsson/doFuture</a></li> </ul></li> <li>future.apply package: <ul> <li>CRAN page: <a href="https://cran.r-project.org/package=future.apply">https://cran.r-project.org/package=future.apply</a></li> <li>GitHub page: <a href="https://github.com/HenrikBengtsson/future.apply">https://github.com/HenrikBengtsson/future.apply</a></li> </ul></li> </ul> </description>
</item>
<item>
<title>future 1.8.0: Preparing for a Shiny Future</title>
<link>https://www.jottr.org/2018/04/12/future-results/</link>
<pubDate>Thu, 12 Apr 2018 00:00:00 +0000</pubDate>
<guid>https://www.jottr.org/2018/04/12/future-results/</guid>
<description> <p><strong><a href="https://cran.r-project.org/package=future">future</a></strong> 1.8.0 is available on CRAN.</p> <p>This release lays the foundation for being able to capture outputs from futures, perform automated timing and memory benchmarking (profiling) on futures, and more. These features are <em>not</em> yet available out of the box, but thanks to this release we will be able to make some headway on many of <a href="https://github.com/HenrikBengtsson/future/issues/172">the feature requests related to this</a> - hopefully already by the next release.</p> <p><img src="https://www.jottr.org/post/retro-shiny-future-small.png" alt="&quot;A Shiny Future&quot;" /></p> <p>For <strong>shiny</strong> users following Joe Cheng&rsquo;s efforts on extending <a href="https://rstudio.github.io/promises/articles/shiny.html">Shiny with asynchronous processing using futures</a>, <strong>future</strong> 1.8.0 comes with some <a href="https://github.com/HenrikBengtsson/future/issues/200">important updates/bug fixes</a> that allow for consistent error handling regardless whether Shiny runs with or without futures and regardless of the future backend used. With previous versions of the <strong>future</strong> package, you would receive errors of different classes depending on which future backend was used.</p> <p>The <code>future_lapply()</code> function was moved to the <strong><a href="https://cran.r-project.org/package=future.apply">future.apply</a></strong> package back in January 2018. Please use that one instead, especially since the one in the <strong>future</strong> package is now formally deprecated (and produces a warning if used). In <strong>future.apply</strong> there is also a <code>future_sapply()</code> function and hopefully, in a not too far future, we&rsquo;ll see additional futurized versions of other base R apply functions, e.g. <code>future_vapply()</code> and <code>future_apply()</code>.</p> <p>Finally, with this release, there was an bug fix related to <em>nested futures</em> (where you call <code>future()</code> within a <code>future()</code> - or use <code>%&lt;-%</code> within another <code>%&lt;-%</code>). When using non-standard evaluation (NSE) such as <strong>dplyr</strong> expressions in a nested future, you could get a false error that complained about not being able to identify a global variable when it actually was a column in a data.frame.</p> <h2 id="what-s-next">What&rsquo;s next?</h2> <ul> <li><p>I&rsquo;m giving a presentation on futures at the <a href="https://2018.erum.io/">eRum 2018 conference taking place on May 14-16, 2018 in Budapest</a>. I&rsquo;m excited about this opportunity and to meet more folks in the European R community.</p></li> <li><p>I&rsquo;m happy to announce that The Infrastructure Steering Committee of The R Consortium is funding the project <a href="https://www.r-consortium.org/projects/awarded-projects">Future Minimal API: Specification with Backend Conformance Test Suite</a>. I&rsquo;m grateful for their support. The aim is to formalize the Future API further and to provide a standardized test suite that packages implementing future backends can validate their implementations against. This will benefit the quality of higher-level parallel frameworks that utilize futures internally, e.g. <strong><a href="https://cran.r-project.org/package=future.apply">future.apply</a></strong> and <strong>foreach</strong> with <strong><a href="https://cran.r-project.org/package=doFuture">doFuture</a></strong>. It will also help moving forward on several of <a href="https://github.com/HenrikBengtsson/future/issues/172">the feature requests received from the community</a>.</p></li> </ul> <h2 id="help-shape-the-future">Help shape the future</h2> <p>If you find futures useful in your R-related work, please consider sharing your stories, e.g. by blogging, on <a href="https://twitter.com/henrikbengtsson">Twitter</a>, or on <a href="https://github.com/HenrikBengtsson/future">GitHub</a>. It always exciting to hear about how people are using them or how they&rsquo;d like to use. I know there are so many great ideas out there!</p> <p>Happy futuring!</p> <h2 id="links">Links</h2> <ul> <li>future package: <a href="https://cran.r-project.org/package=future">CRAN</a>, <a href="https://github.com/HenrikBengtsson/future">GitHub</a></li> <li>future.batchtools package: <a href="https://cran.r-project.org/package=future.batchtools">CRAN</a>, <a href="https://github.com/HenrikBengtsson/future.batchtools">GitHub</a></li> <li>future.callr package: <a href="https://cran.r-project.org/package=future.callr">CRAN</a>, <a href="https://github.com/HenrikBengtsson/future.callr">GitHub</a></li> <li>doFuture package: <a href="https://cran.r-project.org/package=doFuture">CRAN</a>, <a href="https://github.com/HenrikBengtsson/doFuture">GitHub</a> (a <a href="https://cran.r-project.org/package=foreach">foreach</a> adaptor)</li> </ul> </description>
</item>
<item>
<title>Performance: Avoid Coercing Indices To Doubles</title>
<link>https://www.jottr.org/2018/04/02/coercion-of-indices/</link>
<pubDate>Mon, 02 Apr 2018 00:00:00 +0000</pubDate>
<guid>https://www.jottr.org/2018/04/02/coercion-of-indices/</guid>
<description> <p><img src="https://www.jottr.org/post/1or1L.png" alt="&quot;1 or 1L?&quot;" /></p> <p><code>x[idxs + 1]</code> or <code>x[idxs + 1L]</code>? That is the question.</p> <p>Assume that we have a vector $x$ of $n = 100,000$ random values, e.g.</p> <pre><code class="language-r">&gt; n &lt;- 100000 &gt; x &lt;- rnorm(n) </code></pre> <p>and that we wish to calculate the $n-1$ first-order differences $y=(y_1, y_2, &hellip;, y_{n-1})$ where $y_i=x_{i+1} - x_i$. In R, we can calculate this using the following vectorized form:</p> <pre><code class="language-r">&gt; idxs &lt;- seq_len(n - 1) &gt; y &lt;- x[idxs + 1] - x[idxs] </code></pre> <p>We can certainly do better if we turn to native code, but is there a more efficient way to implement this using plain R code? It turns out there is (*). The following <strong>calculation is ~15-20% faster</strong>:</p> <pre><code class="language-r">&gt; y &lt;- x[idxs + 1L] - x[idxs] </code></pre> <p>The reason for this is because the index calculation:</p> <pre><code class="language-r">idxs + 1 </code></pre> <p>is <strong>inefficient due to a coercion of integers to doubles</strong>. We have that <code>idxs</code> is an integer vector but <code>idxs + 1</code> becomes a double vector because <code>1</code> is a double:</p> <pre><code class="language-r">&gt; typeof(idxs) [1] &quot;integer&quot; &gt; typeof(idxs + 1) [1] &quot;double&quot; &gt; typeof(1) [1] &quot;double&quot; </code></pre> <p>Note also that doubles (aka &ldquo;numerics&rdquo; in R) take up <strong>twice the amount of memory</strong>:</p> <pre><code class="language-r">&gt; object.size(idxs) 400040 bytes &gt; object.size(idxs + 1) 800032 bytes </code></pre> <p>which is because integers are stored as 4 bytes and doubles as 8 bytes.</p> <p>By using <code>1L</code> instead, we can avoid this coercion from integers to doubles:</p> <pre><code class="language-r">&gt; typeof(idxs) [1] &quot;integer&quot; &gt; typeof(idxs + 1L) [1] &quot;integer&quot; &gt; typeof(1L) [1] &quot;integer&quot; </code></pre> <p>and we save some, otherwise wasted, memory;</p> <pre><code class="language-r">&gt; object.size(idxs + 1L) 400040 bytes </code></pre> <p><strong>Does it really matter for the overall performance?</strong> It should because <strong>less memory is allocated</strong> which always comes with some overhead. Possibly more importantly, by using objects that are smaller in memory, the more likely it is that elements can be found in the memory cache rather than in the RAM itself, i.e. the <strong>chance for <em>cache hits</em> increases</strong>. Accessing data in the cache is orders of magnitute faster than in RAM. Furthermore, we also <strong>avoid coercion/casting</strong> of doubles to integers when R adds one to each element, which may add some extra CPU overhead.</p> <p>The performance gain is confirmed by running <strong><a href="https://cran.r-project.org/package=microbenchmark">microbenchmark</a></strong> on the two alternatives:</p> <pre><code class="language-r">&gt; microbenchmark::microbenchmark( + y &lt;- x[idxs + 1 ] - x[idxs], + y &lt;- x[idxs + 1L] - x[idxs] + ) Unit: milliseconds expr min lq mean median uq max neval cld y &lt;- x[idxs + 1] - x[idxs] 1.27 1.58 3.71 2.27 2.62 80.6 100 a y &lt;- x[idxs + 1L] - x[idxs] 1.04 1.25 2.38 1.34 2.20 76.5 100 a </code></pre> <p>From the median (which is the most informative here), we see that using <code>idxs + 1L</code> is ~15-20% faster than <code>idxs + 1</code> in this case (it depends on $n$ and the overall calculation performed).</p> <p><strong>Is it worth it?</strong> Although it is &ldquo;only&rdquo; an absolute difference of ~1 ms, it adds up if we do these calculations a large number times, e.g. in a bootstrap algorithm. And if there are many places in the code that result in coercions from index calculations like these, that also adds up. Some may argue it&rsquo;s not worth it, but at least now you know it does indeed improve the performance a bit if you specify index constants as integers, i.e. by appending an <code>L</code>.</p> <p>To wrap it up, here is look at the cost of subsetting all of the $1,000,000$ elements in a vector using various types of integer and double index vectors:</p> <pre><code class="language-r">&gt; n &lt;- 1000000 &gt; x &lt;- rnorm(n) &gt; idxs &lt;- seq_len(n) ## integer indices &gt; idxs_dbl &lt;- as.double(idxs) ## double indices &gt; microbenchmark::microbenchmark(unit = &quot;ms&quot;, + x[], + x[idxs], + x[idxs + 0L], + x[idxs_dbl], + x[idxs_dbl + 0], + x[idxs_dbl + 0L], + x[idxs + 0] + ) Unit: milliseconds expr min lq mean median uq max neval cld x[] 0.7056 0.7481 1.6563 0.7632 0.8351 74.682 100 a x[idxs] 3.9647 4.0638 5.1735 4.2020 4.7311 78.038 100 b x[idxs + 0L] 5.7553 5.8724 6.2694 6.0810 6.6447 7.845 100 bc x[idxs_dbl] 6.6355 6.7799 7.9916 7.1305 7.6349 77.696 100 cd x[idxs_dbl + 0] 7.7081 7.9441 8.6044 8.3321 8.9432 12.171 100 d x[idxs_dbl + 0L] 8.0770 8.3050 8.8973 8.7669 9.1682 12.578 100 d x[idxs + 0] 7.9980 8.2586 8.8544 8.8924 9.2197 12.345 100 d </code></pre> <p>(I ordered the entries by their &lsquo;median&rsquo; processing times.)</p> <p>In all cases, we are extracting the complete vector of <code>x</code>. We see that</p> <ol> <li>subsetting using an integer vector is faster than using a double vector,</li> <li><code>x[idxs + 0L]</code> is faster than <code>x[idxs + 0]</code> (as seen previously),</li> <li><code>x[idxs + 0L]</code> is still faster than <code>x[idxs_dbl]</code> despite also involving an addition, and</li> <li><code>x[]</code> is whoppingly fast (probably because it does not have to iterate over an index vector) and serves as a lower-bound reference for the best we can hope for.</li> </ol> <p>(*): There already exists a highly efficient implementation for calculating the first-order differences, namely <code>y &lt;- diff(x)</code>. But for the sake of the take-home message of this blog post, let&rsquo;s ignore that.</p> <p><strong>Bonus</strong>: Did you know that <code>sd(y) / sqrt(2)</code> is an estimator of the standard deviation of the above <code>x</code>:s (von Neumann et al., 1941)? It&rsquo;s actually not too hard to derive this - give it a try by deriving the variance when <code>x</code> is independent, identically distributed Gaussian random variables. This property is useful in cases where we are interested in the noise level of <code>x</code> and <code>x</code> has a piecewise constant mean level which changes at a small number of locations, e.g. a DNA copy-number profile of a tumor. In such cases we cannot use <code>sd(x)</code>, because the estimate would be biased due to the different mean levels. Instead, by taking the first-order differences <code>y</code>, changes in mean levels of <code>x</code> become sporadic outliers in <code>y</code>. If we could trim off these outliers, <code>sd(y) / sqrt(2)</code> would be a good estimate of the standard deviation of <code>x</code> after subtracting the mean levels. Even better, by using a robust estimator, such as the median absolute deviation (MAD) - <code>mad(y) / sqrt(2)</code> - we do not have to worry about have to identify the outliers. Efficient implementations of <code>sd(diff(x)) / sqrt(2))</code> and <code>mad(diff(x)) / sqrt(2))</code> are <code>sdDiff(x)</code> and <code>madDiff(x)</code> of the <strong><a href="https://cran.r-project.org/package=matrixStats">matrixStats</a></strong> package.</p> <h1 id="references">References</h1> <p>J. von Neumann et al., The mean square successive difference. <em>Annals of Mathematical Statistics</em>, 1941, 12, 153-162.</p> <h1 id="session-information">Session information</h1> <p><details></p> <pre><code class="language-r">&gt; sessionInfo() R version 3.4.4 (2018-03-15) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 16.04.4 LTS Matrix products: default BLAS: /usr/lib/atlas-base/atlas/libblas.so.3.0 LAPACK: /usr/lib/atlas-base/atlas/liblapack.so.3.0 locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base loaded via a namespace (and not attached): [1] compiler_3.4.4 </code></pre> <p></details></p> </description>
</item>
<item>
<title>Startup with Secrets - A Poor Man's Approach</title>
<link>https://www.jottr.org/2018/03/30/startup-secrets/</link>
<pubDate>Fri, 30 Mar 2018 00:00:00 +0000</pubDate>
<guid>https://www.jottr.org/2018/03/30/startup-secrets/</guid>
<description> <p>New release: <strong><a href="https://cran.r-project.org/package=startup">startup</a></strong> 0.10.0 is now on CRAN.</p> <p>If your R startup files (<code>.Renviron</code> and <code>.Rprofile</code>) get long and windy, or if you want to make parts of them public and other parts private, then you can use the <strong><a href="https://cran.r-project.org/package=startup">startup</a></strong> package to split them up in separate files and directories under <code>.Renviron.d/</code> and <code>.Rprofile.d/</code>. For instance, the <code>.Rprofile.d/repos.R</code> file can be solely dedicated to setting in the <code>repos</code> option, which specifies from which web servers R packages are installed from. This makes it easy to find and easy to share with others (e.g. on GitHub). To make use of <strong>startup</strong>, install the package and then call <code>startup::install()</code> once. For an introduction, see <a href="https://www.jottr.org/2016/12/22/startup/">Start Me Up</a>.</p> <p><img src="https://www.jottr.org/post/startup_0.10.0-zxspectrum.gif" alt="ZX Spectrum animation" /> <em>startup::startup() is cross platform.</em></p> <p>Several R packages provide APIs for easier access to online services such as GitHub, GitLab, Twitter, Amazon AWS, Google GCE, etc. These packages often rely on R options or environment variables to hold your secret credentials or tokens in order to provide more or less automatic, batch-friendly access to those services. For convenience, it is common to set these secret options in <code>~/.Rprofile</code> or secret environment variables in <code>~/.Renviron</code> - or if you use the <strong><a href="https://cran.r-project.org/package=startup">startup</a></strong> package, in separate files. For instance, by adding a file <code>~/.Renviron.d/private/github</code> containing:</p> <pre><code>## GitHub token used by devtools GITHUB_PAT=db80a925a60ee5b57f323c7b3719bbaaf9f96b26 </code></pre> <p>then, when you start R, environment variable <code>GITHUB_PAT</code> will be accessible from within R as:</p> <pre><code class="language-r">&gt; Sys.getenv(&quot;GITHUB_PAT&quot;) [1] &quot;db80a925a60ee5b57f323c7b3719bbaaf9f96b26&quot; </code></pre> <p>which means that also <strong>devtools</strong> can make use of it.</p> <p><strong>IMPORTANT</strong>: If you&rsquo;re on a shared file system or a computer with multiple users, you want to make sure no one else can access your files holding &ldquo;secrets&rdquo;. If you&rsquo;re on Linux or macOS, this can be done by:</p> <pre><code class="language-sh">$ chmod -R go-rwx ~/.Renviron.d/private/ </code></pre> <p>Also, <em>keeping &ldquo;secrets&rdquo; in options or environment variables is <strong>not</strong> super secure</em>. For instance, <em>if your script or a third-party package dumps <code>Sys.getenv()</code> to a log file, that log file will contain your &ldquo;secrets&rdquo; too</em>. Depending on your default settings on the machine / file system, that log file might be readable by others in your group or even by anyone on the file system. And if you&rsquo;re not careful, you might even end up sharing that file with the public, e.g. on GitHub.</p> <p>Having said this, with the above setup we at least know that the secret token is only loaded when we run R and only when we run R as ourselves. <strong>Starting with startup 0.10.0</strong> (*), we can customize the startup further such that secrets are only loaded conditionally on a certain environment variable. For instance, if we instead of putting our secret files in a folder named:</p> <pre><code>~/.Renviron.d/private/SECRET=develop/ </code></pre> <p>because then (i) that folder will not be visible to anyone else because we already restricted access to <code>~/.Renviron.d/private/</code> and (ii) the secrets defined by files of that folder will <em>only be loaded</em> during the R startup <em>if and only if</em> environment variable <code>SECRET</code> has value <code>develop</code>. For example,</p> <pre><code class="language-r">$ SECRET=develop Rscript -e &quot;Sys.getenv('GITHUB_PAT')&quot; [1] &quot;db80a925a60ee5b57f323c7b3719bbaaf9f96b26&quot; </code></pre> <p>will load the secrets, but none of:</p> <pre><code class="language-r">$ Rscript -e &quot;Sys.getenv('GITHUB_PAT')&quot; [1] &quot;&quot; $ SECRET=runtime Rscript -e &quot;Sys.getenv('GITHUB_PAT')&quot; [1] &quot;&quot; </code></pre> <p>In other words, with the above approach, you can avoid loading secrets by default and only load them when you really need them. This lowers the risk of exposing them by mistake in log files or to R code you&rsquo;re not in control of. Furthermore, if you only need <code>GITHUB_PAT</code> in <em>interactive</em> devtools sessions, name the folder:</p> <pre><code>~/.Renviron.d/private/interactive=TRUE,SECRET=develop/ </code></pre> <p>and it will only be loaded in an interactive session, e.g.</p> <pre><code class="language-r">$ SECRET=develop Rscript -e &quot;Sys.getenv('GITHUB_PAT')&quot; [1] &quot;&quot; </code></pre> <p>and</p> <pre><code class="language-r">$ SECRET=develop R --quiet &gt; Sys.getenv('GITHUB_PAT') [1] &quot;db80a925a60ee5b57f323c7b3719bbaaf9f96b26&quot; </code></pre> <p>To repeat what already been said above, <em>storing secrets in environment variables or R variables provides only very limited security</em>. The above approach is meant to provide you with a bit more control if you are already storing credentials in <code>~/.Renviron</code> or <code>~/.Rprofile</code>. For a more secure approach to store secrets, see the <strong><a href="https://cran.r-project.org/package=keyring">keyring</a></strong> package, which makes it easy to &ldquo;access the system credential store from R&rdquo; in a cross-platform fashion, provides a better alternative.</p> <h2 id="what-s-new-in-startup-0-10-0">What&rsquo;s new in startup 0.10.0?</h2> <ul> <li><p>Renviron and Rprofile startup files that use <code>&lt;key&gt;=&lt;value&gt;</code> filters with non-declared keys are now(*) skipped (which makes the above possible).</p></li> <li><p><code>startup(debug = TRUE)</code> report on more details.</p></li> <li><p>A startup script can use <code>startup::is_debug_on()</code> to output message during the startup process conditionally on whether the user chooses to display debug message or not.</p></li> <li><p>Added <code>sysinfo()</code> flags <code>microsoftr</code>, <code>pqr</code>, <code>rstudioterm</code>, and <code>rtichoke</code>, which can be used in directory and file names to process them depending on in which environment R is running.</p></li> <li><p><code>restart()</code> works also in the RStudio Terminal.</p></li> </ul> <h2 id="links">Links</h2> <ul> <li><p><strong>startup</strong> package:</p> <ul> <li>CRAN page: <a href="https://cran.r-project.org/package=startup">https://cran.r-project.org/package=startup</a> (<a href="https://cran.r-project.org/web/packages/startup/NEWS">NEWS</a>, <a href="https://cran.r-project.org/web/packages/startup/vignettes/startup-intro.html">vignette</a>)</li> <li>GitHub page: <a href="https://github.com/HenrikBengtsson/startup">https://github.com/HenrikBengtsson/startup</a></li> </ul></li> <li><p>Blog post <a href="https://www.jottr.org/2016/12/22/startup/">Start Me Up</a> on 2016-12-22.</p></li> </ul> <p>(*) In <strong>startup</strong> (&lt; 0.10.0), <code>~/.Renviron.d/private/SECRET=develop/</code> would be processed not only when <code>SECRET</code> had value <code>develop</code> but also when it was <em>undefined</em>. In <strong>startup</strong> (&gt;= 0.10.0), files with such <code>&lt;key&gt;=&lt;value&gt;</code> tags will now be skipped when that key variable is undefined.</p> </description>
</item>
<item>
<title>The Many-Faced Future</title>
<link>https://www.jottr.org/2017/06/05/many-faced-future/</link>
<pubDate>Mon, 05 Jun 2017 00:00:00 +0000</pubDate>
<guid>https://www.jottr.org/2017/06/05/many-faced-future/</guid>
<description> <p>The <a href="https://cran.r-project.org/package=future">future</a> package defines the Future API, which is a unified, generic, friendly API for parallel processing. The Future API follows the principle of <strong>write code once and run anywhere</strong> - the developer chooses what to parallelize and the user how and where.</p> <p>The nature of a future is such that it lends itself to be used with several of the existing map-reduce frameworks already available in R. In this post, I&rsquo;ll give an example of how to apply a function over a set of elements concurrently using plain sequential R, the parallel package, the <a href="https://cran.r-project.org/package=future">future</a> package alone, as well as future in combination of the <a href="https://cran.r-project.org/package=foreach">foreach</a>, the <a href="https://cran.r-project.org/package=plyr">plyr</a>, and the <a href="https://cran.r-project.org/package=purrr">purrr</a> packages.</p> <p><img src="https://www.jottr.org/post/julia_sets.gif" alt="Julia Set animation" /> <em>You can choose your own future and what you want to do with it.</em></p> <h2 id="example-multiple-julia-sets">Example: Multiple Julia sets</h2> <p>The <a href="https://cran.r-project.org/package=Julia">Julia</a> package provides the <code>JuliaImage()</code> function for generating a <a href="https://en.wikipedia.org/wiki/Julia_set">Julia set</a> for a given set of start parameters <code>(centre, L, C)</code> where <code>centre</code> specify the center point in the complex plane, <code>L</code> specify the width and height of the square region around this location, and <code>C</code> is a complex coefficient controlling the &ldquo;shape&rdquo; of the generated Julia set. For example, to generate one of the above Julia set images (1000-by-1000 pixels), you can use:</p> <pre><code class="language-r">library(&quot;Julia&quot;) set &lt;- JuliaImage(1000, centre = 0 + 0i, L = 3.5, C = -0.4 + 0.6i) plot_julia(set) </code></pre> <p>with</p> <pre><code class="language-r">plot_julia &lt;- function(img, col = topo.colors(16)) { par(mar = c(0, 0, 0, 0)) image(img, col = col, axes = FALSE) } </code></pre> <p>For the purpose of illustrating how to calculate different Julia sets in parallel, I will use the same <code>(centre, L) = (0 + 0i, 3.5)</code> region as above with the following ten complex coefficients (from <a href="https://en.wikipedia.org/wiki/Julia_set">Julia set</a>):</p> <pre><code class="language-r">Cs &lt;- c( a = -0.618, b = -0.4 + 0.6i, c = 0.285 + 0i, d = 0.285 + 0.01i, e = 0.45 + 0.1428i, f = -0.70176 - 0.3842i, g = 0.835 - 0.2321i, h = -0.8 + 0.156i, i = -0.7269 + 0.1889i, j = - 0.8i ) </code></pre> <p>Now we&rsquo;re ready to see how we can use futures in combination of different map-reduce implementations in R for generating these ten sets in parallel. Note that all approaches will generate the exact same ten Julia sets. So, feel free to pick your favorite approach.</p> <h2 id="sequential">Sequential</h2> <p>To process the above ten regions sequentially, we can use the <code>lapply()</code> function:</p> <pre><code class="language-r">library(&quot;Julia&quot;) sets &lt;- lapply(Cs, function(C) { JuliaImage(1000, centre = 0 + 0i, L = 3.5, C = C) }) </code></pre> <h2 id="parallel">Parallel</h2> <pre><code class="language-r">library(&quot;parallel&quot;) ncores &lt;- future::availableCores() ## a friendly version of detectCores() cl &lt;- makeCluster(ncores) clusterEvalQ(cl, library(&quot;Julia&quot;)) sets &lt;- parLapply(cl, Cs, function(C) { JuliaImage(1000, centre = 0 + 0i, L = 3.5, C = C) }) </code></pre> <h2 id="futures-in-parallel">Futures (in parallel)</h2> <pre><code class="language-r">library(&quot;future&quot;) plan(multisession) ## defaults to availableCores() workers library(&quot;Julia&quot;) sets &lt;- future_lapply(Cs, function(C) { JuliaImage(1000, centre = 0 + 0i, L = 3.5, C = C) }) </code></pre> <p>We could also have used the more explicit setup <code>plan(cluster, workers = makeCluster(availableCores()))</code>, which is identical to <code>plan(multisession)</code>.</p> <h2 id="futures-with-foreach">Futures with foreach</h2> <pre><code class="language-r">library(&quot;doFuture&quot;) registerDoFuture() ## tells foreach futures should be used plan(multisession) ## specifies what type of futures sets &lt;- foreach(C = Cs) %dopar% { JuliaImage(1000, centre = 0 + 0i, L = 3.5, C = C) } </code></pre> <p>Note that I didn&rsquo;t pass <code>.packages = &quot;Julia&quot;</code> to <code>foreach()</code> because the doFuture backend will do that automatically for us - that&rsquo;s one of the treats of using futures. If we would have used <code>doParallel::registerDoParallel(cl)</code> or similar, we would have had to worry about that.</p> <h2 id="futures-with-plyr">Futures with plyr</h2> <p>The plyr package will utilize foreach internally if we pass <code>.parallel = TRUE</code>. Because of this, we can use <code>plyr::llply()</code> to parallelize via futures as follows:</p> <pre><code class="language-r">library(&quot;plyr&quot;) library(&quot;doFuture&quot;) registerDoFuture() ## tells foreach futures should be used plan(multisession) ## specifies what type of futures library(&quot;Julia&quot;) sets &lt;- llply(Cs, function(C) { JuliaImage(1000, centre = 0 + 0i, L = 3.5, C = C) }, .parallel = TRUE) </code></pre> <p>For the same reason as above, we also here don&rsquo;t have to worry about global variables and making sure needed packages are attached; that&rsquo;s all handles by the future packages.</p> <h2 id="futures-with-purrr-furrr">Futures with purrr (= furrr)</h2> <p>As a final example, here is how you can use futures to parallelize your <code>purrr::map()</code> calls:</p> <pre><code class="language-r">library(&quot;purrr&quot;) library(&quot;future&quot;) plan(multisession) library(&quot;Julia&quot;) sets &lt;- Cs %&gt;% map(~ future(JuliaImage(1000, centre = 0 + 0i, L = 3.5, C = .x))) %&gt;% values </code></pre> <p><em>Comment:</em> This latter approach will not perform load balancing (&ldquo;scheduling&rdquo;) across backend workers; that&rsquo;s a feature that ideally would be taken care of by purrr itself. However, I have some ideas for future versions of future (pun&hellip;) that may achieve this without having to modify the purrr package.</p> <h1 id="got-compute">Got compute?</h1> <p>If you have access to one or more machines with R installed (e.g. a local or remote cluster, or a <a href="https://cran.r-project.org/package=googleComputeEngineR">Google Compute Engine cluster</a>), and you&rsquo;ve got direct SSH access to those machines, you can have those machines to calculate the above Julia sets; just change future plan, e.g.</p> <pre><code class="language-r">plan(cluster, workers = c(&quot;machine1&quot;, &quot;machine2&quot;, &quot;machine3.remote.org&quot;)) </code></pre> <p>If you have access to a high-performance compute (HPC) cluster with a HPC scheduler (e.g. Slurm, TORQUE / PBS, LSF, and SGE), then you can harness its power by switching to:</p> <pre><code class="language-r">library(&quot;future.batchtools&quot;) plan(batchtools_sge) </code></pre> <p>For more details, see the vignettes of the <a href="https://cran.r-project.org/package=future.batchtools">future.batchtools</a> and <a href="https://cran.r-project.org/package=batchtools">batchtools</a> packages.</p> <p>Happy futuring!</p> <h2 id="links">Links</h2> <ul> <li>future package: <ul> <li>CRAN page: <a href="https://cran.r-project.org/package=future">https://cran.r-project.org/package=future</a></li> <li>GitHub page: <a href="https://github.com/HenrikBengtsson/future">https://github.com/HenrikBengtsson/future</a></li> </ul></li> <li>future.batchtools package: <ul> <li>CRAN page: <a href="https://cran.r-project.org/package=future.batchtools">https://cran.r-project.org/package=future.batchtools</a></li> <li>GitHub page: <a href="https://github.com/HenrikBengtsson/future.batchtools">https://github.com/HenrikBengtsson/future.batchtools</a></li> </ul></li> <li>doFuture package (an <a href="https://cran.r-project.org/package=foreach">foreach</a> adaptor): <ul> <li>CRAN page: <a href="https://cran.r-project.org/package=doFuture">https://cran.r-project.org/package=doFuture</a></li> <li>GitHub page: <a href="https://github.com/HenrikBengtsson/doFuture">https://github.com/HenrikBengtsson/doFuture</a></li> </ul></li> </ul> <h2 id="see-also">See also</h2> <ul> <li><a href="https://www.jottr.org/2016/07/a-future-for-r-slides-from-user-2016.html">A Future for R: Slides from useR 2016</a>, 2016-07-02</li> <li><a href="https://www.jottr.org/2016/10/remote-processing-using-futures.html">Remote Processing Using Futures</a>, 2016-10-21</li> <li><a href="https://www.jottr.org/2017/02/future-reproducible-rngs-futurelapply.html">future: Reproducible RNGs, future_lapply() and more</a>, 2017-02-19</li> <li><a href="https://www.jottr.org/2017/03/dofuture-universal-foreach-adapator.html">doFuture: A universal foreach adaptor ready to be used by 1,000+ packages</a>, 2017-03-18</li> </ul> </description>
</item>
<item>
<title>The R-help Community was Started on This Day 20 Years Ago</title>
<link>https://www.jottr.org/2017/04/01/history-r-help-20-years/</link>
<pubDate>Sat, 01 Apr 2017 00:00:00 +0000</pubDate>
<guid>https://www.jottr.org/2017/04/01/history-r-help-20-years/</guid>
<description><p>Today, its been 20 years since Martin Mächler started the <a href="https://stat.ethz.ch/pipermail/r-help/">R-help community list</a>. The <a href="https://stat.ethz.ch/pipermail/r-help/1997-April/001488.html">first post</a> was written by Ross Ihaka on 1997-04-01:</p> <p><img src="https://www.jottr.org/post/r-help_first_post.png" alt="Subject: R-alpha: R-testers: pmin heisenbug From: Ross Ihaka &lt;ihaka at stat.auckland.ac.nz&gt; When: Tue Apr 1 10:35:48 CEST 1997" /> <em>Screenshot of the very first post to the R-help mailing list.</em></p> <p>This is a post about R&rsquo;s memory model. We&rsquo;re talking <a href="https://cran.r-project.org/src/base/R-0/">R v0.50 beta</a>. I think that the paragraph at the end provides a nice anecdote on the importance not to be overwhelmed by problems ahead:</p> <blockquote> <p>&rdquo;(The consumption of one cell per string is perhaps the major memory problem in R - we didn&rsquo;t design it with large problems in mind. It is probably fixable, but it will mean a lot of work).&rdquo;</p> </blockquote> <p>We all know the story; an endless number of hours has been put in by many contributors throughout the years, making The R Project and its community the great experience it is today.</p> <p>Thank you!</p> <p>PS. This is a blog version of my <a href="https://stat.ethz.ch/pipermail/r-help/2017-April/445921.html">R-help post</a> with the same content.</p> </description>
</item>
<item>
<title>doFuture: A Universal Foreach Adaptor Ready to be Used by 1,000+ Packages</title>
<link>https://www.jottr.org/2017/03/18/dofuture/</link>
<pubDate>Sat, 18 Mar 2017 00:00:00 +0000</pubDate>
<guid>https://www.jottr.org/2017/03/18/dofuture/</guid>
<description> <p><a href="https://cran.r-project.org/package=doFuture">doFuture</a> 0.4.0 is available on CRAN. The doFuture package provides a <em>universal</em> <a href="https://cran.r-project.org/package=foreach">foreach</a> adaptor enabling <em>any</em> <a href="https://cran.r-project.org/package=future">future</a> backend to be used with the <code>foreach() %dopar% { ... }</code> construct. As shown below, this will allow <code>foreach()</code> to parallelize on not only multiple cores, multiple background R sessions, and ad-hoc clusters, but also cloud-based clusters and high performance compute (HPC) environments.</p> <p>1,300+ R packages on CRAN and Bioconductor depend, directly or indirectly, on foreach for their parallel processing. By using doFuture, a user has the option to parallelize those computations on more compute environments than previously supported, especially HPC clusters. Notably, all <a href="https://cran.r-project.org/package=plyr">plyr</a> code with <code>.parallel = TRUE</code> will be able to take advantage of this without need for modifications - this is possible because internally plyr makes use of foreach for its parallelization.</p> <p><img src="https://www.jottr.org/post/programmer_next_to_62500_punch_cards_SAGE.jpg" alt=" Programmer standing beside punched cards" /> <em>With doFuture, foreach can process your code in more places than ever before. Alright, it may not be able to process <a href="http://www.computerhistory.org/revolution/memory-storage/8/326/924">this programmer&rsquo;s 62,500 punched cards</a>.</em></p> <h2 id="what-is-new-in-dofuture-0-4-0">What is new in doFuture 0.4.0?</h2> <ul> <li><p><strong>Load balancing</strong>: The doFuture <code>%dopar%</code> backend will now partition all iterations (elements) and distribute them uniformly such that the each backend worker will receive exactly one partition equally sized to those sent to the other workers. This approach speeds up the processing significantly when iterating over a large set of elements that each has a relatively small processing time.</p></li> <li><p><strong>Globals</strong>: Global variables and packages needed in order for external R workers to evaluate the foreach expression are now identified by the same algorithm as used for regular future constructs and <code>future::future_lapply()</code>.</p></li> </ul> <p>For full details on updates, please see the <a href="https://cran.r-project.org/package=doFuture">NEWS</a> file. <strong>The doFuture package installs out-of-the-box on all operating systems</strong>.</p> <h2 id="a-quick-example">A quick example</h2> <p>Here is a bootstrap example using foreach adapted from <code>help(&quot;clusterApply&quot;, package = &quot;parallel&quot;)</code>. I use this example to illustrate how to perform <code>foreach()</code> iterations in parallel on a variety of backends.</p> <pre><code>library(&quot;boot&quot;) run &lt;- function(...) { cd4.rg &lt;- function(data, mle) MASS::mvrnorm(nrow(data), mle$m, mle$v) cd4.mle &lt;- list(m = colMeans(cd4), v = var(cd4)) boot(cd4, corr, R = 10000, sim = &quot;parametric&quot;, ran.gen = cd4.rg, mle = cd4.mle) } ## Attach doFuture (and foreach), and tell foreach to use futures library(&quot;doFuture&quot;) registerDoFuture() ## Sequentially on the local machine plan(sequential) system.time(boot &lt;- foreach(i = 1:100, .packages = &quot;boot&quot;) %dopar% { run() }) ## user system elapsed ## 298.728 0.601 304.242 # In parallel on local machine (with 8 cores) plan(multisession) system.time(boot &lt;- foreach(i = 1:100, .packages = &quot;boot&quot;) %dopar% { run() }) ## user system elapsed ## 452.241 1.635 68.740 # In parallel on the ad-hoc cluster machine (5 machines with 4 workers each) nodes &lt;- rep(c(&quot;n1&quot;, &quot;n2&quot;, &quot;n3&quot;, &quot;n4&quot;, &quot;n5&quot;), each = 4L) plan(cluster, workers = nodes) system.time(boot &lt;- foreach(i = 1:100, .packages = &quot;boot&quot;) %dopar% { run() }) ## user system elapsed ## 2.046 0.188 22.227 # In parallel on Google Compute Engine (10 r-base Docker containers) vms &lt;- lapply(paste0(&quot;node&quot;, 1:10), FUN = googleComputeEngineR::gce_vm, template = &quot;r-base&quot;) vms &lt;- lapply(vms, FUN = gce_ssh_setup) vms &lt;- as.cluster(vms, docker_image = &quot;henrikbengtsson/r-base-future&quot;) plan(cluster, workers = vms) system.time(boot &lt;- foreach(i = 1:100, .packages = &quot;boot&quot;) %dopar% { run() }) ## user system elapsed ## 0.952 0.040 26.269 # In parallel on a HPC cluster with a TORQUE / PBS scheduler # (Note, the below timing includes waiting time on job queue) plan(future.BatchJobs::batchjobs_torque, workers = 10) system.time(boot &lt;- foreach(i = 1:100, .packages = &quot;boot&quot;) %dopar% { run() }) ## user system elapsed ## 15.568 6.778 52.024 </code></pre> <h2 id="about-export-and-packages">About <code>.export</code> and <code>.packages</code></h2> <p>When using <code>doFuture::registerDoFuture()</code>, there is no need to manually specify which global variables (argument <code>.export</code>) to export. By default, the doFuture backend automatically identifies and exports all globals needed. This is done using recursive static-code inspection. The same is true for packages that need to be attached; those will also be handled automatically and there is no need to specify them manually via argument <code>.packages</code>. This is in line with how it works for regular future constructs, e.g. <code>y %&lt;-% { a * sum(x) }</code>.</p> <p>Having said this, you may still want to specify arguments <code>.export</code> and <code>.packages</code> because of the risk that your <code>foreach()</code> statement may not work with other foreach adaptors, e.g. <a href="https://cran.r-project.org/package=doParallel">doParallel</a> and <a href="https://cran.r-project.org/package=doSNOW">doSNOW</a>. Exactly when and where a failure may occur depends on the nestedness of your code and the location of your global variables. Specifying <code>.export</code> and <code>.packages</code> manually skips such automatic identification.</p> <p>Finally, I recommend that you as a developer always try to write your code in such way the users can choose their own futures: The developer decides <em>what</em> should be parallelized - the user chooses <em>how</em>.</p> <p>Happy futuring!</p> <p>UPDATE 2022-12-11: Update examples that used the deprecated <code>multiprocess</code> future backend alias to use the <code>multisession</code> backend.</p> <h2 id="links">Links</h2> <ul> <li>doFuture package: <ul> <li>CRAN page: <a href="https://cran.r-project.org/package=doFuture">https://cran.r-project.org/package=doFuture</a></li> <li>GitHub page: <a href="https://github.com/HenrikBengtsson/doFuture">https://github.com/HenrikBengtsson/doFuture</a></li> </ul></li> <li>future package: <ul> <li>CRAN page: <a href="https://cran.r-project.org/package=future">https://cran.r-project.org/package=future</a></li> <li>GitHub page: <a href="https://github.com/HenrikBengtsson/future">https://github.com/HenrikBengtsson/future</a></li> </ul></li> <li>future.BatchJobs package (enhancing <a href="https://cran.r-project.org/package=BatchJobs">BatchJobs</a>): <ul> <li>CRAN page: <a href="https://cran.r-project.org/package=future.BatchJobs">https://cran.r-project.org/package=future.BatchJobs</a></li> <li>GitHub page: <a href="https://github.com/HenrikBengtsson/future.BatchJobs">https://github.com/HenrikBengtsson/future.BatchJobs</a></li> </ul></li> <li>future.batchtools package (enhancing <a href="https://cran.r-project.org/package=batchtools">batchtools</a>): <ul> <li>CRAN page: coming soon</li> <li>GitHub page: <a href="https://github.com/HenrikBengtsson/future.batchtools">https://github.com/HenrikBengtsson/future.batchtools</a></li> </ul></li> <li>googleComputeEngineR package: <ul> <li>CRAN page: <a href="https://cran.r-project.org/package=googleComputeEngineR">https://cran.r-project.org/package=googleComputeEngineR</a></li> <li>GitHub page: <a href="https://cloudyr.github.io/googleComputeEngineR">https://cloudyr.github.io/googleComputeEngineR</a> <br /></li> </ul></li> </ul> <h2 id="see-also">See also</h2> <ul> <li><a href="https://www.jottr.org/2017/02/future-reproducible-rngs-futurelapply.html">future: Reproducible RNGs, future_lapply() and more</a>, 2017-02-19</li> <li><a href="https://www.jottr.org/2016/10/remote-processing-using-futures.html">Remote Processing Using Futures</a>, 2016-10-21</li> <li><a href="https://www.jottr.org/2016/07/a-future-for-r-slides-from-user-2016.html">A Future for R: Slides from useR 2016</a>, 2016-07-02</li> </ul> </description>
</item>
<item>
<title>future 1.3.0: Reproducible RNGs, future_lapply() and More</title>
<link>https://www.jottr.org/2017/02/19/future-rng/</link>
<pubDate>Sun, 19 Feb 2017 00:00:00 +0000</pubDate>
<guid>https://www.jottr.org/2017/02/19/future-rng/</guid>
<description> <p><a href="https://cran.r-project.org/package=future">future</a> 1.3.0 is available on CRAN. With futures, it is easy to <strong>write R code once</strong>, which the user can choose to evaluate in parallel using whatever resources s/he has available, e.g. a local machine, a set of local machines, a set of remote machines, a high-end compute cluster (via <a href="https://cran.r-project.org/package=future.BatchJobs">future.BatchJobs</a> and soon also <a href="https://github.com/HenrikBengtsson/future.batchtools">future.batchtools</a>), or in the cloud (e.g. via <a href="https://cran.r-project.org/package=googleComputeEngineR">googleComputeEngineR</a>).</p> <p><img src="https://www.jottr.org/post/funny_car_magnet_animated.gif" alt="Silent movie clip of man in a cart catching a ride with a car passing by using a giant magnet" /> <em>Futures makes it easy to harness any resources at hand.</em></p> <p>Thanks to great feedback from the community, this new version provides:</p> <ul> <li><p><strong>A convenient lapply() function</strong></p> <ul> <li>Added <code>future_lapply()</code> that works like <code>lapply()</code> and gives identical results with the difference that futures are used internally. Depending on user&rsquo;s choice of <code>plan()</code>, these calculations may be processed sequential, in parallel, or distributed on multiple machines.</li> <li>Load balancing can be controlled by argument <code>future.scheduling</code>, which is a scalar adjusting how many futures each worker should process.</li> <li>Perfect reproducible random number generation (RNG) is guaranteed given the same initial seed, regardless of the type of futures used and choice of load balancing. Argument <code>future.seed = TRUE</code> (default) will use a random initial seed, which may also be specified as <code>future.seed = &lt;integer&gt;</code>. L&rsquo;Ecuyer-CMRG RNG streams are used internally.</li> </ul></li> <li><p><strong>Clarifies distinction between developer and end user</strong></p> <ul> <li>The end user controls what future strategy to use by default, e.g. <code>plan(multisession)</code> or <code>plan(cluster, workers = c(&quot;machine1&quot;, &quot;machine2&quot;, &quot;remote.server.org&quot;))</code>.</li> <li>The developer controls whether futures should be resolved eagerly (default) or lazily, e.g. <code>f &lt;- future(..., lazy = TRUE)</code>. Because of this, <code>plan(lazy)</code> is now deprecated.</li> </ul></li> <li><p><strong>Is even more friendly to multi-tenant compute environments</strong></p> <ul> <li><code>availableCores()</code> returns the number of cores available to the current R process. On a regular machine, this typically corresponds to the number of cores on the machine (<code>parallel::detectCores()</code>). If option <code>mc.cores</code> or environment variable <code>MC_CORES</code> is set, then that will be returned. However, on compute clusters using schedulers such as SGE, Slurm, and TORQUE / PBS, the function detects the number of cores allotted to the job by the scheduler and returns that instead. <strong>This way developers don&rsquo;t have to adjust their code to match a certain compute environment; the default works everywhere</strong>.</li> <li>With the new version, it is possible to override the fallback value used when nothing else is specified to not be the number of cores on the machine but to option <code>future.availableCores.fallback</code> or environment variable <code>R_FUTURE_AVAILABLE_FALLBACK</code>. For instance, by using <code>R_FUTURE_AVAILABLE_FALLBACK=1</code> system-wide in HPC environments, any user running outside of the scheduler will automatically use single-core processing unless explicitly requesting more cores. This lowers the risk of overloading the CPU by mistake.</li> <li>Analogously to how <code>availableCores()</code> returns the number of cores, the new function <code>availableWorkers()</code> returns the host names available to the R process. The default is <code>rep(&quot;localhost&quot;, times = availableCores())</code>, but when using HPC schedulers it may be the host names of other compute notes allocated to the job. <br /></li> </ul></li> </ul> <p>For full details on updates, please see the <a href="https://cran.r-project.org/package=future">NEWS</a> file. <strong>The future package installs out-of-the-box on all operating systems</strong>.</p> <h2 id="a-quick-example">A quick example</h2> <p>The bootstrap example of <code>help(&quot;clusterApply&quot;, package = &quot;parallel&quot;)</code> adapted to make use of futures.</p> <pre><code class="language-r">library(&quot;future&quot;) library(&quot;boot&quot;) run &lt;- function(...) { cd4.rg &lt;- function(data, mle) MASS::mvrnorm(nrow(data), mle$m, mle$v) cd4.mle &lt;- list(m = colMeans(cd4), v = var(cd4)) boot(cd4, corr, R = 5000, sim = &quot;parametric&quot;, ran.gen = cd4.rg, mle = cd4.mle) } # base::lapply() system.time(boot &lt;- lapply(1:100, FUN = run)) ### user system elapsed ### 133.637 0.000 133.744 # Sequentially on the local machine plan(sequential) system.time(boot0 &lt;- future_lapply(1:100, FUN = run, future.seed = 0xBEEF)) ### user system elapsed ### 134.916 0.003 135.039 # In parallel on the local machine (with 8 cores) plan(multisession) system.time(boot1 &lt;- future_lapply(1:100, FUN = run, future.seed = 0xBEEF)) ### user system elapsed ### 0.960 0.041 29.527 stopifnot(all.equal(boot1, boot0)) </code></pre> <h2 id="what-s-next">What&rsquo;s next?</h2> <p>The <a href="https://cran.r-project.org/package=future.BatchJobs">future.BatchJobs</a> package, which builds on top of <a href="https://cran.r-project.org/package=BatchJobs">BatchJobs</a>, provides future strategies for various HPC schedulers, e.g. SGE, Slurm, and TORQUE / PBS. For example, by using <code>plan(batchjobs_torque)</code> instead of <code>plan(multisession)</code> your futures will be resolved distributed on a compute cluster instead of parallel on your local machine. That&rsquo;s it! However, since last year, the BatchJobs package has been decommissioned and the authors recommend everyone to use their new <a href="https://cran.r-project.org/package=batchtools">batchtools</a> package instead. Just like BatchJobs, it is a very well written package, but at the same time it is more robust against cluster problems and it also supports more types of HPC schedulers. Because of this, I&rsquo;ve been working on <a href="https://github.com/HenrikBengtsson/future.batchtools">future.batchtools</a> which I hope to be able to release soon.</p> <p>Finally, I&rsquo;m really keen on looking into how futures can be used with Shaun Jackman&rsquo;s <a href="https://github.com/sjackman/lambdar">lambdar</a>, which is a proof-of-concept that allows you to execute R code on Amazon&rsquo;s &ldquo;serverless&rdquo; <a href="https://aws.amazon.com/lambda/">AWS Lambda</a> framework. My hope is that, in a not too far future (pun not intended*), we&rsquo;ll be able to resolve our futures on AWS Lambda using <code>plan(aws_lambda)</code>.</p> <p>Happy futuring!</p> <p>(*) Alright, I admit, it was intended.</p> <p>UPDATE 2022-12-11: Update examples that used the deprecated <code>multiprocess</code> future backend alias to use the <code>multisession</code> backend.</p> <h2 id="links">Links</h2> <ul> <li>future package: <ul> <li>CRAN page: <a href="https://cran.r-project.org/package=future">https://cran.r-project.org/package=future</a></li> <li>GitHub page: <a href="https://github.com/HenrikBengtsson/future">https://github.com/HenrikBengtsson/future</a></li> </ul></li> <li>future.BatchJobs package: <ul> <li>CRAN page: <a href="https://cran.r-project.org/package=future.BatchJobs">https://cran.r-project.org/package=future.BatchJobs</a></li> <li>GitHub page: <a href="https://github.com/HenrikBengtsson/future.BatchJobs">https://github.com/HenrikBengtsson/future.BatchJobs</a></li> </ul></li> <li>future.batchtools package: <ul> <li>CRAN page: N/A</li> <li>GitHub page: <a href="https://github.com/HenrikBengtsson/future.batchtools">https://github.com/HenrikBengtsson/future.batchtools</a></li> </ul></li> <li>doFuture package (a <a href="https://cran.r-project.org/package=foreach">foreach</a> adaptor): <ul> <li>CRAN page: <a href="https://cran.r-project.org/package=doFuture">https://cran.r-project.org/package=doFuture</a></li> <li>GitHub page: <a href="https://github.com/HenrikBengtsson/doFuture">https://github.com/HenrikBengtsson/doFuture</a></li> </ul></li> </ul> <h2 id="see-also">See also</h2> <ul> <li><a href="https://www.jottr.org/2016/07/a-future-for-r-slides-from-user-2016.html">A Future for R: Slides from useR 2016</a>, 2016-07-02</li> <li><a href="https://www.jottr.org/2016/10/remote-processing-using-futures.html">Remote Processing Using Futures</a>, 2016-10-21</li> </ul> </description>
</item>
<item>
<title>Start Me Up</title>
<link>https://www.jottr.org/2016/12/22/startup/</link>
<pubDate>Thu, 22 Dec 2016 00:00:00 +0000</pubDate>
<guid>https://www.jottr.org/2016/12/22/startup/</guid>
<description> <p>The <a href="https://cran.r-project.org/package=startup">startup</a> package makes it easy to control your R startup processes and to share part of your startup settings with others (e.g. as a public Git repository) while keeping secret parts to yourself. Instead of having long and windy <code>.Renviron</code> and <code>.Rprofile</code> startup files, you can split them up into short specific files under corresponding <code>.Renviron.d/</code> and <code>.Rprofile.d/</code> directories. For example,</p> <pre><code># Environment variables # (one name=value per line) .Renviron.d/ +- lang # language settings +- libs # library settings +- r_cmd_check # R CMD check settings +- secrets # secret access keys (don't share!) # Configuration scripts # (regular R scripts) .Rprofile.d/ +- interactive=TRUE/ # Used in interactive-mode only: | +- help.start.R # - launch the help server on fixed port | +- misc.R # - TAB completions and more | +- package=fortunes.R # - show a random fortune (iff installed) +- package=devtools.R # devtools-specific options +- os=windows.R # Windows-specific settings +- repos.R # set up the CRAN repository </code></pre> <p>All you need to for this to work is to have a line:</p> <pre><code class="language-r">startup::startup() </code></pre> <p>in your <code>~/.Rprofile</code> file (you may use it in any of the other locations that R supports). As an alternative to manually edit this file, just call <code>startup::install()</code> and this line will be appended if missing and if the file is missing that will also be created. Don&rsquo;t worry, your old file will be backed up with a timestamp.</p> <p>The startup package is extremely lightweight, has no external dependencies and depends only on the &lsquo;base&rsquo; R package. It can be installed from CRAN using <code>install.packages(&quot;startup&quot;)</code>. <em>Note, startup 0.4.0 was released on CRAN on 2016-12-22 - until macOS and Windows binaries are available you can install it via <code>install.packages(&quot;startup&quot;, type = &quot;source&quot;)</code>.</em></p> <p>For more information on what&rsquo;s possible to do with the startup package, see the <a href="https://cran.r-project.org/web/packages/startup/README.html">README</a> file of the package.</p> <h2 id="links">Links</h2> <ul> <li>startup package: <ul> <li>CRAN page: <a href="https://cran.r-project.org/package=startup">https://cran.r-project.org/package=startup</a></li> <li>GitHub page: <a href="https://github.com/HenrikBengtsson/startup">https://github.com/HenrikBengtsson/startup</a></li> </ul></li> </ul> </description>
</item>
<item>
<title>High-Performance Compute in R Using Futures</title>
<link>https://www.jottr.org/2016/10/22/future-hpc/</link>
<pubDate>Sat, 22 Oct 2016 00:00:00 +0000</pubDate>
<guid>https://www.jottr.org/2016/10/22/future-hpc/</guid>
<description> <p>A new version of the <a href="https://cran.r-project.org/package=future.BatchJobs">future.BatchJobs</a> package has been released and is available on CRAN. With a single change of settings, it allows you to switch from running an analysis sequentially on a local machine to running it in parallel on a compute cluster.</p> <p><img src="https://www.jottr.org/post/future_mainframe_red.jpg" alt="A room with a classical mainframe computer and work desks" /> <em>Our different futures can easily be resolved on high-performance compute clusters.</em></p> <h2 id="requirements">Requirements</h2> <p>The future.BatchJobs package implements the Future API, as defined by the <a href="https://cran.r-project.org/package=future">future</a> package, on top of the API provided by the <a href="https://cran.r-project.org/package=BatchJobs">BatchJobs</a> package. These packages and their dependencies install out-of-the-box on all operating systems.</p> <p>Installing the package is all that is needed in order to give it a test ride. If you have access to a compute cluster that uses one of the common job schedulers, such as <a href="https://en.wikipedia.org/wiki/TORQUE">TORQUE (PBS)</a>, <a href="https://en.wikipedia.org/wiki/Slurm_Workload_Manager">Slurm</a>, <a href="https://en.wikipedia.org/wiki/Oracle_Grid_Engine">Sun/Oracle Grid Engine (SGE)</a>, <a href="https://en.wikipedia.org/wiki/Platform_LSF">Load Sharing Facility (LSF)</a> or <a href="https://en.wikipedia.org/wiki/OpenLava">OpenLava</a>, then you&rsquo;re ready to take it for a serious ride. If your cluster uses another type of scheduler, it is possible to configure it to work also there. If you don&rsquo;t have access to a compute cluster right now, you can still try future.BatchJobs by simply using <code>plan(batchjobs_local)</code> in the below example - all futures (&ldquo;jobs&rdquo;) will then be processed sequentially on your local machine (*).</p> <p><small> (*) For those of you who are already familiar with the <a href="https://cran.r-project.org/package=future">future</a> package - yes, if you&rsquo;re only going to run locally, then you can equally well use <code>plan(sequential)</code> or <code>plan(multisession)</code>, but for the sake of demonstrating future.BatchJobs per se, I suggest using <code>plan(batchjobs_local)</code> because it will use the BatchJobs machinery underneath. </small></p> <h2 id="example-extracting-text-and-generating-images-from-pdfs">Example: Extracting text and generating images from PDFs</h2> <p>Imagine we have a large set of PDF documents from which we would like to extract the text and also generate PNG images for each of the pages. Below, I will show how this can be easily done in R thanks to the <a href="https://cran.r-project.org/package=pdftools">pdftools</a> package written by <a href="https://github.com/jeroenooms">Jeroen Ooms</a>. I will also show how we can speed up the processing by using futures that are resolved in parallel either on the local machine or, as shown here, distributed on a compute cluster.</p> <pre><code class="language-r">library(&quot;pdftools&quot;) library(&quot;future.BatchJobs&quot;) library(&quot;listenv&quot;) ## Process all PDFs on local TORQUE cluster plan(batchjobs_torque) ## PDF documents to process pdfs &lt;- dir(path = rev(.libPaths())[1], recursive = TRUE, pattern = &quot;[.]pdf$&quot;, full.names = TRUE) pdfs &lt;- pdfs[basename(dirname(pdfs)) == &quot;doc&quot;] print(pdfs) ## For each PDF ... docs &lt;- listenv() for (ii in seq_along(pdfs)) { pdf &lt;- pdfs[ii] message(sprintf(&quot;%d. Processing %s&quot;, ii, pdf)) name &lt;- tools::file_path_sans_ext(basename(pdf)) docs[[name]] %&lt;-% { path &lt;- file.path(&quot;output&quot;, name) dir.create(path, recursive = TRUE, showWarnings = FALSE) ## (a) Extract the text and write to file content &lt;- pdf_text(pdf) txt &lt;- file.path(path, sprintf(&quot;%s.txt&quot;, name)) cat(content, file = txt) ## (b) Create a PNG file per page pngs &lt;- listenv() for (jj in seq_along(content)) { pngs[[jj]] %&lt;-% { img &lt;- pdf_render_page(pdf, page = jj) png &lt;- file.path(path, sprintf(&quot;%s_p%03d.png&quot;, name, jj)) png::writePNG(img, png) png } } list(pdf = pdf, txt = txt, pngs = unlist(pngs)) } } ## Resolve everything if not already done docs &lt;- as.list(docs) str(docs) </code></pre> <p>As true for all code using the Future API, as a user you always have full control on how futures should be resolved. For instance, you can choose to run the above on your local machine, still via the BatchJobs framework, by using <code>plan(batchjobs_local)</code>. You could even skip the future.BatchJobs package and use what is available in the future package alone, e.g. <code>library(&quot;future&quot;)</code> and <code>plan(multisession)</code>.</p> <p>As emphasized in for instance the <a href="https://www.jottr.org/2016/10/remote-processing-using-futures.html">Remote Processing Using Futures</a> blog post and in the vignettes of the <a href="https://cran.r-project.org/package=future">future</a> package, there is no need to manually identify and manually export variables and functions that need to be available to the external R processes resolving the futures. Such global variables are automatically identified by the future package and exported when necessary.</p> <h2 id="futures-may-be-nested">Futures may be nested</h2> <p>Note how we used nested futures in the above example, where we create one future per PDF and for each PDF we, in turn, create one future per PNG. The design of the Future API is such that the user should have full control on how each level of futures is resolved. In other words, it is the user and not the developer who should decide what is specified in <code>plan()</code>.</p> <p>For futures, if nothing is specified, then sequential processing is always used for resolving futures. In the above example, we specified <code>plan(batchjobs_torque)</code>, which means that the outer loop of futures is processed as individual jobs on the cluster. Each of these futures will be resolved in a separate R process. Next, since we didn&rsquo;t specify how the inner loop of futures should be processed, these will be resolved sequentially as part of these individual R processes.</p> <p>However, we could also choose to have the futures in the inner loop be resolved as individual jobs on the scheduler, which can be done as:</p> <pre><code class="language-r">plan(list(batchjobs_torque, batchjobs_torque)) </code></pre> <p>This would cause each PDF to be submitted as an individual job, which when launched on a compute node by scheduler will start by extract the plain text of the document and write it to file. When this is done, the job continues by generating a PDF image file for each page, which is done via individual jobs on the scheduler.</p> <p>Exactly what strategies to use for resolving the different levels of futures depends on how long they take to process. If the amount of processing needed for a future is really long, then it makes sense to submit it the scheduler whereas if it is really quick it probably makes more sense to process it on the current machine either using parallel futures or no futures at all. For instance, in our example, we could also have chosen to generate the PNGs in parallel on the same compute node that extracted the text. Such a configuration could look like:</p> <pre><code class="language-r">plan(list( tweak(batchjobs_torque, resources = &quot;nodes=1:ppn=12&quot;), multisession )) </code></pre> <p>This setup tells the scheduler that each job should be allocated 12 cores that the individual R processes then may use in parallel. The future package and the <code>multisession</code> configuration will automatically detect how many cores it was allocated by the scheduler.</p> <p>There are numerous other ways to control how and where futures are resolved. See the vignettes of the <a href="https://cran.r-project.org/package=future">future</a> and the <a href="https://cran.r-project.org/package=future.BatchJobs">future.BatchJobs</a> packages for more details. Also, if you read the above and thought that this may result in an explosion of futures created recursively that will bring down your computer or your cluster, don&rsquo;t worry. It&rsquo;s built into the core of future package to prevent this from happening.</p> <h2 id="what-s-next">What&rsquo;s next?</h2> <p>The future.BatchJobs package simply implements the Future API (as defined by the future package) on top of the API provided by the awesome BatchJobs package. The creators of that package are working on the next generation of their tool - the <a href="https://github.com/mllg/batchtools">batchtools</a> package. I&rsquo;ve already started on the corresponding future.batchtools package so that you and your users can switch over to using <code>plan(batchtools_torque)</code> - it&rsquo;ll be as simple as that.</p> <p>Happy futuring!</p> <p>UPDATE 2022-12-11: Update examples that used the deprecated <code>multiprocess</code> future backend alias to use the <code>multisession</code> backend.</p> <h2 id="links">Links</h2> <ul> <li>future package: <ul> <li>CRAN page: <a href="https://cran.r-project.org/package=future">https://cran.r-project.org/package=future</a></li> <li>GitHub page: <a href="https://github.com/HenrikBengtsson/future">https://github.com/HenrikBengtsson/future</a></li> </ul></li> <li>future.BatchJobs package: <ul> <li>CRAN page: <a href="https://cran.r-project.org/package=future.BatchJobs">https://cran.r-project.org/package=future.BatchJobs</a></li> <li>GitHub page: <a href="https://github.com/HenrikBengtsson/future.BatchJobs">https://github.com/HenrikBengtsson/future.BatchJobs</a></li> </ul></li> <li>doFuture package: <ul> <li>CRAN page: <a href="https://cran.r-project.org/package=doFuture">https://cran.r-project.org/package=doFuture</a></li> <li>GitHub page: <a href="https://github.com/HenrikBengtsson/doFuture">https://github.com/HenrikBengtsson/doFuture</a></li> </ul></li> </ul> <h2 id="see-also">See also</h2> <ul> <li><a href="https://www.jottr.org/2016/07/a-future-for-r-slides-from-user-2016.html">A Future for R: Slides from useR 2016</a>, 2016-07-02</li> <li><a href="https://www.jottr.org/2016/10/remote-processing-using-futures.html">Remote Processing Using Futures</a>, 2016-10-11</li> </ul> <p>Keywords: R, future, future.BatchJobs, BatchJobs, package, CRAN, asynchronous, parallel processing, distributed processing, high-performance compute, HPC, compute cluster, TORQUE, PBS, Slurm, SGE, LSF, OpenLava</p> </description>
</item>
<item>
<title>Remote Processing Using Futures</title>
<link>https://www.jottr.org/2016/10/11/future-remotes/</link>
<pubDate>Tue, 11 Oct 2016 00:00:00 +0000</pubDate>
<guid>https://www.jottr.org/2016/10/11/future-remotes/</guid>
<description> <p>A new version of the <a href="https://cran.r-project.org/package=future">future</a> package has been released and is available on CRAN. With futures, it is easy to <em>write R code once</em>, which later <em>the user can choose</em> to parallelize using whatever resources s/he has available, e.g. a local machine, a set of local notebooks, a set of remote machines, or a high-end compute cluster.</p> <p><img src="https://www.jottr.org/post/early_days_video_call.jpg" alt="Postcard from 1900 showing how people in the year 2000 will communicate using audio and projected video" /> <em>The future provides comfortable and friendly long-distance interactions.</em></p> <p>The new version, future 1.1.1, provides:</p> <ul> <li><p><strong>Much easier usage of remote computers / clusters</strong></p> <ul> <li>If you can SSH to the machine, then you can also use it to resolve R expressions remotely.</li> <li>Firewall configuration and port forwarding are no longer needed.</li> </ul></li> <li><p><strong>Improved identification of global variables</strong></p> <ul> <li>Corner cases where the package previously failed to identify and export global variables are now also handled. For instance, variable <code>x</code> is now properly identified as a global variable in expressions such as <code>x$a &lt;- 3</code> and <code>x[1, 2, 4] &lt;- 3</code> as well as in formulas such as <code>y ~ x | z</code>.</li> <li>Global variables are by default identified automatically, but can now also be specified manually, either by their names (as a character vector) or by their names and values (as a named list). <br /></li> </ul></li> </ul> <p>For full details on updates, please see the <a href="https://cran.r-project.org/package=future">NEWS</a> file. The future package installs out-of-the-box on all operating systems.</p> <h2 id="example-remote-graphics-rendered-locally">Example: Remote graphics rendered locally</h2> <p>To illustrate how simple and powerful remote futures can be, I will show how to (i) set up locally stored data, (ii) generate <a href="https://cran.r-project.org/package=plotly">plotly</a>-enhanced <a href="https://cran.r-project.org/package=ggplot2">ggplot2</a> graphics based on these data using a remote machine, and then (iii) render these plotly graphics in the local web browser for interactive exploration of data.</p> <p>Before starting, all we need to do is to verify that we have SSH access to the remote machine, let&rsquo;s call it <code>remote.server.org</code>, and that it has R installed:</p> <pre><code class="language-sh">{local}: ssh remote.server.org {remote}: Rscript --version R scripting front-end version 3.3.1 (2016-06-21) {remote}: exit {local}: exit </code></pre> <p>Note, it is highly recommended to use <a href="https://en.wikipedia.org/wiki/Secure_Shell#Key_management">SSH-key pair authentication</a> so that login credentials do not have to be entered manually.</p> <p>After having made sure that the above works, we are ready for our remote future demo. The following code is based on an online <a href="https://plot.ly/ggplot2/">plotly example</a> where only a few minor modifications have been done:</p> <pre><code class="language-r">library(&quot;plotly&quot;) library(&quot;future&quot;) ## %&lt;-% assignments will be resolved remotely plan(remote, workers = &quot;remote.server.org&quot;) ## Set up data (locally) set.seed(100) d &lt;- diamonds[sample(nrow(diamonds), 1000), ] ## Generate ggplot2 graphics and plotly-fy (remotely) gg %&lt;-% { p &lt;- ggplot(data = d, aes(x = carat, y = price)) + geom_point(aes(text = paste(&quot;Clarity:&quot;, clarity)), size = 4) + geom_smooth(aes(colour = cut, fill = cut)) + facet_wrap(~ cut) ggplotly(p) } ## Display graphics in browser (locally) gg </code></pre> <p>The above renders the plotly-compiled ggplot2 graphics in our local browser. See below screenshot for an example.</p> <p>This might sound like magic, but all that is going behind the scenes is a carefully engineered utilization of the <a href="https://cran.r-project.org/package=globals">globals</a> and the parallel packages, which is then encapsulated in the unified API provided by the future package. First, a future assignment (<code>%&lt;-%</code>) is used for <code>gg</code>, instead of a regular assignment (<code>&lt;-</code>). That tells R to use a future to evaluate the expression on the right-hand side (everything within <code>{ ... }</code>). Second, since we specified that we want to use the remote machine <code>remote.server.org</code> to resolve our futures, that is where the future expression is evaluated. Third, necessary data is automatically communicated between our local and remote machines. That is, any global variables (<code>d</code>) and functions are automatically identified and exported to the remote machine and required packages (<code>ggplot2</code> and <code>plotly</code>) are loaded remotely. When resolved, the value of the expression is automatically transferred back to our local machine afterward and is available as the value of future variable <code>gg</code>, which was formally set up as a promise.</p> <p><img src="https://www.jottr.org/post/future_1.1.1-example_plotly.png" alt="Screenshot of a plotly-rendered panel of ggplot2 graphs" /> <em>An example of remote futures: This ggplot2 + plotly figure was generated on a remote machine and then rendered in the local web browser where it is can be interacted with dynamically.</em></p> <p><em>What&rsquo;s next?</em> Over the summer, I have received tremendous feedback from several people, such as (in no particular order) <a href="https://github.com/krlmlr">Kirill Müller</a>, <a href="https://github.com/gdevailly">Guillaume Devailly</a>, <a href="https://github.com/clarkfitzg">Clark Fitzgerald</a>, <a href="https://github.com/michaelsbradleyjr">Michael Bradley</a>, <a href="https://github.com/thomasp85">Thomas Lin Pedersen</a>, <a href="https://github.com/alexvorobiev">Alex Vorobiev</a>, <a href="https://github.com/hrbrmstr">Bob Rudis</a>, <a href="https://github.com/RebelionTheGrey">RebelionTheGrey</a>, <a href="https://github.com/wrathematics">Drew Schmidt</a> and <a href="https://github.com/gaborcsardi">Gábor Csárdi</a> (sorry if I missed anyone, please let me know). This feedback contributed to some of the new features found in future 1.1.1. However, there&rsquo;re many great <a href="https://github.com/HenrikBengtsson/future/issues">suggestions and wishes</a> that didn&rsquo;t make it in for this release - I hope to be able to work on those next. Thank you all.</p> <p>Happy futuring!</p> <h2 id="links">Links</h2> <ul> <li>future package: <ul> <li>CRAN page: <a href="https://cran.r-project.org/package=future">https://cran.r-project.org/package=future</a></li> <li>GitHub page: <a href="https://github.com/HenrikBengtsson/future">https://github.com/HenrikBengtsson/future</a></li> </ul></li> <li>future.BatchJobs package: <ul> <li>CRAN page: <a href="https://cran.r-project.org/package=future.BatchJobs">https://cran.r-project.org/package=future.BatchJobs</a></li> <li>GitHub page: <a href="https://github.com/HenrikBengtsson/future.BatchJobs">https://github.com/HenrikBengtsson/future.BatchJobs</a></li> </ul></li> <li>doFuture package: <ul> <li>CRAN page: <a href="https://cran.r-project.org/package=doFuture">https://cran.r-project.org/package=doFuture</a></li> <li>GitHub page: <a href="https://github.com/HenrikBengtsson/doFuture">https://github.com/HenrikBengtsson/doFuture</a></li> </ul></li> </ul> <h2 id="see-also">See also</h2> <ul> <li><a href="https://www.jottr.org/2016/07/a-future-for-r-slides-from-user-2016.html">A Future for R: Slides from useR 2016</a>, 2016-07-02</li> </ul> </description>
</item>
<item>
<title>A Future for R: Slides from useR 2016</title>
<link>https://www.jottr.org/2016/07/02/future-user2016-slides/</link>
<pubDate>Sat, 02 Jul 2016 00:00:00 +0000</pubDate>
<guid>https://www.jottr.org/2016/07/02/future-user2016-slides/</guid>
<description> <p>Unless you count DSC 2003 in Vienna, last week&rsquo;s <a href="http://user2016.org/">useR</a> conference at Stanford was my very first time at useR. It was a great event, it was awesome to meet our lovely and vibrant R community in real life, which we otherwise only get know from online interactions, and of course it was very nice to meet old friends and make new ones.</p> <p><img src="https://www.jottr.org/post/hover_craft_car_photo_picture.jpg" alt="Classical illustration of a hover car above the tree taking of from a yard with a house" /> <em>The future is promising.</em></p> <p>At the end of the second day, I presented <em>A Future for R</em> (18 min talk; slides below) on how you can use the <a href="https://cran.r-project.org/package=future">future</a> package for asynchronous (parallel and distributed) processing using a single unified API regardless of what backend you have available, e.g. multicore, multisession, ad hoc cluster, and job schedulers. I ended with a teaser on how futures can be used for much more than speeding up your code, e.g. generating graphics remotely and displaying it locally.</p> <p>Here&rsquo;s an example using two futures that process data in parallel:</p> <pre><code class="language-r">&gt; library(&quot;future&quot;) &gt; plan(multisession) ## Parallel processing &gt; a %&lt;-% slow_sum(1:50) ## These two assignments are &gt; b %&lt;-% slow_sum(51:100) ## non-blocking and in parallel &gt; y &lt;- a + b ## Waits for a and b to be resolved &gt; y [1] 5050 </code></pre> <p>Below are different formats of my talk (18 slides + 9 appendix slides) on 2016-06-28:</p> <ul> <li><a href="http://www.aroma-project.org/share/presentations/BengtssonH_20160628-useR2016/BengtssonH_20160628-A_Future_for_R,useR2016.html">HTML</a> (incremental slides; requires online access)</li> <li><a href="http://www.aroma-project.org/share/presentations/BengtssonH_20160628-useR2016/BengtssonH_20160628-A_Future_for_R,useR2016,flat.html">HTML</a> (non-incremental slides; requires online access)</li> <li><a href="http://www.aroma-project.org/share/presentations/BengtssonH_20160628-useR2016/BengtssonH_20160628-A_Future_for_R,useR2016.pdf">PDF</a> (incremental slides)</li> <li><a href="http://www.aroma-project.org/share/presentations/BengtssonH_20160628-useR2016/BengtssonH_20160628-A_Future_for_R,useR2016,flat.pdf">PDF</a> (non-incremental slides)</li> <li><a href="http://www.aroma-project.org/share/presentations/BengtssonH_20160628-useR2016/BengtssonH_20160628-A_Future_for_R,useR2016,pure.md">Markdown</a> (screen reader friendly)</li> <li><a href="https://www.youtube.com/watch?v=K8KYi9AFRlk">YouTube</a> (video recording)</li> </ul> <p>May the future be with you!</p> <p>UPDATE 2022-12-11: Update examples that used the deprecated <code>multiprocess</code> future backend alias to use the <code>multisession</code> backend.</p> <h2 id="links">Links</h2> <ul> <li>useR 2016: <ul> <li>Conference site: <a href="https://user2016.r-project.org/">https://user2016.r-project.org/</a></li> <li>Talk abstract: <a href="https://user2016.sched.org/event/7BZK/a-future-for-r">https://user2016.sched.org/event/7BZK/a-future-for-r</a></li> </ul></li> <li>future package: <ul> <li>CRAN page: <a href="https://cran.r-project.org/package=future">https://cran.r-project.org/package=future</a></li> <li>GitHub page: <a href="https://github.com/HenrikBengtsson/future">https://github.com/HenrikBengtsson/future</a></li> </ul></li> <li>future.BatchJobs package: <ul> <li>CRAN page: <a href="https://cran.r-project.org/package=future.BatchJobs">https://cran.r-project.org/package=future.BatchJobs</a></li> <li>GitHub page: <a href="https://github.com/HenrikBengtsson/future.BatchJobs">https://github.com/HenrikBengtsson/future.BatchJobs</a></li> </ul></li> <li>doFuture package: <ul> <li>CRAN page: <a href="https://cran.r-project.org/package=doFuture">https://cran.r-project.org/package=doFuture</a></li> <li>GitHub page: <a href="https://github.com/HenrikBengtsson/doFuture">https://github.com/HenrikBengtsson/doFuture</a></li> </ul></li> </ul> </description>
</item>
<item>
<title>matrixStats: Optimized Subsetted Matrix Calculations</title>
<link>https://www.jottr.org/2015/12/16/matrixstats-subsetting/</link>
<pubDate>Wed, 16 Dec 2015 00:00:00 +0000</pubDate>
<guid>https://www.jottr.org/2015/12/16/matrixstats-subsetting/</guid>
<description> <p>The <a href="http://cran.r-project.org/package=matrixStats">matrixStats</a> package provides highly optimized functions for computing <a href="https://cran.r-project.org/web/packages/matrixStats/vignettes/matrixStats-methods.html">common summaries</a> over rows and columns of matrices. In a <a href="https://www.jottr.org/2015/01/matrixStats-0.13.1.html">previous blog post</a>, I showed that, instead of using <code>apply(X, MARGIN = 2, FUN = median)</code>, we can speed up calculations dramatically by using <code>colMedians(X)</code>. In the most recent release (version 0.50.0), matrixStats has been extended to perform <strong>optimized calculations also on a subset of rows and/or columns</strong> specified via new arguments <code>rows</code> and <code>cols</code>, e.g. <code>colMedians(X, cols = 1:50)</code>.</p> <p><img src="https://www.jottr.org/post/DragsterLeavingTeamBehind.gif" alt="Draster leaving team behind" /></p> <p>For instance, assume we wish to find the median value of the first 50 columns of matrix <code>X</code> with 1,000,000 rows and 100 columns. For simplicity, assume</p> <pre><code class="language-r">&gt; X &lt;- matrix(rnorm(1e6 * 100), nrow = 1e6, ncol = 100) </code></pre> <p>To get the median values without matrixStats, we would do</p> <pre><code class="language-r">&gt; y &lt;- apply(X[, 1:50], MARGIN = 2, FUN = median) &gt; str(y) num [1:50] -0.001059 0.00059 0.001316 0.00103 0.000814 ... </code></pre> <p>As in the past, we could use matrixStats to do</p> <pre><code class="language-r">&gt; y &lt;- colMedians(X[, 1:50]) </code></pre> <p>which is <a href="https://www.jottr.org/2015/01/matrixStats-0.13.1.html">much faster</a> than <code>apply()</code> with <code>median()</code>.</p> <p>However, both approaches require that <code>X</code> is subsetted before the actual calculations can be performed, i.e. the temporary object <code>X[, 1:50]</code> is created. In this example, the size of the original matrix is ~760 MiB and the subsetted one is ~380 MiB;</p> <pre><code class="language-r">&gt; object.size(X) 800000200 bytes &gt; object.size(X[, 1:50]) 400000100 bytes </code></pre> <p>This temporary object is created by (i) R first allocating the size for it and then (ii) copying all its values over from <code>X</code>. After the medians have been calculated this temporary object is automatically discarded and eventually (iii) R&rsquo;s garbage collector will deallocate its memory. This introduces overhead in form of extra memory usage as well as processing time.</p> <p>Starting with matrixStats 0.50.0, we can avoid this overhead by instead using</p> <pre><code class="language-r">&gt; y &lt;- colMedians(X, cols = 1:50) </code></pre> <p><strong>This uses less memory</strong>, because no internal copy of <code>X[, 1:50]</code> has to be created. Instead all calculations are performed directly on the source object <code>X</code>. Because of this, the latter approach of subsetting is <strong>also faster</strong>.</p> <h2 id="bootstrapping-example">Bootstrapping example</h2> <p>Subsetted calculations occur naturally in bootstrap analysis. Assume we want to calculate the median for each column of a 100-by-10,000 matrix <code>X</code> where <strong>the rows are resampled with replacement</strong> 1,000 times. Without matrixStats, this can be done as</p> <pre><code class="language-r">B &lt;- 1000 Y &lt;- matrix(NA_real_, nrow = B, ncol = ncol(X)) for (b in seq_len(B)) { rows &lt;- sample(seq_len(nrow(X)), replace = TRUE) Y[b,] &lt;- apply(X[rows, ], MARGIN = 2, FUN = median) } </code></pre> <p>However, powered with the new matrixStats we can do</p> <pre><code class="language-r">B &lt;- 1000 Y &lt;- matrix(NA_real_, nrow = B, ncol = ncol(X)) for (b in seq_len(B)) { rows &lt;- sample(seq_len(nrow(X)), replace = TRUE) Y[b, ] &lt;- colMedians(X, rows = rows) } </code></pre> <p>In the first approach, with explicit subsetting (<code>X[rows, ]</code>), we are creating a large number of temporary objects - each of size <code>object.size(X[rows, ]) == object.size(X)</code> - that all need to be allocated, copied and deallocated. Thus, if <code>X</code> is a 100-by-10,000 double matrix of size 8,000,200 bytes = 7.6 MiB we are allocating and deallocating a total of 7.5 GiB worth of RAM when using 1,000 bootstrap samples. With a million bootstrap samples, we&rsquo;re consuming a total of 7.3 TiB RAM. In other words, we are wasting lots of compute resources on memory allocation, copying, deallocation and garbage collection. Instead, by using the optimized subsetted calculations available in matrixStats (&gt;= 0.50.0), which is used in the second approach, we spare the computer all that overhead.</p> <p>Not only does the peak memory requirement go down by roughly a half, but <strong>the overall speedup is also substantial</strong>; using a regular notebook the above 1,000 bootstrap samples took 660 seconds (= 11 minutes) to complete using <code>apply(X[rows, ])</code>, 85 seconds (8x speedup) using <code>colMedians(X[rows, ])</code> and 45 seconds (<strong>15x speedup</strong>) using <code>colMedians(X, rows = rows)</code>.</p> <h2 id="availability">Availability</h2> <p>The matrixStats package can be installed on all common operating systems as</p> <pre><code class="language-r">&gt; install.packages(&quot;matrixStats&quot;) </code></pre> <p>The source code is available on <a href="https://github.com/HenrikBengtsson/matrixStats/">GitHub</a>.</p> <h2 id="credits">Credits</h2> <p>Support for optimized calculations on subsets was implemented by <a href="https://www.linkedin.com/in/dongcanjiang">Dongcan Jiang</a>. Dongcan is a Master&rsquo;s student in Computer Science at Peking University and worked on <a href="https://github.com/rstats-gsoc/gsoc2015/wiki/matrixStats">this project</a> from April to August 2015 through support by the <a href="https://developers.google.com/open-source/gsoc/">Google Summer of Code</a> 2015 program. This GSoC project was mentored jointly by me and Hector Corrada Bravo at University of Maryland. We would like to thank Dongcan again for this valuable addition to the package and the community. We would also like to thank Google and the <a href="https://github.com/rstats-gsoc/">R Project in GSoC</a> for making this possible.</p> <p>Any type of feedback, including <a href="https://github.com/HenrikBengtsson/matrixStats/issues/">bug reports</a>, is always appreciated!</p> <h2 id="links">Links</h2> <ul> <li>CRAN package: <a href="http://cran.r-project.org/package=matrixStats">http://cran.r-project.org/package=matrixStats</a></li> <li>Source code and bug reports: <a href="https://github.com/HenrikBengtsson/matrixStats">https://github.com/HenrikBengtsson/matrixStats</a></li> <li>Google Summer of Code (GSoC): <a href="https://developers.google.com/open-source/gsoc/">https://developers.google.com/open-source/gsoc/</a></li> <li>R Project in GSoC (R-GSoC): <a href="https://github.com/rstats-gsoc">https://github.com/rstats-gsoc</a></li> <li>matrixStats in R-GSoC 2015: <a href="https://github.com/rstats-gsoc/gsoc2015/wiki/matrixStats">https://github.com/rstats-gsoc/gsoc2015/wiki/matrixStats</a></li> </ul> </description>
</item>
<item>
<title>Milestone: 7000 Packages on CRAN</title>
<link>https://www.jottr.org/2015/08/12/milestone-cran-7000/</link>
<pubDate>Wed, 12 Aug 2015 00:00:00 +0000</pubDate>
<guid>https://www.jottr.org/2015/08/12/milestone-cran-7000/</guid>
<description><p>Another 1,000 packages were added to CRAN, which took less than 9 months. Today (August 12, 2015), the Comprehensive R Archive Network (CRAN) package page reports:</p> <blockquote> <p>&ldquo;Currently, the CRAN package repository features 7002 available packages.&rdquo;</p> </blockquote> <p>While the previous 1,000 packages took 355 days, going from 6,000 to 7,000 packages took 286 days - which means that now a new CRAN package is born on average every 6.9 hours (or 3.5 packages per day). Since the start of CRAN 18.3 years ago on April 23, 1997, there has been on average one new package appearing on CRAN every 22.9 hours. It is actually more frequent than that because dropped/archived packages are not accounted for. The 7,000 packages on CRAN are maintained by ~4,130 people.</p> <p>Thanks to the CRAN team and to all package developers. You can give back by carefully reporting bugs to the maintainers and properly citing any packages you use in your publications (see <code>citation(&quot;pkg name&quot;)</code>).</p> <p>Milestones:</p> <ul> <li>2015-08-12: <a href="https://stat.ethz.ch/pipermail/r-package-devel/2015q3/000393.html">7000 packages</a></li> <li>2014-10-29: <a href="https://mailman.stat.ethz.ch/pipermail/r-devel/2014-October/069997.html">6000 packages</a></li> <li>2013-11-08: <a href="https://stat.ethz.ch/pipermail/r-devel/2013-November/067935.html">5000 packages</a></li> <li>2012-08-23: <a href="https://stat.ethz.ch/pipermail/r-devel/2012-August/064675.html">4000 packages</a></li> <li>2011-05-12: <a href="https://stat.ethz.ch/pipermail/r-devel/2011-May/061002.html">3000 packages</a></li> <li>2009-10-04: <a href="https://stat.ethz.ch/pipermail/r-devel/2009-October/055049.html">2000 packages</a></li> <li>2007-04-12: <a href="https://stat.ethz.ch/pipermail/r-devel/2007-April/045359.html">1000 packages</a></li> <li>2004-10-01: 500 packages</li> <li>2003-04-01: 250 packages</li> </ul> <p>These data are for CRAN only. There are many more packages elsewhere, e.g. <a href="http://bioconductor.org/">Bioconductor</a>, <a href="http://r-forge.r-project.org/">R-Forge</a> (sic!), <a href="http://rforge.net/">RForge</a> (sic!), <a href="http://github.com/">Github</a> etc.</p> </description>
</item>
<item>
<title>Performance: Calling R_CheckUserInterrupt() Every 256 Iteration is Actually Faster than Every 1,000,000 Iteration</title>
<link>https://www.jottr.org/2015/06/05/checkuserinterrupt/</link>
<pubDate>Fri, 05 Jun 2015 00:00:00 +0000</pubDate>
<guid>https://www.jottr.org/2015/06/05/checkuserinterrupt/</guid>
<description> <p>If your native code takes more than a few seconds to finish, it is a nice courtesy to the user to check for user interrupts (Ctrl-C) once in a while, say, every 1,000 or 1,000,000 iteration. The C-level API of R provides <code>R_CheckUserInterrupt()</code> for this (see &lsquo;Writing R Extensions&rsquo; for more information on this function). Here&rsquo;s what the code would typically look like:</p> <pre><code class="language-c">for (int ii = 0; ii &lt; n; ii++) { /* Some computational expensive code */ if (ii % 1000 == 0) R_CheckUserInterrupt() } </code></pre> <p>This uses the modulo operator <code>%</code> and tests when it is zero, which happens every 1,000 iteration. When this occurs, it calls <code>R_CheckUserInterrupt()</code>, which will interrupt the processing and &ldquo;return to R&rdquo; whenever an interrupt is detected.</p> <p>Interestingly, it turns out that, it is <em>significantly faster to do this check every $k=2^m$ iteration</em>, e.g. instead of doing it every 1,000 iteration, it is faster to do it every 1,024 iteration. Similarly, instead of, say, doing it every 1,000,000 iteration, do it every 1,048,576 - not one less (1,048,575) or one more (1,048,577). The difference is so large that it is even 2-3 times faster to call <code>R_CheckUserInterrupt()</code> every 256 iteration rather than, say, every 1,000,000 iteration, which at least to me was a bit counter intuitive the first time I observed it.</p> <p>Below are some benchmark statistics supporting the claim that testing / calculating <code>ii % k == 0</code> is faster for $k=2^m$ (blue) than for other choices of $k$ (red).</p> <p><img src="https://www.jottr.org/post/boxplot.png" alt="Boxplot showing that testing every 2^k:th iteration is faster" /></p> <p>Note that the times are on the log scale (the results are also tabulated at the end of this post). Now, will it make a big difference to the overall performance of you code if you choose, say, 1,048,576 instead of 1,000,000? Probably not, but on the other hand, it does not hurt to pick an interval that is a $2^m$ integer. This observation may also be useful in algorithms that make lots of use of the modulo operator.</p> <p>So why is <code>ii % k == 0</code> a faster test when $k=2^m$? <del>I can only speculate. For instance, the integer $2^m$ is a binary number with all bits but one set to zero. It might be that this is faster to test for than other bit patterns, but I don&rsquo;t know if this is because of how the native code is optimized by the compiler and/or if it goes down to the hardware/CPU level. I&rsquo;d be interested in feedback and hear your thoughts on this.</del></p> <p><strong>UPDATE 2015-06-15</strong>: Thomas Lumley kindly <a href="https://twitter.com/tslumley/status/610627555545083904">replied</a> and pointed me to fact that <a href="https://en.wikipedia.org/wiki/Modulo_operation#Performance_issues">&ldquo;the modulo of powers of 2 can alternatively be expressed as a bitwise AND operation&rdquo;</a>, which in C terms means that <code>ii % 2^m</code> is identical to <code>ii &amp; (2^m - 1)</code> (at least for positive integers), and this is <a href="http://stackoverflow.com/questions/22446425/do-c-c-compilers-such-as-gcc-generally-optimize-modulo-by-a-constant-power-of">an optimization that the GCC compiler does by default</a>. The bitwise AND operator is extremely fast, because the CPU can take the AND on all bits at the same time (think 64 electronic AND gates for a 64-bit integer). After this, comparing to zero is also very fast. The optimization cannot be done for integers that are not powers of two. So, in our case, when the compiler sees <code>ii % 256 == 0</code> it optimizes it to become <code>ii &amp; 255 == 0</code>, which is much faster to calculate than the non-optimized <code>ii % 256 == 0</code> (or <code>ii % 257 == 0</code>, or <code>ii % 1000000 == 0</code>, and so on).</p> <h2 id="details-on-how-the-benchmarking-was-done">Details on how the benchmarking was done</h2> <p>I used the <a href="http://cran.r-project.org/package=inline">inline</a> package to generate a set of C-level functions with varying interrupt intervals ($k$). I&rsquo;m not passing $k$ as a parameter to these functions. Instead, I use it as a constant value so that the compiler can optimize as far as possible, but also in order to imitate how most code is written. This is why I generate multiple C functions. I benchmarked across a wide range of interval choices using the <a href="http://cran.r-project.org/package=microbenchmark">microbenchmark</a> package. The C functions (with corresponding R functions calling them) and the corresponding benchmark expressions to be called were generated as follows:</p> <pre><code class="language-r">## The interrupt intervals to benchmark ## (a) Classical values ks &lt;- c(1, 10, 100, 1000, 10e3, 100e3, 1e6) ## (b) 2^k values and the ones before and after ms &lt;- c(2, 5, 8, 10, 16, 20) as &lt;- c(-1, 0, +1) + rep(2^ms, each = 3) ## List of unevaluated expressions to benchmark mbexpr &lt;- list() for (k in sort(c(ks, as))) { name &lt;- sprintf(&quot;every_%d&quot;, k) ## The C function assign(name, inline::cfunction(c(length = &quot;integer&quot;), body = sprintf(&quot; int i, n = asInteger(length); for (i=0; i &lt; n; i++) { if (i %% %d == 0) R_CheckUserInterrupt(); } return ScalarInteger(n); &quot;, k))) ## The corresponding expression to benchmark mbexpr &lt;- c(mbexpr, substitute(every(n), list(every = as.symbol(name)))) } </code></pre> <p>The actual benchmarking of the 25 cases was then done by calling:</p> <pre><code class="language-r">n &lt;- 10e6 ## Number of iterations stats &lt;- microbenchmark::microbenchmark(list = mbexpr) </code></pre> <table> <thead> <tr> <th align="left">expr</th> <th align="right">min</th> <th align="right">lq</th> <th align="right">mean</th> <th align="right">median</th> <th align="right">uq</th> <th align="right">max</th> </tr> </thead> <tbody> <tr> <td align="left">every_1(n)</td> <td align="right">479.19</td> <td align="right">485.08</td> <td align="right">511.45</td> <td align="right">492.91</td> <td align="right">521.50</td> <td align="right">839.50</td> </tr> <tr> <td align="left">every_3(n)</td> <td align="right">184.08</td> <td align="right">185.74</td> <td align="right">197.86</td> <td align="right">189.10</td> <td align="right">197.31</td> <td align="right">321.69</td> </tr> <tr> <td align="left">every_4(n)</td> <td align="right">148.99</td> <td align="right">150.80</td> <td align="right">160.92</td> <td align="right">152.73</td> <td align="right">158.55</td> <td align="right">245.72</td> </tr> <tr> <td align="left">every_5(n)</td> <td align="right">127.42</td> <td align="right">129.25</td> <td align="right">134.18</td> <td align="right">131.26</td> <td align="right">134.69</td> <td align="right">190.88</td> </tr> <tr> <td align="left">every_10(n)</td> <td align="right">91.96</td> <td align="right">93.12</td> <td align="right">99.75</td> <td align="right">94.48</td> <td align="right">98.10</td> <td align="right">194.98</td> </tr> <tr> <td align="left">every_31(n)</td> <td align="right">65.78</td> <td align="right">67.15</td> <td align="right">71.18</td> <td align="right">68.33</td> <td align="right">70.52</td> <td align="right">113.55</td> </tr> <tr> <td align="left">every_32(n)</td> <td align="right">49.12</td> <td align="right">49.49</td> <td align="right">51.72</td> <td align="right">50.24</td> <td align="right">51.38</td> <td align="right">91.28</td> </tr> <tr> <td align="left">every_33(n)</td> <td align="right">63.29</td> <td align="right">64.01</td> <td align="right">67.96</td> <td align="right">64.76</td> <td align="right">68.79</td> <td align="right">112.26</td> </tr> <tr> <td align="left">every_100(n)</td> <td align="right">50.85</td> <td align="right">51.46</td> <td align="right">54.81</td> <td align="right">52.37</td> <td align="right">55.01</td> <td align="right">89.83</td> </tr> <tr> <td align="left">every_255(n)</td> <td align="right">56.05</td> <td align="right">56.48</td> <td align="right">59.81</td> <td align="right">57.21</td> <td align="right">59.25</td> <td align="right">119.47</td> </tr> <tr> <td align="left">every_256(n)</td> <td align="right">19.46</td> <td align="right">19.62</td> <td align="right">21.03</td> <td align="right">19.88</td> <td align="right">20.71</td> <td align="right">41.98</td> </tr> <tr> <td align="left">every_257(n)</td> <td align="right">53.32</td> <td align="right">53.70</td> <td align="right">57.16</td> <td align="right">54.54</td> <td align="right">56.34</td> <td align="right">96.61</td> </tr> <tr> <td align="left">every_1000(n)</td> <td align="right">44.76</td> <td align="right">46.68</td> <td align="right">50.40</td> <td align="right">47.50</td> <td align="right">50.19</td> <td align="right">121.97</td> </tr> <tr> <td align="left">every_1023(n)</td> <td align="right">53.68</td> <td align="right">54.89</td> <td align="right">57.64</td> <td align="right">55.57</td> <td align="right">57.71</td> <td align="right">111.59</td> </tr> <tr> <td align="left">every_1024(n)</td> <td align="right">17.41</td> <td align="right">17.55</td> <td align="right">18.86</td> <td align="right">17.80</td> <td align="right">18.78</td> <td align="right">43.54</td> </tr> <tr> <td align="left">every_1025(n)</td> <td align="right">51.19</td> <td align="right">51.72</td> <td align="right">54.09</td> <td align="right">52.28</td> <td align="right">53.29</td> <td align="right">101.97</td> </tr> <tr> <td align="left">every_10000(n)</td> <td align="right">42.82</td> <td align="right">45.65</td> <td align="right">48.09</td> <td align="right">46.20</td> <td align="right">47.83</td> <td align="right">82.92</td> </tr> <tr> <td align="left">every_65535(n)</td> <td align="right">51.51</td> <td align="right">53.45</td> <td align="right">55.68</td> <td align="right">54.00</td> <td align="right">55.04</td> <td align="right">87.36</td> </tr> <tr> <td align="left">every_65536(n)</td> <td align="right">16.74</td> <td align="right">16.84</td> <td align="right">17.91</td> <td align="right">16.99</td> <td align="right">17.37</td> <td align="right">47.82</td> </tr> <tr> <td align="left">every_65537(n)</td> <td align="right">60.62</td> <td align="right">61.44</td> <td align="right">65.16</td> <td align="right">62.56</td> <td align="right">64.93</td> <td align="right">104.71</td> </tr> <tr> <td align="left">every_100000(n)</td> <td align="right">43.68</td> <td align="right">44.48</td> <td align="right">46.81</td> <td align="right">44.98</td> <td align="right">46.51</td> <td align="right">83.33</td> </tr> <tr> <td align="left">every_1000000(n)</td> <td align="right">41.61</td> <td align="right">44.21</td> <td align="right">46.99</td> <td align="right">44.86</td> <td align="right">47.11</td> <td align="right">87.90</td> </tr> <tr> <td align="left">every_1048575(n)</td> <td align="right">50.98</td> <td align="right">52.80</td> <td align="right">54.92</td> <td align="right">53.55</td> <td align="right">55.36</td> <td align="right">72.44</td> </tr> <tr> <td align="left">every_1048576(n)</td> <td align="right">16.73</td> <td align="right">16.83</td> <td align="right">17.92</td> <td align="right">17.05</td> <td align="right">17.89</td> <td align="right">35.52</td> </tr> <tr> <td align="left">every_1048577(n)</td> <td align="right">60.28</td> <td align="right">62.58</td> <td align="right">65.43</td> <td align="right">63.92</td> <td align="right">65.91</td> <td align="right">87.58</td> </tr> </tbody> </table> <p>I get similar results across various operating systems (Windows, OS X and Linux) all using GNU Compiler Collection (GCC).</p> <p>Feedback and comments are apprecated!</p> <p>To reproduce these results, do:</p> <pre><code class="language-r">&gt; path &lt;- 'https://raw.githubusercontent.com/HenrikBengtsson/jottr.org/master/blog/20150604%2CR_CheckUserInterrupt' &gt; html &lt;- R.rsp::rfile('R_CheckUserInterrupt.md.rsp', path = path) &gt; !html ## Open in browser </code></pre> </description>
</item>
<item>
<title>To Students: matrixStats for Google Summer of Code</title>
<link>https://www.jottr.org/2015/03/12/matrixstats-gsoc/</link>
<pubDate>Thu, 12 Mar 2015 00:00:00 +0000</pubDate>
<guid>https://www.jottr.org/2015/03/12/matrixstats-gsoc/</guid>
<description> <p>We are pleased to announce our proposal &lsquo;<strong><a href="https://github.com/rstats-gsoc/gsoc2015/wiki/matrixStats">Subsetted and parallel computations in matrixStats</a></strong>&rsquo; for Google Summer of Code. The project is aimed for a student with experience in R and C, it runs for three months, and the student gets paid 5500 USD by Google. Students from (almost) all over the world can apply. Application deadline is <strong>March 27, 2015</strong>. I, Henrik Bengtsson, and Héctor Corrada Bravo will be joint mentors. Communication and mentoring will occur online. We&rsquo;re looking forward to your application.</p> <p><img src="https://www.jottr.org/post/banner-gsoc2015.png" alt="Google Summer of Code 2015 banner" /></p> <h2 id="links">Links</h2> <ul> <li>The matrixStats GSoC project: <a href="https://github.com/rstats-gsoc/gsoc2015/wiki/matrixStats">Subsetted and parallel computations in matrixStats</a></li> <li>CRAN page: <a href="http://cran.r-project.org/package=matrixStats">http://cran.r-project.org/package=matrixStats</a></li> <li>GitHub page: <a href="https://github.com/HenrikBengtsson/matrixStats">https://github.com/HenrikBengtsson/matrixStats</a></li> <li>R Project GSoC wiki: <a href="https://github.com/rstats-gsoc/gsoc2015">https://github.com/rstats-gsoc/gsoc2015</a></li> <li>Google Summer of Code (GSoC) page: <a href="http://www.google-melange.com/gsoc/homepage/google/gsoc2015">http://www.google-melange.com/gsoc/homepage/google/gsoc2015</a></li> </ul> <h2 id="related-posts">Related posts</h2> <ul> <li><a href="https://www.jottr.org/2015/01/matrixStats-0.13.1.html">PACKAGE: matrixStats 0.13.1 - Methods that Apply to Rows and Columns of a Matrix (and Vectors)</a></li> <li><a href="http://www.r-bloggers.com/?s=Google+Summer+of+Code">R Blogger posts on GSoC</a></li> </ul> </description>
</item>
<item>
<title>How to: Package Vignettes in Plain LaTeX</title>
<link>https://www.jottr.org/2015/02/21/how-to-plain-latex-vignettes/</link>
<pubDate>Sat, 21 Feb 2015 00:00:00 +0000</pubDate>
<guid>https://www.jottr.org/2015/02/21/how-to-plain-latex-vignettes/</guid>
<description> <p>Ever wanted to include a plain-LaTeX vignette in your package and have it compiled into a PDF? The <a href="http://cran.r-project.org/package=R.rsp">R.rsp</a> package provides a four-line solution for this.</p> <p><em>But, first, what&rsquo;s R.rsp?</em> R.rsp is an R package that implements a compiler for the RSP markup language. RSP can be used to embed dynamic R code in <em>any</em> text-based source document to be compiled into a final document, e.g. RSP-embedded LaTeX into PDF, RSP-embedded Markdown into HTML, RSP-embedded HTML into HTML and so on. The package provides a set of <em>vignette engines</em> making it straightforward to use RSP in vignettes and there are also other vignette engines to, for instance, include static PDF vignettes. Starting with R.rsp v0.20.0 (on CRAN), a vignette engine for including plain LaTeX-based vignettes is also available. The R.rsp package installs out-of-the-box on all common operating systems, including Linux, OS X and Windows. Its source code is available on <a href="https://github.com/HenrikBengtsson/R.rsp">GitHub</a>.</p> <p><img src="https://www.jottr.org/post/Writing_ball_keyboard_3.jpg" alt="A Hansen writing ball - a keyboard invented by Rasmus Malling-Hansen in 1865" /></p> <h2 id="steps-to-include-a-latex-vignettes-in-your-package">Steps to include a LaTeX vignettes in your package</h2> <ol> <li><p>Place your LaTeX file in the <code>vignettes/</code> directory of your package. If it needs other files such as image files, place those under this directory too.</p></li> <li><p>Rename the file to have filename extension *.ltx, e.g. vignettes/UsingYadayada.ltx(*)</p></li> <li><p>Add the following meta directives at the top of the LaTeX file:<br /> <code>%\VignetteIndexEntry{Using Yadayada}</code><br /> <code>%\VignetteEngine{R.rsp::tex}</code></p></li> <li><p>Add the following to your <code>DESCRIPTION</code> file:<br /> <code>Suggests: R.rsp</code><br /> <code>VignetteBuilder: R.rsp</code></p></li> </ol> <p>That&rsquo;s all!</p> <p>When you run <code>R CMD build</code>, the <code>R.rsp::tex</code> vignette engine will compile your LaTeX vignette into a PDF and make it part of your package&rsquo;s *.tar.gz file. As for any vignette engine, the PDF will be placed in the <code>inst/doc/</code> directory of the *.tar.gz file, ready to be installed together with your package. Users installing your package will <em>not</em> have to install R.rsp.</p> <p>If this is your first package vignette ever, you should know that you are now only baby steps away from writing your first &ldquo;dynamic&rdquo; vignette using Sweave, <a href="http://cran.r-project.org/package=knitr">knitr</a> or RSP. For RSP-embedded LaTeX vignettes, change the engine to <code>R.rsp::rsp</code>, rename the file to <code>*.ltx.rsp</code> (or <code>*.tex.rsp</code>) and start embedding R code in the LaTeX file, e.g. &lsquo;The p-value is &lt;%= signif(p, 2) %&gt;`.</p> <p><em>Footnote:</em> (*) If one uses filename extension <code>*.tex</code>, then <code>R CMD check</code> will give a <em>false</em> NOTE about the file &ldquo;should probably not be installed&rdquo;. Using extension <code>*.ltx</code>, which is an official LaTeX extension, avoids this issue.</p> <h3 id="why-not-use-sweave">Why not use Sweave?</h3> <p>It has always been possible to &ldquo;hijack&rdquo; the Sweave vignette engine to achieve the same thing by renaming the filename extension to <code>*.Rnw</code> and including the proper <code>\VignetteIndexEntry</code> markup. This would trick R to compile it as an Sweave vignette (without Sweave markup) resulting in a PDF, which in practice would work as a plain LaTeX-to-PDF compiler. The <code>R.rsp::tex</code> engine achieves the same without the &ldquo;hack&rdquo; and without the Sweave machinery.</p> <h3 id="static-pdfs">Static PDFs?</h3> <p>If you want to use a &ldquo;static&rdquo; pre-generated PDF as a package vignette that can also be achieved in a few step using the <code>R.rsp::asis</code> vignette engine. There is an R.rsp <a href="http://cran.r-project.org/package=R.rsp">vignette</a> explaining how to do this, but please consider alternatives that compile from source before doing this. Also, vignettes without full source may not be accepted by CRAN. A LaTeX vignette does not have this problem.</p> <h2 id="links">Links</h2> <ul> <li>CRAN page: <a href="http://cran.r-project.org/package=R.rsp">http://cran.r-project.org/package=R.rsp</a></li> <li>GitHub page: <a href="https://github.com/HenrikBengtsson/R.rsp">https://github.com/HenrikBengtsson/R.rsp</a></li> </ul> </description>
</item>
<item>
<title>Package: matrixStats 0.13.1 - Methods that Apply to Rows and Columns of a Matrix (and Vectors)</title>
<link>https://www.jottr.org/2015/01/25/matrixstats-0.13.1/</link>
<pubDate>Sun, 25 Jan 2015 00:00:00 +0000</pubDate>
<guid>https://www.jottr.org/2015/01/25/matrixstats-0.13.1/</guid>
<description> <p>A new release 0.13.1 of <a href="http://cran.r-project.org/package=matrixStats">matrixStats</a> is now on CRAN. The source code is available on <a href="https://github.com/HenrikBengtsson/matrixStats">GitHub</a>.</p> <h2 id="what-does-it-do">What does it do?</h2> <p>The matrixStats package provides highly optimized functions for computing common summaries over rows and columns of matrices, e.g. <code>rowQuantiles()</code>. There are also functions that operate on vectors, e.g. <code>logSumExp()</code>. Their implementations strive to minimize both memory usage and processing time. They are often remarkably faster compared to good old <code>apply()</code> solutions. The calculations are mostly implemented in C, which allow us to optimize(*) beyond what is possible to do in plain R. The package installs out-of-the-box on all common operating systems, including Linux, OS X and Windows.</p> <p>The following example computes the median of the columns in a 20-by-500 matrix</p> <pre><code class="language-r">&gt; library(&quot;matrixStats&quot;) &gt; X &lt;- matrix(rnorm(20 * 500), nrow = 20, ncol = 500) &gt; stats &lt;- microbenchmark::microbenchmark(colMedians = colMedians(X), + `apply+median` = apply(X, MARGIN = 2, FUN = median), unit = &quot;ms&quot;) &gt; stats Unit: milliseconds expr min lq mean median uq max neval cld colMedians 0.41 0.45 0.49 0.47 0.5 0.75 100 a apply+median 21.50 22.77 25.59 23.86 26.2 107.12 100 b </code></pre> <p><img src="https://www.jottr.org/post/colMedians.png" alt="Graph showing that colMedians is significantly faster than apply+median over 100 test runs" /></p> <p>It shows that <code>colMedians()</code> is ~51 times faster than <code>apply(..., MARGIN = 2, FUN = median)</code> in this particular case. The relative gain varies with matrix shape, so you should benchmark with your configurations. You can also play around with the benchmark reports that are under development, e.g. <code>html &lt;- matrixStats:::benchmark(&quot;colRowMedians&quot;); !html</code>.</p> <h2 id="what-is-new">What is new?</h2> <p>With this release, all <em>the functions run faster than ever before and at the same time use less memory than ever before</em>, which in turn means that now even larger data matrices can be processed without having to upgrade the RAM. A few small bugs have also been fixed and some &ldquo;missing&rdquo; <a href="http://cran.r-project.org/web/packages/matrixStats/vignettes/matrixStats-methods.html">functions</a> have been added to the R API. This update is part of a long-term tune-up that started back in June 2014. Most of the major groundwork has already been done, but there is still room for improvements. If you&rsquo;re using matrixStats functions in your package already now, you should see some notable speedups for those function calls, especially compared to what was available back in June. For instance, <code>rowMins()</code> is now <a href="http://stackoverflow.com/questions/13676878/fastest-way-to-get-min-from-every-column-in-a-matrix">5-20 times faster</a> than functions such as <code>base::pmin.int()</code> whereas in the past they performed roughly the same.</p> <p>I&rsquo;ve also added a large number of new package tests; the R and C source code coverage has recently gone up from 59% to <a href="https://coveralls.io/r/HenrikBengtsson/matrixStats?branch=develop">96%</a> (&hellip; and counting). Some of the bugs were discovered as part of this effort. Here a special thank should go out to Jim Hester for his great work on <a href="https://github.com/jimhester/covr">covr</a>, which provides me with on-the-fly coverage reports via Coveralls. (You can run covr locally or via GitHub + Travis CI, which is very easy if you&rsquo;re already up and running there. <em>Try it!</em>) I would also like to thank the R core team and the CRAN team for their continuous efforts on improving the package tests that we get via <code>R CMD check</code> but also via the CRAN farm (which occasionally catches code issues that I&rsquo;m not always seeing on my end).</p> <p><em>Footnote: (*) One strategy for keeping the memory footprint at a minimum is to optimize the implementations for the integer and the numeric (double) data types separately. Because of this, a great number of data-type coercions are avoided, coercions that otherwise would consume precious memory due to temporarily allocated copies, but also precious processing time because the garbage collector later would have to spend time cleaning up the mess. The new <code>weightedMean()</code> function, which is many times faster than <code>stats::weighted.mean()</code>, is one of several cases where this strategy is particular helpful.</em></p> <h2 id="links">Links</h2> <ul> <li>CRAN page: <a href="http://cran.r-project.org/package=matrixStats">http://cran.r-project.org/package=matrixStats</a></li> <li>GitHub page: <a href="https://github.com/HenrikBengtsson/matrixStats">https://github.com/HenrikBengtsson/matrixStats</a></li> <li>Coveralls page: <a href="https://coveralls.io/r/HenrikBengtsson/matrixStats?branch=develop">https://coveralls.io/r/HenrikBengtsson/matrixStats?branch=develop</a></li> <li>Bug reports: <a href="https://github.com/HenrikBengtsson/matrixStats/issues">https://github.com/HenrikBengtsson/matrixStats/issues</a></li> <li>covr: <a href="https://github.com/jimhester/covr">https://github.com/jimhester/covr</a></li> </ul> </description>
</item>
<item>
<title>Milestone: 6000 Packages on CRAN</title>
<link>https://www.jottr.org/2014/10/29/milestone-cran-6000/</link>
<pubDate>Wed, 29 Oct 2014 00:00:00 +0000</pubDate>
<guid>https://www.jottr.org/2014/10/29/milestone-cran-6000/</guid>
<description><p>Another 1,000 packages were added to CRAN and this time in less than 12 months. Today (2014-10-29) on The Comprehensive R Archive Network (CRAN) package page:</p> <blockquote> <p>&ldquo;Currently, the CRAN package repository features 6000 available packages.&rdquo;</p> </blockquote> <p>Going from 5,000 to 6,000 packages took 355 days - which means that it on average was only ~8.5 hours between each new packages added. It is actually even more frequent since dropped packages are not accounted for. The 6,000 packages on CRAN are maintained by 3,444 people. Thanks to all package developers and to the CRAN Team for handling all this!</p> <p>You can give back by carefully reporting bugs to the maintainers and properly citing any packages you use in your publications, cf. <code>citation(&quot;pkg name&quot;)</code>.</p> <p>Milestones:</p> <ul> <li>2014-10-29: <a href="https://mailman.stat.ethz.ch/pipermail/r-devel/2014-October/069997.html">6000 packages</a></li> <li>2013-11-08: <a href="https://stat.ethz.ch/pipermail/r-devel/2013-November/067935.html">5000 packages</a></li> <li>2012-08-23: <a href="https://stat.ethz.ch/pipermail/r-devel/2012-August/064675.html">4000 packages</a></li> <li>2011-05-12: <a href="https://stat.ethz.ch/pipermail/r-devel/2011-May/061002.html">3000 packages</a></li> <li>2009-10-04: <a href="https://stat.ethz.ch/pipermail/r-devel/2009-October/055049.html">2000 packages</a></li> <li>2007-04-12: <a href="https://stat.ethz.ch/pipermail/r-devel/2007-April/045359.html">1000 packages</a></li> <li>2004-10-01: 500 packages</li> <li>2003-04-01: 250 packages</li> </ul> <p>These data are for CRAN only. There are many more packages elsewhere, e.g. <a href="http://bioconductor.org/">Bioconductor</a>, <a href="http://r-forge.r-project.org/">R-Forge</a> (sic!), <a href="http://rforge.net/">RForge</a> (sic!), <a href="http://github.com/">Github</a> etc.</p> </description>
</item>
<item>
<title>Pitfall: Did You Really Mean to Use matrix(nrow, ncol)?</title>
<link>https://www.jottr.org/2014/06/17/matrixna-wrong-way/</link>
<pubDate>Tue, 17 Jun 2014 00:00:00 +0000</pubDate>
<guid>https://www.jottr.org/2014/06/17/matrixna-wrong-way/</guid>
<description> <p><img src="https://www.jottr.org/post/wrong_way_035.jpg" alt="Road sign reading &quot;Wrong Way&quot;" /></p> <p>Are you a good R citizen and preallocates your matrices? <strong>If you are allocating a numeric matrix in one of the following two ways, then you are doing it the wrong way!</strong></p> <pre><code class="language-r">x &lt;- matrix(nrow = 500, ncol = 100) </code></pre> <p>or</p> <pre><code class="language-r">x &lt;- matrix(NA, nrow = 500, ncol = 100) </code></pre> <p>Why? Because it is counter productive. And why is that? In the above, <code>x</code> becomes a <strong>logical</strong> matrix, and <strong>not a numeric</strong> matrix as intended. This is because the default value of the <code>data</code> argument of <code>matrix()</code> is <code>NA</code>, which is a <strong>logical</strong> value, i.e.</p> <pre><code class="language-r">&gt; x &lt;- matrix(nrow = 500, ncol = 100) &gt; mode(x) [1] &quot;logical&quot; &gt; str(x) logi [1:500, 1:100] NA NA NA NA NA NA ... </code></pre> <p>Why is that bad? Because, as soon as you assign a numeric value to any of the cells in <code>x</code>, the matrix will first have to be coerced to numeric when the new value is assigned. <strong>The originally allocated logical matrix was allocated in vain and just adds an unnecessary memory footprint and extra work for the garbage collector</strong>.</p> <p>Instead allocate it using <code>NA_real_</code> (or <code>NA_integer_</code> for integers):</p> <pre><code class="language-r">x &lt;- matrix(NA_real_, nrow = 500, ncol = 100) </code></pre> <p>Of course, if you wish to allocate a matrix with all zeros, use <code>0</code> instead of <code>NA_real_</code> (or <code>0L</code> for integers).</p> <p>The exact same thing happens with <code>array()</code> and also because the default value is <code>NA</code>, e.g.</p> <pre><code class="language-r">&gt; x &lt;- array(dim = c(500, 100)) &gt; mode(x) [1] &quot;logical&quot; </code></pre> <p>Similarly, be careful when you setup vectors using <code>rep()</code>, e.g. compare</p> <pre><code class="language-r">x &lt;- rep(NA, times = 500) </code></pre> <p>to</p> <pre><code class="language-r">x &lt;- rep(NA_real_, times = 500) </code></pre> <p>Note, if all you want is an empty vector with all zeros, you may as well use</p> <pre><code class="language-r">x &lt;- double(500) </code></pre> <p>for doubles and</p> <pre><code class="language-r">x &lt;- integer(500) </code></pre> <p>for integers.</p> <h2 id="details">Details</h2> <p>In the &lsquo;base&rsquo; package there is a neat little function called <code>tracemem()</code> that can be used to trace the internal copying of objects. We can use it to show how the two cases differ. Lets start by doing it the wrong way:</p> <pre><code class="language-r">&gt; x &lt;- matrix(nrow = 500, ncol = 100) &gt; tracemem(x) [1] &quot;&lt;0x00000000100a0040&gt;&quot; &gt; x[1,1] &lt;- 3.14 tracemem[0x00000000100a0040 -&gt; 0x000007ffffba0010]: &gt; x[1,2] &lt;- 2.71 &gt; </code></pre> <p>That &lsquo;tracemem&rsquo; output message basically tells us that <code>x</code> is copied, or more precisely that a new internal object (0x000007ffffba0010) is allocated and that <code>x</code> now refers to that instead of the original one (0x00000000100a0040). This happens because <code>x</code> needs to be coerced from logical to numerical before assigning cell (1,1) the (numerical) value 3.14. Note that there is no need for R to create a copy in the second assignment to <code>x</code>, because at this point it is already of a numeric type.</p> <p>To avoid the above, lets make sure to allocate a numeric matrix from the start and there will be no extra copies created:</p> <pre><code class="language-r">&gt; x &lt;- matrix(NA_real_, nrow = 500, ncol = 100) &gt; tracemem(x) [1] &quot;&lt;0x000007ffffd70010&gt;&quot; &gt; x[1,1] &lt;- 3.14 &gt; x[1,2] &lt;- 2.71 &gt; </code></pre> <h2 id="appendix">Appendix</h2> <h3 id="session-information">Session information</h3> <pre><code class="language-r">R version 3.1.0 Patched (2014-06-11 r65921) Platform: x86_64-w64-mingw32/x64 (64-bit) locale: [1] LC_COLLATE=English_United States.1252 [2] LC_CTYPE=English_United States.1252 [3] LC_MONETARY=English_United States.1252 [4] LC_NUMERIC=C [5] LC_TIME=English_United States.1252 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] R.utils_1.32.5 R.oo_1.18.2 R.methodsS3_1.6.2 loaded via a namespace (and not attached): [1] R.cache_0.10.0 R.rsp_0.19.0 tools_3.1.0 </code></pre> <h3 id="reproducibility">Reproducibility</h3> <p>This report was generated from an RSP-embedded Markdown <a href="https://gist.github.com/HenrikBengtsson/854d13a11a33b3d43ec3/raw/matrixNA.md.rsp">document</a> using <a href="http://cran.r-project.org/package=R.rsp">R.rsp</a> v0.19.0. <!-- It can be recompiled as `R.rsp::rfile("https://gist.github.com/HenrikBengtsson/854d13a11a33b3d43ec3/raw/matrixNA.md.rsp")`. --></p> </description>
</item>
<item>
<title>Performance: captureOutput() is Much Faster than capture.output()</title>
<link>https://www.jottr.org/2014/05/26/captureoutput/</link>
<pubDate>Mon, 26 May 2014 00:00:00 +0000</pubDate>
<guid>https://www.jottr.org/2014/05/26/captureoutput/</guid>
<description> <p>The R function <code>capture.output()</code> can be used to &ldquo;collect&rdquo; the output of functions such as <code>cat()</code> and <code>print()</code> to strings. For example,</p> <pre><code class="language-r">&gt; s &lt;- capture.output({ + cat(&quot;Hello\nworld!\n&quot;) + print(pi) + }) &gt; s [1] &quot;Hello&quot; &quot;world!&quot; &quot;[1] 3.141593&quot; </code></pre> <p>More precisely, it captures all output sent to the <a href="http://www.wikipedia.org/wiki/Standard_streams">standard output</a> and returns a character vector where each element correspond to a line of output. By the way, it does not capture the output sent to the standard error, e.g. <code>cat(&quot;Hello\nworld!\n&quot;, file = stderr())</code> and <code>message(&quot;Hello\nworld!\n&quot;)</code>.</p> <p>However, as currently implemented (R 3.1.0), this function is <a href="https://stat.ethz.ch/pipermail/r-devel/2014-February/068349.html">very slow</a> in capturing a large number of lines. Its processing time is approximately <em>quadratic (= $O(n^2)$)</em>, <del>exponential (= O(e^n))</del> in the number of lines capture, e.g. on my notebook 10,000 lines take 0.7 seconds to capture, whereas 50,000 take 12 seconds, and 100,000 take 42 seconds. The culprit is <code>textConnection()</code> which <code>capture.output()</code> utilizes. Without going in to the <a href="https://github.com/wch/r-source/blob/R-3-1-branch/src/main/connections.c#L2920-2960">details</a>, it turns out that textConnection() copies lines one by one internally, which is extremely inefficient.</p> <p><strong>The <code>captureOutput()</code> function of <a href="http://cran.r-project.org/package=R.utils">R.utils</a> does not have this problem.</strong> Its processing time is <em>linear</em> in the number of lines and characters, because it relies on <code>rawConnection()</code> instead of <code>textConnection()</code>. For instance, 100,000 lines take 0.2 seconds and 1,000,000 lines take 2.5 seconds to captures when the lines are 100 characters long. For 100,000 lines with 1,000 characters it takes 2.4 seconds.</p> <h2 id="benchmarking">Benchmarking</h2> <p>The above benchmark results were obtained as following. We first create a function that generates a string with a large number of lines:</p> <pre><code class="language-r">&gt; lineBuffer &lt;- function(n, len) { + line &lt;- paste(c(rep(letters, length.out = len), &quot;\n&quot;), collapse = &quot;&quot;) + line &lt;- charToRaw(line) + lines &lt;- rep(line, times = n) + rawToChar(lines, multiple = FALSE) + } </code></pre> <p>For example,</p> <pre><code class="language-r">&gt; cat(lineBuffer(n = 2, len = 10)) abcdefghij abcdefghij </code></pre> <p>For very long character vectors <code>paste()</code> becomes very slow, which is why <code>rawToChar()</code> is used above.</p> <p>Next, lets create a function that measures the processing time for a capture function to capture the output of a given number of lines:</p> <pre><code class="language-r">&gt; benchmark &lt;- function(fcn, n, len) { + x &lt;- lineBuffer(n, len) + system.time({ + fcn(cat(x)) + }, gcFirst = TRUE)[[3]] + } </code></pre> <p>Note that the measured processing time neither includes the creation of the line buffer string nor the garbage collection.</p> <p>The functions to be benchmarked are:</p> <pre><code class="language-r">&gt; fcns &lt;- list(capture.output = capture.output, captureOutput = captureOutput) </code></pre> <p>and we choose to benchmark for outputs with a variety number of lines:</p> <pre><code class="language-r">&gt; ns &lt;- c(1, 10, 100, 1000, 10000, 25000, 50000, 75000, 1e+05) </code></pre> <p>Finally, lets benchmark all of the above with lines of length 100 and 1,000 characters:</p> <pre><code class="language-r">&gt; benchmarkAll &lt;- function(ns, len) { + stats &lt;- lapply(ns, FUN = function(n) { + message(sprintf(&quot;n=%d&quot;, n)) + t &lt;- sapply(fcns, FUN = benchmark, n = n, len = len) + data.frame(name = names(t), n = n, time = unname(t)) + }) + Reduce(rbind, stats) + } &gt; stats_100 &lt;- benchmarkAll(ns, len = 100L) &gt; stats_1000 &lt;- benchmarkAll(ns, len = 1000L) </code></pre> <p>The results are:</p> <table> <thead> <tr> <th align="right">n</th> <th align="right">capture.output(100)</th> <th align="right">captureOutput(100)</th> <th align="right">capture.output(1000)</th> <th align="right">captureOutput(1000)</th> </tr> </thead> <tbody> <tr> <td align="right">1</td> <td align="right">0.00</td> <td align="right">0.00</td> <td align="right">0.00</td> <td align="right">0.00</td> </tr> <tr> <td align="right">10</td> <td align="right">0.00</td> <td align="right">0.00</td> <td align="right">0.00</td> <td align="right">0.00</td> </tr> <tr> <td align="right">100</td> <td align="right">0.00</td> <td align="right">0.00</td> <td align="right">0.01</td> <td align="right">0.00</td> </tr> <tr> <td align="right">1000</td> <td align="right">0.00</td> <td align="right">0.02</td> <td align="right">0.02</td> <td align="right">0.01</td> </tr> <tr> <td align="right">10000</td> <td align="right">0.69</td> <td align="right">0.02</td> <td align="right">0.80</td> <td align="right">0.21</td> </tr> <tr> <td align="right">25000</td> <td align="right">3.18</td> <td align="right">0.05</td> <td align="right">2.99</td> <td align="right">0.57</td> </tr> <tr> <td align="right">50000</td> <td align="right">11.88</td> <td align="right">0.15</td> <td align="right">10.33</td> <td align="right">1.17</td> </tr> <tr> <td align="right">75000</td> <td align="right">25.01</td> <td align="right">0.19</td> <td align="right">25.43</td> <td align="right">1.80</td> </tr> <tr> <td align="right">100000</td> <td align="right">41.73</td> <td align="right">0.24</td> <td align="right">46.34</td> <td align="right">2.41</td> </tr> </tbody> </table> <p><em>Table: Benchmarking of <code>captureOutput()</code> and <code>capture.output()</code> for n lines of length 100 and 1,000 characters. All times are in seconds.</em></p> <p><img src="https://www.jottr.org/post/captureOutput_vs_capture.output,67760e64d0951ca2124886cd8c257b6c,len=100.png" alt="captureOutput_vs_capture.output" /> <em>Figure: <code>captureOutput()</code> captures standard output much faster than <code>capture.output()</code>. The processing time for the latter grows exponentially in the number of lines captured whereas for the former it only grows linearly.</em></p> <p>These results will vary a little bit from run to run, particularly since we only benchmark once per setting. This also explains why for some settings the processing time for lines with 1,000 characters appears faster than the corresponding setting with 100 characters. Averaging over multiple runs would remove this artifact.</p> <p><strong>UPDATE:</strong><br /> 2015-02-06: Thanks to Kevin Van Horn for pointing out that the growth of the <code>capture.output()</code> is probably not as extreme as <em>exponential</em> and suggests <em>quadratic</em> growth.</p> <h2 id="appendix">Appendix</h2> <h3 id="session-information">Session information</h3> <pre><code class="language-r">R version 3.1.0 Patched (2014-05-21 r65711) Platform: x86_64-w64-mingw32/x64 (64-bit) locale: [1] LC_COLLATE=English_United States.1252 [2] LC_CTYPE=English_United States.1252 [3] LC_MONETARY=English_United States.1252 [4] LC_NUMERIC=C [5] LC_TIME=English_United States.1252 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] markdown_0.7 plyr_1.8.1 R.cache_0.9.5 knitr_1.5.26 [5] ggplot2_1.0.0 R.devices_2.9.2 R.utils_1.32.5 R.oo_1.18.2 [9] R.methodsS3_1.6.2 loaded via a namespace (and not attached): [1] base64enc_0.1-1 colorspace_1.2-4 digest_0.6.4 evaluate_0.5.5 [5] formatR_0.10 grid_3.1.0 gtable_0.1.2 labeling_0.2 [9] MASS_7.3-33 mime_0.1.1 munsell_0.4.2 proto_0.3-10 [13] R.rsp_0.18.2 Rcpp_0.11.1 reshape2_1.4 scales_0.2.4 [17] stringr_0.6.2 tools_3.1.0 </code></pre> <p>Tables were generated using <a href="http://cran.r-project.org/package=plyr">plyr</a> and <a href="http://cran.r-project.org/package=knitr">knitr</a>, and graphics using <a href="http://cran.r-project.org/package=ggplot2">ggplot2</a>.</p> <h3 id="reproducibility">Reproducibility</h3> <p>This report was generated from an RSP-embedded Markdown <a href="https://gist.github.com/HenrikBengtsson/854d13a11a33b3d43ec3/raw/captureOutput.md.rsp">document</a> using <a href="http://cran.r-project.org/package=R.rsp">R.rsp</a> v0.18.2. <!-- It can be recompiled as `R.rsp::rfile("https://gist.github.com/HenrikBengtsson/854d13a11a33b3d43ec3/raw/captureOutput.md.rsp")`. --></p> </description>
</item>
<item>
<title>Speed Trick: Assigning Large Object NULL is Much Faster than using rm()!</title>
<link>https://www.jottr.org/2013/05/25/trick-fast-rm/</link>
<pubDate>Sat, 25 May 2013 00:00:00 +0000</pubDate>
<guid>https://www.jottr.org/2013/05/25/trick-fast-rm/</guid>
<description> <p>When processing large data sets in R you often also end up creating large temporary objects. In order to keep the memory footprint small, it is always good to remove those temporary objects as soon as possible. When done, removed objects will be deallocated from memory (RAM) the next time the garbage collection runs.</p> <h2 id="better-use-rm-list-x-instead-of-rm-x-if-using-rm">Better: Use <code>rm(list = &quot;x&quot;)</code> instead of <code>rm(x)</code>, if using <code>rm()</code></h2> <p>To remove an object in R, one can use the <code>rm()</code> function (with alias <code>remove()</code>). However, it turns out that that function has quite a bit of internal overhead (look at its R code), particularly if you call it as <code>rm(x)</code> rather than <code>rm(list = &quot;x&quot;)</code>. The former takes about three times longer to complete. Example:</p> <pre><code class="language-r">&gt; t1 &lt;- system.time(for (k in 1:1e5) { a &lt;- 1; rm(a) }) &gt; t2 &lt;- system.time(for (k in 1:1e5) { a &lt;- 1; rm(list = &quot;a&quot;) }) &gt; t1 user system elapsed 10.45 0.00 10.50 &gt; t2 user system elapsed 2.93 0.00 2.94 &gt; t1/t2 user system elapsed 3.566553 NaN 3.571429 </code></pre> <p>Note: In order to minimize the impact of the memory allocation on the benchmark, I use <code>a &lt;- 1</code> to represent the &ldquo;large&rdquo; object.</p> <h2 id="best-use-x-null-instead-of-rm">Best: Use x &lt;- NULL instead of rm()</h2> <p>Instead of using <code>rm(list = &quot;x&quot;)</code>, which still has a fair amount of overhead, one can remove a large active object by assigning the corresponding variable a new value (a small object), e.g. <code>x &lt;- NULL</code>. Whenever doing this, the previously assigned value (the large object) will become available for garbage collection. Example:</p> <pre><code class="language-r">&gt; t3 &lt;- system.time(for (k in 1:1e5) { a &lt;- 1; a &lt;- NULL }) &gt; t3 user system elapsed 0.05 0.00 0.05 &gt; t1/t3 user system elapsed 209 NaN 210 </code></pre> <p>That&rsquo;s a <strong>200 times speedup</strong>!</p> <h2 id="background">Background</h2> <p>I &ldquo;accidentally&rdquo; discovered this when profiling <code>readMat()</code> in my <a href="http://cran.r-project.org/web/packages/R.matlab/">R.matlab</a> package. In particular, there was one rm(x) call inside a local function that was called thousands of times when parsing modestly large MAT files. Together with some additional optimizations, R.matlab v2.0.0 (to be appear) is now 10-20 times faster. Now I&rsquo;m going to review all my other packages for expensive <code>rm()</code> calls.</p> </description>
</item>
<item>
<title>This Day in History (1997-04-01)</title>
<link>https://www.jottr.org/2013/04/01/history-r-help/</link>
<pubDate>Mon, 01 Apr 2013 00:00:00 +0000</pubDate>
<guid>https://www.jottr.org/2013/04/01/history-r-help/</guid>
<description><p>Today it&rsquo;s 16 years ago and 367,496 messages later since Martin Mächler started the R-help (321,119 msgs), R-devel (45,830 msgs) and R-announce (547 msgs) mailing lists [1] - a great benefit to all of us. Special thanks to Martin and also thanks to everyone else contributing to these forums.</p> <p><img src="https://www.jottr.org/post/r-help,r-devel.png" alt="Number of messages on R-help and R-devel from 1997 to 2013" /></p> <p>[1] <a href="https://stat.ethz.ch/pipermail/r-help/1997-April/001490.html">https://stat.ethz.ch/pipermail/r-help/1997-April/001490.html</a></p> </description>
</item>
<item>
<title>Speed Trick: unlist(..., use.names=FALSE) is Heaps Faster!</title>
<link>https://www.jottr.org/2013/01/07/trick-unlist/</link>
<pubDate>Mon, 07 Jan 2013 00:00:00 +0000</pubDate>
<guid>https://www.jottr.org/2013/01/07/trick-unlist/</guid>
<description><p>Sometimes a minor change to your R code can make a big difference in processing time. Here is an example showing that if you&rsquo;re don&rsquo;t care about the names attribute when <code>unlist()</code>:ing a list, specifying argument <code>use.names = FALSE</code> can speed up the processing lots!</p> <pre><code class="language-r">&gt; x &lt;- split(sample(1000, size = 1e6, rep = TRUE), rep(1:1e5, times = 10)) &gt; t1 &lt;- system.time(y1 &lt;- unlist(x)) &gt; t2 &lt;- system.time(y2 &lt;- unlist(x, use.names = FALSE)) &gt; stopifnot(identical(y2, unname(y1))) &gt; t1/t2 user system elapsed 103 NaN 104 </code></pre> <p>That&rsquo;s more than a 100 times speedup.</p> <p>So, check your code to see to which <code>unlist()</code> statements you can add an <code>use.names = FALSE</code>.</p> </description>
</item>
<item>
<title>Force R Help HTML Server to Always Use the Same URL Port</title>
<link>https://www.jottr.org/2012/10/22/config-help-start/</link>
<pubDate>Mon, 22 Oct 2012 00:00:00 +0000</pubDate>
<guid>https://www.jottr.org/2012/10/22/config-help-start/</guid>
<description><p>The below code shows how to configure the <code>help.ports</code> option in R such that the built-in R help server always uses the same URL port. Just add it to the <code>.Rprofile</code> file in your home directory (iff missing, create it). For more details, see <code>help(&quot;Startup&quot;)</code>.</p> <pre><code class="language-r"># Force the URL of the help to http://127.0.0.1:21510 options(help.ports = 21510) </code></pre> <p>A slighter fancier version is to use a environment variable to set the port(s):</p> <pre><code class="language-r">local({ ports &lt;- Sys.getenv(&quot;R_HELP_PORTS&quot;, 21510) ports &lt;- as.integer(unlist(strsplit(ports, &quot;,&quot;))) options(help.ports = ports) }) </code></pre> <p>However, if you launch multiple R sessions in parallel, this means that they will all try to use the same port, but it&rsquo;s only the first one that will success and all other will fail. An alternative is then to provide R with a set of ports to choose from (see <code>help(&quot;startDynamicHelp&quot;, package = &quot;tools&quot;)</code>). To set the ports to 21510-21519 if you run R v2.15.1, to 21520-21529 if you run R v2.15.2, to 21600-21609 if you run R v2.16.0 (&ldquo;devel&rdquo;) and so on, do:</p> <pre><code class="language-r">local( port &lt;- sum(c(1e4, 100) * as.double(R.version[c(&quot;major&quot;, &quot;minor&quot;)])) options(help.ports = port + 0:9) }) </code></pre> <p>With this it will be easy from the URL to identify for which version of R the displayed help is for. Finally, if you wish the R help server to start automatically in the background when you start R, add:</p> <pre><code class="language-r"># Try to start HTML help server if (interactive()) { try(tools::startDynamicHelp()) } </code></pre> </description>
</item>
<item>
<title>Set Package Repositories at Startup</title>
<link>https://www.jottr.org/2012/09/27/config-repos/</link>
<pubDate>Thu, 27 Sep 2012 00:00:00 +0000</pubDate>
<guid>https://www.jottr.org/2012/09/27/config-repos/</guid>
<description><p>The below code shows how to configure the <code>repos</code> option in R such that <code>install.packages()</code> etc. will locate the packages without having to explicitly specify the repository. Just add it to the <code>.Rprofile</code> file in your home directory (iff missing, create it). For more details, see <code>help(&quot;Startup&quot;)</code>.</p> <pre><code class="language-r">local({ repos &lt;- getOption(&quot;repos&quot;) # http://cran.r-project.org/ # For a list of CRAN mirrors, see getCRANmirrors(). repos[&quot;CRAN&quot;] &lt;- &quot;http://cran.stat.ucla.edu&quot; # http://www.stats.ox.ac.uk/pub/RWin/ReadMe if (.Platform$OS.type == &quot;windows&quot;) { repos[&quot;CRANextra&quot;] &lt;- &quot;http://www.stats.ox.ac.uk/pub/RWin&quot; } # http://r-forge.r-project.org/ repos[&quot;R-Forge&quot;] &lt;- &quot;http://R-Forge.R-project.org&quot; # http://www.omegahat.org/ repos[&quot;Omegahat&quot;] &lt;- &quot;http://www.omegahat.org/R&quot; options(repos = repos) }) </code></pre> </description>
</item>
</channel>
</rss>