# Performance: Avoid Coercing Indices To Doubles

x[idxs + 1] or x[idxs + 1L]? That is the question. Assume that we have a vector $x$ of $n = 100,000$ random values, e.g. > n <- 100000 > x <- rnorm(n) and that we wish to calculate the $n-1$ first-order differences $y=(y_1, y_2, …, y_{n-1})$ where $y_i=x_{i+1} - x_i$. In R, we can calculate this using the following vectorized form: > idxs <- seq_len(n - 1) > y <- x[idxs + 1] - x[idxs] We can certainly do better if we turn to native code, but is there a more efficient way to implement this using plain R code?

# matrixStats: Optimized Subsetted Matrix Calculations

The matrixStats package provides highly optimized functions for computing common summaries over rows and columns of matrices. In a previous blog post, I showed that, instead of using apply(X, MARGIN = 2, FUN = median), we can speed up calculations dramatically by using colMedians(X). In the most recent release (version 0.50.0), matrixStats has been extended to perform optimized calculations also on a subset of rows and/or columns specified via new arguments rows and cols, e.

# Performance: Calling R_CheckUserInterrupt() Every 256 Iteration is Actually Faster than Every 1,000,000 Iteration

If your native code takes more than a few seconds to finish, it is a nice courtesy to the user to check for user interrupts (Ctrl-C) once in a while, say, every 1,000 or 1,000,000 iteration. The C-level API of R provides R_CheckUserInterrupt() for this (see ‘Writing R Extensions’ for more information on this function). Here’s what the code would typically look like: for (int ii = 0; ii < n; ii++) { /* Some computational expensive code */ if (ii % 1000 == 0) R_CheckUserInterrupt() } This uses the modulo operator % and tests when it is zero, which happens every 1,000 iteration.

# To Students: matrixStats for Google Summer of Code

We are pleased to announce our proposal ‘Subsetted and parallel computations in matrixStats’ for Google Summer of Code. The project is aimed for a student with experience in R and C, it runs for three months, and the student gets paid 5500 USD by Google. Students from (almost) all over the world can apply. Application deadline is March 27, 2015. I, Henrik Bengtsson, and Héctor Corrada Bravo will be joint mentors.

# Package: matrixStats 0.13.1 - Methods that Apply to Rows and Columns of a Matrix (and Vectors)

A new release 0.13.1 of matrixStats is now on CRAN. The source code is available on GitHub. What does it do? The matrixStats package provides highly optimized functions for computing common summaries over rows and columns of matrices, e.g. rowQuantiles(). There are also functions that operate on vectors, e.g. logSumExp(). Their implementations strive to minimize both memory usage and processing time. They are often remarkably faster compared to good old apply() solutions.

# Pitfall: Did You Really Mean to Use matrix(nrow, ncol)?

Are you a good R citizen and preallocates your matrices? If you are allocating a numeric matrix in one of the following two ways, then you are doing it the wrong way! x <- matrix(nrow = 500, ncol = 100) or x <- matrix(NA, nrow = 500, ncol = 100) Why? Because it is counter productive. And why is that? In the above, x becomes a logical matrix, and not a numeric matrix as intended.

# Performance: captureOutput() is Much Faster than capture.output()

The R function capture.output() can be used to “collect” the output of functions such as cat() and print() to strings. For example, > s <- capture.output({ + cat("Hello\nworld!\n") + print(pi) + }) > s [1] "Hello" "world!" "[1] 3.141593" More precisely, it captures all output sent to the standard output and returns a character vector where each element correspond to a line of output. By the way, it does not capture the output sent to the standard error, e.

# Speed Trick: Assigning Large Object NULL is Much Faster than using rm()!

When processing large data sets in R you often also end up creating large temporary objects. In order to keep the memory footprint small, it is always good to remove those temporary objects as soon as possible. When done, removed objects will be deallocated from memory (RAM) the next time the garbage collection runs. Better: Use rm(list = "x") instead of rm(x), if using rm() To remove an object in R, one can use the rm() function (with alias remove()).

# Speed Trick: unlist(..., use.names=FALSE) is Heaps Faster!

Sometimes a minor change to your R code can make a big difference in processing time. Here is an example showing that if you’re don’t care about the names attribute when unlist():ing a list, specifying argument use.names = FALSE can speed up the processing lots! > x <- split(sample(1000, size = 1e6, rep = TRUE), rep(1:1e5, times = 10)) > t1 <- system.time(y1 <- unlist(x)) > t2 <- system.time(y2 <- unlist(x, use.

#### Henrik Bengtsson

MSc CS | PhD Math Stat | Associate Professor | R

Associate Professor