General notes, musings and comments about the “Randomized Matrix Decompositions using R” paper by Erichson, Brunton, Voronin, and Kutz

The paper introduces the rsvd R package implementing the Halko, Martinsson, Tropp, Rokhlin, etc. randomized partial SVD estimation method.

I think the paper is really nice. The careful description of the randomized SVD algorithm is great to read. And I think the examples are cool. Some very specific comments I have are outlined below.

Performance comparison of what exactly?

Section 3.7 of the paper compares several computations that estimate low rank (100) reconstructions of a 1,600 x 1,200 matrix whose entries are made up of grayscale values from the “tiger” image in the rsvd package. The relative Frobenius error norms of each reconstructed matrix \(tiger_{est}\) compared to the original are presented, \[ \|tiger_{est} - tiger\|_F / \|tiger\|_F \] along with wall-clock times for each method. Other than selecting the rank of the reconstructed matrix, default parameters for each method are mostly used except for a variable choice of the number of iterations for rsvd().

I think it’s important to carefully understand the comparison. As written, I’m not sure that the timings have much meaning–that is the comparisons are not quite valid. In a sense, apples and oranges are being compared here.

First, note that the relative error norms of low-rank reconstructions are compared–not, notably, more direct comparisons of the estimated partial singular value decompositions.

Why should that matter? Since we know anyway that the SVD is known to produce the reconstruction with the lowest relative Frobenius error for a given rank. But, importantly, the converse may not hold–a low rank reconstruction with low relative Frobenius error does not imply an accurate singular value decomposition estimate.

Indeed the converse doesn’t hold for this example. That’s because this matrix has a few big singular values, and a long gradually-decaying tail of increasingly tiny ones. In this example, that means that there will be potentially lots of rank 100 reconstructions with nearly as good error norms as the optimal one found by the SVD.

library(rsvd)
data(tiger)
t_svd <- system.time(s <- svd(tiger))
plot(s$d)

Again, why does that matter? Well, if you wish to estimate a low-rank reconstruction as in this example, it’s great news! We can be pretty loose in our estimate and get a good result.

The svds() and irlba() functions from the RSpectra and irlba packages, respectively, are designed to estimate partial singular value decompositions with a modicum of accuracy. Accuracy, that is, with respect to the decomposition itself.

The main issue I find in this paper is with the reported timings. The comparisons presented in the paper do not consider performance of computing a partial SVD estimate but rather the time to compute a decent low-rank reconstruction–a much easier problem than estimating the SVD in this case.

And, since the rsvd() function has no convergence tolerance in terms of the estimated partial SVD, results are shown simply for a few fixed numbers of algorithm iterations. Meanwhile the poor svds() and irlba() functions are shown to be busy toiling away on the different and harder problem of computing an accurate SVD estimate.

Let’s illustrate all that talk with the example itself. First, here are timings and relative errors using R’s native svd() and the rsvd() and irlba() functions, from version 0.6 of the rsvd and 2.3.1 of the irlba packages, respectively. The example timings are reported for my home PC with an AMD A10-7850K quad-core 3700 MHz CPU and 16 GB of DDR3 synchronous, unbuffered 1333 MHz RAM running Ubuntu 16.04 and R version 3.4.2 linked to OpenBLAS/OpenMP (based on Goto BLAS) r0.3.0 (the OMP_NUM_CORES variable was unset, so using all four cores when possible).

library(rsvd)
library(irlba)

## Loading required package: Matrix

data(tiger)
set.seed(1)
k <- 100
#t_svd <- system.time(s <- svd(tiger))  # already computed above up there...
t_irlba <- system.time(l <- irlba(tiger, k))
t_rsvd <- system.time(q <- rsvd(tiger, k=k, q=3)) # q=3 because why not

# Timings:
print(rbind(t_svd, t_irlba, t_rsvd))[, 1:3]

        user.self sys.self elapsed
t_svd       6.736    0.144   2.358
t_irlba     2.680    0.024   0.718
t_rsvd      0.688    0.028   0.364

The code below shows the relative errors for the reconstructed matrices for each method:

N <- norm(tiger, "F")
e_svd <- norm(tiger - s$u[, 1:k] %*% (s$d[1:k] * t(s$v[, 1:k])), "F") / N
e_irlba <- norm(tiger - l$u %*% (l$d * t(l$v)), "F") / N
e_rsvd <- norm(tiger - q$u %*% (q$d * t(q$v)), "F") / N
print(rbind(e_svd, e_irlba, e_rsvd))

             [,1]
e_svd   0.1208136
e_irlba 0.1208136
e_rsvd  0.1215720

As in the paper, rsvd() is fastest and all the methods produce reconstructions with extremely similar relative errors. But note that irlba() does find the reconstruction with very nearly the same (optimal) error as R’s svd(), however close that may be to the estimate produced by rsvd().

OK, now let’s compare the estimated partial singular value decompositions. The next plot compares the basis vectors of the range subspace produced by each solution method to R’s svd() function.

# visualize the projected ranges
plot(sqrt(apply(crossprod(s$u[, 1:100], s$u[, 1:400]), 2, crossprod)), type='l', ylab="", main="u_{svd}^T u_{est} for each basis vector estimation method")
points(sqrt(apply(crossprod(l$u, s$u[, 1:400]), 2, crossprod)), pch="+", col=4)
points(sqrt(apply(crossprod(q$u, s$u[, 1:400]), 2, crossprod)), pch="x", col=2)
legend("topright", legend=c("svd basis vectors", "irlba basis vectors", "rsvd q=3 basis vectors"), fill=c(1, 4, 2), bty="n")

The irlba() and svd() methods find nearly the same solution range subspace, rsvd() does not. You could say rsvd() found either a pretty bad estimate of the SVD, or it’s not an SVD at all but rather some other rank 100 decomposition.

But of course for this problem it simply doesn’t matter!–that is we don’t care about the SVD here at all, just the reconstruction. But this is what I mean when I say apples and oranges, rsvd() is really not computing the same thing as svd(), irlba(), or svds() in this case.

A more apples to apples comparison approach

Now that we carefully established that the aim of the ‘tiger’ example is to compute a decent low-rank reconstructed matrix instead of an SVD estimate, it’s possible to formulate a more direct comparison of the various algorithms, for example by simply reducing the tolerance used by irlba() (and, although not shown here this also applies to svds() from RSpectra).

The following table shows results for a reduced tolerance run of irlba(), repeating the above timings and errors for comparison.

t_irlba2 <- system.time(l2 <- irlba(tiger, k, tol=0.02))
e_irlba2 <- norm(tiger - l2$u %*% (l2$d * t(l2$v)), "F") / N

print(rbind(t_svd, t_irlba, t_rsvd, t_irlba2))[, 1:3]
print(rbind(e_svd, e_irlba, e_rsvd, e_irlba2))

        user.self sys.self elapsed
t_svd       6.736    0.144   2.358
t_irlba     2.680    0.024   0.718
t_rsvd      0.688    0.028   0.364
t_irlba2    1.368    0.000   0.357

             [,1]
e_svd    0.1208136
e_irlba  0.1208136
e_rsvd   0.1215720
e_irlba2 0.1216254

With this reduced tolerance, irlba’s run time is much closer to rsvd’s, and with similar reconstruction error norms. I have not run it, but I think that the same may be true of RSpectra for this problem.

Alternative oranges to oranges comparison

Alternatively, we can see experimentally how long it takes rsvd to come up with a comparably accurate SVD estimate. I use the maximum of the norm of the estimated vectors in the direction of the desired null space (relative to R’s svd() function) as a measure of subspace estimate quality below. Many other similar measures could be used instead. Again, reported results were conducted on my home PC described above.

# First look at irlba's max subspace error:
max(sqrt(apply(crossprod(l$u, s$u[, -(1:100)]), 2, crossprod)))
# [1] 0.00749263

# Now a comparable rsvd result:
system.time(q <- rsvd(tiger, k=k, q=43))
#    user  system elapsed 
#   6.460   0.388   3.689 

max(sqrt(apply(crossprod(q$u, s$u[, -(1:100)]), 2, crossprod)))
# [1] 0.006233981

I found it took q=43 to get a comparable result, so in the oranges to oranges comparison of computing an SVD, rsvd is quite a bit slower.

Comment

I think that the lack of a meaningful convergence tolerance in rsvd() is a problem (and that goes for Python’s Scikit learn randomized_svd() implementation of this method too!). Users should have at least a ballpark estimate of what the output actually represents, in my opinion.

Is it even important to compute accurate SVD estimates?

It’s worth asking this question because the answer is clearly not! At least for some problems (like the one discussed here).

But I think that it is important for a large class of problems. For example, the projected range subspace from partial SVDs are often used to cluster data (it turns out that there are interesting deep connections between clustering and partial SVDs). Inaccuracies in the subspace approximation may result in mis-classification.

Similarly, graph partitioning using Fielder’s vector relies on an accurate estimate of the singular vector associated with a particular singular value (the 2nd smallest one). Again, inaccurate subspace estimation may lead to a poor partition. Many directed graph analyses like community detection, measures of vertex centrality, measures of flow through networks, etc. are based on partial SVDs of the corresponding adjacency matrix, and they all can be adversely affected by inaccurate subspace estimation.

And, subspace errors can lead to poor imputation results for partial SVD-based matrix completion methods.

Etc.

So, it depends on the problem I guess.

Musings on subspace iteration versus Krylov

It should be stated somewhere that the algorithms employed by irlba() and rsvd() are both randomized (or not) methods for estimating partial SVDs in the sense that both methods start with a subspace usually (but not necessarily) defined by random numbers.

And, both methods are natural extensions of well-studied approaches for eigenvalues. They both search for some eigenvectors of a Gram-like matrix but instead of solving an eigenvalue problem they adapt their methods to solving projected SVD sub-problems. Both methods augment their working search space with extra vectors.

Both methods find solutions based on polynomials of the matrix \(x^T x\) (or \(x x^T\) depending on dimensions)–irlba finds a solution in a Krylov subspace and rsvd in a block Krylov subspace.

Consider the following example to help illuminate a difference between the methods. Let’s think for a moment about finding the left singular vector associated with the largest singular value of \(x\). Just for a moment, let’s cripple rsvd and use just a single starting vector \(q\) (no oversampling I guess you might say–then rsvd is basically orthogonal power iteration adapted to the SVD) and run \(n\) steps of the algorithm. Consider the Krylov subspace \[ K(x, q, n) = \mathrm{span}\{q, x^T x q, \ldots, (x^T x)^{n - 1} q\}. \] Then in this highly contrived example, rsvd’s solution for the right singular vector associated with the largest singular value is a particular vector in \(K(x, q, n)\)–namely \((x^T x)^{n-1} q\)–while irlba’s solution maximizes a Ritz value over all vectors in \(K(x, q, n)\).

But of course rsvd uses oversampling always and seeks a solution in a block Krylov subspace. That is, instead of an initial vector \(q\), rsvd uses an initial matrix \(Q\). The larger the block size (including oversampling), the more chance that there will be values that quickly converge to a good solution. However, rsvd still limits its solution to a particular polynomial instead of finding a solution from all possible polynomials of degree \(n-1\) like irlba.

Stated cheaply, irlba achieves fast convergence by optimizing over lots of polynomials with little data, rsvd achieves its fast convergence by optimizing over lots of data with a specific polynomial.

The convergence theory of Lanczos methods applied to eigenvalue problems was extensively studied by Kaniel and Paige back in the day. Much of that carries over to the irlba adaption of Lanczos to the SVD problem. You can find a summary comparison of basic Lanczos and basic single-vector power iteration in section 9.1.5 in Golub and Van Loan.

Gene Golub used to call the eigenvalue version of rsvd (block approach) “orthogonal iteration” and you can read about it in Chapter 7.3.2 of Golub and Van Loan. Another presentation can be found in Chapter 5 of Saad’s eigenvalue book http://www-users.cs.umn.edu/~saad/books.html (which he calls “subspace iteration”).

I will informally comment on some specific advantages and drawbacks of each approach in following sections.

Advantages of, and problems with, subspace iteration

Block methods like subspace iteration have a lot of advantages, some already outlined in your JSS paper. In particular they enable excellent use of internal vectorized arithmetic within computer CPUs (so-called level 3 BLAS operations). That makes them fast for dense arithmetic problems (but note not really an advantage for sparse problems).

Another big advantage–in my opinion their biggest advantage–is that they can easily find subspaces associated with non-distinct singular vectors. See the next section for an example of this…

The main disadvantage of power iteration methods like rsvd is slow convergence unless a large initial random subspace is used for many problems, because of the simple polynomial used to formulate the solution. The “tiger” example above is actually a good example of that.

Advantages of, and problems with, Lanczos

Convergence is typically really fast–we know this thanks to Kaniel and Paige.

It doesn’t use the level 3 BLAS, so does not take best advantage of available CPU hardware for dense problems. This is not really an issue for sparse matrix problems. Also, this performance disadvantage is somewhat made up for by math–that is, more efficient use of each flop thanks to the fast Lanczos convergence properties.

Standard Lanczos methods can only approximate (sometimes poorly) subspaces associated with non-distinct singular values because the method assumes that the singular values are distinct. In my opinion, this is the biggest problem. In practice, irlba runs in to the most trouble with nearly non-distinct singular values. Let’s investigate with the “tprolate” example from the irlba package contributed by Giuseppe Rodriguez. This example exhibits two clusters of singular values near one and zero. They are distinct, but only barely so.

library(rsvd)
library(irlba)

tprolate <- function(n, w=0.25)
{
  a <- rep(0, n)
  a[1] <- 2 * w
  a[2:n] <- sin( 2 * pi * w * (1:(n-1)) ) / ( pi * (1:(n-1)) )
  toeplitz(a)
}
     
x <- tprolate(512)
s <- svd(x)
plot(s$d)

set.seed(1)
l <- irlba(x, 20)
q <- rsvd(x, 20)
data.frame(matrix_products=c(l$mprod, 1),
           error=sqrt(c(crossprod(l$d - s$d[1:20]), crossprod(q$d - s$d[1:20]))),
           row.names=c("IRLBA", "Randomized SVD"))

##                matrix_products        error
## IRLBA                      110 9.686744e-04
## Randomized SVD               1 5.931187e-07

This example is really bad for irlba and really good for rsvd. Despite using more than 100x the number of matrix products, irlba’s solution is substantially less accurate than rsvd!!

This kind of problem is definitely a sweet spot for rsvd.

Notes on block Lanczos

Block Lanczos eigenvalue methods have been well studied. They are, for instance, summarized in Golub and Van Loan’s book in section 9.2.6, and their convergence was extensively analyzed by Saad and others. Baglama and Reichel adapted them to the projected SVD problem in 2006.

It seems like they would be the answer to several potential issues with irlba–they can take advantage of fast SIMD/vectorized CPU features and overcome convergence issues for clustered singular values while retaining the fast Lanczos convergence properties. Best of both worlds so to say.

But it’s not that simple. Increasing the block size increases the working storage requirements of the method. Even allowing for the extremely efficient potential internal operational details of Lanczos thanks to the use of orthogonal polynomials, this added storage can be a serious problem for large-scale problems.

We can reduce memory requirements by stopping the Krylov subspace iteration early and restarting–after all, irlba has a clever implicit shift restarting technique baked in anyway. But this compromise slows down the convergence. In the extreme example of restarting after one step, it reduces it to rsvd’s power iteration! (And this is less extreme than it seems, since ideally the block size would be larger than the dimension of the subspace associated with the clustered singular values.)

Less theoretically I have found experimentally that the block IRL methods from Baglama and generally perform very poorly compared to irlba. (They perform great for a few specific problems, but badly in general).

The upshot is that I am not convinced that block Lanczos is the best way to improve irlba.

Where to go from here

I have not thought carefully about this yet, but I think it might be interesting to combine the rsvd/irlba ideas in a hybrid method. Perhaps starting with irlba and finishing with rsvd using the approximate subspace from irlba as a starting subspace for rsvd instead of random vectors – or vice versa. Or perhaps even an alternating method. This seems worth investigating, what do you think?