Think Small

16-September, 2015

Think about solving your problems before trying to implement them

Think about solving your problems before trying to implement solutions

especially on distributed systems

Sometimes, evidently huge problems are smaller than they seem.

Sometimes, "big data" systems suck.

information << data

—Michael Wu

The cancer genome atlas

Big data…it's a nonsense term.

—Bill Cleveland, Interface 2015

Examples

Our conclusion is not that the systems … are obviously bad, but rather that there is not yet much evidence that they are especially good.

—Frank McSherry on Spark, GraphX, Giraph, GraphLab

Page rank

Twitter_rv

System	CPU Cores	Time (s)
Spark/GraphX	128	419
Giraph	128	596
Graphlab	128	249
McSherry's Macbook Pro	1	110

Follower network example, ~41M vertices 1.4B edges

This table shows 20 iterations of page rank computed by Frank McSherry on his laptop, compared to some big distribtued systems.

Sparse SVD biclustering

Innovative algorithm by Lee, Shen, Huang, and Marron

and a faithful R package implementation (s4vd) by Martin Sill and Sebastian Kaiser

Findings

The method requires only a low-rank SVD in each iteration
An inner matrix optimization problem could be (exactly) projected into a 1-d problem

These knock an \(m^3\) problem down to about an order \(m\) problem or so, where \(m\) is the number of data observations

Example

1,000 \(\times\) 10,000 example matrix timed on a laptop

library(s4vd)
set.seed(1)
x = matrix(rnorm(1000 * 10000), nrow=1000)
i = seq(from=1, to=nrow(x), by=2)
j = seq(from=1, to=ncol(x), by=2)
x[i,j] = rnorm(length(i)*length(j), mean=2, sd=0.5)
cl = biclust(x, method=BCssvd, K=1)

Original routines	3,522s
Reformulated routines	20s

Network centrality

A tiny part of the Bitcoin transaction graph

Viewed as an adjacency matrix \(A\):

Network centrality

It's easy to see that

\[ \left( A^k \right)_{i,j} \]

counts the number of paths of length \(k\) between vertices \(i\) and \(j\).

Network centrality

\[ \left(A + A^2 + A^3 + \cdots \right)_{i,j} \]

counts the number of paths of all lengths between vertices \(i\) and \(j\).

Network centrality

One interesting measure of centrality de-emphasizes longer paths

\[ I + 1 A + \frac{1}{2!}A^2 + \cdots = \exp(A). \]

Can be expensive to compute.

Network centrality

Rodriguez et al. show that the top \(k\) most central network nodes are often found in low-dimensional subspaces related to the SVD.

And, they have a simple method that tells us when we've got a subspace guaranteed to contain the top \(k\) most central nodes.

Bitcoin network centrality on a Chromebook

load("bitcoin_from_to_graph.rdata")
t1 = proc.time()
x = topm(B,q=2,tol=0.1,m_b=5)
proc.time() - t1

# user system elapsed
# 86.970 24.350 111.605

Compute the top 5 most central nodes for the entire Bitcoin transaction graph.

Compare to Padé approximant

On a 1000 \(\times\) 1000 subset

t1 = proc.time()
ex = diag(expm(X) + expm(-X))/2
proc.time() - t1
# user system elapsed
# 151.080 0.220 151.552

order(ex,decreasing=TRUE)[1:5]
#[1] 11 25 27 29 74

t1  = proc.time()
top = topm(X, type="cent")
proc.time() - t1
# user system elapsed
# 0.555 0.010 0.565

top$hubs
#[1] 11 25 27 29 74

Google genomics and Spark (ADAM): not thinking small enough

PCA on the 1000 genomes chrom. 20

A few lines of R code. A few minutes from start to finish on a laptop.

cf. https://youtu.be/ExNxi_X4qug

PCA on the 1000 genomes chrom. 20

library(Matrix)
p = pipe("zcat ALL.chr20.phase3_....genotypes.vcf.gz  
         | sed /^#/d  | cut  -f '10-' | parser | cut -f '1-2'")
x = read.table(p, colClasses=c("integer","integer"), fill=TRUE)
chr20 = sparseMatrix(i=x[,2], j=x[,1], x=1.0)

print(dim(chr20))
# [1]    2504 1812841

library(irlba)
cm = colMeans(chr20)
p = irlba(chr20, nv=3, nu=3, dU=rep(1,nrow(chr20)), ds=1, dV=cm)

library(threejs)
scatterplot3js(p$u)

AFR AMR EAS EUR SAS

What about more chromosomes?

Not too big a deal. Easy to supply a custom, column-wise chunked matrix-vector product.

Overlap join

The overlap join finds overlapping regions between two tables of coordinates.

B overlaps with region 2 and 3.
C overlaps with region 4.

1000 genomes on a cheap PC

Find overlaps between all 9,362,139 on chromosomes 7 and 8 against a table of 1,989 ranges.

R data load from file	260s
R IRanges find overlaps	8s
R data.table find overlaps	9s
Vertica data load from file	303s
Vertica find overlaps	28s

Linear mixed models

FastLMM

Nifty recent work of Lippert, Listgarten, Liu, Kadie, Davidson, Heckerman at MSFT Research

Beautifully simple change of basis that reduces normally expensive large-scale mixed effects models to simpler (diagonal) form

Can be combined naturally with low-dimensional subspace projection

Practical thoughts

Projection methods turn “ big data ” into data

Changing bases can exploit underlying structure to simplify problems

IRLBA

State of the art partial SVD by Baglama and Reichel
Competetive with Martinsson, Rokhlin, et al. methods.
On CRAN, development on GitHub recent improvements for PCA, partial symmetric eigensystems, …

References

https://community.lithium.com/t5/Science-of-Social-blog/The-Big-Data-Fallacy-Data-Information/ba-p/59250
http://www.frankmcsherry.org/graph/scalability/cost/2015/01/15/COST.html
https://github.com/bwlewis/s4vd (sparse svd biclustering)
http://compbio.cs.uic.edu/data/bitcoin/ (bitcoin data)
https://github.com/googlegenomics/spark-examples
http://htmlwidgets.org (the interactive R plots)
https://github.com/bwlewis/IRL (irlba method, also on CRAN)
https://github.com/bwlewis/1000_genomes_examples
http://bwlewis.github.io/1000_genomes_examples/ranges.html
http://www.nature.com/nmeth/journal/v8/n10/abs/nmeth.1681.html (Fast LMM)
http://illposed.net