16-September, 2015

Think about solving your problems before trying to implement them

Think about solving your problems before trying to implement solutions


especially on distributed systems

Sometimes, evidently huge problems are smaller than they seem.

Sometimes, evidently huge problems are smaller than they seem.


Sometimes, "big data" systems suck.



information << data

—Michael Wu

The cancer genome atlas



Big data…it's a nonsense term.

—Bill Cleveland, Interface 2015

Examples

Our conclusion is not that the systems … are obviously bad, but rather that there is not yet much evidence that they are especially good.

—Frank McSherry on Spark, GraphX, Giraph, GraphLab

Page rank

Twitter_rv

System CPU Cores Time (s)
Spark/GraphX 128 419
Giraph 128 596
Graphlab 128 249
McSherry's Macbook Pro 1 110

Follower network example, ~41M vertices 1.4B edges

This table shows 20 iterations of page rank computed by Frank McSherry on his laptop, compared to some big distribtued systems.

Sparse SVD biclustering



Innovative algorithm by Lee, Shen, Huang, and Marron

and a faithful R package implementation (s4vd) by Martin Sill and Sebastian Kaiser

Findings

  • The method requires only a low-rank SVD in each iteration
  • An inner matrix optimization problem could be (exactly) projected into a 1-d problem

These knock an \(m^3\) problem down to about an order \(m\) problem or so, where \(m\) is the number of data observations

Example

1,000 \(\times\) 10,000 example matrix timed on a laptop

library(s4vd)
set.seed(1)
x = matrix(rnorm(1000 * 10000), nrow=1000)
i = seq(from=1, to=nrow(x), by=2)
j = seq(from=1, to=ncol(x), by=2)
x[i,j] = rnorm(length(i)*length(j), mean=2, sd=0.5)
cl = biclust(x, method=BCssvd, K=1)
Original routines 3,522s
Reformulated routines 20s

Network centrality

A tiny part of the Bitcoin transaction graph

Viewed as an adjacency matrix \(A\):

Network centrality

It's easy to see that

\[ \left( A^k \right)_{i,j} \]

counts the number of paths of length \(k\) between vertices \(i\) and \(j\).

Network centrality

\[ \left(A + A^2 + A^3 + \cdots \right)_{i,j} \]

counts the number of paths of all lengths between vertices \(i\) and \(j\).

Network centrality

One interesting measure of centrality de-emphasizes longer paths

\[ I + 1 A + \frac{1}{2!}A^2 + \cdots = \exp(A). \]

Can be expensive to compute.

Network centrality

Rodriguez et al. show that the top \(k\) most central network nodes are often found in low-dimensional subspaces related to the SVD.

And, they have a simple method that tells us when we've got a subspace guaranteed to contain the top \(k\) most central nodes.

Bitcoin network centrality on a Chromebook

load("bitcoin_from_to_graph.rdata")
t1 = proc.time()
x = topm(B,q=2,tol=0.1,m_b=5)
proc.time() - t1

# user system elapsed
# 86.970 24.350 111.605

Compute the top 5 most central nodes for the entire Bitcoin transaction graph.

Compare to Padé approximant

On a 1000 \(\times\) 1000 subset

t1 = proc.time()
ex = diag(expm(X) + expm(-X))/2
proc.time() - t1
# user system elapsed
# 151.080 0.220 151.552

order(ex,decreasing=TRUE)[1:5]
#[1] 11 25 27 29 74
t1  = proc.time()
top = topm(X, type="cent")
proc.time() - t1
# user system elapsed
# 0.555 0.010 0.565

top$hubs
#[1] 11 25 27 29 74

Google genomics and Spark (ADAM): not thinking small enough

PCA on the 1000 genomes chrom. 20

PCA on the 1000 genomes chrom. 20

library(Matrix)
p = pipe("zcat ALL.chr20.phase3_....genotypes.vcf.gz  
         | sed /^#/d  | cut  -f '10-' | parser | cut -f '1-2'")
x = read.table(p, colClasses=c("integer","integer"), fill=TRUE)
chr20 = sparseMatrix(i=x[,2], j=x[,1], x=1.0)

print(dim(chr20))
# [1]    2504 1812841

library(irlba)
cm = colMeans(chr20)
p = irlba(chr20, nv=3, nu=3, dU=rep(1,nrow(chr20)), ds=1, dV=cm)

library(threejs)
scatterplot3js(p$u)

AFR   AMR   EAS   EUR   SAS  

What about more chromosomes?

Not too big a deal. Easy to supply a custom, column-wise chunked matrix-vector product.

Overlap join

The overlap join finds overlapping regions between two tables of coordinates.

The overlap join finds overlapping regions between two tables of coordinates.


B overlaps with region 2 and 3.
C overlaps with region 4.

1000 genomes on a cheap PC

Find overlaps between all 9,362,139 on chromosomes 7 and 8 against a table of 1,989 ranges.

R data load from file 260s
R IRanges find overlaps 8s
R data.table find overlaps 9s
Vertica data load from file 303s
Vertica find overlaps 28s

Linear mixed models

FastLMM

Nifty recent work of Lippert, Listgarten, Liu, Kadie, Davidson, Heckerman at MSFT Research

Beautifully simple change of basis that reduces normally expensive large-scale mixed effects models to simpler (diagonal) form

Can be combined naturally with low-dimensional subspace projection

Practical thoughts


Projection methods turn “ big data ” into data


Projection methods turn “ big data ” into data


Changing bases can exploit underlying structure to simplify problems

IRLBA

  • State of the art partial SVD by Baglama and Reichel
  • Competetive with Martinsson, Rokhlin, et al. methods.
  • On CRAN, development on GitHub recent improvements for PCA, partial symmetric eigensystems, …

References