6/10/2015
Think about solving your problems before trying to implement them
System | CPU Cores | Time (s) |
---|---|---|
Spark/GraphX | 128 | 419 |
Giraph | 128 | 596 |
Graphlab | 128 | 249 |
McSherry's Macbook Pro | 1 | 110 |
Follower network example, ~41M vertices 1.4B edges
This table shows 20 iterations of page rank computed by Frank McSherry on his laptop, compared to some big distribtued systems.
Innovative algorithm by Lee, Shen, Huang, and Marron
and a faithful R package implementation (s4vd) by Martin Sill and Sebastian Kaiser
These knock an \(m^3\) problem down to about an order \(m\) problem or so, where \(m\) is the number of data observations
1,000 \(\times\) 10,000 example matrix timed on a Macbook Pro laptop
library(s4vd) set.seed(1) x = matrix(rnorm(1000 * 10000), nrow=1000) i = seq(from=1, to=nrow(x), by=2) j = seq(from=1, to=ncol(x), by=2) x[i,j] = rnorm(length(i)*length(j), mean=2, sd=0.5) cl = biclust(x, method=BCssvd, K=1)
Original routines | 3,522s |
Reformulated routines | 20s |
It's easy to see that
\[ \left( A^k \right)_{i,j} \]
counts the number of paths of length \(k\) between vertices \(i\) and \(j\).
\[ \left(A + A^2 + A^3 + \cdots \right)_{i,j} \]
counts the number of paths of all lengths between vertices \(i\) and \(j\).
One interesting measure of centrality de-emphasizes longer paths
\[ I + 1 A + \frac{1}{2!}A^2 + \cdots = \exp(A). \]
Can be expensive to compute.
Rodriguez et al. show that the top \(k\) most central network nodes are often found in low-dimensional subspaces related to the SVD.
And, they have a simple method that tells us when we've got a subspace guaranteed to contain the top \(k\) most central nodes.
load("bitcoin_from_to_graph.rdata") t1 = proc.time() x = topm(B,q=2,tol=0.1,m_b=5) proc.time() - t1 # user system elapsed # 86.970 24.350 111.605
Compute the top 5 most central nodes for the entire Bitcoin transaction graph.
t1 = proc.time() ex = diag(expm(X) + expm(-X))/2 proc.time() - t1 # user system elapsed # 151.080 0.220 151.552 order(ex,decreasing=TRUE)[1:5] #[1] 11 25 27 29 74
t1 = proc.time() top = topm(X, type="cent") proc.time() - t1 # user system elapsed # 0.555 0.010 0.565 top$hubs #[1] 11 25 27 29 74
Often used as an example of a big data analysis...
I'm not saying this is profound or even interesting. Just that it's easy. A few minutes from start to finish on a laptop.
library(Matrix) p = pipe("zcat ALL.chr20.phase3_....genotypes.vcf.gz | sed /^#/d | cut -f '10-' | parser | cut -f '1-2'") x = read.table(p, colClasses=c("integer","integer"), fill=TRUE) chr20 = sparseMatrix(i=x[,2], j=x[,1], x=1.0) print(dim(chr20)) # [1] 2504 1812841 library(irlba) cm = colMeans(chr20) p = irlba(chr20, nv=3, nu=3, dU=rep(1,nrow(chr20)), ds=1, dV=cm) library(threejs) scatterplot3js(p$u)
Not too big a deal. Easy to supply a custom, column-wise chunked matrix-vector product.
Nifty recent work of Lippert, Listgarten, Liu, Kadie, Davidson, Heckerman at MSFT Research
Beautifully simple change of basis that reduces normally expensive large-scale mixed effects models to simpler (diagonal) form
Can be combined naturally with low-dimensional subspace projection
Projection methods turn “big data” into data
Projection methods turn “big data” into data
Changing bases can exploit underlying structure to simplify problems