16-September, 2015
Think about solving your problems before trying to implement solutions
Sometimes, evidently huge problems are smaller than they seem.
System | CPU Cores | Time (s) |
---|---|---|
Spark/GraphX | 128 | 419 |
Giraph | 128 | 596 |
Graphlab | 128 | 249 |
McSherry's Macbook Pro | 1 | 110 |
Follower network example, ~41M vertices 1.4B edges
This table shows 20 iterations of page rank computed by Frank McSherry on his laptop, compared to some big distribtued systems.
Innovative algorithm by Lee, Shen, Huang, and Marron
and a faithful R package implementation (s4vd) by Martin Sill and Sebastian Kaiser
These knock an \(m^3\) problem down to about an order \(m\) problem or so, where \(m\) is the number of data observations
1,000 \(\times\) 10,000 example matrix timed on a laptop
library(s4vd) set.seed(1) x = matrix(rnorm(1000 * 10000), nrow=1000) i = seq(from=1, to=nrow(x), by=2) j = seq(from=1, to=ncol(x), by=2) x[i,j] = rnorm(length(i)*length(j), mean=2, sd=0.5) cl = biclust(x, method=BCssvd, K=1)
Original routines | 3,522s |
Reformulated routines | 20s |
It's easy to see that
\[ \left( A^k \right)_{i,j} \]
counts the number of paths of length \(k\) between vertices \(i\) and \(j\).
\[ \left(A + A^2 + A^3 + \cdots \right)_{i,j} \]
counts the number of paths of all lengths between vertices \(i\) and \(j\).
One interesting measure of centrality de-emphasizes longer paths
\[ I + 1 A + \frac{1}{2!}A^2 + \cdots = \exp(A). \]
Can be expensive to compute.
Rodriguez et al. show that the top \(k\) most central network nodes are often found in low-dimensional subspaces related to the SVD.
And, they have a simple method that tells us when we've got a subspace guaranteed to contain the top \(k\) most central nodes.
load("bitcoin_from_to_graph.rdata") t1 = proc.time() x = topm(B,q=2,tol=0.1,m_b=5) proc.time() - t1 # user system elapsed # 86.970 24.350 111.605
Compute the top 5 most central nodes for the entire Bitcoin transaction graph.
On a 1000 \(\times\) 1000 subset
t1 = proc.time() ex = diag(expm(X) + expm(-X))/2 proc.time() - t1 # user system elapsed # 151.080 0.220 151.552 order(ex,decreasing=TRUE)[1:5] #[1] 11 25 27 29 74
t1 = proc.time() top = topm(X, type="cent") proc.time() - t1 # user system elapsed # 0.555 0.010 0.565 top$hubs #[1] 11 25 27 29 74
A few lines of R code. A few minutes from start to finish on a laptop.
library(Matrix) p = pipe("zcat ALL.chr20.phase3_....genotypes.vcf.gz | sed /^#/d | cut -f '10-' | parser | cut -f '1-2'") x = read.table(p, colClasses=c("integer","integer"), fill=TRUE) chr20 = sparseMatrix(i=x[,2], j=x[,1], x=1.0) print(dim(chr20)) # [1] 2504 1812841 library(irlba) cm = colMeans(chr20) p = irlba(chr20, nv=3, nu=3, dU=rep(1,nrow(chr20)), ds=1, dV=cm) library(threejs) scatterplot3js(p$u)
Not too big a deal. Easy to supply a custom, column-wise chunked matrix-vector product.
The overlap join finds overlapping regions between two tables of coordinates.
The overlap join finds overlapping regions between two tables of coordinates.
B overlaps with region 2 and 3.
C overlaps with region 4.
Find overlaps between all 9,362,139 on chromosomes 7 and 8 against a table of 1,989 ranges.
R data load from file | 260s |
R IRanges find overlaps | 8s |
R data.table find overlaps | 9s |
Vertica data load from file | 303s |
Vertica find overlaps | 28s |
Nifty recent work of Lippert, Listgarten, Liu, Kadie, Davidson, Heckerman at MSFT Research
Beautifully simple change of basis that reduces normally expensive large-scale mixed effects models to simpler (diagonal) form
Can be combined naturally with low-dimensional subspace projection
Projection methods turn “ big data ” into data
Projection methods turn “ big data ” into data
Changing bases can exploit underlying structure to simplify problems