We're swimming in data these days.
18-September, 2015
We're swimming in data these days.
We're swimming in data these days.
We're swimming in data these days.
Data is easier to obtain than ever.We're swimming in data these days.
Data is easier to obtain than ever.
But the ratio of munging to math is still usually quite high.We can scrape the morel sightings postings to get approximate location and dates.
But what else?
[temp, spectra, precip] * x ~ sightings
Our fanciful morel model prediction for March 6 of this year (2015).
It's kind of like Zillow for foragers.
Maybe this is not such a great model.
But I'm not the only one making silly predictions.
Think about solving problems before trying to implement solutions.
Think about solving problems before trying to implement solutions.
Sometimes (often?), evidently huge problems are smaller than they seem.
Lots of interesting information in low-dimensional subspaces!
Petabytes (raw image scans)
Terabytes (aligned reads)
Gigabytes (variants)
Kilobytes (3-d projection)
library(Matrix) p = pipe("zcat ALL.chr20.phase3_....genotypes.vcf.gz | sed /^#/d | cut -f '10-' | parser | cut -f '1-2'") x = read.table(p, colClasses=c("integer","integer"), fill=TRUE) chr20 = sparseMatrix(i=x[,2], j=x[,1], x=1.0) print(dim(chr20)) # [1] 2504 1812841 library(irlba) cm = colMeans(chr20) p = irlba(chr20, nv=3, nu=3, dU=rep(1,nrow(chr20)), ds=1, dV=cm) library(threejs) scatterplot3js(p$u)
A tiny part of the Bitcoin transaction network.
It's easy to see that
\[ \left( A^k \right)_{i,j} \]
counts the number of paths of length \(k\) between vertices \(i\) and \(j\).
\[ \left(A + A^2 + A^3 + \cdots \right)_{i,j} \]
counts the number of paths of all lengths between vertices \(i\) and \(j\).
One interesting measure of centrality de-emphasizes longer paths
\[ I + A + \frac{1}{2!}A^2 + \cdots = \exp(A). \]
Seemingly expensive to compute for big networks!
The top \(k\) most central network nodes are often found in low-dimensional subspaces.
load("bitcoin_from_to_graph.rdata") t1 = proc.time() x = topm(B,q=2,tol=0.1,m_b=5) proc.time() - t1 # user system elapsed # 86.970 24.350 111.605
Compute the top 5 most central nodes for the entire Bitcoin transaction graph.
On a 1000 \(\times\) 1000 subset
t1 = proc.time() ex = diag(expm(X) + expm(-X))/2 proc.time() - t1 # user system elapsed # 151.080 0.220 151.552 order(ex,decreasing=TRUE)[1:5] #[1] 11 25 27 29 74
t1 = proc.time() top = topm(X, type="cent") proc.time() - t1 # user system elapsed # 0.555 0.010 0.565 top$hubs #[1] 11 25 27 29 74
Try to avoid silly prediction results with
We collect tons of data because we can.
But! Reasonable comparisons usually need systematic collection to make sense.
Many thousands of patients of various gender, race, and age, treated with either Gatifloxicin or something else. Patients were evaluated for dysglycemia after treatment.
Gatifloxacin was pulled from the market!
The goal of matching is to create a data set that looks closer to one that would result from a perfectly blocked (and possibly randomized) experiment.
Gist: for each observation in the treatment group, find a "nearby" observation in the control group, with replacement.
Even after matching we see an adverse effect.