Math in the Time of Data

18-September, 2015

We're swimming in data these days.

Petabytes: U Chicago cancer genome atlas repository
10-25 TB/day: SEC consolidated audit trail requirements
15 TB/night: Large synoptic survey telescope

We're swimming in data these days.

Data is easier to obtain than ever.

We're swimming in data these days.

Data is easier to obtain than ever.

But the ratio of munging to math is still usually quite high.

Ideas for modeling morels

We can scrape the morel sightings postings to get approximate location and dates.

But what else?

The public owns satellites that almost continuously measure values like
vegetation cover, temperature, …

Building blocks of a model?

Web-scraped historical morel sightings
NASA MODIS spectral radiometry
NASA land surface temperature/precip

[temp, spectra, precip] * x ~ sightings

Our fanciful morel model prediction for March 6 of this year (2015).

It's kind of like Zillow for foragers.

See this link for an example: http://illposed.net/nycr2015/leaflet/

Maybe this is not such a great model.

But I'm not the only one making silly predictions.

Outline

How to think about (big) data
The world needs you to help make sense of this stuff
Software to know about

Think about solving problems before trying to implement solutions.

Sometimes (often?), evidently huge problems are smaller than they seem.

One taxonomy of "data science" methods

Exploration
Prediction
Inference

Exploration

The 1000 Genomes Project

Lots of interesting information in low-dimensional subspaces!

Petabytes (raw image scans)
Terabytes (aligned reads)
Gigabytes (variants)
Kilobytes (3-d projection)

PCA of chromosome 20

library(Matrix)
p = pipe("zcat ALL.chr20.phase3_....genotypes.vcf.gz  
         | sed /^#/d  | cut  -f '10-' | parser | cut -f '1-2'")
x = read.table(p, colClasses=c("integer","integer"), fill=TRUE)
chr20 = sparseMatrix(i=x[,2], j=x[,1], x=1.0)

print(dim(chr20))
# [1]    2504 1812841

library(irlba)
cm = colMeans(chr20)
p = irlba(chr20, nv=3, nu=3, dU=rep(1,nrow(chr20)), ds=1, dV=cm)

library(threejs)
scatterplot3js(p$u)

AFR AMR EAS EUR SAS

Exploring networks

A tiny part of the Bitcoin transaction network.

Viewed as an adjacency matrix \(A\):

Network centrality

It's easy to see that

\[ \left( A^k \right)_{i,j} \]

counts the number of paths of length \(k\) between vertices \(i\) and \(j\).

Network centrality

\[ \left(A + A^2 + A^3 + \cdots \right)_{i,j} \]

counts the number of paths of all lengths between vertices \(i\) and \(j\).

Network centrality

One interesting measure of centrality de-emphasizes longer paths

\[ I + A + \frac{1}{2!}A^2 + \cdots = \exp(A). \]

Seemingly expensive to compute for big networks!

Network centrality

The top \(k\) most central network nodes are often found in low-dimensional subspaces.

Bitcoin network centrality on a Chromebook

load("bitcoin_from_to_graph.rdata")
t1 = proc.time()
x = topm(B,q=2,tol=0.1,m_b=5)
proc.time() - t1

# user system elapsed
# 86.970 24.350 111.605

Compute the top 5 most central nodes for the entire Bitcoin transaction graph.

Compare to Padé approximant

On a 1000 \(\times\) 1000 subset

t1 = proc.time()
ex = diag(expm(X) + expm(-X))/2
proc.time() - t1
# user system elapsed
# 151.080 0.220 151.552

order(ex,decreasing=TRUE)[1:5]
#[1] 11 25 27 29 74

t1  = proc.time()
top = topm(X, type="cent")
proc.time() - t1
# user system elapsed
# 0.555 0.010 0.565

top$hubs
#[1] 11 25 27 29 74

Prediction

Try to avoid silly prediction results with

regularization and variable selection
ensemble methods
Consider estimating coefficient distributions with the bootstrap
beware extrapolation, over-fitting

Inference in
the time of data

We collect tons of data because we can.

But! Reasonable comparisons usually need systematic collection to make sense.

Montefiore Gatifloxacin Data

Many thousands of patients of various gender, race, and age, treated with either Gatifloxicin or something else. Patients were evaluated for dysglycemia after treatment.

Linear logistic model odds ratios

Gatifloxacin was pulled from the market!

Careful!

The goal of matching is to create a data set that looks closer to one that would result from a perfectly blocked (and possibly randomized) experiment.

—Gary King

Gist: for each observation in the treatment group, find a "nearby" observation in the control group, with replacement.

Ideas for modeling morels

Building blocks of a model?

Outline

One taxonomy of "data science" methods

Exploration

The 1000 Genomes Project

PCA of chromosome 20

Exploring networks

Exploring networks

Viewed as an adjacency matrix \(A\):

Network centrality

Network centrality

Network centrality

Network centrality

Bitcoin network centrality on a Chromebook

Compare to Padé approximant

Prediction

Inference in
the time of data

Montefiore Gatifloxacin Data

Linear logistic model odds ratios

Careful!

After CEM matching

Matched data odds ratios

Software tools

Ideas for modeling morels

Building blocks of a model?

Outline

One taxonomy of "data science" methods

Exploration

The 1000 Genomes Project

PCA of chromosome 20

Exploring networks

Exploring networks

Viewed as an adjacency matrix \(A\):

Network centrality

Network centrality

Network centrality

Network centrality

Bitcoin network centrality on a Chromebook

Compare to Padé approximant

Prediction

Inference inthe time of data

Montefiore Gatifloxacin Data

Linear logistic model odds ratios

Careful!

After CEM matching

Matched data odds ratios

Software tools

Inference in
the time of data