18-September, 2015



We're swimming in data these days.



We're swimming in data these days.



  • Petabytes: U Chicago cancer genome atlas repository
  • 10-25 TB/day: SEC consolidated audit trail requirements
  • 15 TB/night: Large synoptic survey telescope



We're swimming in data these days.

Data is easier to obtain than ever.



We're swimming in data these days.

Data is easier to obtain than ever.

But the ratio of munging to math is still usually quite high.

Ideas for modeling morels

We can scrape the morel sightings postings to get approximate location and dates.

But what else?

The public owns satellites that almost continuously measure values like
vegetation cover, temperature, …

Building blocks of a model?

  • Web-scraped historical morel sightings
  • NASA MODIS spectral radiometry
  • NASA land surface temperature/precip


[temp, spectra, precip] * x ~ sightings




Maybe this is not such a great model.

But I'm not the only one making silly predictions.

Outline

  • How to think about (big) data
  • The world needs you to help make sense of this stuff
  • Software to know about


Think about solving problems before trying to implement solutions.


Think about solving problems before trying to implement solutions.

Sometimes (often?), evidently huge problems are smaller than they seem.

One taxonomy of "data science" methods

  • Exploration
  • Prediction
  • Inference

Exploration

The 1000 Genomes Project

Lots of interesting information in low-dimensional subspaces!


Petabytes (raw image scans)
Terabytes (aligned reads)
Gigabytes (variants)
Kilobytes (3-d projection)

PCA of chromosome 20

library(Matrix)
p = pipe("zcat ALL.chr20.phase3_....genotypes.vcf.gz  
         | sed /^#/d  | cut  -f '10-' | parser | cut -f '1-2'")
x = read.table(p, colClasses=c("integer","integer"), fill=TRUE)
chr20 = sparseMatrix(i=x[,2], j=x[,1], x=1.0)

print(dim(chr20))
# [1]    2504 1812841

library(irlba)
cm = colMeans(chr20)
p = irlba(chr20, nv=3, nu=3, dU=rep(1,nrow(chr20)), ds=1, dV=cm)

library(threejs)
scatterplot3js(p$u)

AFR   AMR   EAS   EUR   SAS  

Exploring networks

Exploring networks

A tiny part of the Bitcoin transaction network.

Viewed as an adjacency matrix \(A\):

Network centrality

It's easy to see that

\[ \left( A^k \right)_{i,j} \]

counts the number of paths of length \(k\) between vertices \(i\) and \(j\).

Network centrality

\[ \left(A + A^2 + A^3 + \cdots \right)_{i,j} \]

counts the number of paths of all lengths between vertices \(i\) and \(j\).

Network centrality

One interesting measure of centrality de-emphasizes longer paths

\[ I + A + \frac{1}{2!}A^2 + \cdots = \exp(A). \]

Seemingly expensive to compute for big networks!

Network centrality

The top \(k\) most central network nodes are often found in low-dimensional subspaces.

Bitcoin network centrality on a Chromebook

load("bitcoin_from_to_graph.rdata")
t1 = proc.time()
x = topm(B,q=2,tol=0.1,m_b=5)
proc.time() - t1

# user system elapsed
# 86.970 24.350 111.605

Compute the top 5 most central nodes for the entire Bitcoin transaction graph.

Compare to Padé approximant

On a 1000 \(\times\) 1000 subset

t1 = proc.time()
ex = diag(expm(X) + expm(-X))/2
proc.time() - t1
# user system elapsed
# 151.080 0.220 151.552

order(ex,decreasing=TRUE)[1:5]
#[1] 11 25 27 29 74
t1  = proc.time()
top = topm(X, type="cent")
proc.time() - t1
# user system elapsed
# 0.555 0.010 0.565

top$hubs
#[1] 11 25 27 29 74

Prediction

Try to avoid silly prediction results with

  • regularization and variable selection
  • ensemble methods
  • Consider estimating coefficient distributions with the bootstrap
  • beware extrapolation, over-fitting

Inference in
the time of data

We collect tons of data because we can.


But! Reasonable comparisons usually need systematic collection to make sense.

Montefiore Gatifloxacin Data

Many thousands of patients of various gender, race, and age, treated with either Gatifloxicin or something else. Patients were evaluated for dysglycemia after treatment.

Linear logistic model odds ratios

Gatifloxacin was pulled from the market!

Careful!

The goal of matching is to create a data set that looks closer to one that would result from a perfectly blocked (and possibly randomized) experiment.

—Gary King


Gist: for each observation in the treatment group, find a "nearby" observation in the control group, with replacement.

After CEM matching

Matched data odds ratios

Even after matching we see an adverse effect.

Software tools

  • R
  • Python
  • Scala and other JVM things
  • Databases