Bryan W. Lewis
I prefer to forage, and I enjoy many mushrooms, other wild foods,
and living simply.
Everyone working on scientific computing problems should consider using R
, a wonderfully powerful and expressive system
for computation and visualization.
Send electronic mail to me at:
There are two kinds of people in the world, those that believe in dimensionality reduction, and those that believe there are 7 billion kinds of people in the world.
—told to me by Jenny Bryan
- I updated the irlba package for R (new version is 2.1.2). The new version includes a convenience 'prcomp'-like function for principal components, and a significant performance boost for most problems. I've got a web page about the method that includes referenes to vignettes and applications here: https://bwlewis.github.io/irlba/, and a comparison with the cool RSpectra package for R here: https://bwlewis.github.io/irlba/comparison.html. Jim Baglama informs me that the Mathworks has finally adopted the irlba method for use in MATLAB's truncated SVD sovler, 'svds'. All I can say is, welcome to the present. (R has had this available in stable form for over 5 years, I guess it’s hard for proprietary software companies to keep up with the times–they have to spend so much effort on marketing after all.)
- Friends and I are working on some new methods for fast and efficient thresholded correlation, suitable for very large-scale problems: preprint paper and corresponding prototype R package github.com/bwlewis/tcor. Here is an example that walks through correlation of TCGA RNASeq gene expression data: https://github.com/bwlewis/tcor/blob/master/vignettes/brca.Rmd.
- A few folks asked how I install R on Linux for my own use. This is how I do it: r-on-linux.html.
- Some "big data" genomics problems are easier than you might think. These R examples walk through PCA and overlap join problems involving genomic variant data from the 1000 Genomes Project: https://github.com/bwlewis/1000_genomes_examples. I recently added some examples that show how to compute PCA across the whole genome here http://bwlewis.github.io/1000_genomes_examples/PCA_whole_genome.html. For instance, using plain old R we can compute PCA across all genomic variants in the 1000 Genomes data on a single Amazon EC2 instance in only about 8 minutes!
- I was honored to participate again in the 2016 NYR event/party/conference, http://www.rstats.nyc/, and saw many wonderful talks there. Here is a link to a video of my talk: https://youtu.be/PM7O6EGakKY and all the talks: http://www.rstats.nyc/2016.
William A. Stein of SageMath fame posted a heavy talk on the lack of
support for serious free and open mathematical software in academia: http://wstein.org/talks/2016-06-sage-bp/bp.pdf.
His talk is important and a bit depressing. Happily, the R world that I mostly
live in fares better. R is fortunate to receive many
significant contributions from both academy and industry.
R development benefits from superb journals like the Journal of Statistical Software, the Journal of
Computational and Graphical Statistics, the R Journal, and others that provide
software-centric publication outlets.
It's almost impossible to overstate the importance of freely
available, transparent, correct implementations of algorithms and the software
applications that make these things usable.
These applications are used to make decisions
affecting all of us like analyzing clinical drug trials, figuring the impact of
climate change, discovering disease pathways, and on and on. It may be that
some proprietary black box system can solve certain problems really fast or
exhibit some other advantage. But without high-quality, free open source
software we have no easy way to gauge the quality of those black box solutions.
That's why I'm so deeply involved with R and why I admire Python so much. I
hope that the academic world and society more broadly understands how important
- You should really be reading Andrew Gelman's blog.
- Geeks are the new jocks http://www.wsj.com/articles/a-data-scientist-dissects-the-2016-nfl-draft-1461793878
- R/Finance makes Chicago the place
to be every May.
As usual I advocate for the obvious, avoid complexity and keep things simple: r_finance_2016.html. As you can see, I'm a big fan of feather.
- I gave a talk at Kent state on cointegration and its implementation http://illposed.net/cointegration.html.
Mike and I have been writing down our working notes on generalized linear
models. Still incomplete and a bit rough, but maybe interesting to somebody...
Our focus is on numerics and performance. See http://bwlewis.github.io/GLM and the
associated project https://github.com/bwlewis/GLM.
- A thoughtful talk by Douglas Adams (1998) http://ia800202.us.archive.org/9/items/biota2_audio/biota_adams.mp3, "The fact that we live at the bottom of a deep gravity well, on the surface of a gas covered planet going around a nuclear fireball 90 million miles away and think this to be normal is obviously some indication of how skewed our perspective tends to be..."
- I was invited to give a few short talks at Microsoft's R and Python day in which I urged data scientists to not get so hung up on languages and think more about literature. We also discussed the AzureML package for R; see
r_and_python.html and azureml.html.
- I added a new function for plotting interactive 3-d force directed graphs to the threejs package for R, also now on CRAN. See http://bwlewis.github.io/rthreejs/graphjs.html for some basic examples. Source code is available here.
- A few of my incidental R projects recently needed very fast data compression/decompression. R's gzip implementation is excellent overall. But I wanted a bit more speed, which led to this: https://github.com/bwlewis/lz4. The lz4 method by Yann Collet is extremely fast at decompression, you can read more about it here: lz4.org.
- A new version 2.0.0 of the irlba package for fast truncated SVD, PCA, and now partial symmetric eigenvalue decomposition is on CRAN.
- Jedediah Purdy defends Thoreau in the Atlantic from the bitingly provacative article by Kathryn Schulz in the New Yorker. Henry Thoreau is either a "genuine American weirdo" or narcissistic author of "cabin porn" or, perhaps, both. I really enjoyed reading these gems.
- I gave a talk at the University of Rhode Island Friday, September 18th called "Math in the Time of Data." Slides: http://illposed.net/uri.html
- I gave a talk Wednesday, September 16th at the Boston R meetup on thinking small about big data. Slides are here: boston_rug_sept_2015.html.
- Fascinating data from the Department of Education: https://collegescorecard.ed.gov/data/
- A really cool negative result: http://dash.harvard.edu/bitstream/handle/1/3043415/imbens_bootstrap.pdf
- Three.js plot widgets for R (now on CRAN): https://bwlewis.github.io/rthreejs.
- The Interface conference was in Morgantown, WV this year. I gave a contrarian talk, criticizing some "big data" systems and examples out there and encouraging us to think carefully about solving problems before resorting to using those tools. Here are the slides: think_small.html.
- I spoke about the many uses of the singular value decomposition in computational finance at the seventh annual R in Finance Conference in Chicago. Only the very coolest people attend this conference. The slides are here: rf2015.html.
- I gave a talk at the NY R conference on foraging and visualization. Here is a video of the talk: https://www.youtube.com/watch?v=OXYX1FVlbdI and slides can be found here: http://illposed.net/nycr2015/
- A cautionary note on missing value handling in correlation: http://bwlewis.github.io/covar/missing.html
- I highly recommend this article by Frank McSherry summarizing his work with Michael Isard, and Derek Murray: Scalability! But at what COST?. Using some moderately large graph problems as examples, they advocate thinking carefully about problems and their solution methods (seems obvious, right?).
- Here is a fun example of covariance shrinkage and graph clustering of stock market returns that uses
htmlwidgets to visualize the output: https://bwlewis.github.io/covariance-shrinkage/
an fun little interactive illustration of the Gerschgorin's circles and Brauer's
Please feel free to fork and use the code available on Github here:
- If you missed the Bay Area meetup, I'm giving a longer talk on htmlwidgets at the Cleveland R meetup on Wednesday, February 25th (2015). See http://www.meetup.com/Cleveland-UseR-Group/events/220140560/ for more info. Here are the slides from that talk.
- I'm giving a short talk tonight (27-Jan-2015) in San Jose at the BARUG on htmlwidgets. The crazy amazing line-up of speakers tonight includes Mike Kane, Dirk Eddelbuettel, and Gabe Becker. Not to be missed!!!
- I've been learning about clustering methods recently. Here is a link
to a simple hierarchical clustering implementation (<50 lines) that is
written only in R to make it easy to understand and experiment with:
The algorithm used by the native R hclust function in the statistics package
is far faster, so use that in practice.
- I wrote up a trivially simple implementation of and examples illustrating
Gene Golub's SVD subset
selection algorithm. Mike and I are using it in one of our GLM implementations
(see below). But it's a cool method and deserves more attention.
- This is cool: http://xkcd.r-forge.r-project.org/
- I asked some questions about ill-posed problems and regularization
at Kent State recently. Here are the slides:
The slides include a simple R program that
applies regularization to stock returns in order to cluster stocks
by a relevance network graph.
- I gave a talk with Jake VanderPlas about SciDB at PyData 2013 NYC. Here is a link to a Wakari notebook: http://goo.gl/ovGaHS
The Redis client for R was recently updated! R package here on CRAN:
Source code here on GitHub:
And the package vignette (PDF):
- Here are some relatively recent papers I really like:
Network analysis via partial spectral factorization and Gauss quadrature
In Search of an Understandable Consensus Algorithm
A Scalable Bootstrap for Massive Data
Quadrature Rule-Based Bounds for Functions of Adjacency Matrices
Augmented Implicitly Restarted Lanczos Bidiagonalization Methods
OK, those last two are not so new, but they're super-cool.
- I gave talk on tips and tricks for performance computing with R at the Cleveland R meet-up on Wednesday, August 7th. Here are the slides: http://goo.gl/gcPezs. Perhaps the most interesting part shows that it's pretty easy to install the commercial but freely available AMD BLAS and LAPACK libraries for R on Windows and Linux.
- I gave a talk at the Boston PyData conference (http://pydata.org/) about SciDB-Py -- Jake Vanderplas' new interface between SciDB and Python. The interface defines a numpy/scipy-like array class for Python backed by SciDB arrays. Install the package directly from GitHub with pip install git+ssh://github.com/jakevdp/scidb-py.git.
- I've just been reading Patrick Burns' book, http://www.burns-stat.com/documents/books/tao-te-programming/, and really enjoy it.
- I gave a talk on SciDB and R and Python at JSM on Sunday, August 4th. Here are the slides: http://goo.gl/A2RPkn.
- So you like Python muthaph*kkahz!?! You got it: https://github.com/bwlewis/irlbpy. This is the fastest
partial SVD and PCA routine for dense and sparse matrices available in Python.
restricted right now to real-valued matrices and is still under active
development. Mike Kane presents
our work at the SciPy Conference next week in Austin June 24--28 (2013).
- I get a lot of questions about using the fast truncated SVD
irlba package, especially for large problems.
So, I've started a page of miscellaneous tips here:
Whit Armstrong and I ran a seminar on high performance computing with R at the
R/Finance conference in May.
We emphasized elastic computing using 0MQ and Redis with R,
and a bit of parallel linear algebra with SciDB. Here
are the slides we used:
doRedis.html, a parallel back end for the R language that uses Redis and foreach.
Here is the vignette documentation:
The irlba package for
R provides a state of the art fast partial singular value decomposition. It's
suitable for very large scale problems and supports sparse and dense matrices.
To give you an idea how fast it is, one can compute a five-dimensional
principal components analysis (PCA) on the Netflix data set
(480,189 user IDs and 17,770 movies) in a few minutes on a dual-core notebook
(using R's sparse Matrix package).
- My lightning talk on SciDB and R for the Boston R meetup on 22-Jan-2013: goo.gl/btioG.
- I gave a talk at JSM about R and websockets. Here it is:
And, here is a nifty application of websockets and R in quant. finance:
Here is a silly cool "chat" script for R using websockets (many web clients can share
a super basic R session):
Joe Cheng over at RStudio has taken over active development
of the package.
- Slides about the R bigmemory, parallel linear algebra in R, and a preview of what I'm working on with R and SciDB from a recent talk at the Boston R Meetup:
- One new idea and one old idea that should be better known on the SVD and cointegration
(from a recent talk at R/Finance 2012):
- A data frame promise for R that very quickly extracts subsets directly from raw delimited text files:
A native HTML 5 Websocket library for R:
I discussed some methods other than Hadoop for analyzing large data with
the New York CTO club. My notes are available here:
Outlaw talk: "The Betfair Package" at
R/Finance 2011: Applied Finance with R
Betfair is the world's largest betting exchange with more than three million
global clients. The BetfaiR package implements the Betfair Sports API in the R
language, providing direct access to the Betfair sports exchange from R. All
of the Betfair Sports API functions are available, including functions for real
time market data and user account access. The package also provides a number of
high-level functions for sports betting analysis, modeling and graphics.
This was the first talk I ever gave where running the examples live would
require breaking the law.
- Talk: "How good are Krylov methods for discrete ill-posed problems?," March 25--28 AMS meeting in Lexington, KY: http://www.ms.uky.edu/~corso/amsmaa2010/.
Here are some slides:
pvshm.html: A Linux filesystem that
provides a memory mapping overlay for PVFS2 or other file systems lacking
memory mapping capability.
http://github.com/bwlewis/fls, an implementation of Kalaba-Tesfatsion flexible least squares method for R.
R4P, an R library for Processing.
Ratlab, tools for foolin' with R and Octave (or Matlab) together.
http://etna.math.kent.edu/vol.30.2008/pp128-143.dir/zeros/index.html A newer Java applet illustrating the dynamical motion of the zeros of the partial sums of the exponential function (from work with Richard Varga and Amos Carpenter).