Bryan W. Lewis
I prefer to forage, and I enjoy many mushrooms, other wild foods,
and living simply.
Everyone working on scientific computing problems should consider using R
, a wonderfully powerful and expressive system
for computation and visualization.
Send electronic mail to me at:
- The West Virginia Annual Mushroom Foray is next weekend July 20--21. Not to be missed!
- The Ohio Mushroom Society just finished a great summer foray in the Zaleski Forest in the Appalachian foothills of Ohio. Despite dry conditions, a good number of interesting species were found including Boletus Roodyi and many other interesting Boletes and an unusual chanterelle with a slight purplish color on its cap that looks kind of like cantharellus amethysteus. John Plischke, III gave a hilarious and informative talk on foraging boletes, something like "Mycologists in cars foraging boletes." He and Laura Wilson also prepared an interesting Tuber (truffle) species for DNA sequencing and entry into the http://mycoflora.org/ project. And the famous Walt Sturgeon presided as chief mycologist and identifier. The OMS main fall foray will be in Hiram Ohio on the weekend of October 6th.
- Melissa O'Neill's PCG family of random number generators is really interesting. Strangely, it brought out some hostile responses from other researchers, including this one from Sebastiano Vigna; a good example of a bad way to critique ideas. Read O'Neill's gracious response for the right way to respond to that.
- Douglas Hofstadter on Google Translate.
- Thanks to substantial help from many users that found and fixed bugs and performance issues, version 2.3.2 of the R irlba package is on CRAN now. See https://bwlewis.github.io/irlba/ for an overview of this bug fix release.
- Here is a simple new containerized way to run Tomas Kalibera's indispensable rchk tool for R packages with compiled code, based on Singularity: https://github.com/bwlewis/rchk/blob/master/image/README_SINGULARITY.md.
- Julia Silge wrote very cool notes about computationally efficient word vector computation, https://juliasilge.com/blog/tidy-word-vectors/, inspired by Chris Moody at Stich Fix http://multithreaded.stitchfix.com/blog/2017/10/18/stop-using-word2vec/. I love this approach!
- If you're interested in a great reference on practical parallel computing, I recommend Norm Matloff's book: Parallel Computing for Data Science with Examples in R C++ and CUDA.
- A thoughtful talk that I enjoyed on distributed ledgers/blockchains by Vili Lehdonvirta: https://youtu.be/eNrzE_UfkTw.
- A short note on regularization, correlation, and network analysis using stock prices (I'm still working on this a bit): https://bwlewis.github.io/correlation-regularization/ (uses the new threejs package below).
- A substantially new version of the R threejs package, 0.3.1, is now on CRAN! The new version is focused on enhanced visualization of igraph objects (networks), and includes support for graph animation, crosstalk widgets, and much more.
An updated, fun interactive example of some eigenvalue inclusion methods:
http://bwlewis.github.io/cassini/ (an update of old work with Richard Varga).
- I gave a talk on projection methods at R/Finance in Chicago in May (of course!): https://bwlewis.github.io/rfinance-2017/. Some of those ideas are still developing...we'll be working on them over the summer.
- A new version 2.2.1 of the irlba package for fast truncated SVD, PCA, and symmetric eigenvalue decomposition is on CRAN. The new version includes improved convergence criteria for tricky problems and the return of a prototype method for finding smallest singular values.
- Michael Elad wrote an interesting editorial on deep learning in SIAM News: https://sinews.siam.org/Details-Page/deep-deep-trouble-4.
- Some more notes on PCA over full-genome variants from the NIH 1000 Genomes Project, and on computing in parallel with R:
- Brief overview
- Notes on the parallel computation
- Some notes on R and Singularity, a lightweight container system well-suited to HPC: https://bwlewis.github.io/r-and-singularity/.
- A substantially revised new version of the threejs package
is almost ready, with a big focus on improving graph visualization.
See https://bwlewis.github.io/rthreejs/ for a few examples.
- A celebrated recent paper on the fallacy of the hot hand fallacy worth reading! http://dx.doi.org/10.2139/ssrn.2627354 Miller, Joshua Benjamin and Sanjurjo, Adam, Surprised by the Gambler's and Hot Hand Fallacies? A Truth in the Law of Small Numbers (November 15, 2016). IGIER Working Paper No. 552.
- Germany is putting Tim Gower's arguments into practice and effectively boycotting Elsevier journals next year, see www.sub.uni-goettingen.de. Elsevier is a particularly dodgy publisher with questionable ethics. Fortunately for me there are exceptionally high-quality free and open access journals like Electronic Transactions on Numerical Analysis.
- From the dada manifesto
TO MAKE A DADAIST POEM
Take a newspaper.
Take some scissors.
Choose from this paper an article of the length you want to make your poem.
Cut out the article.
Next carefully cut out each of the words that makes up this article and put them all in a bag.
Next take out each cutting one after the other.
Copy conscientiously in the order in which they left the bag.
Them poem will resemble you.
And there you are - an infinitely original author of charming sensibility, even though unappreciated by the vulgar herd.
- After recently upgrading my cheap laptop to Ubuntu 16.04 I discovered that my old network manager scheme for randomizing hardware MAC address broke. I saw that the latest build of network manager introduced a new scheme for doing this, but was unable to get that to work in Ubuntu (see for instance this thread https://bbs.archlinux.org/viewtopic.php?id=213855). Frustrated with making such a simple thing so difficult, I stripped out all that junk from my operating system and now use this trivial script to manually manage my wifi network interface: wifi. The script doesn't cover all connection cases, but less common connections can usually be manually configured without much problem.
- Friends and I are working on some new methods for fast and efficient thresholded correlation, suitable for very large-scale problems: preprint paper and corresponding prototype R package github.com/bwlewis/tcor. Here is an example that walks through correlation of TCGA RNASeq gene expression data: https://github.com/bwlewis/tcor/blob/master/vignettes/brca.Rmd.
- A few folks asked how I install R on Linux for my own use. This is how I do it: r-on-linux.html.
- Some "big data" genomics problems are easier than you might think. These R examples walk through PCA and overlap join problems involving genomic variant data from the 1000 Genomes Project: https://github.com/bwlewis/1000_genomes_examples. I recently added some examples that show how to compute PCA across the whole genome here http://bwlewis.github.io/1000_genomes_examples/PCA_whole_genome.html. For instance, using plain old R we can compute PCA across all genomic variants in the 1000 Genomes data on a single Amazon EC2 instance in only about 8 minutes!
- I updated the irlba package for R (new version is 2.1.2). The new version includes a convenience 'prcomp'-like function for principal components, and a significant performance boost for most problems. I've got a web page about the method that includes referenes to vignettes and applications here: https://bwlewis.github.io/irlba/, and a comparison with the cool RSpectra package for R here: https://bwlewis.github.io/irlba/comparison.html. Jim Baglama informs me that the Mathworks has finally adopted the irlba method for use in MATLAB's truncated SVD sovler, 'svds'. All I can say is, welcome to the present. (R has had this available in stable form for over 5 years, I guess it’s hard for proprietary software companies to keep up with the times–they have to spend so much effort on marketing after all.)
- I was honored to participate again in the 2016 NYR event/party/conference, http://www.rstats.nyc/, and saw many wonderful talks there. Here is a link to a video of my talk: https://youtu.be/PM7O6EGakKY and all the talks: http://www.rstats.nyc/2016.
- You should really be reading Andrew Gelman's blog.
- Geeks are the new jocks http://www.wsj.com/articles/a-data-scientist-dissects-the-2016-nfl-draft-1461793878
- R/Finance makes Chicago the place
to be every May.
As usual I advocate for the obvious, avoid complexity and keep things simple: r_finance_2016.html. As you can see, I'm a big fan of feather.
- I gave a talk at Kent state on cointegration and its implementation http://illposed.net/cointegration.html.
Mike and I have been writing down our working notes on generalized linear
models. Still incomplete and a bit rough, but maybe interesting to somebody...
Our focus is on numerics and performance. See http://bwlewis.github.io/GLM and the
associated project https://github.com/bwlewis/GLM.
- A thoughtful talk by Douglas Adams (1998) http://ia800202.us.archive.org/9/items/biota2_audio/biota_adams.mp3, "The fact that we live at the bottom of a deep gravity well, on the surface of a gas covered planet going around a nuclear fireball 90 million miles away and think this to be normal is obviously some indication of how skewed our perspective tends to be..."
- I was invited to give a few short talks at Microsoft's R and Python day in which I urged data scientists to not get so hung up on languages and think more about literature. We also discussed the AzureML package for R; see
r_and_python.html and azureml.html.
- I added a new function for plotting interactive 3-d force directed graphs to the threejs package for R, also now on CRAN. See http://bwlewis.github.io/rthreejs/graphjs.html for some basic examples. Source code is available here.
- A few of my incidental R projects recently needed very fast data compression/decompression. R's gzip implementation is excellent overall. But I wanted a bit more speed, which led to this: https://github.com/bwlewis/lz4. The lz4 method by Yann Collet is extremely fast at decompression, you can read more about it here: lz4.org.
- Jedediah Purdy defends Thoreau in the Atlantic from the bitingly provacative article by Kathryn Schulz in the New Yorker. Henry Thoreau is either a "genuine American weirdo" or narcissistic author of "cabin porn" or, perhaps, both. I really enjoyed reading these gems.
- I gave a talk at the University of Rhode Island Friday, September 18th called "Math in the Time of Data." Slides: http://illposed.net/uri.html
- I gave a talk Wednesday, September 16th at the Boston R meetup on thinking small about big data. Slides are here: boston_rug_sept_2015.html.
- Fascinating data from the Department of Education: https://collegescorecard.ed.gov/data/
- A really cool negative result: http://dash.harvard.edu/bitstream/handle/1/3043415/imbens_bootstrap.pdf
- Three.js plot widgets for R (now on CRAN): https://bwlewis.github.io/rthreejs.
- The Interface conference was in Morgantown, WV this year. I gave a contrarian talk, criticizing some "big data" systems and examples out there and encouraging us to think carefully about solving problems before resorting to using those tools. Here are the slides: think_small.html.
- I spoke about the many uses of the singular value decomposition in computational finance at the seventh annual R in Finance Conference in Chicago. Only the very coolest people attend this conference. The slides are here: rf2015.html.
- I gave a talk at the NY R conference on foraging and visualization. Here is a video of the talk: https://www.youtube.com/watch?v=OXYX1FVlbdI and slides can be found here: http://illposed.net/nycr2015/
- A cautionary note on missing value handling in correlation: http://bwlewis.github.io/covar/missing.html
- I highly recommend this article by Frank McSherry summarizing his work with Michael Isard, and Derek Murray: Scalability! But at what COST?. Using some moderately large graph problems as examples, they advocate thinking carefully about problems and their solution methods (seems obvious, right?).
- If you missed the Bay Area meetup, I'm giving a longer talk on htmlwidgets at the Cleveland R meetup on Wednesday, February 25th (2015). See http://www.meetup.com/Cleveland-UseR-Group/events/220140560/ for more info. Here are the slides from that talk.
- I'm giving a short talk tonight (27-Jan-2015) in San Jose at the BARUG on htmlwidgets. The crazy amazing line-up of speakers tonight includes Mike Kane, Dirk Eddelbuettel, and Gabe Becker. Not to be missed!!!
- I've been learning about clustering methods recently. Here is a link
to a simple hierarchical clustering implementation (<50 lines) that is
written only in R to make it easy to understand and experiment with:
The algorithm used by the native R hclust function in the statistics package
is far faster, so use that in practice.
- I wrote up a trivially simple implementation of and examples illustrating
Gene Golub's SVD subset
selection algorithm. Mike and I are using it in one of our GLM implementations
(see below). But it's a cool method and deserves more attention.
- This is cool: http://xkcd.r-forge.r-project.org/
- I asked some questions about ill-posed problems and regularization
at Kent State recently. Here are the slides:
The slides include a simple R program that
applies regularization to stock returns in order to cluster stocks
by a relevance network graph.
- I gave a talk with Jake VanderPlas about SciDB at PyData 2013 NYC. Here is a link to a Wakari notebook: http://goo.gl/ovGaHS
The Redis client for R was recently updated! R package here on CRAN:
Source code here on GitHub:
And the package vignette (PDF):
- Here are some relatively recent papers I really like:
Network analysis via partial spectral factorization and Gauss quadrature
In Search of an Understandable Consensus Algorithm
A Scalable Bootstrap for Massive Data
Quadrature Rule-Based Bounds for Functions of Adjacency Matrices
Augmented Implicitly Restarted Lanczos Bidiagonalization Methods
OK, those last two are not so new, but they're super-cool.
- I gave talk on tips and tricks for performance computing with R at the Cleveland R meet-up on Wednesday, August 7th. Here are the slides: http://goo.gl/gcPezs. Perhaps the most interesting part shows that it's pretty easy to install the commercial but freely available AMD BLAS and LAPACK libraries for R on Windows and Linux.
- I gave a talk at the Boston PyData conference (http://pydata.org/) about SciDB-Py -- Jake Vanderplas' new interface between SciDB and Python. The interface defines a numpy/scipy-like array class for Python backed by SciDB arrays. Install the package directly from GitHub with pip install git+ssh://github.com/jakevdp/scidb-py.git.
- I've just been reading Patrick Burns' book, http://www.burns-stat.com/documents/books/tao-te-programming/, and really enjoy it.
- I gave a talk on SciDB and R and Python at JSM on Sunday, August 4th. Here are the slides: http://goo.gl/A2RPkn.
- So you like Python muthaph*kkahz!?! You got it: https://github.com/bwlewis/irlbpy. This is the fastest
partial SVD and PCA routine for dense and sparse matrices available in Python.
restricted right now to real-valued matrices and is still under active
development. Mike Kane presents
our work at the SciPy Conference next week in Austin June 24--28 (2013).
Whit Armstrong and I ran a seminar on high performance computing with R at the
R/Finance conference in May.
We emphasized elastic computing using 0MQ and Redis with R,
and a bit of parallel linear algebra with SciDB. Here
are the slides we used:
doRedis.html, a parallel back end for the R language that uses Redis and foreach.
Here is the vignette documentation:
- My lightning talk on SciDB and R for the Boston R meetup on 22-Jan-2013: goo.gl/btioG.
- I gave a talk at JSM about R and websockets. Here it is:
And, here is a nifty application of websockets and R in quant. finance:
Here is a silly cool "chat" script for R using websockets (many web clients can share
a super basic R session):
Joe Cheng over at RStudio has taken over active development
of the package.
- Slides about the R bigmemory, parallel linear algebra in R, and a preview of what I'm working on with R and SciDB from a recent talk at the Boston R Meetup:
- One new idea and one old idea that should be better known on the SVD and cointegration
(from a recent talk at R/Finance 2012):
- A data frame promise for R that very quickly extracts subsets directly from raw delimited text files:
A native HTML 5 Websocket library for R:
I discussed some methods other than Hadoop for analyzing large data with
the New York CTO club. My notes are available here:
Outlaw talk: "The Betfair Package" at
R/Finance 2011: Applied Finance with R
Betfair is the world's largest betting exchange with more than three million
global clients. The BetfaiR package implements the Betfair Sports API in the R
language, providing direct access to the Betfair sports exchange from R. All
of the Betfair Sports API functions are available, including functions for real
time market data and user account access. The package also provides a number of
high-level functions for sports betting analysis, modeling and graphics.
This was the first talk I ever gave where running the examples live would
require breaking the law.
- Talk: "How good are Krylov methods for discrete ill-posed problems?," March 25--28 AMS meeting in Lexington, KY: http://www.ms.uky.edu/~corso/amsmaa2010/.
Here are some slides:
pvshm.html: A Linux filesystem that
provides a memory mapping overlay for PVFS2 or other file systems lacking
memory mapping capability.
http://github.com/bwlewis/fls, an implementation of Kalaba-Tesfatsion flexible least squares method for R.
R4P, an R library for Processing.
Ratlab, tools for foolin' with R and Octave (or Matlab) together.
http://etna.math.kent.edu/vol.30.2008/pp128-143.dir/zeros/index.html A newer Java applet illustrating the dynamical motion of the zeros of the partial sums of the exponential function (from work with Richard Varga and Amos Carpenter).