Send electronic mail to me at: email@example.com.
Here are some old notes (from late 2017) on an example about fraud detection using a neural network autoencoder, from the RStudioblog originally, for your interest: autoencoder.html.
I've been modeling application performance for Atrio. Here is a high-level description of that work if you're interested... atrio.html
A simple template for simple and basic HTML slide presentations https://github.com/bwlewis/slides.
Manny Salzman's wondrous trip https://www.coloradoindependent.com/171325/manny-salzmans-long-strange-and-wondrous-trip. Here is a photo of Manny and Joanne (and Laura Wilson in the netted stinkhorn disguise and me, as some kind of clavaria species) in Telluride: manny.jpg.
The Ohio Mushroom Society just finished a great summer foray in the Zaleski Forest in the Appalachian foothills of Ohio. Despite dry conditions, a lot of interesting species were found including Boletus Roodyi and many other interesting Boletes and an unusual chanterelle with a slight purplish color on its cap that looks kind of like cantharellus amethysteus. John Plischke III gave a hilarious and informative talk on foraging boletes, something like "Mycologists in cars foraging boletes." He and Laura Wilson also prepared an interesting Tuber (truffle) species that we could not identify for DNA sequencing and entry into the http://mycoflora.org/ project. And the famous Walt Sturgeon presided as chief mycologist and identifier. The OMS main fall foray will be in Hiram, Ohio on the weekend of October 6th.
Melissa O'Neill's PCG family of random number generators is really interesting. Strangely, it brought out some hostile responses from other researchers, including this one from Sebastiano Vigna; a good example of a bad way to critique ideas. Read O'Neill's gracious response for the right way to respond to that.
Thanks to substantial help from many users that found and fixed bugs and performance issues, version 2.3.2 of the R irlba package is on CRAN now. See https://bwlewis.github.io/irlba/ for an overview of this bug fix release.
Here is a simple new containerized way to run Tomas Kalibera's indispensable rchk tool for R packages with compiled code, based on Singularity: https://github.com/bwlewis/rchk/blob/master/image/README_SINGULARITY.md.
Julia Silge wrote very cool notes about computationally efficient word vector computation, https://juliasilge.com/blog/tidy-word-vectors/, inspired by Chris Moody at Stich Fix http://multithreaded.stitchfix.com/blog/2017/10/18/stop-using-word2vec/. I love this approach!
If you're interested in a great reference on practical parallel computing, I recommend Norm Matloff's book: Parallel Computing for Data Science with Examples in R C++ and CUDA.
A thoughtful talk that I enjoyed on distributed ledgers/blockchains by Vili Lehdonvirta: https://youtu.be/eNrzE_UfkTw.
A short note on regularization, correlation, and network analysis using stock prices (I'm still working on this a bit): https://bwlewis.github.io/correlation-regularization/ (uses the new threejs package below).
A substantially new version of the R threejs package, 0.3.1, is now on CRAN! The new version is focused on enhanced visualization of igraph objects (networks), and includes support for graph animation, crosstalk widgets, and much more.
An updated, fun interactive example of some eigenvalue inclusion methods: http://bwlewis.github.io/cassini/ (an update of old work with Richard Varga).
I gave a talk on projection methods at R/Finance in Chicago in May (of course!): https://bwlewis.github.io/rfinance-2017/. Some of those ideas are still developing...we'll be working on them over the summer.
A new version 2.2.1 of the irlba package for fast truncated SVD, PCA, and symmetric eigenvalue decomposition is on CRAN. The new version includes improved convergence criteria for tricky problems and the return of a prototype method for finding smallest singular values.
Michael Elad wrote an interesting editorial on deep learning in SIAM News: https://sinews.siam.org/Details-Page/deep-deep-trouble-4.
Some more notes on PCA over full-genome variants from the NIH 1000 Genomes Project, and on computing in parallel with R:
- Brief overview http://bwlewis.github.io/1000_genomes_examples/PCA_overview.html
- Notes on the parallel computation http://bwlewis.github.io/1000_genomes_examples/notes.html
Some notes on R and Singularity, a lightweight container system well-suited to HPC: https://bwlewis.github.io/r-and-singularity/.
A celebrated recent paper on the fallacy of the hot hand fallacy worth reading! http://dx.doi.org/10.2139/ssrn.2627354 Miller, Joshua Benjamin and Sanjurjo, Adam, Surprised by the Gambler's and Hot Hand Fallacies? A Truth in the Law of Small Numbers (November 15, 2016). IGIER Working Paper No. 552.
Germany is putting Tim Gower's arguments into practice and effectively boycotting Elsevier journals next year, see www.sub.uni-goettingen.de. Elsevier is a particularly dodgy publisher with questionable ethics. Fortunately for me there are exceptionally high-quality free and open access journals like Electronic Transactions on Numerical Analysis.
The dada manifesto
After recently upgrading my cheap laptop to Ubuntu 16.04 I discovered that my old network manager scheme for randomizing hardware MAC address broke. I saw that the latest build of network manager introduced a new scheme for doing this, but was unable to get that to work in Ubuntu (see for instance this thread https://bbs.archlinux.org/viewtopic.php?id=213855). Frustrated with making such a simple thing so difficult, I stripped out all that junk from my operating system and now use this trivial script to manually manage my wifi network interface: wifi. The script doesn't cover all connection cases, but less common connections can usually be manually configured without much problem.
Friends and I are working on some new methods for fast and efficient thresholded correlation, suitable for very large-scale problems: preprint paper and corresponding prototype R package github.com/bwlewis/tcor. Here is an example that walks through correlation of TCGA RNASeq gene expression data: https://github.com/bwlewis/tcor/blob/master/vignettes/brca.Rmd.
A few folks asked how I install R on Linux for my own use. This is how I do it: r-on-linux.html.
Some "big data" genomics problems are easier than you might think. These R examples walk through PCA and overlap join problems involving genomic variant data from the 1000 Genomes Project: https://github.com/bwlewis/1000_genomes_examples. I recently added some examples that show how to compute PCA across the whole genome here http://bwlewis.github.io/1000_genomes_examples/PCA_whole_genome.html. For instance, using plain old R we can compute PCA across all genomic variants in the 1000 Genomes data on a single Amazon EC2 instance in only about 8 minutes!
I updated the irlba package for R (new version is 2.1.2). The new version includes a convenience 'prcomp'-like function for principal components, and a significant performance boost for most problems. I've got a web page about the method that includes referenes to vignettes and applications here: https://bwlewis.github.io/irlba/, and a comparison with the cool RSpectra package for R here: https://bwlewis.github.io/irlba/comparison.html. Jim Baglama informs me that the Mathworks has finally adopted the irlba method for use in MATLAB's truncated SVD sovler, 'svds'. All I can say is, welcome to the present. (R has had this available in stable form for over 5 years, I guess it’s hard for proprietary software companies to keep up with the times–they have to spend so much effort on marketing after all.)
I was honored to participate again in the 2016 NYR event/party/conference, http://www.rstats.nyc/, and saw many wonderful talks there. Here is a link to a video of my talk: https://youtu.be/PM7O6EGakKY and all the talks: http://www.rstats.nyc/2016.
You should really be reading Andrew Gelman's blog.
Geeks are the new jocks http://www.wsj.com/articles/a-data-scientist-dissects-the-2016-nfl-draft-1461793878
I gave a talk at Kent state on cointegration and its implementation http://illposed.net/cointegration.html.
Mike and I have been writing down our working notes on generalized linear models. Still incomplete and a bit rough, but maybe interesting to somebody... Our focus is on numerics and performance. See http://bwlewis.github.io/GLM and the associated project https://github.com/bwlewis/GLM.
A thoughtful talk by Douglas Adams (1998) http://ia800202.us.archive.org/9/items/biota2_audio/biota_adams.mp3, "The fact that we live at the bottom of a deep gravity well, on the surface of a gas covered planet going around a nuclear fireball 90 million miles away and think this to be normal is obviously some indication of how skewed our perspective tends to be..."
I was invited to give a few short talks at Microsoft's R and Python day in which I urged data scientists to not get so hung up on languages and think more about literature. We also discussed the AzureML package for R; see r_and_python.html and azureml.html.
I added a new function for plotting interactive 3-d force directed graphs to the threejs package for R, also now on CRAN. See http://bwlewis.github.io/rthreejs/graphjs.html for some basic examples. Source code is available here.
A few of my incidental R projects recently needed very fast data compression/decompression. R's gzip implementation is excellent overall. But I wanted a bit more speed, which led to this: https://github.com/bwlewis/lz4. The lz4 method by Yann Collet is extremely fast at decompression, you can read more about it here: lz4.org.
Jedediah Purdy defends Thoreau in the Atlantic from the bitingly provacative article by Kathryn Schulz in the New Yorker. Henry Thoreau is either a "genuine American weirdo" or narcissistic author of "cabin porn" or, perhaps, both. I really enjoyed reading these gems.
I gave a talk at the University of Rhode Island Friday, September 18th called "Math in the Time of Data." Slides: http://illposed.net/uri.html
I gave a talk Wednesday, September 16th at the Boston R meetup on thinking small about big data. Slides are here: boston_rug_sept_2015.html.
Fascinating data from the Department of Education: https://collegescorecard.ed.gov/data/
A really cool negative result: http://dash.harvard.edu/bitstream/handle/1/3043415/imbens_bootstrap.pdf
Three.js plot widgets for R (now on CRAN): https://bwlewis.github.io/rthreejs.
The Interface conference was in Morgantown, WV this year. I gave a contrarian talk, criticizing some "big data" systems and examples out there and encouraging us to think carefully about solving problems before resorting to using those tools. Here are the slides: think_small.html.
I spoke about the many uses of the singular value decomposition in computational finance at the seventh annual R in Finance Conference in Chicago. Only the very coolest people attend this conference. The slides are here: rf2015.html.
I gave a talk at the NY R conference on foraging and visualization. Here is a video of the talk: https://www.youtube.com/watch?v=OXYX1FVlbdI and slides can be found here: http://illposed.net/nycr2015/
A cautionary note on missing value handling in correlation: http://bwlewis.github.io/covar/missing.html
I highly recommend this article by Frank McSherry summarizing his work with Michael Isard, and Derek Murray: Scalability! But at what COST?. Using some moderately large graph problems as examples, they advocate thinking carefully about problems and their solution methods (seems obvious, right?).
If you missed the Bay Area meetup, I'm giving a longer talk on htmlwidgets at the Cleveland R meetup on Wednesday, February 25th (2015). See http://www.meetup.com/Cleveland-UseR-Group/events/220140560/ for more info. Here are the slides from that talk.
I'm giving a short talk tonight (27-Jan-2015) in San Jose at the BARUG on htmlwidgets. The crazy amazing line-up of speakers tonight includes Mike Kane, Dirk Eddelbuettel, and Gabe Becker. Not to be missed!!!
I've been learning about clustering methods recently. Here is a link to a simple hierarchical clustering implementation (<50 lines) that is written only in R to make it easy to understand and experiment with: https://github.com/bwlewis/hclust_in_R The algorithm used by the native R hclust function in the statistics package is far faster, so use that in practice.
I wrote up a trivially simple implementation of and examples illustrating Gene Golub's SVD subset selection algorithm. Mike and I are using it in one of our GLM implementations (see below). But it's a cool method and deserves more attention. See http://bwlewis.github.io/GLM/svdss.html.
This is cool: http://xkcd.r-forge.r-project.org/
I asked some questions about ill-posed problems and regularization at Kent State recently. Here are the slides: http://illposed.net/illposed_ksu_nov_2013.pdf. The slides include a simple R program that applies regularization to stock returns in order to cluster stocks by a relevance network graph.
I gave a talk with Jake VanderPlas about SciDB at PyData 2013 NYC. Here is a link to a Wakari notebook: http://goo.gl/ovGaHS
The Redis client for R was recently updated! R package here on CRAN:
Source code here on GitHub: https://github.com/bwlewis/rredis
And the package vignette (PDF): redis.pdf
Here are some relatively recent papers I really like:
Network analysis via partial spectral factorization and Gauss quadrature
In Search of an Understandable Consensus Algorithm
A Scalable Bootstrap for Massive Data
Quadrature Rule-Based Bounds for Functions of Adjacency Matrices
Augmented Implicitly Restarted Lanczos Bidiagonalization Methods
OK, those last two are not so new, but they're super-cool.
I gave talk on tips and tricks for performance computing with R at the Cleveland R meet-up on Wednesday, August 7th. Here are the slides: http://goo.gl/gcPezs. Perhaps the most interesting part shows that it's pretty easy to install the commercial but freely available AMD BLAS and LAPACK libraries for R on Windows and Linux.
I gave a talk at the Boston PyData conference (http://pydata.org/) about SciDB-Py -- Jake Vanderplas' new interface between SciDB and Python. The interface defines a numpy/scipy-like array class for Python backed by SciDB arrays. Install the package directly from GitHub with pip install git+ssh://github.com/jakevdp/scidb-py.git.
I've just been reading Patrick Burns' book, http://www.burns-stat.com/documents/books/tao-te-programming/, and really enjoy it.
I gave a talk on SciDB and R and Python at JSM on Sunday, August 4th. Here are the slides: http://goo.gl/A2RPkn.
So you like Python muthaph*kkahz!?! You got it: https://github.com/bwlewis/irlbpy. This is the fastest partial SVD and PCA routine for dense and sparse matrices available in Python. It's restricted right now to real-valued matrices and is still under active development. Mike Kane presents our work at the SciPy Conference next week in Austin June 24--28 (2013).
Whit Armstrong and I ran a seminar on high performance computing with R at the R/Finance conference in May. We emphasized elastic computing using 0MQ and Redis with R, and a bit of parallel linear algebra with SciDB. Here are the slides we used: elastic-r-redis.pdf. 0MQ.distributed.computing.pdf. SciDBR-brief.pdf.
My lightning talk on SciDB and R for the Boston R meetup on 22-Jan-2013: goo.gl/btioG.
I gave a talk at JSM about R and websockets. Here it is:
And, here is a nifty application of websockets and R in quant. finance: http://timelyportfolio.blogspot.com/2012/07/hi-r-and-axys-im-d3js-nice-to-meet-you.html. Here is a silly cool "chat" script for R using websockets (many web clients can share a super basic R session): http://illposed.net/rchat.R.
Joe Cheng over at RStudio has taken over active development of the package.
Slides about the R bigmemory, parallel linear algebra in R, and a preview of what I'm working on with R and SciDB from a recent talk at the Boston R Meetup: http://illposed.net/boston_r_meetup_2012.pdf
One new idea and one old idea that should be better known on the SVD and cointegration (from a recent talk at R/Finance 2012): Lewis_RFinance_2012.pdf.
A data frame promise for R that very quickly extracts subsets directly from raw delimited text files: lazy.frame.html.
A native HTML 5 Websocket library for R: http://illposed.net/websockets.html
I discussed some methods other than Hadoop for analyzing large data with the New York CTO club. My notes are available here: http://goo.gl/PeJwm.
Outlaw talk: "The Betfair Package" at R/Finance 2011: Applied Finance with R
Betfair is the world's largest betting exchange with more than three million global clients. The BetfaiR package implements the Betfair Sports API in the R language, providing direct access to the Betfair sports exchange from R. All of the Betfair Sports API functions are available, including functions for real time market data and user account access. The package also provides a number of high-level functions for sports betting analysis, modeling and graphics.This was the first talk I ever gave where running the examples live would require breaking the law.
Talk: "How good are Krylov methods for discrete ill-posed problems?," March 25--28 AMS meeting in Lexington, KY: http://www.ms.uky.edu/~corso/amsmaa2010/.
Here are some slides: AMS_Lex_March2010-1.pdf
http://github.com/bwlewis/fls, an implementation of Kalaba-Tesfatsion flexible least squares method for R.
R4P, an R library for Processing.
Ratlab, tools for foolin' with R and Octave (or Matlab) together.
http://etna.math.kent.edu/vol.30.2008/pp128-143.dir/zeros/index.html A newer Java applet illustrating the dynamical motion of the zeros of the partial sums of the exponential function (from work with Richard Varga and Amos Carpenter).