I'm a research scientist at Google DeepMind. We're on a mission to solve artificial general intelligence. My own research interests span artificial intelligence, machine learning and computational neuroscience.
As a PhD student at the Center for Theoretical Neuroscience at Columbia, I worked on algorithms for analyzing and understanding high-dimensional data from neural recordings with Liam Paninski and nonparametric Bayesian methods for predicting time series data with Frank Wood. Prior to joining DeepMind I also consulted for Qadium, working on Data Microscopes, an open source library of fast, modular nonparametric Bayesian models.
We introduce a method to stabilize Generative Adversarial Networks (GANs) by defining the generator objective with respect to an unrolled optimization of the discriminator. This allows training to be adjusted between using the optimal discriminator in the generator’s objective, which is ideal but infeasible in practice, and using the current value of the discriminator, which is often unstable and leads to poor solutions. We show how this technique solves the common problem of mode collapse, stabilizes training of GANs with complex recurrent generators, and increases diversity and coverage of the data distribution by the generator.
Generative adversarial networks (GANs) have become popular in the world of deep unsupervised learning recently, but are notorious for being hard to optimize. This may be because the model consists of two neural networks, each of which is being optimized relative to the current state of the other one, meaning each network is trying to hit a moving target. We describe a practical method for optimizing one of these networks with respect to where the other will be in the future instead of where it is now, hopefully preventing some of these pathologies common to training GANs.
@InProceedings{metz2017unrolled,
title={Unrolled Generative Adversarial Networks},
author={Metz, Luke and Poole, Ben and Pfau, David and Sohl-Dickstein, Jascha},
booktitle={5th International Conference on Learning Representations},
year={2017}
}
Both generative adversarial networks (GAN) in unsupervised learning and actor-critic methods in reinforcement learning (RL) have gained a reputation for being difficult to optimize. Practitioners in both fields have amassed a large number of strategies to mitigate these instabilities and improve training. Here we show that GANs can be viewed as actor-critic methods in an environment where the actor cannot affect the reward. We review the strategies for stabilizing training for each class of models, both those that generalize between the two and those that are particular to that model. We also review a number of extensions to GANs and RL algorithms with even more complicated information flow. We hope that by highlighting this formal connection we will encourage both GAN and RL communities to develop general, scalable, and stable algorithms for multilevel optimization with deep networks, and to draw inspiration across communities.
Generative adversarial networks have become popular in the world of deep unsupervised learning recently, but are notorious for being hard to optimize. Actor-critic methods in reinforcement learning have much the same reputation. We show that the two methods are actually very closely related, and review strategies used in both communities to improve the stability of training and diversity of samples, in the hopes of encouraging cross-pollination between the fields.
@InProceedings{pfau2016connecting,
title={Connecting Generative Adversarial Networks and Actor-Critic Methods},
author={Pfau, David and Vinyals, Oriol},
booktitle={NIPS Workshop on Adversarial Training},
year={2016}
}
The move from hand-designed features to learned features in machine learning has been wildly successful. In spite of this, optimization algorithms are still designed by hand. In this paper we show how the design of an optimization algorithm can be cast as a learning problem, allowing the algorithm to learn to exploit structure in the problems of interest in an automatic way. Our learned algorithms, implemented by LSTMs, outperform generic, hand-designed competitors on the tasks for which they are trained, and also generalize well to new tasks with similar structure. We demonstrate this on a number of tasks, including simple convex problems, training neural networks, and styling images with neural art.
A lot of the success of deep learning has been in showing that features in domains like computer vision that had been hand-designed could be learned instead. Learning itself is mostly done with hand-designed optimization algorithms, however. This paper attempts to apply the successes of deep learning at the meta-level, to the optimization algorithms used to train the deep networks themselves. In other words: "Yo dawg, I heard you like optimizers, so I put a deep network in your deep network so you can learn while you learn."
@InProceedings{andrychowicz2016learning,
title = {Learning to Learn by Gradient Descent by Gradient Descent},
author = {Andrychowicz, Marcin and Denil, Misha and Gomez, Sergio and Hoffman, Matthew W and Pfau, David and Schaul, Tom and de Freitas, Nando},
booktitle = {Advances in Neural Information Processing Systems},
year = {2016}
}
In this work we introduce a differentiable version of the Compositional Pattern Producing Network, called the DPPN. Unlike a standard CPPN, the topology of a DPPN is evolved but the weights are learned. A Lamarckian algorithm, that combines evolution and learning, produces DPPNs to reconstruct an image. Our main result is that DPPNs can be evolved/trained to compress the weights of a denoising autoencoder from 157684 to roughly 200 parameters, while achieving a reconstruction accuracy comparable to a fully connected network with more than two orders of magnitude more parameters. The regularization ability of the DPPN allows it to rediscover (approximate) convolutional network architectures embedded within a fully connected architecture. Such convolutional architectures are the current state of the art for many computer vision applications, so it is satisfying that DPPNs are capable of discovering this structure rather than having to build it in by design. DPPNs exhibit better generalization when tested on the Omniglot dataset after being trained on MNIST, than directly encoded fully connected autoencoders. DPPNs are therefore a new framework for integrating learning and evolution.
Evolutionary computing is a type of stochastic search that randomly changes parameters in a model and keeps around the models that score the highest. While crude, a type of model called "compositional pattern producing networks" can generate interesting images, and can even fool state-of-the-art computer vision algorithms. We show that a mix of gradient descent and stochastic search works better for training convolutional pattern producing networks to produce the parameters of a neural network than stochastic search alone. The neural network parameters that are learned have a structure somewhat similar to convolutions, which are a type of invariance normally built into neural networks by hand. This suggests that important invariances could possibly be discovered instead of being hand-encoded into models.
@InProceedings{fernando2016convolution,
title = {Convolution by Evolution: Differentiable Pattern Producing Networks},
author = {Chrisantha Fernando, Dylan Banarse, Malcolm Reynolds, Frederic Besse, David Pfau, Max Jaderberg, Marc Lanctot, Daan Wierstra},
booktitle = {The Genetic and Evolutionary Computation Conference},
year = {2016}
}
We present a modular approach for analyzing calcium imaging recordings of large neuronal ensembles. Our goal is to simultaneously identify the locations of the neurons, demix spatially overlapping components, and denoise and deconvolve the spiking activity from the slow dynamics of the calcium indicator. Our approach relies on a constrained nonnegative matrix factorization that expresses the spatiotemporal fluorescence activity as the product of a spatial matrix that encodes the spatial footprint of each neuron in the optical field and a temporal matrix that characterizes the calcium concentration of each neuron over time. This framework is combined with a novel constrained deconvolution approach that extracts estimates of neural activity from fluorescence traces, to create a spatiotemporal processing algorithm that requires minimal parameter tuning. We demonstrate the general applicability of our method by applying it to in vitro and in vivo multi-neuronal imaging data, whole-brain light-sheet imaging data, and dendritic imaging data.
Calcium imaging is a powerful class of experimental techniques that allow us to image from hundreds to thousands of neurons simultaneously in living animals. However, the information we really care about - which neuron is spiking when - is mixed together in complex ways in the raw video data. This paper presents a new statistical method that can simultaneously identify where neurons are, unmix the signals from overlapping neurons, and infer when a spike is occurring from noisy data, potentially saving experimenters a lot of time and energy.
@Article{pnevmatikakis2016simultaneous,
title={Simultaneous denoising, deconvolution, and demixing of calcium imaging data},
author={Pnevmatikakis, Eftychios A and Soudry, Daniel and Gao, Yuanjun and Machado, Timothy A and Merel, Josh and Pfau, David and Reardon, Thomas and Mu, Yu and Lacefield, Clay and Yang, Weijian and others},
journal={Neuron},
volume={89},
number={2},
pages={285--299},
year={2016},
publisher={Elsevier}
}
Making intelligent decisions from incomplete information is critical in many applications: for example, robots must choose actions based on imperfect sensors, and speech-based interfaces must infer a user’s needs from noisy microphone inputs. What makes these tasks hard is that often we do not have a natural representation with which to model the domain and use for choosing actions; we must learn about the domain’s properties while simultaneously performing the task. Learning a representation also involves trade-offs between modeling the data that we have seen previously and being able to make predictions about new data. This article explores learning representations of stochastic systems using Bayesian nonparametric statistics. Bayesian nonparametric methods allow the sophistication of a representation to scale gracefully with the complexity in the data. Our main contribution is a careful empirical evaluation of how representations learned using Bayesian nonparametric methods compare to other standard learning approaches, especially in support of planning and control. We show that the Bayesian aspects of the methods result in achieving state-of-the-art performance in decision making with relatively few samples, while the nonparametric aspects often result in fewer computations. These results hold across a variety of different techniques for choosing actions given a representation.
Is it possible for an agent to learn the structure of the world while learning how to act optimally in the world if it isn't able to see everything about the world all at once? We certainly hope so, or artificial intelligence may not be possible. We use a number of techniques for learning structure from time series in a Bayesian nonparametric way, including the Probabilistic Deterministic Infinite Automata (PDIA), to try to address this question. On small problems some of the methods tried do in fact recover the true structure of the world. Not the PDIA, sadly.
@article{doshi2015bayesian,
title={Bayesian nonparametric methods for partially-observable reinforcement learning},
author={Doshi-Velez, Finale and Pfau, David and Wood, Frank and Roy, Nicholas},
journal={IEEE transactions on pattern analysis and machine intelligence},
volume={37},
number={2},
pages={394--407},
year={2015},
publisher={IEEE}
}
Advances in neuroscience are producing data at an astounding rate - data which are fiendishly complex both to process and to interpret. Biological neural networks are high-dimensional, nonlinear, noisy, heterogeneous, and in nearly every way defy the simplifying assumptions of standard statistical methods. In this dissertation we address a number of issues with understanding the structure of neural populations, from the abstract level of how to uncover structure in generic time series, to the practical matter of finding relevant biological structure in state-of-the-art experimental techniques. To learn the structure of generic time series, we develop a new statistical model, which we dub the probabilistic deterministic infinite automata (PDIA), which uses tools from nonparametric Bayesian inference to learn a very general class of sequence models. We show that the models learned by the PDIA often offer better predictive performance and faster inference than Hidden Markov Models, while being significantly more compact than models that simply memorize contexts. For large populations of neurons, models like the PDIA become unwieldy, and we instead investigate ways to robustly reduce the dimensionality of the data. In particular, we adapt the generalized linear model (GLM) framework for regres- sion to the case of matrix completion, which we call the low-dimensional GLM. We show that subspaces and dynamics of neural activity can be accurately recovered from model data, and with only minimal assumptions about the structure of the dynamics can still lead to good predictive performance on real data. Finally, to bridge the gap between recording technology and analysis, particularly as recordings from ever-larger populations of neurons becomes the norm, automated methods for extracting activity from raw recordings become a necessity. We present a number of methods for automatically segmenting biological units from optical imaging data, with applications to light sheet recording of genetically encoded calcium indicator fluorescence in the larval zebrafish, and optical electrophysiology using genetically encoded voltage indicators in culture. Together, these methods are a powerful set of tools for addressing the diverse challenges of modern neuroscience.
6 years of my life compressed into 150-odd pages. Most of chapter 2 and 3 had been published at NIPS already, but some of the material included has not been published elsewhere. The first chapter gives a good summary of the role of information theory in neuroscience and provides some of the motivation for the work in Chapter 2 on time series models. Chapter 2 includes experiments with the PDIA on neuroscience data that have not been published elsewhere, showing that we can learn long-range dependencies in data better than a GLM (at least for data where the observations are binary). Chapter 4 provides a number of experiments in processing calcium imaging data that eventually led to work published in Neuron, but in quite a different form from what's presented here.
@phdthesis{pfau2015learning,
author = {Pfau, David},
title = {Learning Structure in Time Series for Neuroscience and Beyond},
school = {Columbia University},
year = 2015,
month = 2,
}
We present a structured matrix factorization approach to analyzing calcium imaging recordings of large neuronal ensembles. Our goal is to simultaneously identify the locations of the neurons, demix spatially overlapping components, and denoise and deconvolve the spiking activity of each neuron from the slow dynamics of the calcium indicator. The matrix factorization approach relies on the observation that the spatiotemporal fluorescence activity can be expressed as a product of two matrices: a spatial matrix that encodes the location of each neuron in the optical field and a temporal matrix that characterizes the calcium concentration of each neuron over time. We present a simple approach for estimating the dynamics of the calcium indicator as well as the observation noise statistics from the observed data. These parameters are then used to set up the matrix factorization problem in a constrained form that requires no further parameter tuning. We discuss initialization and post-processing techniques that enhance the performance of our method, along with efficient and largely parallelizable algorithms. We apply our method to in vivo large scale multi-neuronal imaging data and also demonstrate how similar methods can be used for the analysis of in vivo dendritic imaging data.
A preliminary version of our work on processing calcium imaging data later published in Neuron.
@Article{pnevmatikakis2014structured,
title={A structured matrix factorization framework for large scale calcium imaging data analysis},
author={Pnevmatikakis, Eftychios A and Gao, Yuanjun and Soudry, Daniel and Pfau, David and Lacefield, Clay and Poskanzer, Kira and Bruno, Randy and Yuste, Rafael and Paninski, Liam},
journal={arXiv preprint arXiv:1409.2903},
year={2014}
}
Recordings from large populations of neurons make it possible to search for hypothesized low-dimensional dynamics. Finding these dynamics requires models that take into account biophysical constraints and can be fit efficiently and robustly. Here, we present an approach to dimensionality reduction for neural data that is convex, does not make strong assumptions about dynamics, does not require averaging over many trials and is extensible to more complex statistical models that combine local and global influences. The results can be combined with spectral methods to learn dynamical systems models. The basic method extends PCA to the exponential family using nuclear norm minimization. We evaluate the effectiveness of this method using an exact decomposition of the Bregman divergence that is analogous to variance explained for PCA. We show on model data that the parameters of latent linear dynamical systems can be recovered, and that even if the dynamics are not stationary we can still recover the true latent subspace. We also demonstrate an extension of nuclear norm minimization that can separate sparse local connections from global latent dynamics. Finally, we demonstrate improved prediction on real neural data from monkey motor cortex compared to fitting linear dynamical models without nuclear norm smoothing.
New technologies make it possible to record from massive populations of neurons, but making that data interpretable is challenging. Dimensionality reduction is one approach, which looks for a few factors in the data which account for most of the variability. State space models extend dimensionality reduction by modeling dynamics in the low-dimensional space of factors, as well as allowing for more complex models of noise that are more appropriate for neural data. In the machine learning community, state space models are typically fit with methods like expectation-maximization, which may be sensitive to the choice of initialization. We show here that state space models that are of interest to neuroscientists can also be fit using techniques from convex optimization - in particular techniques from the matrix completion and system identification community.
@InProceedings{pfau2013robust,
title={Robust learning of low-dimensional dynamics from large neural ensembles},
author={Pfau, David and Pnevmatikakis, Eftychios A and Paninski, Liam},
booktitle={Advances in neural information processing systems},
pages={2391--2399},
year={2013}
}
The opacity of typical objects in the world results in occlusion, an important property of natural scenes that makes inference of the full three-dimensional structure of the world challenging. The relationship between occlusion and low-level image statistics has been hotly debated in the literature, and extensive simulations have been used to determine whether occlusion is responsible for the ubiquitously observed power-law power spectra of natural images. To deepen our understanding of this problem, we have analytically computed the two- and four-point functions of a generalized “dead leaves” model of natural images with parameterized object transparency. Surprisingly, transparency alters these functions only by a multiplicative constant, so long as object diameters follow a power-law distribution. For other object size distributions, transparency more substantially affects the low-level image statistics. We propose that the universality of power-law power spectra for both natural scenes and radiological medical images, formed by the transmission of x-rays through partially transparent tissue, stems from power-law object size distributions, independent of object opacity.
If you compute the correlation between pixels in an image as a function of distance between pixels, a common statistical distribution emerges across nearly all images. The reason for this distribution was not clear - one camp held that it was due to the presence of objects of many different sizes in an image, while another held that it was caused by sharp edges. We show conclusively that the former camp is correct by analytically calculating the correlations in a model of natural images that factors in transparency. We show that changing the transparency of objects in the model does not change the correlation structure, but changing the distribution of sizes in the model does. Thus object sizes, not edges, lead to the complex correlations in nearly all natural images.
@Article{zylberberg2012dead,
title={Dead leaves and the dirty ground: Low-level image statistics in transmissive and occlusive imaging environments},
author={Zylberberg, Joel and Pfau, David and DeWeese, Michael Robert},
journal={Physical Review E},
volume={86},
number={6},
pages={066112},
year={2012},
publisher={APS}
}
A major goal for brain machine interfaces is to allow patients to control prosthetic devices with high degrees of independent movements. Such devices like robotic arms and hands require this high dimensionality of control to restore the full range of actions exhibited in natural movement. Current BMI strategies fall well short of this goal allowing the control of only a few degrees of freedom at a time. In this paper we present work towards the decoding of 27 joint angles from the shoulder, arm and hand as subjects perform reach and grasp movements. We also extend previous work in examining and optimizing the recording depth of electrodes to maximize the movement information that can be extracted from recorded neural signals.
One of the great potential applications of neural decoding is in neural prosthetics - potentially granting locked in patients the ability to move again. Neural decoding has demonstrated the ability to decode monkey reaching in 2 or 3 dimensions, but natural motion is far more complex than that. We showed that baseline algorithms from the neural prosthetics community could scale to controlling a virtual limb with dozens of degrees of freedom, opening the way to more realistic and rich movements from brain-machine interfaces.
@InProceedings{wong2012decoding,
title={Decoding arm and hand movements across layers of the macaque frontal cortices},
author={Wong, Yan T and Vigeral, Mariana and Putrino, David and Pfau, David and Merel, Josh and Paninski, Liam and Pesaran, Bijan},
booktitle={2012 Annual International Conference of the IEEE Engineering in Medicine and Biology Society},
pages={1757--1760},
year={2012},
organization={IEEE}
}
We propose a novel Bayesian nonparametric approach to learning with probabilistic deterministic finite automata (PDFA). We define and develop a sampler for a PDFA with an infinite number of states which we call the probabilistic deterministic infinite automata (PDIA). Posterior predictive inference in this model, given a finite training sequence, can be interpreted as averaging over multiple PDFAs of varying structure, where each PDFA is biased towards having few states. We suggest that our method for averaging over PDFAs is a novel approach to predictive distribution smoothing. We test PDIA inference both on PDFA structure learning and on both natural language and DNA data prediction tasks. The results suggest that the PDIA presents an attractive compromise between the computational cost of hidden Markov models and the storage requirements of hierarchically smoothed Markov models.
Probabilistic deterministic finite automata (PDFA) are a class of probabilistic models for sequences - they assign a probability to every possible sequence, like a string of text. They fall in between hidden Markov models and n-gram models in complexity. Like n-gram models, inference is fast and cheap as there is no uncertainty about what the context is. Like hidden Markov models, complex dependencies in how states transition can be learned - transitions which cannot be learned by an n-gram model no matter how long the context is. We develop a nonparametric Bayesian way of learning PDFA and show it can recover the true structure of artificial grammars that psychologists used to study human sequence learning, as well as learning very compact models of text and DNA.
@InProceedings{pfau2010probabilistic,
title={Probabilistic deterministic infinite automata},
author={Pfau, David and Bartlett, Nicholas and Wood, Frank},
booktitle={Advances in neural information processing systems},
pages={1930--1938},
year={2010}
}
We propose a novel dependent hierarchical Pitman-Yor process model for discrete data. An incremental Monte Carlo inference procedure for this model is developed. We show that inference in this model can be performed in constant space and linear time. The model is demonstrated in a discrete sequence prediction task where it is shown to achieve state of the art sequence prediction performance while using significantly less memory
The sequence memoizer is a powerful probabilistic model of sequential data like text. One downside of the sequence memoizer is that it grows linearly in memory with the amount of data. We show that by forgetting intelligently, a constant-memory sequence memoizer performs comparably to the original linear-memory algorithm.
@InProceedings{bartlett2010forgetting,
title={Forgetting counts: Constant memory inference for a dependent hierarchical Pitman-Yor process},
author={Bartlett, Nicholas and Pfau, David and Wood, Frank},
booktitle={Proceedings of the 27th International Conference on Machine Learning (ICML-10)},
pages={63--70},
year={2010}
}
Under natural viewing conditions, our eyes alternate between saccadic movement and fixation. However,
even during fixation there are constant small movements, which can be decomposed into miniature saccades
and diffusion-like random eye movements. Some diffusion helps prevent adaptation to a particular
stimulus, but diffusion also blurs the image of the world across the retina. Despite this, humans can resolve
fine spatial detail very well, and this diffusion may even enhance the ability to distinguish high-frequency
components of an image [1]. This suggests that the brain compensates for fixational eye diffusion and may
even extract useful information from it. To investigate the effect of eye diffusion on image reconstruction,
we extended a generalized linear model (GLM) of retinal encoding/decoding to incorporate random-walk
drift of the image falling on the retina. GLMs have been successfully applied to modeling a range of neural
systems, including retinal ganglion cells [2]. Previously developed GLMs of the retina, directly estimated
from spiking data, generate simulated network spike trains with the correct spatiotemporal filtering and
correlation structure. Finally, given this network spiking encoding model and a statistical model of the spatiotemporal
visual inputs, there is a natural Bayesian method for decoding the response [3]. For our model
incorporating fixational eye diffusion, the decoding model would assign a probability to all possible random
walks the image could take. However, the number of possible paths grows exponentially with time, making
this method computationally intractable. Instead, we approximate the posterior distribution of images given
the observed spikes as a mixture of Gaussians, and track the diffusive movements of the mixture components
by a particle filtering approximation. This method is both computationally tractable and effective at
reconstructing the encoded image. Preliminary results show that the image reconstruction is poor at both
very low and very high diffusion rates, while reconstruction works reasonably well at intermediate diffusion
rates. Thus, a well-defined optimal diffusion rate exists, and in general depends on statistical properties
of both the stimulus and the retinal spatiotemporal receptive fields, such as the strength of the sustained
response component and whether the transient component lasts longer than the persistence time of the
eye movements. We are currently pursuing quantitative comparisons to the real diffusion coefficient during
head-fixed viewing.
References
[1] Miniature eye movements enhance fine spatial detail. M. Rucci,
R. Iovin, M. Poletti, and F. Santini, Nature 447(7146):851-854, 2007.
[2] Spatio-temporal correlations and
visual signalling in a complete neuronal population. J. W. Pillow, J. Shlens, L. Paninski, A. Sher, A. M.
Litke, E. J. Chichilnisky and E. P. Simoncelli, Nature 454(7202):995-999, 2008.
[3] Model-based decoding,
information estimation, and change-point detection in multi-neuron spike trains. J. W. Pillow, L. Paninski., 2008.
Even when staring fixed at an object, our eyes are moving in subtle ways, yet the world appears fixed to us. Somehow, our brain must be compensating for these random movements of our eyes to create a coherent and stable perception of the world. We developed a Bayesian method to decode both the content of a scene and the motion of the eye simultaneously from a model of the signal the optic nerve sends to the brain. We also derived the optimal amount of eye movement given this model, but found it was ten times smaller than the actual amount of movement, suggesting other factors are in play.
@InProceedings{pfau2009bayesian,
title={A Bayesian method to predict the optimal diffusion coefficient in random fixational eye movements},
author={Pfau, David and Pitkow, Xaq and Paninski, Liam},
booktitle={Conference abstract: Computational and systems neuroscience},
year={2009}
}
Data from neuroscience is fiendishly complex. Neurons exhibit correlations on very long timescales and across large populations, and the activity of individual neurons is difficult to extract from noisy experimental data. I will present work on several projects to address these issues, both abstract and applied. First I will discuss the Probabilistic Deterministic Infinite Automata (PDIA), a nonparametric model of discrete sequences such as natural language or neural spiking. The PDIA explicitly enumerates latent states that are predictive of the future, and by using a Hierarchical Dirichlet Process prior can learn arbitrary transitions between those states. The model class learned by the PDIA is smaller than hidden Markov models but yields superior predictive performance on data with strong history dependence, like text. One weakness of the PDIA is that it is hard to scale when the space of possible observations is very large, as is the case with large populations of neurons. In this limit we are instead interested in reducing the dimensionality of data, and I will present work on unifying the generalized linear model (GLM) framework in neuroscience with dimensionality reduction. The resulting models can be efficiently learned using convex techniques from the matrix completion literature, and can be combined with spectral methods to learn surprisingly accurate models of the dynamics of real neural data. To apply these models to the kinds of high-dimensional neural data now becoming available, we have to bridge the gap between raw data and units of neural activity. I will present joint work with Misha Ahrens and Jeremy Freeman on extracting neural activity from whole-brain recordings in larval zebrafish, as a step towards the long-term goal of making dynamics modeling a daily part of the data analysis routine in neuroscience.
You can find my personal Github here. Notable projects include a collection of methods for learning state space models for neuroscience data, some of which has been integrated into the pop_spik_dyn package, a Matlab implementation of Learning Recurrent Neural Networks with Hessian-Free Optimization, and the Java implementation of the Probabilistic Deterministic Infinite Automata used our paper. For those interested in probabilistic programming, I have also provided a PDIA implementation in WebChurch. I also contributed a C++ implementation of Beam Sampling for the Infinite Hidden Markov Model to the Data Microscopes project. At a factor of 40 faster than existing Matlab code, it's likely the fastest beam sampler for the iHMM in the world.
Not everything makes it into a paper, but that doesn't mean it's not important. You can find short notes and other writings that don't have a home elsewhere here.
A simple result that I haven't seen published elsewhere. Other research on generalized bias-variance decompositions historically has focused on 0-1 loss and is relevant to classificiation and boosting. In probabilistic modeling, error is measured through log probabilities instead of classification accuracy, often with distributions in the exponential family. Exponential family likelihoods and Bregman divergences are closely related, and it turns out it's straightforward to generalize the bias-variance decomposition for squared error to all Bregman divergences.