I'm a senior research scientist at DeepMind. We're on a mission to solve artificial general intelligence. My own research interests span artificial intelligence, machine learning and computational science.
As a PhD student at the Center for Theoretical Neuroscience at Columbia, I worked on algorithms for analyzing and understanding high-dimensional data from neural recordings with Liam Paninski and nonparametric Bayesian methods for predicting time series data with Frank Wood.
Current research interests include applications of machine learning to computational physics and connections between group theory and "disentangling" in unsupervised learning.
How should a machine intelligence perform unsupervised structure discovery over streams of sensory input? One approach to this problem is to cast it as an apperception task. Here, the task is to construct an explicit interpretable theory that both explains the sensory sequence and also satisfies a set of unity conditions, designed to ensure that the constituents of the theory are connected in a relational structure.
However, the original formulation of the apperception task had one fundamental limitation: it assumed the raw sensory input had already been parsed using a set of discrete categories, so that all the system had to do was receive this already-digested symbolic input, and make sense of it. But what if we don't have access to pre-parsed input? What if our sensory sequence is raw unprocessed information?
The central contribution of this paper is a neuro-symbolic framework for distilling interpretable theories out of streams of raw, unprocessed sensory experience. First, we extend the definition of the apperception task to include ambiguous (but still symbolic) input: sequences of sets of disjunctions. Next, we use a neural network to map raw sensory input to disjunctive input. Our binary neural network is encoded as a logic program, so the weights of the network and the rules of the theory can be solved jointly as a single SAT problem. This way, we are able to jointly learn how to perceive (mapping raw sensory information to concepts) and apperceive (combining concepts into declarative rules).
The world around us consists of both objects and rules governing how those objects behave. For many simple model worlds, like games, the rules describing how objects interact are quite simple and can be easily written down in formal logic. But in many settings in artificial intelligence, we do not have access to either the rules or knowledge of the individual objects - all we have is an undifferentiated stream of sensory input. This work combines deep learning methods popular for perception with tools from logic programming to create a system, called the Apperception Engine, which can simultaneously learn to recognize simple objects and learn the rules that govern interactions between those objects, for simple domains like the game Sokoban.
@article{evans2021making,
title={Making Sense of Raw Input},
author={Evans, Richard and Bošnjak, Matko and Buesing, Lars and Ellis, Kevin and Pfau, David and Kohli, Pushmeet and Sergot, Marek},
journal={Artificial Intelligence},
year={2021},
volume={299},
pages={103521},
doi = {https://doi.org/10.1016/j.artint.2021.103521},
url = {https://www.sciencedirect.com/science/article/pii/S0004370221000722}
}
We introduce a method for reconstructing an infinitesimal normalizing flow given only an infinitesimal change to a (possibly unnormalized) probability distribution. This reverses the conventional task of normalizing flows -- rather than being given samples from a unknown target distribution and learning a flow that approximates the distribution, we are given a perturbation to an initial distribution and aim to reconstruct a flow that would generate samples from the known perturbed distribution. While this is an underdetermined problem, we find that choosing the flow to be an integrable vector field yields a solution closely related to electrostatics, and a solution can be computed by the method of Green's functions. Unlike conventional normalizing flows, this flow can be represented in an entirely nonparametric manner. We validate this derivation on low-dimensional problems, and discuss potential applications to problems in quantum Monte Carlo and machine learning.
For many problems in machine learning and computational physics, an optimization problem and sampling problem are coupled together. The optimization depends on the sampler reaching equilibrium, and the sampler has to re-run every iteration as the optimization changes the target equilibrium. It would be very convenient if it were possible to update the sampler based on previous iterations, instead of restarting an MCMC algorithm with no knowledge of past steps. We derive a deterministic update for samples from a distribution when given the change to the log probability of the (unnormalized) distribution. Essentially, we calculate the "electric field" created by a set of samples, where the charge is the change in the probability distribution. Mathematically, this has close connections to Neural ODEs, Stein Variational Gradient Descent, and the Fokker-Planck equation. Unfortunately, the update seems to suffer quite badly from the curse of dimensionality, meaning its application to real problems is uncertain.
@article{pfau2020integrable,
title={Integrable Nonparametric Flows},
author={Pfau, David and Rezende, Danilo},
journal={arXiv preprint arXiv:2012.02035},
year={2020}
}
The Fermionic Neural Network (FermiNet) is a recently-developed neural network architecture that can be used as a wavefunction Ansatz for many-electron systems, and has already demonstrated high accuracy on small systems. Here we present several improvements to the FermiNet that allow us to set new records for speed and accuracy on challenging systems. We find that increasing the size of the network is sufficient to reach chemical accuracy on atoms as large as argon. Through a combination of implementing FermiNet in JAX and simplifying several parts of the network, we are able to reduce the number of GPU hours needed to train the FermiNet on large systems by an order of magnitude. This enables us to run the FermiNet on the challenging transition of bicyclobutane to butadiene and compare against the PauliNet on the automerization of cyclobutadiene, and we achieve results near the state of the art for both.
The FermiNet, which we introduced in a paper earlier in the same year, is a neural network that can represent wavefunctions of many-electron systems. This makes it possible to solve for the energies of chemical systems from first principle to very high accuracy. But, scaling this method is very difficult. We show that by switching from TensorFlow to JAX and stripping out a few features from the network that weren't really doing anything we can run calculations much faster. This allows us to train bigger networks that reach higher accuracy on systems like larger atoms, and run many calculations in parallel, like different possible transition states from the same chemical reaction.
@article{spencer2020better,
title={Better, Faster Fermionic Neural Networks},
author={Spencer, James S. and Pfau, David and Botev, Aleksandar and Foulkes, W. M. C.},
journal={arXiv preprint arXiv:2011.07125},
year={2020}
}
We present a novel nonparametric algorithm for symmetry-based disentangling of data manifolds, the Geometric Manifold Component Estimator (GEOMANCER). GEOMANCER provides a partial answer to the question posed by Higgins et al. (2018): is it possible to learn how to factorize a Lie group solely from observations of the orbit of an object it acts on? We show that fully unsupervised factorization of a data manifold is possible if the true metric of the manifold is known and each factor manifold has nontrivial holonomy -- for example, rotation in 3D. Our algorithm works by estimating the subspaces that are invariant under random walk diffusion, giving an approximation to the de Rham decomposition from differential geometry. We demonstrate the efficacy of GEOMANCER on several complex synthetic manifolds. Our work reduces the question of whether unsupervised disentangling is possible to the question of whether unsupervised metric learning is possible, providing a unifying insight into the geometric nature of representation learning.
"Disentangling" is a somewhat nebulous term in ML, but it is broadly about building models that can separate out different latent factors of variation - for instance, in vision, separating translation, rotation, and changes in lighting or color that leave objects invariant. There are many definitions of disentangling - this paper is focused on the "symmetry-based" definition, which formalizes different possible invariances in the world as a product of continuous transformations, also known as Lie groups. We formalized symmetry-based disentangling in a previous paper - in short, a representation is disentangled if it matches the product structure of the group transformations that act on objects in the world. While this helped clarify terms used in the field, it did not provide any recipe for how to learn this product structure for Lie groups. That's where GEOMANCER comes in.
GEOMANCER was inspired by an observation about analogical reasoning. When working with vector representations, you can make analogies just by adding vectors together. For instance, hand + leg - arm = foot. But this model of analogies breaks down when you move from vector representations to Lie groups. Suddenly things don't commute any more! This is especially a problem when dealing with 3D rotations, which are ubiquitous in computer vision. Some image analogies we can complete without a problem, while others are more ambiguous. The idea behind GEOMANCER is to use this ambiguity as a learning signal itself. Directions that are disentangled from one another will be those such that analogies made in those directions can be completed unambiguously, even over long distances. Formalizing this idea mathematically leads to a branch of differential geometry known as holonomy theory, that specifically deals with how much vectors deviate from their behavior in flat spaces when moved around a curved manifold. Working through the math, we arrive at an algorithm based around the idea of subspaces undergoing random walk diffusion on a data manifold.
On synthetic manifolds, we are able to automatically discover the correct number of submanifolds, their dimension, and (up to sampling noise) learn the disentangled directions almost exactly. This works on the product of as many as 5 manifolds, far more than other methods. But, our method assumes that the data is already in a space where distances are correct and disentangled directions are at right angles. That usually isn't the case for raw data - so the problem of symmetry-based disentangling is only half-solved! Because we start from the symmetry-based definition and work backwards from first principles, we believe that GEOMANCER is a promising first step in a research direction that will lead to more general and robust disentangling algorithms.
@article{pfau2020disentangling,
title={Disentangling by Subspace Diffusion},
author={Pfau, David and Higgins, Irina and Botev, Aleksandar and Racani\`ere,
S{\'e}bastian},
journal={Advances in Neural Information Processing Systems (NeurIPS)},
year={2020}
}
Given access to accurate solutions of the many-electron Schrödinger equation, nearly all chemistry could be derived from first principles. Exact wavefunctions of interesting chemical systems are out of reach because they are NP-hard to compute in general, but approximations can be found using polynomially-scaling algorithms. The key challenge for many of these algorithms is the choice of wavefunction approximation, or Ansatz, which must trade off between efficiency and accuracy. Neural networks have shown impressive power as accurate practical function approximators and promise as a compact wavefunction Ansatz for spin systems, but problems in electronic structure require wavefunctions that obey Fermi-Dirac statistics. Here we introduce a novel deep learning architecture, the Fermionic Neural Network, as a powerful wavefunction Ansatz for many-electron systems. The Fermionic Neural Network is able to achieve accuracy beyond other variational Monte Carlo Ansätze on a variety of atoms and small molecules. Using no data other than atomic positions and charges, we predict the dissociation curves of the nitrogen molecule and hydrogen chain, two challenging strongly-correlated systems, to significantly higher accuracy than the coupled cluster method, widely considered the gold standard for quantum chemistry. This demonstrates that deep neural networks can outperform existing ab-initio quantum chemistry methods, opening the possibility of accurate direct optimisation of wavefunctions for previously intractable molecules and solids.
The Schrödinger equation - basically Newton's laws at the atomic scale - have been known for almost 100 years. But the equations are impossible to solve in closed form for anything more complicated than a hydrogen atom. To quote Paul Dirac, "Physical laws necessary for the mathematical theory of a large part of physics and the whole of chemistry are thus completely known, and the difficulty is only that the exact application of these laws leads to equations much too complicated to be soluble." People have been solving these equations computationally almost as long as there have been computers - but an incredibly high level of accuracy is needed for these computations to be relevant to chemistry - something like 99.999% accuracy or higher. We've developed a new neural network architecture that can represent wavefunctions for systems of fermions - the kind of particles that make up most matter - and show that it is much more accurate than conventional approximate wavefunctions.
@article{pfau2020abinitio,
title={Ab-initio Solution of the Many-Electron Schr{\"o}dinger Equation with Deep Neural Networks},
author={Pfau, David and Spencer, James S. and Matthews, Alexander G. de G. and Foulkes, W. M. C.},
journal={Phys. Rev. Research},
year={2020},
volume={2},
issue = {3},
pages={033429},
doi = {10.1103/PhysRevResearch.2.033429},
url = {https://link.aps.org/doi/10.1103/PhysRevResearch.2.033429}
}
We present Spectral Inference Networks, a framework for learning eigenfunctions of linear operators by stochastic optimization. Spectral Inference Networks generalize Slow Feature Analysis to generic symmetric operators, and are closely related to Variational Monte Carlo methods from computational physics. As such, they can be a powerful tool for unsupervised representation learning from video or pairs of data. We derive a training algorithm for Spectral Inference Networks that addresses the bias in the gradients due to finite batch size and allows for online learning of multiple eigenfunctions. We show results of training Spectral Inference Networks on problems in quantum mechanics and feature learning for videos on synthetic datasets as well as the Arcade Learning Environment. Our results demonstrate that Spectral Inference Networks accurately recover eigenfunctions of linear operators, can discover interpretable representations from video and find meaningful subgoals in reinforcement learning environments.
Computing the eigendecomposition of a matrix is a ubiquitous problem in computational sciences. Often this is an approximation of an eigenfunction of a linear operator from finite points. We show how to use generic function approximators like neural networks trained by gradient descent to approximately solve this problem. This has the advantage that generalization is extremely fast and simple compared to alternative approaches like the Nystrom method. Slow feature analysis, a classic unsupervised learning algorithm, is a special case of the framework we outline here.
@InProceedings{pfau2019spectral,
title={Spectral Inference Networks: Unifying Deep and Spectral Learning},
author={Pfau, David and Petersen, Stig and Agarwal, Ashish and Barrett, David and Stachenfeld, Kimberly L.},
booktitle={7th International Conference on Learning Representations},
year={2019}
}
How can intelligent agents solve a diverse set of tasks in a data-efficient manner? The disentangled representation learning approach posits that such an agent would benefit from separating out (disentangling) the underlying structure of the world into disjoint parts of its representation. However, there is no generally agreed-upon definition of disentangling, not least because it is unclear how to formalise the notion of world structure beyond toy datasets with a known ground truth generative process. Here we propose that a principled solution to characterising disentangled representations can be found by focusing on the transformation properties of the world. In particular, we suggest that those transformations that change only some properties of the underlying world state, while leaving all other properties invariant, are what gives exploitable structure to any kind of data. Similar ideas have already been successfully applied in physics, where the study of symmetry transformations has revolutionised the understanding of the world structure. By connecting symmetry transformations to vector representations using the formalism of group and representation theory we arrive at the first formal definition of disentangled representations. Our new definition is in agreement with many of the current intuitions about disentangling, while also providing principled resolutions to a number of previous points of contention. While this work focuses on formally defining disentangling - as opposed to solving the learning problem - we believe that the shift in perspective to studying data transformations can stimulate the development of better representation learning algorithms.
Learning to automatically disentangle different factors of variation in data (for instance, object pose, illumination, color and identity) is a major recent topic of interest in unsupervised learning. However, no one can really agree on what it means for a representation to be "disentangled". This paper is an attempt to make our intuitive notion of "disentangled representation" mathematically precise, using machinery from group representation theory.
@article{higgins2018towards,
title={Towards a Definition of Disentangled Representations},
author={Higgins, Irina and Amos, David and Pfau, David and Racaniere, Sebastian and Matthey, Loic and Rezende, Danilo and Lerchner, Alexander},
journal={arXiv preprint arXiv:1812.02230},
year={2018}
}
Spectral algorithms for learning low-dimensional data manifolds have largely been supplanted by deep learning methods in recent years. One reason is that classic spectral manifold learning methods often learn collapsed embeddings that do not fill the embedding space. We show that this is a natural consequence of data where different latent dimensions have dramatically different scaling in observation space. We present a simple extension of Laplacian Eigenmaps to fix this problem based on choosing embedding vectors which are both orthogonal and \textit{minimally redundant} to other dimensions of the embedding. In experiments on NORB and similarity-transformed faces we show that Minimally Redundant Laplacian Eigenmap (MR-LEM) significantly improves the quality of embedding vectors over Laplacian Eigenmaps, accurately recovers the latent topology of the data, and discovers many disentangled factors of variation of comparable quality to state-of-the-art deep learning methods.
In the early 2000's, algorithms like LLE, IsoMap and Laplacian Eigenmaps became popular tools for dimensionality reduction, often under the rubric "manifold learning". For a number of reasons, these methods largely fell by the wayside, except for a few like t-SNE that remain popular for visualization. We address the reason behind one of the failure modes of a certain type of manifold learning method. We show that once fixed, these classic algorithms can learn to disentangle complex data as well as modern deep learning methods - particularly data with complex topology - without the need for a generative model and with limited data.
@InProceedings{pfau2018minimally,
title={Minimally Redundant Laplacian Eigenmaps},
author={Pfau, David and Burgess, Christopher P.},
booktitle={6th International Conference on Learning Representations, Workshop Track},
year={2018}
}
We introduce a method to stabilize Generative Adversarial Networks (GANs) by defining the generator objective with respect to an unrolled optimization of the discriminator. This allows training to be adjusted between using the optimal discriminator in the generator’s objective, which is ideal but infeasible in practice, and using the current value of the discriminator, which is often unstable and leads to poor solutions. We show how this technique solves the common problem of mode collapse, stabilizes training of GANs with complex recurrent generators, and increases diversity and coverage of the data distribution by the generator.
Generative adversarial networks (GANs) have become popular in the world of deep unsupervised learning recently, but are notorious for being hard to optimize. This may be because the model consists of two neural networks, each of which is being optimized relative to the current state of the other one, meaning each network is trying to hit a moving target. We describe a practical method for optimizing one of these networks with respect to where the other will be in the future instead of where it is now, hopefully preventing some of these pathologies common to training GANs.
@InProceedings{metz2017unrolled,
title={Unrolled Generative Adversarial Networks},
author={Metz, Luke and Poole, Ben and Pfau, David and Sohl-Dickstein, Jascha},
booktitle={5th International Conference on Learning Representations},
year={2017}
}
Both generative adversarial networks (GAN) in unsupervised learning and actor-critic methods in reinforcement learning (RL) have gained a reputation for being difficult to optimize. Practitioners in both fields have amassed a large number of strategies to mitigate these instabilities and improve training. Here we show that GANs can be viewed as actor-critic methods in an environment where the actor cannot affect the reward. We review the strategies for stabilizing training for each class of models, both those that generalize between the two and those that are particular to that model. We also review a number of extensions to GANs and RL algorithms with even more complicated information flow. We hope that by highlighting this formal connection we will encourage both GAN and RL communities to develop general, scalable, and stable algorithms for multilevel optimization with deep networks, and to draw inspiration across communities.
Generative adversarial networks have become popular in the world of deep unsupervised learning recently, but are notorious for being hard to optimize. Actor-critic methods in reinforcement learning have much the same reputation. We show that the two methods are actually very closely related, and review strategies used in both communities to improve the stability of training and diversity of samples, in the hopes of encouraging cross-pollination between the fields.
@InProceedings{pfau2016connecting,
title={Connecting Generative Adversarial Networks and Actor-Critic Methods},
author={Pfau, David and Vinyals, Oriol},
booktitle={NIPS Workshop on Adversarial Training},
year={2016}
}
The move from hand-designed features to learned features in machine learning has been wildly successful. In spite of this, optimization algorithms are still designed by hand. In this paper we show how the design of an optimization algorithm can be cast as a learning problem, allowing the algorithm to learn to exploit structure in the problems of interest in an automatic way. Our learned algorithms, implemented by LSTMs, outperform generic, hand-designed competitors on the tasks for which they are trained, and also generalize well to new tasks with similar structure. We demonstrate this on a number of tasks, including simple convex problems, training neural networks, and styling images with neural art.
A lot of the success of deep learning has been in showing that features in domains like computer vision that had been hand-designed could be learned instead. Learning itself is mostly done with hand-designed optimization algorithms, however. This paper attempts to apply the successes of deep learning at the meta-level, to the optimization algorithms used to train the deep networks themselves. In other words: "Yo dawg, I heard you like optimizers, so I put a deep network in your deep network so you can learn while you learn."
@InProceedings{andrychowicz2016learning,
title = {Learning to Learn by Gradient Descent by Gradient Descent},
author = {Andrychowicz, Marcin and Denil, Misha and Gomez, Sergio and Hoffman, Matthew W and Pfau, David and Schaul, Tom and de Freitas, Nando},
booktitle = {Advances in Neural Information Processing Systems},
year = {2016}
}
In this work we introduce a differentiable version of the Compositional Pattern Producing Network, called the DPPN. Unlike a standard CPPN, the topology of a DPPN is evolved but the weights are learned. A Lamarckian algorithm, that combines evolution and learning, produces DPPNs to reconstruct an image. Our main result is that DPPNs can be evolved/trained to compress the weights of a denoising autoencoder from 157684 to roughly 200 parameters, while achieving a reconstruction accuracy comparable to a fully connected network with more than two orders of magnitude more parameters. The regularization ability of the DPPN allows it to rediscover (approximate) convolutional network architectures embedded within a fully connected architecture. Such convolutional architectures are the current state of the art for many computer vision applications, so it is satisfying that DPPNs are capable of discovering this structure rather than having to build it in by design. DPPNs exhibit better generalization when tested on the Omniglot dataset after being trained on MNIST, than directly encoded fully connected autoencoders. DPPNs are therefore a new framework for integrating learning and evolution.
Evolutionary computing is a type of stochastic search that randomly changes parameters in a model and keeps around the models that score the highest. While crude, a type of model called "compositional pattern producing networks" can generate interesting images, and can even fool state-of-the-art computer vision algorithms. We show that a mix of gradient descent and stochastic search works better for training convolutional pattern producing networks to produce the parameters of a neural network than stochastic search alone. The neural network parameters that are learned have a structure somewhat similar to convolutions, which are a type of invariance normally built into neural networks by hand. This suggests that important invariances could possibly be discovered instead of being hand-encoded into models.
@InProceedings{fernando2016convolution,
title = {Convolution by Evolution: Differentiable Pattern Producing Networks},
author = {Chrisantha Fernando, Dylan Banarse, Malcolm Reynolds, Frederic Besse, David Pfau, Max Jaderberg, Marc Lanctot, Daan Wierstra},
booktitle = {The Genetic and Evolutionary Computation Conference},
year = {2016}
}
We present a modular approach for analyzing calcium imaging recordings of large neuronal ensembles. Our goal is to simultaneously identify the locations of the neurons, demix spatially overlapping components, and denoise and deconvolve the spiking activity from the slow dynamics of the calcium indicator. Our approach relies on a constrained nonnegative matrix factorization that expresses the spatiotemporal fluorescence activity as the product of a spatial matrix that encodes the spatial footprint of each neuron in the optical field and a temporal matrix that characterizes the calcium concentration of each neuron over time. This framework is combined with a novel constrained deconvolution approach that extracts estimates of neural activity from fluorescence traces, to create a spatiotemporal processing algorithm that requires minimal parameter tuning. We demonstrate the general applicability of our method by applying it to in vitro and in vivo multi-neuronal imaging data, whole-brain light-sheet imaging data, and dendritic imaging data.
Calcium imaging is a powerful class of experimental techniques that allow us to image from hundreds to thousands of neurons simultaneously in living animals. However, the information we really care about - which neuron is spiking when - is mixed together in complex ways in the raw video data. This paper presents a new statistical method that can simultaneously identify where neurons are, unmix the signals from overlapping neurons, and infer when a spike is occurring from noisy data, potentially saving experimenters a lot of time and energy.
@Article{pnevmatikakis2016simultaneous,
title={Simultaneous denoising, deconvolution, and demixing of calcium imaging data},
author={Pnevmatikakis, Eftychios A and Soudry, Daniel and Gao, Yuanjun and Machado, Timothy A and Merel, Josh and Pfau, David and Reardon, Thomas and Mu, Yu and Lacefield, Clay and Yang, Weijian and others},
journal={Neuron},
volume={89},
number={2},
pages={285--299},
year={2016},
publisher={Elsevier}
}
Making intelligent decisions from incomplete information is critical in many applications: for example, robots must choose actions based on imperfect sensors, and speech-based interfaces must infer a user’s needs from noisy microphone inputs. What makes these tasks hard is that often we do not have a natural representation with which to model the domain and use for choosing actions; we must learn about the domain’s properties while simultaneously performing the task. Learning a representation also involves trade-offs between modeling the data that we have seen previously and being able to make predictions about new data. This article explores learning representations of stochastic systems using Bayesian nonparametric statistics. Bayesian nonparametric methods allow the sophistication of a representation to scale gracefully with the complexity in the data. Our main contribution is a careful empirical evaluation of how representations learned using Bayesian nonparametric methods compare to other standard learning approaches, especially in support of planning and control. We show that the Bayesian aspects of the methods result in achieving state-of-the-art performance in decision making with relatively few samples, while the nonparametric aspects often result in fewer computations. These results hold across a variety of different techniques for choosing actions given a representation.
Is it possible for an agent to learn the structure of the world while learning how to act optimally in the world if it isn't able to see everything about the world all at once? We certainly hope so, or artificial intelligence may not be possible. We use a number of techniques for learning structure from time series in a Bayesian nonparametric way, including the Probabilistic Deterministic Infinite Automata (PDIA), to try to address this question. On small problems some of the methods tried do in fact recover the true structure of the world. Not the PDIA, sadly.
@article{doshi2015bayesian,
title={Bayesian nonparametric methods for partially-observable reinforcement learning},
author={Doshi-Velez, Finale and Pfau, David and Wood, Frank and Roy, Nicholas},
journal={IEEE transactions on pattern analysis and machine intelligence},
volume={37},
number={2},
pages={394--407},
year={2015},
publisher={IEEE}
}
Advances in neuroscience are producing data at an astounding rate - data which are fiendishly complex both to process and to interpret. Biological neural networks are high-dimensional, nonlinear, noisy, heterogeneous, and in nearly every way defy the simplifying assumptions of standard statistical methods. In this dissertation we address a number of issues with understanding the structure of neural populations, from the abstract level of how to uncover structure in generic time series, to the practical matter of finding relevant biological structure in state-of-the-art experimental techniques. To learn the structure of generic time series, we develop a new statistical model, which we dub the probabilistic deterministic infinite automata (PDIA), which uses tools from nonparametric Bayesian inference to learn a very general class of sequence models. We show that the models learned by the PDIA often offer better predictive performance and faster inference than Hidden Markov Models, while being significantly more compact than models that simply memorize contexts. For large populations of neurons, models like the PDIA become unwieldy, and we instead investigate ways to robustly reduce the dimensionality of the data. In particular, we adapt the generalized linear model (GLM) framework for regres- sion to the case of matrix completion, which we call the low-dimensional GLM. We show that subspaces and dynamics of neural activity can be accurately recovered from model data, and with only minimal assumptions about the structure of the dynamics can still lead to good predictive performance on real data. Finally, to bridge the gap between recording technology and analysis, particularly as recordings from ever-larger populations of neurons becomes the norm, automated methods for extracting activity from raw recordings become a necessity. We present a number of methods for automatically segmenting biological units from optical imaging data, with applications to light sheet recording of genetically encoded calcium indicator fluorescence in the larval zebrafish, and optical electrophysiology using genetically encoded voltage indicators in culture. Together, these methods are a powerful set of tools for addressing the diverse challenges of modern neuroscience.
6 years of my life compressed into 150-odd pages. Most of chapter 2 and 3 had been published at NIPS already, but some of the material included has not been published elsewhere. The first chapter gives a good summary of the role of information theory in neuroscience and provides some of the motivation for the work in Chapter 2 on time series models. Chapter 2 includes experiments with the PDIA on neuroscience data that have not been published elsewhere, showing that we can learn long-range dependencies in data better than a GLM (at least for data where the observations are binary). Chapter 4 provides a number of experiments in processing calcium imaging data that eventually led to work published in Neuron, but in quite a different form from what's presented here.
@phdthesis{pfau2015learning,
author = {Pfau, David},
title = {Learning Structure in Time Series for Neuroscience and Beyond},
school = {Columbia University},
year = 2015,
month = 2,
}
We present a structured matrix factorization approach to analyzing calcium imaging recordings of large neuronal ensembles. Our goal is to simultaneously identify the locations of the neurons, demix spatially overlapping components, and denoise and deconvolve the spiking activity of each neuron from the slow dynamics of the calcium indicator. The matrix factorization approach relies on the observation that the spatiotemporal fluorescence activity can be expressed as a product of two matrices: a spatial matrix that encodes the location of each neuron in the optical field and a temporal matrix that characterizes the calcium concentration of each neuron over time. We present a simple approach for estimating the dynamics of the calcium indicator as well as the observation noise statistics from the observed data. These parameters are then used to set up the matrix factorization problem in a constrained form that requires no further parameter tuning. We discuss initialization and post-processing techniques that enhance the performance of our method, along with efficient and largely parallelizable algorithms. We apply our method to in vivo large scale multi-neuronal imaging data and also demonstrate how similar methods can be used for the analysis of in vivo dendritic imaging data.
A preliminary version of our work on processing calcium imaging data later published in Neuron.
@Article{pnevmatikakis2014structured,
title={A structured matrix factorization framework for large scale calcium imaging data analysis},
author={Pnevmatikakis, Eftychios A and Gao, Yuanjun and Soudry, Daniel and Pfau, David and Lacefield, Clay and Poskanzer, Kira and Bruno, Randy and Yuste, Rafael and Paninski, Liam},
journal={arXiv preprint arXiv:1409.2903},
year={2014}
}
Recordings from large populations of neurons make it possible to search for hypothesized low-dimensional dynamics. Finding these dynamics requires models that take into account biophysical constraints and can be fit efficiently and robustly. Here, we present an approach to dimensionality reduction for neural data that is convex, does not make strong assumptions about dynamics, does not require averaging over many trials and is extensible to more complex statistical models that combine local and global influences. The results can be combined with spectral methods to learn dynamical systems models. The basic method extends PCA to the exponential family using nuclear norm minimization. We evaluate the effectiveness of this method using an exact decomposition of the Bregman divergence that is analogous to variance explained for PCA. We show on model data that the parameters of latent linear dynamical systems can be recovered, and that even if the dynamics are not stationary we can still recover the true latent subspace. We also demonstrate an extension of nuclear norm minimization that can separate sparse local connections from global latent dynamics. Finally, we demonstrate improved prediction on real neural data from monkey motor cortex compared to fitting linear dynamical models without nuclear norm smoothing.
New technologies make it possible to record from massive populations of neurons, but making that data interpretable is challenging. Dimensionality reduction is one approach, which looks for a few factors in the data which account for most of the variability. State space models extend dimensionality reduction by modeling dynamics in the low-dimensional space of factors, as well as allowing for more complex models of noise that are more appropriate for neural data. In the machine learning community, state space models are typically fit with methods like expectation-maximization, which may be sensitive to the choice of initialization. We show here that state space models that are of interest to neuroscientists can also be fit using techniques from convex optimization - in particular techniques from the matrix completion and system identification community.
@InProceedings{pfau2013robust,
title={Robust learning of low-dimensional dynamics from large neural ensembles},
author={Pfau, David and Pnevmatikakis, Eftychios A and Paninski, Liam},
booktitle={Advances in neural information processing systems},
pages={2391--2399},
year={2013}
}
The opacity of typical objects in the world results in occlusion, an important property of natural scenes that makes inference of the full three-dimensional structure of the world challenging. The relationship between occlusion and low-level image statistics has been hotly debated in the literature, and extensive simulations have been used to determine whether occlusion is responsible for the ubiquitously observed power-law power spectra of natural images. To deepen our understanding of this problem, we have analytically computed the two- and four-point functions of a generalized “dead leaves” model of natural images with parameterized object transparency. Surprisingly, transparency alters these functions only by a multiplicative constant, so long as object diameters follow a power-law distribution. For other object size distributions, transparency more substantially affects the low-level image statistics. We propose that the universality of power-law power spectra for both natural scenes and radiological medical images, formed by the transmission of x-rays through partially transparent tissue, stems from power-law object size distributions, independent of object opacity.
If you compute the correlation between pixels in an image as a function of distance between pixels, a common statistical distribution emerges across nearly all images. The reason for this distribution was not clear - one camp held that it was due to the presence of objects of many different sizes in an image, while another held that it was caused by sharp edges. We show conclusively that the former camp is correct by analytically calculating the correlations in a model of natural images that factors in transparency. We show that changing the transparency of objects in the model does not change the correlation structure, but changing the distribution of sizes in the model does. Thus object sizes, not edges, lead to the complex correlations in nearly all natural images.
@Article{zylberberg2012dead,
title={Dead leaves and the dirty ground: Low-level image statistics in transmissive and occlusive imaging environments},
author={Zylberberg, Joel and Pfau, David and DeWeese, Michael Robert},
journal={Physical Review E},
volume={86},
number={6},
pages={066112},
year={2012},
publisher={APS}
}
A major goal for brain machine interfaces is to allow patients to control prosthetic devices with high degrees of independent movements. Such devices like robotic arms and hands require this high dimensionality of control to restore the full range of actions exhibited in natural movement. Current BMI strategies fall well short of this goal allowing the control of only a few degrees of freedom at a time. In this paper we present work towards the decoding of 27 joint angles from the shoulder, arm and hand as subjects perform reach and grasp movements. We also extend previous work in examining and optimizing the recording depth of electrodes to maximize the movement information that can be extracted from recorded neural signals.
One of the great potential applications of neural decoding is in neural prosthetics - potentially granting locked in patients the ability to move again. Neural decoding has demonstrated the ability to decode monkey reaching in 2 or 3 dimensions, but natural motion is far more complex than that. We showed that baseline algorithms from the neural prosthetics community could scale to controlling a virtual limb with dozens of degrees of freedom, opening the way to more realistic and rich movements from brain-machine interfaces.
@InProceedings{wong2012decoding,
title={Decoding arm and hand movements across layers of the macaque frontal cortices},
author={Wong, Yan T and Vigeral, Mariana and Putrino, David and Pfau, David and Merel, Josh and Paninski, Liam and Pesaran, Bijan},
booktitle={2012 Annual International Conference of the IEEE Engineering in Medicine and Biology Society},
pages={1757--1760},
year={2012},
organization={IEEE}
}
We propose a novel Bayesian nonparametric approach to learning with probabilistic deterministic finite automata (PDFA). We define and develop a sampler for a PDFA with an infinite number of states which we call the probabilistic deterministic infinite automata (PDIA). Posterior predictive inference in this model, given a finite training sequence, can be interpreted as averaging over multiple PDFAs of varying structure, where each PDFA is biased towards having few states. We suggest that our method for averaging over PDFAs is a novel approach to predictive distribution smoothing. We test PDIA inference both on PDFA structure learning and on both natural language and DNA data prediction tasks. The results suggest that the PDIA presents an attractive compromise between the computational cost of hidden Markov models and the storage requirements of hierarchically smoothed Markov models.
Probabilistic deterministic finite automata (PDFA) are a class of probabilistic models for sequences - they assign a probability to every possible sequence, like a string of text. They fall in between hidden Markov models and n-gram models in complexity. Like n-gram models, inference is fast and cheap as there is no uncertainty about what the context is. Like hidden Markov models, complex dependencies in how states transition can be learned - transitions which cannot be learned by an n-gram model no matter how long the context is. We develop a nonparametric Bayesian way of learning PDFA and show it can recover the true structure of artificial grammars that psychologists used to study human sequence learning, as well as learning very compact models of text and DNA.
@InProceedings{pfau2010probabilistic,
title={Probabilistic deterministic infinite automata},
author={Pfau, David and Bartlett, Nicholas and Wood, Frank},
booktitle={Advances in neural information processing systems},
pages={1930--1938},
year={2010}
}
We propose a novel dependent hierarchical Pitman-Yor process model for discrete data. An incremental Monte Carlo inference procedure for this model is developed. We show that inference in this model can be performed in constant space and linear time. The model is demonstrated in a discrete sequence prediction task where it is shown to achieve state of the art sequence prediction performance while using significantly less memory
The sequence memoizer is a powerful probabilistic model of sequential data like text. One downside of the sequence memoizer is that it grows linearly in memory with the amount of data. We show that by forgetting intelligently, a constant-memory sequence memoizer performs comparably to the original linear-memory algorithm.
@InProceedings{bartlett2010forgetting,
title={Forgetting counts: Constant memory inference for a dependent hierarchical Pitman-Yor process},
author={Bartlett, Nicholas and Pfau, David and Wood, Frank},
booktitle={Proceedings of the 27th International Conference on Machine Learning (ICML-10)},
pages={63--70},
year={2010}
}
Under natural viewing conditions, our eyes alternate between saccadic movement and fixation. However,
even during fixation there are constant small movements, which can be decomposed into miniature saccades
and diffusion-like random eye movements. Some diffusion helps prevent adaptation to a particular
stimulus, but diffusion also blurs the image of the world across the retina. Despite this, humans can resolve
fine spatial detail very well, and this diffusion may even enhance the ability to distinguish high-frequency
components of an image [1]. This suggests that the brain compensates for fixational eye diffusion and may
even extract useful information from it. To investigate the effect of eye diffusion on image reconstruction,
we extended a generalized linear model (GLM) of retinal encoding/decoding to incorporate random-walk
drift of the image falling on the retina. GLMs have been successfully applied to modeling a range of neural
systems, including retinal ganglion cells [2]. Previously developed GLMs of the retina, directly estimated
from spiking data, generate simulated network spike trains with the correct spatiotemporal filtering and
correlation structure. Finally, given this network spiking encoding model and a statistical model of the spatiotemporal
visual inputs, there is a natural Bayesian method for decoding the response [3]. For our model
incorporating fixational eye diffusion, the decoding model would assign a probability to all possible random
walks the image could take. However, the number of possible paths grows exponentially with time, making
this method computationally intractable. Instead, we approximate the posterior distribution of images given
the observed spikes as a mixture of Gaussians, and track the diffusive movements of the mixture components
by a particle filtering approximation. This method is both computationally tractable and effective at
reconstructing the encoded image. Preliminary results show that the image reconstruction is poor at both
very low and very high diffusion rates, while reconstruction works reasonably well at intermediate diffusion
rates. Thus, a well-defined optimal diffusion rate exists, and in general depends on statistical properties
of both the stimulus and the retinal spatiotemporal receptive fields, such as the strength of the sustained
response component and whether the transient component lasts longer than the persistence time of the
eye movements. We are currently pursuing quantitative comparisons to the real diffusion coefficient during
head-fixed viewing.
References
[1] Miniature eye movements enhance fine spatial detail. M. Rucci,
R. Iovin, M. Poletti, and F. Santini, Nature 447(7146):851-854, 2007.
[2] Spatio-temporal correlations and
visual signalling in a complete neuronal population. J. W. Pillow, J. Shlens, L. Paninski, A. Sher, A. M.
Litke, E. J. Chichilnisky and E. P. Simoncelli, Nature 454(7202):995-999, 2008.
[3] Model-based decoding,
information estimation, and change-point detection in multi-neuron spike trains. J. W. Pillow, L. Paninski., 2008.
Even when staring fixed at an object, our eyes are moving in subtle ways, yet the world appears fixed to us. Somehow, our brain must be compensating for these random movements of our eyes to create a coherent and stable perception of the world. We developed a Bayesian method to decode both the content of a scene and the motion of the eye simultaneously from a model of the signal the optic nerve sends to the brain. We also derived the optimal amount of eye movement given this model, but found it was ten times smaller than the actual amount of movement, suggesting other factors are in play.
@InProceedings{pfau2009bayesian,
title={A Bayesian method to predict the optimal diffusion coefficient in random fixational eye movements},
author={Pfau, David and Pitkow, Xaq and Paninski, Liam},
booktitle={Conference abstract: Computational and systems neuroscience},
year={2009}
}
Data from neuroscience is fiendishly complex. Neurons exhibit correlations on very long timescales and across large populations, and the activity of individual neurons is difficult to extract from noisy experimental data. I will present work on several projects to address these issues, both abstract and applied. First I will discuss the Probabilistic Deterministic Infinite Automata (PDIA), a nonparametric model of discrete sequences such as natural language or neural spiking. The PDIA explicitly enumerates latent states that are predictive of the future, and by using a Hierarchical Dirichlet Process prior can learn arbitrary transitions between those states. The model class learned by the PDIA is smaller than hidden Markov models but yields superior predictive performance on data with strong history dependence, like text. One weakness of the PDIA is that it is hard to scale when the space of possible observations is very large, as is the case with large populations of neurons. In this limit we are instead interested in reducing the dimensionality of data, and I will present work on unifying the generalized linear model (GLM) framework in neuroscience with dimensionality reduction. The resulting models can be efficiently learned using convex techniques from the matrix completion literature, and can be combined with spectral methods to learn surprisingly accurate models of the dynamics of real neural data. To apply these models to the kinds of high-dimensional neural data now becoming available, we have to bridge the gap between raw data and units of neural activity. I will present joint work with Misha Ahrens and Jeremy Freeman on extracting neural activity from whole-brain recordings in larval zebrafish, as a step towards the long-term goal of making dynamics modeling a daily part of the data analysis routine in neuroscience.
I try to contribute to open source as much as I can from within a private corporation, and some examples include the code from our Spectral Inference Networks paper, as well as various useful linear algebra operators and gradients in TensorFlow and JAX. In particular, the matrix exponential operator in TensorFlow was used to make a novel discovery in the theory of supergravity.
Though it hasn't been updated much since I joined DeepMind, you can find my personal GitHub here. Notable projects include a collection of methods for learning state space models for neuroscience data, some of which has been integrated into the pop_spik_dyn package, a Matlab implementation of Learning Recurrent Neural Networks with Hessian-Free Optimization, and the Java implementation of the Probabilistic Deterministic Infinite Automata used our paper. For those interested in probabilistic programming, I have also provided a PDIA implementation in WebChurch.
I also contributed a C++ implementation of Beam Sampling for the Infinite Hidden Markov Model to the Data Microscopes project. At a factor of 40 faster than existing Matlab code, it's likely the fastest beam sampler for the iHMM in the world.
Not everything makes it into a paper, but that doesn't mean it's not important. You can find short notes and other writings that don't have a home elsewhere here.
A simple result that I haven't seen published elsewhere. Other research on generalized bias-variance decompositions historically has focused on 0-1 loss and is relevant to classificiation and boosting. In probabilistic modeling, error is measured through log probabilities instead of classification accuracy, often with distributions in the exponential family. Exponential family likelihoods and Bregman divergences are closely related, and it turns out it's straightforward to generalize the bias-variance decomposition for squared error to all Bregman divergences.
A short essay about the process behind writing the paper "Disentangling by Subspace Diffusion", containing my own thoughts on the research process and giving some insight into just how long and arduous the process of going from idea to paper can be.