There are some papers that go from idea to submission in just a few weeks. Some of my own most-cited papers fall into that category - some were not much more than a few notes that I didn’t even think were worth publishing until someone encouraged me. Our NeurIPS paper this year, "Disentangling by Subspace Diffusion," is not one of those papers. Geoff Hinton said he was working on the idea behind capsule networks for over 17 years before he published anything. The ideas behind this paper have been rattling around in my head for almost as long.
I can trace the first germ of the idea for this paper all the way back to 2006. I was a senior in college, taking differential geometry and trying to figure out what to do with my life. That semester, I went to a lecture by Douglas Hofstadter on “Analogy as the Core of Cognition”. There are certain thinkers who can have a huge impact on a person if they encounter them at the right age. For me, listening to Hofstadter was like alchemy. I realized sitting in the audience that all the machinery of affine connections and parallel transport that I was studying was precisely a mathematical model of analogy - it was one way to formalize exactly what he was on stage talking about. At the same time, I knew I couldn’t be the first to realize this - there must be a community of people out there thinking about these things as well. I resolved to find that community.
Fast forward 3 years, and I’m in grad school, taking an advanced seminar on machine learning. The topic that semester was manifold learning. “Perfect!” I thought. “This must be where the people thinking about curvature and analogy are!” Imagine my surprise when I found out that almost none of the actual machinery of differential geometry came into it - it was just a fancy term for finding a nonlinear projection of data into a low-dimensional space. And no one was even using it for making the kind of large leaps in analogical reasoning that had first stirred my interest in the area. But it did give me the confidence to think that I could make an impact in this area. If an algorithm that you could write down in a few lines of MATLAB could get published in Science, then maybe I could do this too.
By 2013, I was nearly at the end of grad school. I’d moved into Bayesian machine learning, and then out of it. I’d published a NIPS paper or two. I was still trying to work out what to do with my life. Deep learning was starting to take over the world. Word2vec appeared, and analogical reasoning with latent representations became trendy. But this was just arithmetic in vector spaces! Surely that couldn’t work for data on curved manifolds, the operations wouldn’t commute! In fact, what does it even mean to make analogies on curved manifolds? It seemed like a paradox to me. At the same time, Tomaso Poggio’s group published their massive “Magic Materials” manuscript, which seemed like it was on the right track in connecting invariances in vision to Lie groups. But even there, it seemed like something was missing...there are a huge number of transformations that leave objects invariant, shouldn’t there be one Lie group for each individual invariance? Shouldn’t that product structure be something that can inform learning, reducing its complexity dramatically? Translation invariance is what makes ConvNets work - couldn’t understanding the structure of all invariances, even discovering that structure, lead to even greater leaps in machine learning than the ConvNet?
Putting these ideas together was another lightbulb moment for me. I realized that the paradox that was bothering me so much about analogies on curved manifolds could be resolved if the manifold was a product of several smaller manifolds. More than that, I realized that the failure of operations to commute on curved manifolds didn’t have to be a failure at all - it could be used as a learning signal. If you could identify which operations on the data manifold commuted and which ones didn’t, you had a way to identify how the manifold could be factorized into its fundamental constituents. I had simply never seen anything like this in machine learning - using curvature to learn a factorization! I felt like for the first time in my career, I’d stumbled on a truly new idea. But when I tried to turn that intuition into an algorithm, I found I could make no headway. I was stuck.
I finished grad school. I got a job at DeepMind and moved to London. AlphaGo happened. Machine Learning became Artificial Intelligence. NIPS became NeurIPS...and it kept growing, growing, growing. Around 2017, my colleague Irina Higgins published the beta-VAE, an algorithm to “disentangle” different factors of variation in image data. I realized that what others called disentangling was essentially what I’d been thinking of as manifold factorization. I knew I could offer a radically different approach to tackling the problem, if only I could formalize it. I reached out to Seb Racaniere, who did his PhD on differential geometry, to discuss my idea with Irina. After 15 minutes of me rambling, Seb just says “ah, you are thinking of the de Rham decomposition,” at which point I ran off to look up what that was. Right there on the page was the precise mathematical encapsulation of what I’d been trying to formalize for 4 years. In an afternoon, all of the pieces of what would become the Geometric Manifold Component Estimator, aka Geomancer, came together in my mind. I just had to make it work.
In just a few weeks, I had something that could work on synthetic data, but I knew that to make a great paper, I would need to get it to work on more “real” data - something like the image data that had first inspired me. For months I hammered away at it, but it seemed like nothing would work. In the meantime, disentangling became a hot topic - I even helped organize a workshop on it. While my work on Geomancer stalled, it started to lead to other ideas. I began working on spectral learning, which led to work on quantum chemistry, which became a major project in its own right. But still, I knew that I had to finish Geomancer before the field moved on. So I resolved that this would be the year to get it out, no matter how much I was able to make work. And with a lot of help from my coauthors, we were able to make that happen.
You can judge the results for yourself - we did not in the end make it work on image data, though we think we correctly identified where it fails, and believe it provides some very deep insights into the relationship between different branches of machine learning. The result is a bit like a bridge that goes halfway to its destination - but I think it’s a beautiful bridge, a marvel of engineering, and I still believe the other half will be completed someday. You could say this paper is the culmination of 14 years of thought and work, but I prefer to think it’s only the beginning. There is so much pressure in machine learning today to get out a paper that beats some big benchmark, or renders some amazing-looking video, or racks up a thousand citations in a year. But the ideas that will still be influential in a decade will probably not be the things that make the biggest splash today. I don’t know if Geomancer will be useful for anyone any time soon - though I hope it will be. But I do know that Geomancer is really a new idea, and I truly believe that in the long run, it’s new ideas that make an impact. It took a decade and a half for that college lecture to turn into Geomancer. I just hope that Geomancer will inspire something like that in someone else.
8 December 2020