☎️ Interview: Ben Mildenhall, Co-Inventor of Neural Radiance Fields (NeRFs) on the State of the Neural Rendering, Generative AI, and the Metaverse #007
On why 3D renders from 2D images is the best approach to synthetic 3D data generation; why adoption will first come from filmmaking and games; and the need for spatial interfaces
We must create high-quality objects and scenes to populate digital worlds with 3D assets. But collecting 3D data is expensive, and there isn’t much of it available. Ben Mildenhall, Research Scientist at Google and co-inventor of NeRF: Neural Radiance Fields, thinks that the answer is to extract 3D data from something we have a lot of 2D images. Ben is pioneering work on generating synthetic 3D renders from 2D images and believes this is the only way to build 3D worlds. Digital twins and the Metaverse will not be possible without the emerging field of neural rendering.
Takeaways
The debate in the field is how to scale 3D content creation. One group wants to collect more 3D data, trading off cost and time for accuracy. Another wants to create 3D renders from 2D images, trading off accuracy for speed and scalability.
There are two paths for neural rendering progress: first, making existing models faster, cheaper and more efficient for commercial applications. This path will lead to 10 or so second high-quality text-to-video models. The more interesting path is combining multiple generative models for high-level reasoning. Imagine passive scene reconstruction and segmentation from a few seconds of video with scene manipulation.
Neural rendering is the pathway to digital twins, larger-scale simulations, and Metaverse. They are all just high-fidelity renders of something physical, and without neural rendering, you can’t create digital assets.
Neural rendering will first see adoption in filmmaking and games, where designers feel the pain of creating assets daily. Models will be cheap and fast but of lower quality, so they are likely more suitable for mock-ups, prototypes and inspiration. The higher-value use cases for digital twins and Metaverse require higher quality and will take longer.
Spatial interfaces will become the dominant user interface only when we have high-quality mixed-reality headsets to make the interfaces intuitive. Mixed reality headsets have a long way to go before they are mainstream-ready regarding battery life, weight, computing power, and optics. Still, it’s all engineering and not impossible.
The speed of progress is extraordinary, and it’s hard to predict what will come from the combination of models. It’s hard to look more than two years out, but certainly, we can expect useful, fast, cheap and high-fidelity neural radiance models in the next few years for things we have lots of 3D data.
By 2030, we won’t have an “An entirely AI-generated 20-minute short film wins an Oscar.” That’s the wrong framing. Filmmakers will use neural radiance models, which will be about who uses the tools most effectively. Why limit yourself to a single few-line prompt when you can easily write a richer prompt to get a better output or add more images, videos, and music? At what point is that then co-creation rather than entirely “AI generated”?
---
Reflections
Validated the view that what he called neural rendering is a key building block for digital twins and metaverse. Neural rendering should be pulled out from computer vision as a separate technology, even though it is a part of the field.
Tested the assumption that the size and diversity of 3D datasets like ShapeNet limited the field. Ben’s work and neural rendering offer a way around the data problem. Neural rendering fits our internal thesis that future progress in AI will be driven by “algorithms not data.”
Validated that adoption will come first in filmmaking and gaming where 3D asset creation already exists, and there is a known cost for which neural rendering can reduce. Further validated the view that the outputs will be limited in quality and video length; therefore, the first use cases will be for inspiration and prototyping.
Tested assumption that progress would be rapid and we can expect minute+ short videos of a reasonable quality soon (12 months). This assumption was speculative but based on rapid progress with text-to-image models like Stable Diffusion. But neural rendering is still in its infancy, and we are still relatively far away from general-purpose models.
Still, this interview reinforced the view that investors should be looking to invest in neural rendering startups doing foundational work, even if the beginning of market adoption timeframe might be in the 2025-2030 range rather than 2023-2025.
"So we can imagine text/single-image-to-3d models that will get much attention, like Stable Diffusion. And then there is just much better scene reconstruction from 10 or 50 images. That will have maybe less PR-y value but be super useful for AR/VR and many retail and industrial use cases. And then this all leads us to a world in which you use smart glasses to capture a few seconds of videos to create a full high-resolution scene."
Let’s start by getting oriented. Do you consider your field to be 3D reconstruction or, broadly, computer vision or something else?
Computer vision is a little too broad, and 3D reconstruction is too narrow, as it implies specifically reconstructing geometry. We care about rendered images. A good new term is neural rendering. This is the idea that the work encompasses any modified graphics or rendering pipeline where some component is not hand designed. That could be some part of the representation, the renderer, or both. So let’s go with neural rendering.
I classify computer vision into two application areas: visual understanding and content creation, either reconstruction or new. Is 2D computer vision and 3D computer vision a useful framing?
I don’t think 3D and 2D are the right way to think it. Most tasks people have done in 2D, they’ve also tried to do in 3D, but they're harder. So many of the things we are good at today are 2D. We classify and segment images relatively easily now. And we are pretty good with some videos with 3D bounding box detection and such, but it’s harder. It’s harder because we don’t have a ingle 3D sensor to capture the ground truth. There's no 3D sensor analogous to a 2D image where it's like a perfect dense measurement of truth. There’s a lot of post-processing of the raw data in such a way that introduces artifacts which adds noise. You can't trust 3D data in the same way as 2D data, making the whole field a lot more complicated. But that’s not specific to adding a dimension, it’s just harder because it’s more complex, you need to use synthetic data techniques, and you have to do more computation. So you might say, it’s not a binary 2D and 3D, but more a spectrum of increasingly complexity.
Also when you are dealing with 3D you are now dealing with synthetic data. In 2D computer vision you can just take an image and process it. That’s not the case in 3D because you just don’t have the variety and size of datasets you need. So you need to create new synthetic data. There is a debate in the community at the moment as to whether 3D synthetic data is the right approach at all, actually, and whether it just doesn’t scale or even needs ground truths at all. For a long time, the main line of computer vision 3D research was using the Stanford ShapeNet data set. A data set of thousands of 3D models grouped into semantic categories. You know, cars, planes, chairs, etc. Researchers spent their time using that data set to make generative models. Researchers have collected 3D data from games and simulations to build bigger datasets and train better models. But we think that approach is expensive and slow. We have sort of taken a different approach. I’m focused on extracting 3D information from 2D. So I don’t use 3D data or create synthetic data. I just see what you can get from 2D data and elevate from that since that's the data source we can easily collect. And that kind of scales the best since there is a ton of cheap, labeled data already out there. Also, 2D needs less storage, memory, computing, etc. It’s just a much more scalable approach.
Over the last two years, we have maxed out what you can do with images of a single scene alone. So there are now many paths to progress. If you start to bring in priors and learn models trained over images or some other source of data, maybe that can help aid the reconstruction by adding some semantic knowledge of how images and the world should look. We can take a category from ShapeNet, take some 3D representation, maybe like a mesh box or grid point cloud, and then learn a standard kind of diffusion or any generative model architecture over that. So then you can sample latent code or sample a noise vector and get a new chair out that looks like its shape, not a chair. But the problem with ShapeNet is that the models are pretty simple, and you don't have many. So you're limited.
How fast will progress be over the next couple of years? And more specifically, what are you working on beyond just making neural rendering faster and cheaper? What new capabilities can we expect?
You can split it into two lines of work: the first is, yes, faster and cheaper models. That is probably the applied side because there are already many applications you can imagine where short 10-second or so high-quality videos would be valuable. And then you have the more interesting line of work, which is sort of around reducing the number of images you need to make the thing even more scalable. We have a pretty good intuition about how many pictures you need to capture of an object or scene for a good reconstruction. So we can improve on the capture side with things like Polycam, Luma, and the AR Kits. But that is still pretty unwieldy. But a more interesting path might be using generative model priors to hallucinate and see what we end up with. So there is sort of the accuracy piece and the creativity piece. So we might have like text or single image-to-3D models. That might be more for creativity. And then there’s the case of having 10 or 15, even 50 images of a room or something. It's not technical enough to completely constrain the problem. But I still would like to be able to have kind of a full clean reconstruction where stuff that's ambiguous has been corrected. So there will be this quality slider and different models to fill those needs.
Maybe the most exciting area of research is combining neural rendering with high-level reasoning. You can do simple examples where you can run 2D segmentation on all the input images. You get some features, and then you can train a model that's not only recovering like shape and colour, it can also recover in 3D, a semantic label, or a semantic feature, so you get this kind of like feature field instead. I'm guessing people will start to push on that a lot and try to get better results, jointly doing the 3D reconstruction and a full semantic decomposition of the scene such that you can maybe even start to edit it or modify it somehow. I'm sure that kind of stuff is coming.
So we can imagine text/single-image-to-3d models that will get much attention, like Stable Diffusion. And then there is just much better scene reconstruction from 10 or 50 images. That will have maybe less PR-y value but be super useful for AR/VR and many retail and industrial use cases. And then this all leads us to a world in which you use smart glasses to capture a few seconds of videos to create a full high-resolution scene.
I don’t think I understand the high-level reasoning bit. Are you imagining a case in which a scene is uploaded, you run an algorithm, and we get real-time context like “It looks like this is a scene of Paris, do you want me to add a richer cityscape and highlight places of interest, etc?”
Well, let’s start with a low level. First, context means reconstructing this scene sufficiently so that you can render a new view or look at the underlying geometry, which looks clean to the eye. And then we ask, what else would be valuable? You can imagine a full scene graph-type decomposition where you have a room, and then the room scene is decomposed into a table, chairs, TV and whiteboard, etc. And then maybe break down the chair into assets like legs, materials, or anything. So to some extent, you have “context” about the room, and now it’s modifiable. This is, in some sense, a digital twin. And now, we ask, what else might be useful for a digital twin? So maybe you add all the materials, their estimated weight, or predictions of how the shadow might change under different sunlight conditions. You can plug this data source into any algorithm you want. And so I think about 3D assets, digital twins, simulation, and Metaverse as the same thing. They are all just high-fidelity renders of something physical. So if you start creating 3D renders from 2D data, you end up with these richer and more useful environments.
We already see access to data is the bottleneck for machine learning algorithms. We’ll never have enough high-quality data, so we need algorithms to make more training data. And if we want big full-blown simulations and digital twins, we'll never get there because we can't get high fidelity enough. Unless NERF, right?
I guess there's a question of which aspect you're referring to with high fidelity. Okay, so there's the basic fact that you can't just scale to a metaverse environment. You literally can't scale video game design, the limitations are just cost and time. These massive multiplayer online games just aren’t that massive. It would take 50 years to make and cost like a hundred billion. So absolutely, we have to automate creating new data for training models and the environments themselves.
But I don’t know, I suppose this is why we are working with the data we have in 2D. We probably have enough of that type of data. But the unknown is how to think about cost and high fidelity. There are probably thousands of applications that don’t need extreme high-fidelity rendering. Do we have enough data, assuming we can get down to 50 or fewer images or something, is that enough for huge simulations and digital twins? Probably now, but it’s going to be meaningful. And then, as I said earlier, pair this with some language models for context and maybe for some synthetic generation, and you might get to levels of fidelity you might not expect.
So yeah, broadly, you are right; data is a bottleneck for high-end things. But it’s entirely possible we can get there with the data we have combined with some generative models. Also, it depends on what you are trying to achieve. You may collect the necessary data if you need a perfect replica for a real-world thing, so an advanced digital twin. But if you just want a more realistic shop in a virtual environment, maybe 2D, NERF, and some synthetic generation is enough, right?
I imagine a persistent virtual environment, ideally with millions of people simultaneously experiencing the same state. It becomes more realistic and, therefore, immersive over time. A lot of infrastructure is needed for that vision, but one of those is populating these worlds with realistic content. It strikes me that any vision of the Metaverse requires a way to quickly and cheaply create things. And things in the Metaverse are 3D. So the big advances in LLMs recently around text and images don’t get us anywhere near that vision. But what you are working on and NERF, they do. They are as much a fundamental enabler of the Metaverse as new databases or virtual reality headsets.
When people say Metaverse, it's code for a full, hundred percent complete 3D understanding of the world. And like any application, you would want on top of that, I guess there's the additional component of people being inside it with the metaverse, which is fine. So there are lots of technology that needs to mature. You need higher-fidelity headsets with lower latency. You need hand recognition and gesture control and things. But yeah, getting cheap realistic objects and scenes into these spaces is a massive unlock. You need to be able to view these things in high fidelity, too, so the access device is important. But yeah, it’s an enabler. But that’s all applied; I am working on the research side, so I don’t think too much of the downstream users. But you don’t have to have anything and then populate a Metaverse. You have these intermediate stages where NERFs are useful, like the Mcdonald’s advert and people using it for VX in filmmaking. So we will see the technology in the wild way before a Metaverse. Even for e-commerce websites, for example.
Is that sort of the way you think about it? This is a lowering of the barrier technology.
I think that's true. So it’s lowering the barrier for new users to do 3D content creation. But also, it’s lowering the barrier for the people who already understand and use this stuff and for whom it is too slow and expensive. That’s where we will see this first. These people feel the pain today. It’s slow and time-consuming to make 3D assets. Like today, why would a normal person use Blender or Maya? They don't have an obvious use case, and it takes thousands of hours to learn. It's there's no motivation unless you're a devoted hobbyist. We sort of say this with Lightroom and Facebook. They offered 3D photos and some basic capabilities, but no one used the tools. It sort of lacks a clear use case. Even with Apple today, you can scan the room with an iPhone. The feature is great and creates some amazing stuff. But again, people aren’t using it; why? When I am looking for an apartment, I should create a 3D render, but I don’t.
I think it’s still one of those things humans are better at than computers. Humans have spatial memory and are good at recalling and navigating 3D space. It’s remarkable how good we are at making guesses about spatial affordances, and on a higher semantic level, you have both perspective cues and stuff, and you know where it all is in its context. And you can interact with them immediately. One of the things I thought was cool about VR is just, beyond the display part, the fact that you get a six-degree-of-freedom controller that you can use to manipulate stuff is mind-blowing. And even if you use that like with a 2D screen, what Leap Motion had. Anyone who's used any 3D rotation controller could tell you just that you're trying to manipulate the rotation of an object in three dimensions, there is something intuitive about it. It’s way more natural compared to a mouse. I think it’s inevitable that we will use these 3d creation tools, but probably not until we have 3d interfaces on mixed-reality devices.
The PC was command line, then GUI. Mobile was touch. You could introduce touch to laptops; people have tried it, but it hasn’t worked. So is that the right frame, we need high-quality headsets for people to embrace a new UX like spatial.
We’re talking about the relationship between hardware and software. There's a back-and-forth where every time the hardware makes this step, the software isn't ready, and every time the software makes a step, the hardware isn't ready. People have gotten flack for preaching about the metaverse or the inevitability of AR and VR, but if you've used a VR headset, You can't deny there is something there. There is still a long way to go on the hardware side; they are too low-powered, there isn't much content, and 3D scanning doesn’t work well.
But when headsets first came out, there was no way to scan the room and get a mesh of it. And look how long the pass-through took to work. And the software was and still is kind of clunky. It was not a solved, robustly solved computer vision problem. To be like, okay, we will have a couple of cameras on the headset. We're going to pipe through the correctly interpolated view to your two eyes so it looks like your eyes are looking straight out into the room. That was not an easily solvable problem. There wasn't a great reason to work on that exact problem. So progress was pretty slow, and it turned out all that work went into introducing depth to cameras and solving some fundamental computer vision problems. And then you had companies like Meta working just on VR and building huge R&D teams to ship products that could do pass-through. And that is just super basic, just to get an exact feed close to your eyes and correctly warp it slightly. But now we are talking about doing a whole room scan. Ideally offline. And ideally maintaining accuracy over an extended period. And from a bunch of different views too.
And that’s not even thinking about optical challenges to avoid nausea. We need faster computing speeds at lower latencies and less heat dissipation so people can safely wear headsets for hours. So it’s very easy to envision how this could be in the future, but then there are all these annoying issues in each intermediate stage, and it's very easy to criticize it and be like, oh, I don't know if this is going to work. I know people have made much progress, but it’s slow. Think about how limited 3D printing was initially and how far it’s come. So this isn’t unique to AR/VR, nor are the challenges much harder. So I’m optimistic we will get high-quality headsets relatively soon. And yes, the experience will be much better than touch because of the spatial intuitiveness that people will switch en masse.
Interesting, so let’s think about that a bit more. So we need to hype a technologies readiness to raise capital, bring on talent, and get attention to overcome the technical hurdles in the first place. But then, when inevitably the hurdles take longer than promised, it’s called a fad, and everyone moves on. The Metaverse is clearly in that stage now. So what do you think about this boom and bust regarding market adoption?
It’s probably more nuanced than boom-and-bust. As I said, there are obvious intermediate milestones where you can track definitive progress with pass-through. So, on the one hand, you can point to NERF and progress in 3D content creation in terms of model accuracy and usefulness of outputs. But what can you do if someone on Twitter says we will get feature-length videos from a few prompts this year? The discussion is much more sensible if you talk to people in the industry and the people that know. We all know the size and timing of the market are being overplayed right now. The reality is filmmakers and game studios will be experimenting with the tech first and in very limited ways because the outputs aren’t good enough for their needs yet. But costs are high, so it’s worth experimenting. It’s easier first to reach the people fluent in existing 3D pipelines and give them cheaper or faster tools. They have a much higher tolerance to lower quality but high speed because they know how long it takes to create good 3D content. The average person doesn’t. So if there were to use a NERF-powered content creator today, they would think that it’s really bad. They won’t appreciate how difficult it is. This is why it’s hard to imagine user-generated Metaverse content in the short term because people want to create good stuff, not bad stuff.
I think it’s more likely that the consumer won’t create content as such; rather, it’s passively created for manipulation. An iPhone or Apple Headset has the hardware and software to run scene reconstruction all the time, like Parallax on display. So the “create” bit of the workflow is automated. Consumers can then manipulate the output or be fed into some digital twin or updated Metaverse environment. However, that has to be long-term regarding computing and power requirements. I imagine it will be like the Apple Watch. Some of these features are launched but limited. Maybe real-time head tracking for avatars or whatever. And then capabilities are added over time. Ambient 3D capture feels like the obvious end-state, but that’s far away.
My mental model for this is Apple and Samsung stuff will have a limited battery life. Maybe a couple of hours at best. And so there's only, there'll only be a couple of very clear use cases. I think maybe something around fitness. I don’t know, you go to the gym or something, and you or you do yoga or something that makes sense. But it's a case of having a tight use case where neural rendering is happening in the background in really high fidelity.
I guess that's a given that if Apple is launching something this year, then it will be more of a Watch thing than an iPhone thing. The thing will not be cheap, so the target use cases will be for the high-end. The devices must get out there so developers can understand how to develop for them and the possible experiences. So we will see applications that are either low-powered or don’t use much computing. Or short applications that are useful for, at best, a few hours. Continuous, high-performance applications just won’t be possible without a tethered connection. Once devices are in the market, we might start to see more domain-specific datasets and tweaked models that help extend battery life with optimizations. This will all be in the realm of commercialization, though. Our work we still be on making the underlying models more useful.
Where will the most exciting research come out in the next few years? If I'm looking at papers, what sort of things can I expect?
The most exciting and surprising things will come out of that just because of piggybacking on what's happened in text and image. Video is the next domain, so you should expect many examples of combining the fields. Text-to-3D has been around for less than a year. The speed of development is very fast, and while I have a good sense of what’s possible on the neural rendering front, there will be some crazy things that come out of the combinations of generative tools. Watching for at least a year or two will be pretty wild. I honestly have no idea what's going to happen after two years. There is an AGI scenario, so it shows how hard predictions should be in AI right now.
By 2030 a prediction would be: An entirely AI-generated 20-minute short film wins an Oscar.
Yeah, but that’s too vague. What exactly is “entirely AI-generated”? A single prompt? Or just text? What about a mood board and a one-page script? How many pictures? I do not doubt that filmmaking tools will allow a creator to use a variety of assets, including text, images, videos, and music, to create videos much more quickly. We can overlay other tools to ensure the right emotions are hit at the right time. Films are pretty formulaic. But I’m not sure something like what you are describing is useful; why limit it to a text-to-20-minute video? Why not have a better output with 50 images and some demo music, for example? And at that point, you have to ask, well, is it entirely AI-generated? This will probably be a broader debate in the creative industries going forward.