A developer gives a glimpse into the VR future with generative AI using Stable Diffusion.
Generative AI systems for text, image, audio, video and 3D have made tremendous strides recently. They have the potential to change work processes, or are already doing so by enabling humans to create audio-visually sophisticated media - or simply better text.
Generative AI is also the foundation for the further proliferation of 3D content - similar to what smartphone cameras did for photography. Well-known Silicon Valley venture capital firm Sequoia Capital believes that current generative AI systems are the vanguard for a computing revolution.
A developer is now demonstrating the potential of generative AI using a VR world designed by the open-source image AI Stable Diffusion.
Stable Diffusion for VR
The developer combines Stable Diffusion with the Touchdesigner programming language and calls his result an "immersive latent space in real time". He sees the following video as proof of the technology's future potential and announces further refinements. According to the developer, you can move freely in the demonstrated Stable Diffusion VR world.
According to the developer, the fact that objects in the video permanently change when you look at them for a longer time is a side effect of the current stable diffusion implementation: the image AI assumes that it could have drawn an object better if you looked at it for a longer time, and generates a new variant.
Google's Phenaki text-to-video system shows that generative AI can also create coherent scenes. The video AI renders videos up to two minutes long based on sequential prompts.
Great technical effort - with prospects for rapid improvements
Besides Stable Diffusion, the developer uses a second AI system: MIDAS from Intel is responsible for the 3D representation of the environment. The MIDAS model can calculate the 3D depth based on a single image, onto which the Stable Diffusion generations are then projected.
The demo runs in real-time, but requires an enormous amount of computing power: according to the developer, it consumes 40 credits per hour at Google Colab on an Nvidia A100. The demo was created on an Nvidia 2080 Ti with 11 GB.
The Midas model runs continuously per frame, Stable Diffusion at a preset rate. To further reduce the computing load, the system also only renders the image in the field of view instead of the full 360-degree environment. In the demo, the same image is rendered per eye, so stereoscopic 3D is not yet supported, but that will "definitely be improved" according to the developer.
"The speed of Stable Diffusion is skyrocketing these days, but we're still in need of better optimizations," the developer writes. He can't say when the demo or something similar might be released as a test version. Currently, the code is spread across two neural networks and three different hardware configurations, and bringing it all together would be more effort than he could do on his own.
Nevertheless, further refinements are in the works. Those who want to participate can find more information on Github Deforum or join the group directly on Discord.
Carmack's vision: Automated VR worlds for every video
At the same time, star developer and former Oculus CTO John Carmack is speaking out on Twitter. As a VR enthusiast, he now does AI, so he knows both technologies. His dream is automatically generated 3D photogrammetric worlds "constructed from every movie and video ever filmed," Carmack writes.
There are still many technical challenges to be solved, especially with geometry, such as merging different camera positions, he says. But according to Carmack, "it feels like we are right at the cusp of neural models solving everything, including outpainting."
His vision is a generative AI system that creates 3D worlds based on any given video. "I’m sure there are already experiments with this, but if it escapes the lab like Stable Diffusion did, it will be fantastic," Carmack writes.