Image generated in Stable Diffusion, a deep learning text-to-image diffusion model. Based on the prompt “A cat sitting beside a paper bag.”

Visual AI: The cat is out of the bag

Published in

Focus

5 min readNov 17, 2023

A UBC computer scientist teamed up with Adobe and Apple to reconstruct 3D images and videos from situations that never existed

By Geoff Gilliard

Until very recently, debates about the growing power of artificial intelligence (AI) weren’t a primary concern to computer scientist Dr. Kwang Moo Yi. He’s at the forefront of a field called visual geometry. Visual geometry uses AI to create 3D models of things that already exist and how those real things react given certain conditions — perhaps determining the optimal shape of a windshield to achieve the best aerodynamics.

But since deep learning was introduced to computer vision in 2012, its capabilities have grown in leaps and bounds. AI now allows for the reconstruction of 3D images and videos from situations that never existed. The photo-realistic rendering is important for augmented reality experiences, but it also raises the threat of deep fake videos at a time when politics are already polarized and disinformation rampant.

“How do you know what you’re seeing is real? What’s the credibility of a video?” asks Dr. Yi, assistant professor in the University of British Columbia’s Department of Computer Science. “So there are societal implications now, because the technology is working so well. And it’s very close to being available more broadly.”

Dr. Kwang Moo Yi wants to make 3D reconstruction and reasoning accessible through his algorithms so everybody can have the same level of image editing expertise.

In research with Apple, Dr. Yi has shown it’s possible, starting with a authentic video of a person walking around a static scene, to create an interactive 3D model of that person posing and moving in ways that never happened.

Using a 10-second video clip, two neural human radiance field models — a human and a scene — were trained to estimate the geometry of the subjects and render both the human and scene in new poses and views.

Then there’s interactive neural video editing (INVE), a project from Dr. Yi’s lab conducted in partnership with Silicone Valley software giant Adobe. INVE allows even novice video editors to quickly add or modify visual effects on existing clips.

Layered neural atlases (LNA), the framework that INVE is built on, is too slow for interactive video editing, so Dr. Yi’s lab improved LNA’s training speed and added layered editing. Each set of new data added to a video (graphics, vectorized sketches, or colour adjustments) is on its own editable layer — as in Photoshop — or on frames themselves. Data from one frame then moves realistically with the video’s original contents throughout the entire clip.

Interactive neural video editing can enable object or actor recoloring, relighting, texture editing and many other tasks, in real time.

AI’s self-taught manipulation of reality isn’t limited to video. With FaceLit, Dr. Yi conducted research with Apple to generate photorealistic 3D facial images from existing photo datasets.

“We used existing algorithms to estimate, given a facial image, where the camera was and lighting conditions when the photo was taken. Then we use AI to train a ‘generative model’ of how things should look given certain lighting conditions and camera views. It figures out what the image would look like if the subject was the image of a person drawn at random from a dataset.”

Previous methods of turning photos of faces into 3D renderings weren’t generative — models had to be trained for each face and couldn’t generate new ones. Beyond that, the face could only be set in scenes comprised of a collection of images of that scene from multiple perspectives, rather than from images drawn from existing datasets.

With earlier generative models, the components of scenes — geometry, appearance and lighting — were entangled. Users couldn’t control each component separately. But using physics-based rendering, FaceLit learns to disentangle the components so that users can control the camera angle as well as the subjects’ pose and illumination, and present them in scenes from existing datasets of images “in-the-wild.”

“What FaceLit learns is an abstract space, where every point of the space corresponds to an actual object such as a human face, a cat or a building,” Dr. Yi explains.

“Anything that doesn’t make sense physically or in terms of data will not be allowed in the abstract space. Because this space is only about meaningful things, chances are you can find a best fit able to describe what’s happening in reality. For example, even if you don’t know the source of the lighting or different materials of a surface, you’ll be able to determine those things because the only rational explanation for how this image is formed is already learned in the model.”

FaceLit can generate 3D faces while allowing users to control the camera pose and illumination. The environment map is rendered using the half-sphere at the bottom right. The images are from a model trained on the Flickr-Faces-HQ dataset and the Metfaces dataset of images extracted from works of art.

“Machine learning and deep learning are really good interpolators — if you have nearby data points that the model has seen during training, it does a good job,” says Dr. Yi. “The moment it steps outside, there’s a blank spot so ML is probably going to have trouble.”

Blank spots in datasets, though, are becoming less of a problem. Over the past year, software companies have started training text-to-image diffusion models with very, very large datasets. Stable Diffusion is a deep learning model based on open-source code trained with five billion image-caption pairs scraped from the internet. Type in a request for a photo of a ballerina in boots and the software will comply with a photorealistic image.

*An image generated by Stable Diffusion based on the text prompt “A photograph of a ballerina in hip waders.”*

“With Stable Diffusion it’s very difficult to get out of data coverage because of the sheer number of images it uses. Everything that you’re going to ask it to create often relates to at least one of those five billion images. The images aren’t generated from scratch but are more or less a combination of the five billion images injected into the dataset.”

Dr. Yi has also found that these text-to-image models can be hacked — without any training or supervision the models can be used for 3D extraction and real-world scenarios in a process similar to reverse engineering.

“If you’re able to hack into these models just the right way, without any training or supervision, the models will perform better than models trained by human labels.”

That makes the potential for unintentional copyright violation likely. Users have no idea where the components of the generated photos come from. Lawsuits are pending against Stability AI, the company that created Stable Diffusion.

Beyond the ethical and legal quandaries surrounding AI-generated deep fakes are the implications for egality. Dr. Yi’s primary concern is that the stupendous advantage that AI provides will be limited to only those who can afford the technology. Like ChatGPT, Stable Diffusion is free — for the time being.

“Given that the cat is already out of the bag, it’s very important that these large generative models remain in the public domain,” he says. “No single entity that pursues commercial interests should have control over how AI is used.”

Dr. Yi intends to make 3D reconstruction and reasoning accessible through the algorithms his group is developing so everybody can have the same level of image editing expertise. All his research is based on generic opensource coding libraries, with contributions from all over the world. This is one of the reasons that machine learning and deep learning are advancing at such a rate.

“If I were to make my codebase closed source, it wouldn’t advance fast enough to be relevant,” he says.

Visual AI: The cat is out of the bag

Written by UBC Science