Published in AI

Nvidia creates a simple new AI text-to-image method

by on02 August 2023

Only takes 100KB and a four-minutes training time

Nvidia researchers have created a new text-to-image personalisation method called Perfusion.

Unlike the million-dollar super heavyweight models out there Perfusion is 100KB and takes only four minutes to train.

Perfusion was presented in a research paper created by Nvidia and the Tel-Aviv University in Israel. Despite its small size in some areas it can outperform leading AI art generators like Stability AI's Stable Diffusion v1.5, the newly released Stable Diffusion XL (SDXL), and MidJourney in terms of efficiency.

The main new idea in Perfusion is called "Key-Locking."

This works by connecting new concepts a user wants to add, like a specific cat or chair, to a more general category during image generation. For example, the cat would be linked to the broader idea of a "feline." This helps avoid overfitting when the model gets too narrowly tuned to the exact training examples.

Overfitting makes it hard for the AI to generate new creative versions of the concept. By tying the new cat to the general notion of a feline, the model can portray the cat in many different poses, appearances, and surroundings. But still retains the essential "catness."

Key-Locking lets the AI flexibly portray personalised concepts while keeping their core identity. It's like giving an artist the following directions: "Draw my cat Tom while as he uses the White House as a litter tray.”

Perfusion also enables multiple personalised concepts to be combined in a single image with natural interactions, unlike existing tools that learn concepts in isolation. Users can guide the image creation process through text prompts, merging concepts like a specific cat and chair.

According to Nvidia, Perfusion offers a remarkable feature that lets users control the balance between visual fidelity (the image) and textual alignment (the prompt) during inference by adjusting a single 100KB model. This capability allows users to easily explore the Pareto front (text similarity vs image similarity) and select the optimal trade-off that suits their specific needs without retraining.

It's important to note that training a model requires some finesse. Focusing on reproducing the model too much leads to the model producing the same output repeatedly and making it follow the prompt too closely with no freedom usually produces a bad result. The flexibility to tune how close the generator gets to the prompt is an essential piece of customisation

Last modified on 02 August 2023
Rate this item
(1 Vote)