--- license: apache-2.0 prior: - kandinsky-community/kandinsky-2-1-prior tags: - kandinsky - image-to-image duplicated_from: kandinsky-community/kandinsky-2-1 pipeline_tag: image-to-image --- # Kandinsky 2.1 Kandinsky 2.1 inherits best practices from Dall-E 2 and Latent diffusion while introducing some new ideas. It uses the CLIP model as a text and image encoder, and diffusion image prior (mapping) between latent spaces of CLIP modalities. This approach increases the visual performance of the model and unveils new horizons in blending images and text-guided image manipulation. The Kandinsky model is created by [Arseniy Shakhmatov](https://github.com/cene555), [Anton Razzhigaev](https://github.com/razzant), [Aleksandr Nikolich](https://github.com/AlexWortega), [Igor Pavlov](https://github.com/boomb0om), [Andrey Kuznetsov](https://github.com/kuznetsoffandrey) and [Denis Dimitrov](https://github.com/denndimitrov) ## Usage Kandinsky 2.1 is available in diffusers! ```python pip install diffusers transformers ``` ### Text to image ```python from diffusers import KandinskyPipeline, KandinskyPriorPipeline import torch pipe_prior = KandinskyPriorPipeline.from_pretrained("kandinsky-community/kandinsky-2-1-prior", torch_dtype=torch.float16) pipe_prior.to("cuda") prompt = "A alien cheeseburger creature eating itself, claymation, cinematic, moody lighting" negative_prompt = "low quality, bad quality" image_emb = pipe_prior( prompt, guidance_scale=1.0, num_inference_steps=25, generator=generator, negative_prompt=negative_prompt ).images zero_image_emb = pipe_prior( negative_prompt, guidance_scale=1.0, num_inference_steps=25, generator=generator, negative_prompt=negative_prompt ).images pipe = KandinskyPipeline.from_pretrained("kandinsky-community/kandinsky-2-1", torch_dtype=torch.float16) pipe.to("cuda") images = pipe( prompt, image_embeds=image_emb, negative_image_embeds=zero_image_emb, num_images_per_prompt=2, height=768, width=768, num_inference_steps=100, guidance_scale=4.0, generator=generator, ).images[0] image.save("./cheeseburger_monster.png") ``` data:image/s3,"s3://crabby-images/41ea9/41ea90c63c6f50232c8e597902285c5e6f765d67" alt="img" ### Text Guided Image-to-Image Generation ```python from diffusers import KandinskyImg2ImgPipeline, KandinskyPriorPipeline import torch from PIL import Image import requests from io import BytesIO url = "https://raw.githubusercontent.com/CompVis/stable-diffusion/main/assets/stable-samples/img2img/sketch-mountains-input.jpg" response = requests.get(url) original_image = Image.open(BytesIO(response.content)).convert("RGB") original_image = original_image.resize((768, 512)) # create prior pipe_prior = KandinskyPriorPipeline.from_pretrained("kandinsky-community/kandinsky-2-1-prior", torch_dtype=torch.float16) pipe_prior.to("cuda") # create img2img pipeline pipe = KandinskyImg2ImgPipeline.from_pretrained("kandinsky-community/kandinsky-2-1", torch_dtype=torch.float16) pipe.to("cuda") prompt = "A fantasy landscape, Cinematic lighting" negative_prompt = "low quality, bad quality" image_emb = pipe_prior( prompt, guidance_scale=4.0, num_inference_steps=25, generator=generator, negative_prompt=negative_prompt ).images zero_image_emb = pipe_prior( negative_prompt, guidance_scale=4.0, num_inference_steps=25, generator=generator, negative_prompt=negative_prompt ).images out = pipe( prompt, image=original_image, image_embeds=image_emb, negative_image_embeds=zero_image_emb, height=768, width=768, num_inference_steps=500, strength=0.3, ) out.images[0].save("fantasy_land.png") ``` data:image/s3,"s3://crabby-images/1ae1b/1ae1bd487766355ab70850add1697cf2a2a109ce" alt="img" ### Interpolate ```python from diffusers import KandinskyPriorPipeline, KandinskyPipeline from diffusers.utils import load_image import PIL import torch from torchvision import transforms pipe_prior = KandinskyPriorPipeline.from_pretrained("kandinsky-community/kandinsky-2-1-prior", torch_dtype=torch.float16) pipe_prior.to("cuda") img1 = load_image( "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main" "/kandinsky/cat.png" ) img2 = load_image( "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main" "/kandinsky/starry_night.jpeg" ) images_texts = ["a cat", img1, img2] weights = [0.3, 0.3, 0.4] image_emb, zero_image_emb = pipe_prior.interpolate(images_texts, weights) pipe = KandinskyPipeline.from_pretrained("kandinsky-community/kandinsky-2-1", torch_dtype=torch.float16) pipe.to("cuda") image = pipe( "", image_embeds=image_emb, negative_image_embeds=zero_image_emb, height=768, width=768, num_inference_steps=150 ).images[0] image.save("starry_cat.png") ``` data:image/s3,"s3://crabby-images/f092c/f092c7c6e98462dce7c6c9374901fcc741876180" alt="img" ## Model Architecture ### Overview Kandinsky 2.1 is a text-conditional diffusion model based on unCLIP and latent diffusion, composed of a transformer-based image prior model, a unet diffusion model, and a decoder. The model architectures are illustrated in the figure below - the chart on the left describes the process to train the image prior model, the figure in the center is the text-to-image generation process, and the figure on the right is image interpolation.