Clip

Like the variational autoencoder (VAE), the vision model CLIP (contrastive language-image pre-training), first released in 2021 by OpenAI, is largely unknown to the general public. As the VAE, it is used in the image generation pipeline as a component to encode input into latents. Its role is to transform into embeddings, representations that can be operated upon in the latent space.