Just playing with getting CLIP Guided Diffusion running locally, rather than having to use colab.

Original colab notebooks by Katherine Crowson (,

  • Original 256×256 notebook: Open In Colab

It uses OpenAI’s 256×256 unconditional ImageNet diffusion model (

  • Original 512×512 notebook: Open In Colab

It uses a 512×512 unconditional ImageNet diffusion model fine-tuned from OpenAI’s 512×512 class-conditional ImageNet diffusion model (

Together with CLIP (, they connect text prompts with images.

Either the 256 or 512 model can be used here (by setting --output_size to either 256 or 512)

Some example images:

“A woman standing in a park”:

“An alien landscape”:

“A painting of a man”:

*images enhanced with Real-ESRGAN

You may also be interested in VQGAN-CLIP


  • Ubuntu 20.04 (Windows untested but should work)
  • Anaconda
  • Nvidia RTX 3090

Typical VRAM requirments:

  • 256 defaults: 10 GB
  • 512 defaults: 18 GB

Set up

This example uses Anaconda to manage virtual Python environments.

Create a new virtual Python environment for CLIP-Guided-Diffusion:

conda create --name cgd python=3.9
conda activate cgd

Download and change directory:

git clone
cd CLIP-Guided-Diffusion

Run the setup file:


Or if you want to run the commands manually:

# Install dependencies

pip3 install torch==1.9.0+cu111 torchvision==0.10.0+cu111 torchaudio==0.9.0 -f
git clone
git clone
pip install -e ./CLIP
pip install -e ./guided-diffusion
pip install lpips matplotlib

# Download the diffusion models

curl -OL --http1.1 ''
curl -OL ''


The simplest way to run is just to pass in your text prompt. For example:

python -p "A painting of an apple"

Multiple prompts

Text and image prompts can be split using the pipe symbol in order to allow multiple prompts. You can also use a colon followed by a number to set a weight for that prompt. For example:

python -p "A painting of an apple:1.5|a surreal painting of a weird apple:0.5"

Other options

There are a variety of other options to play with. Use help to display them:

python -h

usage: [-h] [-p PROMPTS] [-ip IMAGE_PROMPTS] [-ii INIT_IMAGE]
[-tvs TV_SCALE] [-rgs RANGE_SCALE] [-os IMAGE_SIZE] [-s SEED] [-o OUTPUT] [-nfp] [-pl]


  • ‘skip_timesteps’ needs to be between approx. 200 and 500 when using an init image.
  • ‘init_scale’ enhances the effect of the init image, a good value is 1000.


The number of timesteps, or one of ddim25, ddim50, ddim150, ddim250, ddim500, ddim1000. Must go into diffusion_steps.

image guidance

  • ‘clip_guidance_scale’ Controls how much the image should look like the prompt.
  • ‘tv_scale’ Controls the smoothness of the final output.
  • ‘range_scale’ Controls how far out of range RGB values are allowed to be.

Examples using a number of options:

python -p "An amazing fractal" -os=256 -cgs=1000 -tvs=50 -rgs=50 -cuts=16 -cutb=4 -t=200 -se=200 -m=ViT-B/32 -o=my_fractal.png

python -p "An impressionist painting of a cat:1.75|trending on artstation:0.25" -cgs=500 -tvs=55 -rgs=50 -cuts=16 -cutb=2 -t=100 -ds=2000 -m=ViT-B/32 -pl -o=cat_100.png

(Funny looking cat, but hey!)

Other repos

You may also be interested in

For upscaling images, try


    title  = {CLIP: Connecting Text and Images},
    author = {Alec Radford, Ilya Sutskever, Jong Wook Kim, Gretchen Krueger, Sandhini Agarwal},
    year   = {2021}