CleanX is an open source python library for exploring, cleaning and augmenting large datasets of X-rays, or certain other types of radiological images. JPEG files can be extracted from DICOM files or used directly.

CleanX is an open source  python library for exploring, cleaning and augmenting large datasets of X-rays, or certain other types of radiological images.
JPEG files can be extracted from DICOM files or used directly.

The latest official release:

Anaconda-Server Badge

primary author: Candace Makeda H. Moore

other authors + contributors: Oleg Sivokon, Andrew Murphy

Continous Integration (CI) status



Developer's Guide

Please refer to Developer's Giude
for more detailed explanation.

Developing Using Anaconda's Python

Use Git to check out the project's source, then, in the source
directory run:

conda create -n cleanx
conda activate -n cleanx
python ./ install_dev

You may have to do this for Python 3.7, Python 3.8 and Python 3.9 if
you need to check that your changes will work in all supported

Developing Using's Python

Use Git to check out the project's source, then in the source
directory run:

python -m venv .venv
. ./.venv/bin/activate
python ./ install_dev

Similar to conda based setup, you may have to use Python versions
3.7, 3.8 and 3.9 to create three different environments to recreate
our CI process.

Supported Platforms

cleanX package is a pure Python package, but it has many
dependencies on native libraries.  We try to test it on as many
platforms as we can to see if dependencies can be installed there.
Below is the list of platforms that will potentially work.

Whether Python or Anaconda Python are supported, it means
that version 3.7, 3.8 and 3.9 are supported.  We know for certain that
3.6 is not supported, and there will be no support in the future.

32-bit Intell and ARM

We don't know if either one of these is supported.  There's a good
chance that 32-bit Intell will work.  There's a good chance that ARM

It's unlikely that the support will be added in the future.

AMD64 (x86)

Linux Win OSX
p Supported Unknown Unknown
a Supported Supported Supported


Seems to be unsupported at the moment on both Linux and OSX, but it's
likely that support will be added in the future.


Online documentation at

You can also build up-to-date documentation by command.

Documentation can be generated by command:

python apidoc
python build_sphinx

The documentation will be generated in ./build/sphinx/html
directory. Documentation is generated automatically as new functions
are added.

Special additional documentation for medical professionals with
limited programming ability is available on the wiki

To get a high level overview of some of the functionality of the
program you can look at the Jupyter notebooks inside workflow_demo.


setting up a virtual environment is desirable, but not absolutely

activate  the environment

Anaconda Installation

  • use command for conda as below
conda install -c doctormakeda -c conda-forge cleanx

You need to specify both channels because there are some cleanX
dependencies that exist in both Anaconda main channel and in

pip installation

  • use pip as below
pip install cleanX

Getting Started

We will imagine a very simple scenario, where we need to automate
normalization of the images we have.  We stored the images in
directory /images/to/clean/ and they all have jpg extension.  We
want the cleaned images to be saved in the cleaned directory.

Normalization here means ensuring that the lowest pixel value (the
darkest part of the image) is as dark as possible and that the
lightest part of the image is as light as possible.

CLI Example

The problem above doesn't require writing any new Python code.  We can
accomplish our task by calling the cleanX command like this:

mkdir cleaned

python -m cleanX images run-pipeline \
    -s Acqure \
    -s Normalize \
    -s "Save(target='cleaned')" \
    -j \
    -r "/images/to/clean/*.jpg"

Let's look at the command's options and arguments:

  • python -m cleanX is the Python's command-line option for loading
    the cleanX package.  All command-line arguments that follow this
    part are interpreted by cleanX.
  • images sub-command is used for processing of images.
  • run-pipeline sub-command is used to start a Pipeline to process
    the images.
  • -s (repeatable) option specifies Pipeline Step.  Steps map to
    their class names as found in the cleanX.image_work.steps module.
    If the __init__ function of a step doesn't take any arguments, only
    the class name is necessary.  If, however, it takes arguments, they
    must be given using Python's literals, using Python's named arguments
  • -j option instructs to create journaling pipeline.  Journaling
    pipelines can be restarted from the point where they failed, or had
    been interrupted.
  • -r allows to specify source for the pipeline.  While, normally, we
    will want to start with Acquire step, if the pipeline was
    interrupted, we need to tell it where to look for the initial

Once the command finishes, we should see the cleaned directory filled
with images with the same names they had in the source directory.

Let's consider another simple task: batch-extraction of images from
DICOM files:

mkdir extracted

python -m cleanX dicom extract \
    -i dir /path/to/dicoms/
    -o extracted

This calls cleanX CLI in the way similar to the example above, however,
it calls the dicom sub-command with extract-images subcommand.

  • -i tells cleanX to look for directory named /path/to/dicoms
  • -o tells cleanX to save extracted JPGs in extracted directory.

If you have any problems with this check
#40 and add
issues or discussions.

Coding Example

Below is the equivalent code in Python:

import os

from cleanX.image_work import (

dst = 'cleaned'

src = GlobSource('/images/to/clean/*.jpg')
p = create_pipeline(


Let's look at what's going on here.  As before, we've created a
pipeline using create_pipeline with three steps: Acquire,
Normalize and Save.  There are several kinds of sources available
for pipelines.  We'll use the GlobSource to match our CLI example.
We'll specify journal=True to match the -j flag in our CLI

And for the DICOM extraction we might use similar code:

imort os

from cleanX.dicom_processing import DicomReader, DirectorySource

dst = 'extracted'

reader = DicomReader()
reader.rip_out_jpgs(DirectorySource('/path/to/dicoms/', 'file'), dst)

This will look for the files with dcm extension in
/path/to/dicoms/ and try to extract images found in those files,
saving them in extracted directory.

About using this library

If you use the library, please credit me and my collaborators.  You
are only free to use this library according to license. We hope that
if you use the library you will open source your entire code base, and
send us modifications.  You can get in touch with me by starting a
( if you
have a legitimate reason to use my library without open-sourcing your
code base, or following other conditions, and I can make you
specifically a different license.

We are adding new functions and classes all the time. Many unit tests
are available in the test folder. Test coverage is currently
partial. Some newly added functions allow for rapid automated data
augmentation (in ways that are realistic for radiological data). Some
other classes and functions are for cleaning datasets including ones

  • Get image and metadata out of dcm (DICOM) files into jpeg and csv
  • Process datasets from csv or json or other formats to generate
  • Run on dataframes to make sure there is no image leakage
  • Run on a dataframe to look for demographic or other biases in
  • Crop off excessive black frames (run this on single images) one at a
  • Run on a list to make a prototype tiny Xray others can be compared
  • Run on image files which are inside a folder to check if they are
  • Take a dataframe with image names and return plotted(visualized)
  • Run to make a dataframe of pics in a folder (assuming they all have
    the same 'label'/diagnosis)
  • Normalize images in terms of pixel values (multiple methods)

All important functions are documented in the online documentation for
programmers. You can also check out one of our videos by clicking the
linked picture below:

GitHub - drcandacemakedamoore/cleanX: Python library for exploring, cleaning, normalizing, and augmenting large datasets of radiological data.
Python library for exploring, cleaning, normalizing, and augmenting large datasets of radiological data. - GitHub - drcandacemakedamoore/cleanX: Python library for exploring, cleaning, normalizing,...