torchdata
torchdata is PyTorch oriented library focused on data processing and input pipelines in general.
It extends torch.utils.data.Dataset and equips it with functionalities known from tensorflow.data like map or cache (with some additions unavailable in aforementioned).
All of that with minimal interference (single call to super().__init__()
) in original PyTorch's datasets.
Functionalities overview:
- Use
map
,apply
,reduce
orfilter
cache
data in RAM/disk/your own method (even partially, say first20%
)- Full PyTorch's
Dataset
andIterableDataset
support (includingtorchvision
) - General
torchdata.maps
likeFlatten
orSelect
- Extensible interface (your own cache methods, cache modifiers, maps etc.)
- Concrete
torchdata.datasets
designed for file reading and other general tasks
Quick examples
- Create image dataset, convert it to Tensors, cache and concatenate with smoothed labels:
import torchdata
import torchvision
class Images(torchdata.Dataset): # Different inheritance
def __init__(self, path: str):
super().__init__() # This is the only change
self.files = [file for file in pathlib.Path(path).glob("*")]
def __getitem__(self, index):
return Image.open(self.files[index])
def __len__(self):
return len(self.files)
images = Images("./data").map(torchvision.transforms.ToTensor()).cache()
You can concatenate above dataset with another (say labels
) and iterate over them as per usual:
for data, label in images | labels:
# Do whatever you want with your data
- Cache first
1000
samples in memory, save the rest on disk in folder./cache
:
images = (
ImageDataset.from_folder("./data").map(torchvision.transforms.ToTensor())
# First 1000 samples in memory
.cache(torchdata.modifiers.UpToIndex(1000, torchdata.cachers.Memory()))
# Sample from 1000 to the end saved with Pickle on disk
.cache(torchdata.modifiers.FromIndex(1000, torchdata.cachers.Pickle("./cache")))
# You can define your own cachers, modifiers, see docs
)
To see what else you can do please check torchdata documentation
Installation
pip
Latest release:
pip install --user torchdata
Nightly:
pip install --user torchdata-nightly
Docker
CPU standalone and various versions of GPU enabled images are available
at dockerhub.
For CPU quickstart, issue:
docker pull szymonmaszke/torchdata:18.04
Nightly builds are also available, just prefix tag with nightly_
. If you are going for GPU
image make sure you have
nvidia/docker installed and it's runtime set.