Introducing Imgup Uptools

8 min readOct 11, 2020

A large portion of my machine learning projects to date have been centred around image classification, which involves many tedious and cumbersome preprocessing steps. One day, I decided to make a small function to automate one of these steps. This small function quickly snowballed into a family of functions, and eventually into an entire package: Imgup!

So welcome to Imgup, a simple library that makes image data processing quick and easy. Imgup is primarily geared towards image classification but can be applied to many other types of projects. The library is split into two main modules: Uptools and Augtools. This post will serve as an introduction to the basic functionality of Uptools.

Processing Image Datasets with Uptools

Uptools helps to streamline various image data preprocessing tasks, such as:

Reading images from a local disk
Rescaling images
Normalizing and standardizing pixel values
Converting image datasets into numpy-arrays
One-hot-encoding label data
Splitting image datasets into training, validation, and testing subsets
Saving numpy-arrays as images in class subdirectories

To begin, ensure Imgup is installed by running pip install imgup. From Imgup, we can then import Uptools:

from imgup import uptools

The key to most of Uptools’ functionality is the Image_Dataset class which represents image datasets. The class comes with several attributes and methods that can be used to complete a variety of image processing tasks in a single line of code.

Reading Image Data from Disk

For the purposes of this demonstration, we will be working from the GitHub repo for Imgup which contains a few collections of images. array_data/ contains data in the form of X-(images) and y-(labels) numpy-arrays, imgs_a/ contains raw images organized into subdirectories by class, and imgs_b/ contains unorganized raw images.

For the most simple case, let us create an image dataset from the images from imgs_a. We can do this by creating an instance of the Image_Dataset class and passing the base path of the images as an argument:

my_dataset = uptools.Image_Dataset("imgs_a")

The object we have just created is an image dataset composed of X_data (image) and y_data (one-hot-encoded label) arrays. We can access these two arrays by calling either the X_data and y_data attributes of my_dataset:

X_data = my_dataset.X_data
y_data = my_dataset.y_data

Or by using the generate method:

X_data, y_data = my_dataset.generate()

We can investigate the contents of these arrays manually, or we can get some summary information using the following methods:

# total number of imagesmy_dataset.size15# number of distinct classesmy_dataset.class_no3# list of distinct classesmy_dataset.class_list['boat', 'car', 'helicopter']

Or, we can use the details method to print some information:

# general summary infomy_dataset.details()Image_Dataset initialized.
--------------------------
total_images: 15
images_per_class: {'boat': 5, 'car': 5, 'helicopter': 5}
image_size: various
pixel_values: {'min': 0, 'max': 255}

Or even to get a pretty visualization:

We can also get a full list of y_data labels:

my_dataset.labels['boat',
 'boat',
 'boat',
 'boat',
 'boat',
 'car',
 'car',
 'car',
 'car',
 'car',
 'helicopter',
 'helicopter',
 'helicopter',
 'helicopter',
 'helicopter']

The y_data array is itself, however, automatically one-hot-encoded, as we can see:

y_data position		y_data		label
---------------		------		-----
y_data[1]		[1. 0. 0.]	boat
y_data[4]		[1. 0. 0.]	boat
y_data[7]		[0. 1. 0.]	car
y_data[10]		[0. 0. 1.]	helicopter
y_data[13]		[0. 0. 1.]	helicopter

Also, we can display random batches of images from X_data using the display_batch method (passing the number of rows and columns to display as arguments):

my_dataset.display_batch(3,4)

Processing Images

Because the primary aim of Uptools is to facilitate the preprocessing stage of comupter vision projects, it is likely that image data contained in an Image_Dataset object will need to be rescaled and/or normalized. This can be done when initializing the Image_Dataset object by passing rescale values to the keyword argument resize and True to the keyword argument normalize, like so:

my_dataset_2 = uptools.Image_Dataset("imgs_a",
                                     resize=[220,220],
                                     normalize=True)my_dataset_2.details()Image_Dataset initialized.
--------------------------
total_images: 15
images_per_class: {'boat': 5, 'car': 5, 'helicopter': 5}
image_size: [220, 220]
pixel_values: {'min': 0.0, 'max': 1.0}

We see that for my_dataset_2 the image size is now 220x220 pixels and the pixel values range from 0 to 1 (compare this to my_dataset_1 above).

It is also likely that we will need to split the dataset into training, validation, and/or testing subsets. We can do this simply by calling the tvt_split method on the Image_Dataset object, and passing as arguments the ratios for each subset. If three numbers are given, the dataset will be split into training, validation, and testing subsets. If only two are given, the dataset will be split into training and testing sets only.

my_dataset_2.tvt_split([0.6,0.2,0.2])X_train shape: (9, 220, 220, 3)
y_train shape: (9, 3)
X_val shape: (3, 220, 220, 3)
y_val shape: (3, 3)
X_test shape: (3, 220, 220, 3)
y_test shape: (3, 3)

The tvt_split method creates and sets an attribute for each of the subsets, so X_train can be accessed by calling my_dataset_2.X_train, and so on. The method also has in-built random seed setting and subset stratification to preserve class distributions. These can be implemented by passing the seed and stratify keyword arguments, respectively. Additionally, the data in X_val and X_test can be standardized to the mean and standard deviation of the X_train subset using the keyword argument standardize.

Furthermore, the data in X_val and X_test can be standardized to the mean and standard deviation of the X_train subset using the keyword argument standardize.

Saving and Loading Image Datasets

Once the Image_Dataset object has been initialized, it can be saved to disk for later use. There are two possibilities for saving: either in numpy-array form, or as .jpg images in a main directory (with subdirectories for each class).

Saving as numpy-arrays:

To save the data in numpy-array form, we can use the method save_as_np. A number of options are available when saving in this way: the path to the save directory is given by the save_path argument, and the file format (either .npy or .npz) is given by the save_mode argument. If the dataset has been split using the tvt_split method, we can choose to save the subsets separately by specifying save_split=True.

The method will check for a directory given as the save_path argument. If it exists, it will save the arrays in subdirectories named for the subsets if the data has been split (if split three-ways, the validation data is saved in the same folder as the training training data). If the directory does not exist, it will be created by the method. If the folders are not empty (ie if the data has already been saved), the data will not be overwritten unless the argument overwrite=True is given.

my_dataset_2.save_as_np("my_dataset_2_arrays_non_split","npz")X_data.npz saved in my_dataset_2_arrays_non_split/
y_data.npz saved in my_dataset_2_arrays_non_split/my_dataset_2.save_as_np("my_dataset_2_arrays_split","npz",save_split=True)X_train.npz saved in my_dataset_2_arrays_split/train_data/
X_val.npz saved in my_dataset_2_arrays_split/train_data/
y_train.npz saved in my_dataset_2_arrays_split/train_data/
y_val.npz saved in my_dataset_2_arrays_split/train_data/
X_test.npz saved in my_dataset_2_arrays_split/test_data/
y_test.npz saved in my_dataset_2_arrays_split/test_data/

Saving as images in folders organized by class:

To save the data in image form, we can use the save_as_imgdirs method. Like the save_as_np method, this method will create a directory given by the save_path argument (unless it already exists). Within this directory, a subdirectory called ds_images will be created. Within this subdirectory, a subdirectory will be created for each class and the images belonging to that class will be saved therein. Images can be overwritten by specifying overwrite=True.

my_dataset_2.save_as_imgdirs("my_dataset_2_images")

Loading Image Datasets

Datasets that have been saved to disk (in either numpy or image form) can be loaded simply by initializing a new Image_Dataset object.

If the data is in image form, we create a new image dataset in the same way we created it above:

my_dataset_3 = uptools.Image_Dataset("my_dataset_2_images/ds_images",normalize=True)my_dataset_3.details()Image_Dataset initialized.
--------------------------
total_images: 15
images_per_class: {'boat': 5, 'car': 5, 'helicopter': 5}
image_size: various
pixel_values: {'min': 0.0, 'max': 1.0}

If the data is in numpy form, we can load it by initializing a new Image_Dataset and passing some additional arguments.

my_dataset_4 = uptools.Image_Dataset("array_data",
                                     resize=[220,220],
                                     normalize=True,                                     
                                     from_np=("X_data.npz","y_data.npz"),
                                     np_classes=["panda","horse","duck","squirrel"],
                                     np_prenorm=True)my_dataset_4.details()Image_Dataset initialized.
--------------------------
total_images: 24
images_per_class: {'duck': 6, 'horse': 6, 'panda': 6, 'squirrel': 6}
image_size: [220, 220]
pixel_values: {'min': 0.0, 'max': 1.0}

Caution: the np_classes argument assumes that the label data has been one-hot-encoded in alphabetical order. The class list given by np_classes is sorted alphabetically when the Image_Dataset object is initialized, so it can itself be given in any order.

my_dataset_4.display_batch(4,6)

The new dataset can then be saved to disk in either numpy or image form, using save_as_numpy or save_as_imgdirs, respectively.

Converting Unlabelled Images to Class Subdirectories (With List of Labels)

There might be a case in which we have a collection of images that are not labelled, but for which we have a list of labels prepared (for example an Excel spreadsheet, or simply a list of labels that correspond with the images). Using Imgup Uptools, we can turn this collection into an Image_Dataset object, tied in with the list of labels, and save them to disk in either numpy form or as images in class subdirectories. To do this, we will make use of some of Imgup Uptools' module-wide functions.

First, we can generate a Pandas DataFrame from the directory in which the images are stored using the img_to_df function:

df = uptools.img_to_df('imgs_b')

We can then use display_img_df to have display the images in order. The images will be displayed in batches, with the first argument specifying which batch to display, the second specifying the number of batches to display, and the last two being the number of rows and columns of images to display in the batch, respectively.

uptools.display_img_df(df,0,15,3,5)

We can then use the read_img_df function to read and process the images in the DataFrame. We can also save these images to disk by passing save=True as an argument. This will create a new folder (if it does not already exist) called preprocessing, and the images will be saved in numpy-array form in a subdirectory within this folder called read_img_df.

unlabelled_x_array = uptools.read_img_df(df,save=True)

Now suppose we have obtained a list of labels corresponding to each image, like so:

labels_for_ids_4 = ['heli',
                    'car',
                    'boat',
                    'boat',
                    'boat',
                    'car',
                    'heli',
                    'heli',
                    'heli',
                    'heli',
                    'car',
                    'car',
                    'car',
                    'boat',
                    'boat']

And so we can obtain a list of unique classes:

classes_for_newly_labelled_dataset = list(set(labels_for_ids_4))['boat', 'heli', 'car']

We can then use the one_hot_encode function to one-hot encode the list of labels:

ohe_y_data = uptools.one_hot_encode(labels_for_ids_4,classes_for_ids_4,save=True)

Again, we can save the resulting one-hot encoded array as the y_data by passing the save=True argument. This will save in the preprocessing folder in another subdirectory called one_hot_encoding.

We now have our X_data and our y_data saved to disk in numpy-array form, and so we can create a new Image_Dataset object using the numpy-related arguments:

newly_labelled_dataset = uptools.Image_Dataset('preprocessing',
                                               from_np=('read_img_df/X_data_1.npz','one_hot_encoding/y_data_1.npz'),
                                               np_classes=classes_for_newly_labelled_dataset)newly_labelled_dataset.details()Image_Dataset initialized.
--------------------------
total_images: 15
images_per_class: {'boat': 5, 'car': 5, 'heli': 5}
image_size: various
pixel_values: {'min': 0, 'max': 255}

Now we can just save our newly_labelled_dataset to disk in image form!

newly_labelled_dataset.save_as_imgdirs('newly_labelled_dataset')

Final Remarks

Of course, Imgup is still in its early stages, but preliminary testing is working well, and the package will soon be available via pip!

In the next post, I will be introducing the Augtools module, which aims to be a one-stop-shop for all image augmentation needs.