Imgo: Progress Update

Elby
5 min readOct 31, 2020

Hello from Imgo! This is a blog series about my largest and most complex project to date, a home-made library, distributed through PyPi, whose aim is to streamline and facilitate various aspects of the data processing phase of image classification projects. In this series, I hope to document the ups and downs of building a library from scratch.

Photo by Qingbao Meng on Unsplash

Where Were We?

In the previous post, I announced the release of Imgo 1.0.0, and discussed its working state. The library, comprising an image-data processing module, Uptools; as well as an image augmentation module, Augtools; was discussed in the context of its performance on a demonstration dataset. This dataset (the Boat/Car/Helicopter dataset), is simple and minuscule, and was the set on which unit tests for the development of Imgo were carried out. After a while, the library was deemed to be ready for initial release after successfully passing a number of tests on this dataset.

It was at this point that I began experimenting with a much larger (but still relatively small) dataset: my DeepShroom dataset, comprising 20'000 images of mushrooms across 20 different species classes.

The Augtools module proved to be effective in generating decent mushroom images in order to smooth out class imbalances. The Uptools module, however, ran into some difficulty with this larger dataset. Further experimentation identified a number of inefficiencies and inadequacies. These were, primarily:

  1. Calculations being distributed across multiple class methods
  2. Calculations being repeated in multiple class methods
  3. Excessive computation time for what should be simple methods
  4. Excessive computation time for, and storage requirements of, resulting datasets when saved to disk.

I am pleased to report that all of these issues have been fixed (to within an acceptable tolerance). In what follows, I shall discuss the reason behind these issues, and how they have been remedied.

Calculation Inefficiencies

Uptools is largely based around a single class, the Image_Dataset. This class is capable of representing large image datasets as numpy arrays and allows data processing tasks to be executed simply through various class methods. This underlying principle is that the Image_Dataset object has a sufficiently expressive set of attributes, allowing methods to be called upon it without additional calculation required. The aim, from the beginning, was to front-load the computationally expensive steps into the object initialization, in order to allow to quick and efficient dataset manipulation.

After various experiments and iterations, I found that there is inevitably a tradeoff between the class’ usefulness and the possibility of front-loading the intensive computation in this way. Some manipulations, such as splitting into testing and training subsets, will be rather intensive and require a large amount of time to complete, simply owing to the size of the datasets. However, in version 1.0.0, this, as well as other more simple manipulations, was found to be needlessly inefficient because certain processes were either being repeated, shared among various methods, or were simply superfluous.

The most obvious of these was the details method, which was intended to be a simple reference for the Image_Dataset object’s key attributes, such as the number of classes, the number of images per class, the image dimensions, and so on. Also included herein was the minimum and maximum pixel values of the dataset, in order to determine whether or not the image data had been normalized or standardized.

In the first Imgo release, the minimum and maximum pixel values were identified when the details method was passed. This meant that the numpy functions np.min and np.max were called on the entire image arrays, comprising, in the test case, 20,000*200*200*3 pixels. This workflow is far less efficient than calculating the minimum and maximum per image when the image data is being read, and updating them globally. Doing so also takes the computation out of the details method and puts it into the initialization step, which was one of the fundamental principles of the project.

Normalization and Standardization

In the first Imgo release, the images were normalized on initialization of the Image_Dataset object, and the dataset attributes of the objects were immutably cast to the normalized versions. This meant that if a normalized dataset was split into training and testing subsets, the split was being performed on float type data, which takes considerably longer than if it were performed on int type data. In fact, it was even found to be faster to de-normalize the data before splitting, and then re-normalizing after the split, than to attempt the manipulation on the normalized data.

This workaround seemed somewhat perverse, and I decided to avoid the issue entirely by creating a duplicate, or shadow attribute of the non-normalized or standardized data. This shadow version would be the one that is split by the train/val/test split manipulation, and subsequently normalized or standardized. Overall, this means that the data can be normalized and split, without having to de-normalize at any point, thus saving a large amount of computation time.

Saving To Disk

The implementation of a shadow dataset outlined above also served to remedy the issues surrounding the savedown methods. In the original version, saving normalized or standardized data would take an excessive amount of time and result in inordinately large files. The X_train.npz file for the mushrooms dataset taking somewhere around 12 minutes to save, and resulting in a file size of around 18GB.

In the new version, the option to save standardized or normalized data has been removed, and the savedown only takes the shadow dataset into account. Additionally, the save format has been changed to h5 rather than npz or npy. These changes mean that the data can be saved and reloaded in seconds, and simply re-normalized when initialized as a new Image_Dataset object.

Other Changes

In order to make the module more useful, I felt that it would be necessary to implement compatibility with pre-split datasets, ie to enable the user to create a split Image_Dataset object from a collection of training, validation, and testing subsets. This has been added to the new version of Imgo, along with a merge option in the initialization of the Image_Dataset object, which recombines split subsets into a single dataset.

These changes have greatly improved the performance of Imgo on large datasets such as the mushroom dataset. The task now is to continue testing the functionality of the library, in order to announce the eventual release of Imgo 2.0.0. Watch this space!

--

--