Imgo: Progress Update 5

Elby
4 min readNov 29, 2020

Hello from Imgo! This is a blog series about my largest and most complex project to date, a home-made library, distributed through PyPi, whose aim is to streamline and facilitate various aspects of the data processing phase of image classification projects. In this series, I hope to document the ups and downs of building a library from scratch.

Photo by niko photos on Unsplash

Recap

Last week I discussed the dilemma surrounding the generation of additional training images for image classification projects. The problem, in the end, was that generating additional images in such a way would lead to data leakage, whereby images in the testing and validation sets would be generated from images in the training set. This would render any model trained on such a set invalid, as it would be impossible to gauge its performance.

The obvious solution, therefore, would be to generate images after splitting the dataset, such that the testing and validation subsets would not be contaminated by generated images. The problem with this approach, however, is that there is no simple way to ensure that the classes are balanced, which was part of the point in the first place.

Progress Report

After much head scratching, I have finally managed to implement a solution that provides the intended class rebalancing functionality while also avoiding data leakage. Hooray!

This solution is a new method for the Image_Dataset class called split_rebalance. The method takes an imbalanced dataset, splits it according to split ratios given as an argument, and then calls on an Augmenter to generate additional images for each class to the extent permitted by the split ratios. It works as intended and leaves the validation and testing subsets completely untouched. What’s more, the split is performed in such a way that ensures an equal distribution of classes in all of the subsets.

For example, suppose we have the following dataset called my_dataset:

There is a marked class imbalance here. This can be solved by calling the split_rebalance method, like so:

my_dataset.split_rebalance((0.8,0.1,0.1),
augmenter=augmenter,
augment_scale=500)

This splits the dataset into a validation and a testing subset each containing 10% of the images. After splitting, the Augmenter generates images for each class such that the training set will contain, once complete, the remaining 80% of the images:

The dataset is now not only balanced, but also split into testing and validation subsets. What’s more, the testing and validation subsets are stratified to the rebalanced training set, so they contain an equal number of images in each class.

We can take a look at some of the images, and note that only the training set contains augmented images.

Of course, this method comes with some caveats…

The first is the fact that there is a limit to how many images can be generated. Because the dataset is split before rebalancing, in order to preserve an equal number of each class, the size of the combined testing and validation subsets (or just the testing subset if only split into two) is limited to half the size of the smallest class. The maximum size of the training set is therefore determined by the size of the smallest class, as well as the split ratios given. In the above example, the splits are given as follows: 80% training data, 10% validation data, and 10% testing data. The smallest class, b contained 21 images, and therefore the maximum size of the combined testing and validation subsets is 10 images. The maximum training set size (per class) is therefore limited to 10 / 0.2 * 0.8 = 40 images. Using a different split will result in different sizes, and in general the higher the proportional size of the training data, the greater the number of generated images.

The second limitation is time. The method is rather computationally intensive, and using it on a large dataset can take a very long time. For a dataset consisting of 10,000 images (such as the DeepShroom dataset, for instance), depending on the transformation functions used, as well as the image dimensions, it can take around 3 hours to execute on a basic system. But that's life! This kind of operation is only intended to be performed once.

Next Steps

The library is now ready for a more publicised release (version 2.3), and I will be writing up an accompanying introductory post next week.

I believe it is finally time to close the book on Imgo for now, and concentrate on other projects. That said, the library being ready for its intended use case means that I will be able to put it to the test on the DeepShroom project.

Stay tuned!

--

--