Imgo: Progress Update 4

Elby
3 min readNov 22, 2020

Hello from Imgo! This is a blog series about my largest and most complex project to date, a home-made library, distributed through PyPi, whose aim is to streamline and facilitate various aspects of the data processing phase of image classification projects. In this series, I hope to document the ups and downs of building a library from scratch.

Photo by Fabrice Villard on Unsplash

Recap

Last week, I discussed the search for optimal parameters for the augmentation of images in order to generate usable training data for the DeepShroom project. The final word revolved around a desired method by which augmentation could be performed in a specific order such that the chaotic way in which the transformations interact with each other could be controlled somewhat.

Progress Report

I am pleased to announce that imgo 2.2 is now up and running, and contains the desired additional functionality. In the augment_flow and related methods, there is now an option to specify the augment_type as either random or simple. If simple is chosen, there is an additional optional argument order by which the transformation order can be specified. The mechanism under the hood is a bit convoluted; in essence the transformations are order alphabetically by default, and a list of indices can be passed in which sorts the alphabetical list as desired.

With this additional functionality, the stochastic nature of the augmentation can be controlled to a higher degree, and therefore the usability of the augtools module is substantially increased.

A New Dilemma

After a week of testing imgo 2.2 for various use cases, I realised with shock and horror that a fundamental issue lies deep within it. Unfortunately, this issue completely invalidates some of the desired functionality. The issue is simple and it is immensely frustrating that I hadn’t noticed it before, alas…

The problem is that using augtools to generate additional training data only works if the new images are all and only included within the training subset. As it stands, there is no way to ensure this because the augment_flow method only works with unsplit data. This means that, in all likelihood, the validation and testing subsets will be contaminated by augmented images, and therefore there is a serious problem of data leakage. Training a model on such data would make the results unacceptable.

The point of creating augtools in such a way that allowed it to be used to generate training images was twofold: first, it was to increase the size of the dataset (more data = better performance… mostly); second, it was to have a way of balancing classes. This is significant for the DeepShroom project because the dataset contains a large variation in class sizes. The smallest contains roughly 370 images, while the largest contains almost 900. And herein lies the dilemma: if I generate new images before splitting, I can’t avoid data leakage. But if I split before I generate new images, I can’t maintain the desired class rebalancing functionality.

Where We Are Now

After a week spent trying to resolve this dilemma, I believe I may have found an answer. As it stands, I have found a way to rebalance the classes with augmented images while ensuring that there is no data leakage into the validation and testing sets. The catch is that the number of images in each class can only be balanced to a number that is unfortunately smaller than the largest classes. This means that some images cannot be used and will need to be dropped. This is, I hate to say it, a mathematical restriction rather than a technical one.

That said, it does resolve the dilemma in an acceptable way and results in a balanced dataset with no leakage. Hooray!

The work is not yet finished, however. As it stands, I am unable to ensure that the validation and testing subsets are stratified to the rebalanced training sets, ie to ensure that an equal number of images from each class are included in the validation and testing subsets. But I have a plan… Watch this space!

--

--

Elby
0 Followers

Machine Learning Enthusiast