GPU image augmentation benchmark
OpenCV + CuPy = GPU acceleration for augmenting images
author
Juan Pulido & Bojan Ilijoski
In the last year, we’ve seen that GPUs can be used as a computational resource to unlock work that's previously been bottlenecked in sectors like gaming, computer vision and blockchain. One of the most common GPU-accelerated high complexity works is artificial deep neural network training. But what happens if we try to use that computational resource for something other than machine learning models training?
Imagine GPUs being used to accelerate the three common stages of every data science project ETL - extraction, transform and load). How much speed acceleration could we get using GPUs instead of CPUs on image augmentation tasks?
Our Goal “Get a Better performance using GPU for Image Augmentation than using CPU”
This article will illustrate how GPU use can accelerate the transformation process in techniques like image augmentation. There are many CPU libraries to use on image augmentation such as imgaug and torchvision, but we decided to pursue the python library albumentations because of the sheer number of possible transformations that can manage as well as the performance of those compared to others.
After a closer look at the functions and implementation of the albumentations library, we found that it generally relies on two libraries—NumPy and OpenCV. Since our goal was to enable the use of the functions at GPU, we decided to use CuPy as a replacement for NumPy and OpenCV with Compute Unified Device Architecture (CUDA) support.
Experiment Setup
We are using our wrapper for the following image Augmentation functions:
HorizontalFlip, VerticalFlip, RandomTranspose, CLAHE, GaussianFilter, GaussNoise, MedianFilter, MotionFilter, RandomContrast, RandomBrightness, ShiftHSV, ShiftScaleRotate, Resize, Cutout, ElasticTransform, OpticalDistortion and GridDistortion
  • AWS Setup
We used an EC2 Instance of type G (g4dn.xlarge) for running this benchmark, which contains a GPU Tesla T4 and a 4 vCPU.
  • Docker Image
In order to use OpenCV in GPU, we had to build a modification of the following image https://github.com/datamachines/
cuda_tensorflow_opencv
. This image freed us from concerns over Nvidia Drivers, CUDA installation, and OpenCV compilation.
  • Datasets
For testing the functions mentioned in the experiment setup above, we use four different datasets:
Experiment Execution
The experiment effectively consisted of comparing similar function executions in CPU and GPU, using the Albumentations library and our wrapper San Cristobal for OpenCV.
The metric was “time execution per Image”, so every function mentioned before was executed on four different datasets of images in order to compare the behavior in several environments, with five repetitions in the calculation of each transformation.
Benchmark Results
RAPIDS
Results using CPU in seconds per transformation:
RAPIDS
Results using GPU in seconds per transformation:
These are the plots of the time execution against each image augmentation function for every dataset, taking into account that the Y-axis is in a logarithmic scale. You can check the unscaled results in the tables at the beginning of this section.
RAPIDS
For the ISIC Dataset 9 of 17 functions got better results on CPU
RAPIDS
For the ImageNet Dataset 8 of 17 functions got better results on CPU
RAPIDS
For the HDR+ Dataset 4 of 17 functions got better results on CPU
RAPIDS
For the ImageNet Dataset 3 of 17 functions got better results on CPU
RAPIDS
Comparing acceleration performance on GPU in ImageNet against 16K resolution images we can see that the GPU acceleration keeps increasing in most of the functions for high-resolution images (except RandomTranspose function), meaning it will take less time to apply all of those transformations
The most remarkable example is the ElasticTransform function, which took almost 0.015 seconds per image using GPU and approximately 142 seconds per image using CPU.
Icon
If we had a dataset with 10.000 high-resolution images (16K), we could save more than 13 days of computing time. That’s a huge financial incentive.
Icon
CPU Behaviour around all the data
RAPIDS
GPU Behaviour around all the data
RAPIDS
Even though there is an increase in the time it takes to execute most high-res images, in 14 out of 17 cases there was a positive improvement compared to CPU behavior.
Main difficulties and ongoing challenges
  • We weren’t able to find large datasets with high-resolution images
  • Lack of GPU Memory
  • Excessive time between uploading and downloading data from GPU memory to RAM for visualization
Conclusion
Based on our work, it's clearly beneficial to use GPU processing to accelerate image augmentation with OpenCV – even if we don’t have one of the latest and most powerful GPUs. Additionally, we can use cloud services like AWS to improve performance and minimize execution time.
Icon
For datasets with small images, like ISIC and ImageNet, we can conclude that there is an insignificant difference between CPU and GPU.
Icon
More than half of the functions have differences that are less than 0.001 seconds, and 30% are less or around 0.01s. For a very complex Elastic Transformation, there was a difference of 0.2s.
For almost half of the image transformations, the CPU performs better.
Icon
For 4k and 16k datasets, we can see the real performance of GPU.
Icon
For 13 functions we have improvements of 47% to 100%. Gaussian Filter gets better performance on CPU for ImageNet, ISIC, and 4k, but the difference between the CPU performance for each dataset is decreasing, and for 16k we have 22% better performance on GPU. Median Filter performs better on CPU, most likely due to inefficient implementation, and there is space for improvement. CPU performs far better for Random Transpose, and CPU is better for Vertical Flip, which is quite interesting because Horizontal Flip performs better on GPU.
Next Steps
We noticed that some of the transformations were faster using other libraries like CuPy or Numba, or using other implementations. We want to get better results for VerticalFlip, RandomTranspose, and MedianFilter functions. Next we’ll want to measure the performance of some of the image augmentation functions that were not as good as we expected. After that, we’ll use other methods or decorators from Numba or even implement high performance functions by inserting CUDA Kernels inside our Python code.
There is a possibility to parallelize even more the transformation using a Dask Cluster Multiple GPU and Multi-Node.
Notes
These results were for the configuration mentioned in the Experiment Setup section. It is possible if you try to re-run the benchmarks on your own on another setup you are going to get different results.
Juan Pulido
Juan Pulido Is a Data Engineer at Loka. He’s currently exploring the possibilities of speeding up the performance of distributed computations using new technologies based on GPU. Customers know Juan as a curious guy who is always finding ways to use AI to give their startups superpowers. When he's not at his virtual office, you can catch Juan streaming about korean pop trends or eating cannellonis in the nearest italian restaurant.
Bojan Ilijoski
Bojan Ilijoski Is Senior Machine Learning and Data Science engineer at Loka. He is also a PhD student and teaching and research assistant at ss. Syril and Methodius University. Bojan loves exercising his curiosity both mentally and physically. He’s a programming and sports lover, pythonista and father, runner and biker who’s interested in AI, algorithms, and HPC. When he’s offline, Bojan is probably out hiking or immersed in a game of backgammon at the local cafe
- More from Loka
Carbon
Climate Change
Technologists want to lower
carbon emissions with even
more code
Open-source tools designed
to spotlight the carbon footprints
of cloud storage
Loka Dystopia
Artificial Intelligence
Loka wants to be the antidote
to dystopia fatigue
Founder Bobby Mukherjee on
the pursuit of actionable,
hopeful near futures
Bigdata
Humanity
The now and near
future of data
A short and sweet trends
forecast on a software engineer’s
building blocks
Swipe
Loka's syndication policy
Free and easy
Put simply, we encourage free syndication. If you’re interested in sharing, posting or Tweeting our full articles, or even just a snippet, just reach out to medium@loka.com. We also ask that you attribute Loka, Inc. as the original source. And if you post on the web, please link back to the original content on Loka.com. Pretty straight forward stuff. And a good deal, right? Free content for a link back.
If you want to collaborate on something or have another idea for content, just email me.
We’d love to join forces! 🙌

Silicon Valley Office

350 2nd Street, Suite 8 Los Altos, CA 94022

San Francisco Office

535 Mission St, 14th floor San Francisco, CA 94105