author
Juan Pulido & Bojan Ilijoski
In the last year, we’ve seen that GPUs can be used as a
computational resource to unlock work
that's previously been bottlenecked in sectors like
gaming, computer vision and blockchain.
One of the most common GPU-accelerated high complexity
works is artificial deep neural network
training. But what happens if we try to use that
computational resource for something other
than machine learning models training?
Imagine GPUs being used to accelerate the three common
stages of every data science project ETL
- extraction, transform and load). How much speed
acceleration could we get using GPUs instead
of CPUs on image augmentation tasks?
Our Goal “Get a Better
performance using GPU for Image Augmentation than using
CPU”
This article will illustrate how GPU use can accelerate
the transformation process in techniques
like image augmentation. There are many CPU libraries to
use on image augmentation such as imgaug
and torchvision, but we decided to pursue the python
library
albumentations because of
the sheer
number of possible transformations that can manage as
well as the performance of those compared to others.
After a closer look at the functions and implementation
of the albumentations library,
we found that it generally relies on two libraries—NumPy
and OpenCV. Since our goal
was to enable the use of the functions at GPU, we
decided to use CuPy as a replacement
for NumPy and OpenCV with Compute Unified Device
Architecture (CUDA) support.
Experiment Setup
We are using our wrapper for the following image
Augmentation functions:
HorizontalFlip, VerticalFlip, RandomTranspose,
CLAHE, GaussianFilter, GaussNoise,
MedianFilter, MotionFilter, RandomContrast,
RandomBrightness, ShiftHSV, ShiftScaleRotate,
Resize, Cutout, ElasticTransform, OpticalDistortion
and GridDistortion
We used an EC2 Instance of type G (g4dn.xlarge) for
running this benchmark, which contains a GPU Tesla
T4 and a 4 vCPU.
For testing the functions mentioned in the
experiment setup above, we use four different
datasets:
Experiment Execution
The experiment effectively consisted of comparing
similar function executions
in CPU and GPU, using the Albumentations library and our
wrapper San Cristobal for OpenCV.
The metric was “time execution per Image”, so every
function mentioned before was executed
on four different datasets of images in order to compare
the behavior in several environments,
with five repetitions in the calculation of each
transformation.
Benchmark Results
Results using CPU in
seconds per transformation:
Results using GPU in
seconds per transformation:
These are the plots of the time execution against each
image augmentation function
for every dataset, taking into account that the Y-axis
is in a logarithmic scale.
You can check the unscaled results in the tables at the
beginning of this section.
For the ISIC Dataset
9 of 17 functions got better results on CPU
For the ImageNet
Dataset 8 of 17 functions got better results on CPU
For the HDR+ Dataset
4 of 17 functions got better results on CPU
For the ImageNet
Dataset 3 of 17 functions got better results on CPU
Comparing acceleration performance on GPU in ImageNet
against 16K resolution images we can see
that the GPU acceleration keeps increasing in most of
the functions for high-resolution images
(except RandomTranspose function), meaning it will take
less time to apply all of those transformations
The most remarkable example is the ElasticTransform
function, which took almost 0.015 seconds per
image using GPU and approximately 142 seconds per image
using CPU.
If we had a dataset with 10.000 high-resolution
images (16K),
we could save more than 13 days of computing time.
That’s a huge financial incentive.
CPU Behaviour around all the data
GPU Behaviour around all the data
Even though there is an increase in the time it takes to
execute most high-res images, in 14 out of 17 cases
there was a positive improvement compared to CPU
behavior.
Main difficulties and
ongoing challenges
- We weren’t able to find large
datasets with high-resolution images
- Lack of GPU Memory
- Excessive time between uploading and downloading
data from GPU memory to RAM for visualization
Conclusion
Based on our work, it's clearly beneficial to use GPU
processing to accelerate image
augmentation with OpenCV – even if we don’t have one of
the latest and most powerful GPUs.
Additionally, we can use cloud services like AWS to
improve performance and minimize execution time.
For datasets with small images, like ISIC and
ImageNet, we can conclude that
there is an insignificant difference between CPU and
GPU.
More than half of the functions have differences that
are less than 0.001 seconds,
and 30% are less or around 0.01s. For a very complex
Elastic Transformation, there was a difference of 0.2s.
For almost half of the image
transformations, the CPU performs better.
For 4k and 16k datasets, we can see the real
performance of GPU.
For 13 functions we have improvements of 47% to 100%.
Gaussian Filter gets better
performance on CPU for ImageNet, ISIC, and 4k, but the
difference between the CPU
performance for each dataset is decreasing, and for 16k
we have 22% better performance
on GPU. Median Filter performs better on CPU, most
likely due to inefficient implementation,
and there is space for improvement. CPU performs far
better for Random Transpose, and CPU
is better for Vertical Flip, which is quite interesting
because Horizontal Flip performs better on GPU.
Next Steps
We noticed that some of the transformations were faster
using other libraries like
CuPy or
Numba,
or using other implementations. We want to get better
results for VerticalFlip, RandomTranspose,
and MedianFilter functions. Next we’ll want to measure
the performance of some of the image
augmentation functions that were not as good as we
expected. After that, we’ll use other methods
or decorators from Numba or even implement high
performance functions by inserting CUDA Kernels
inside our Python code.
There is a possibility to parallelize even more the
transformation using a Dask Cluster Multiple GPU and
Multi-Node.
Notes
These results were for the configuration mentioned in
the Experiment Setup
section. It is possible if you try to re-run the
benchmarks on your own on another setup you are going to
get different results.