Benchmark-Driven GPU Performance Optimization for Medical Imaging, Genomics, and Large-Scale AI Workloads
Main Article Content
Abstract
This paper illustrates how benchmarking is useful in achieving optimization of the workloads that can be accelerated using GPUs in clinical imaging, genomics studies, and generative AI training. We tested High-Performance Linpack (HPL) tuning, memory throughput optimization, NCCL communication optimization and GPU health validation to clusters of multiple GPUs. Peak floating-point performance of 12.3 TFLOPS to 34.7 TFLOPS was attained in various GPUs. Memory optimizations boosted performance in effective bandwidth up to 1.8-2.2x. In distributed AI workloads NCCL optimization helped to cut communication latency by 35-42, and memory virtualization trained large models, including VGG-16 (batch size 256), with only 18 percent loss in performance on 12 GB of a GPU. In medical imaging, when 2.1 -3.3 times less time was spent on reconstruction, there was no quality loss, and this was due to the use of the GPU. Genomics processes were almost 166X faster in identifying microRNAs than on a CPU. These findings demonstrate that the optimizations through benchmarking can lead to a reduction in the time-to-diagnosis, training, and cluster utilization in healthcare and AI.