Multiscale Fusion at What Cost? Quantifying Efficiency-Accuracy Trade-offs in Hybrid Models

Main Article Content

Thomas Kinyanjui Njoroge, Kelvin Mugoye, Rachael Kibuku

Abstract

Multiscale feature fusion enhances deep vision models but often introduces computational overhead—an under-quantified challenge in hybrid CNN-Transformer architectures, especially for edge-based agricultural deployments. This study proposes an adaptive hybrid framework combining MobileNetV2, EfficientNetV2, and Transformers, trained on 76 classes across 22 crop diseases using Kaggle and field-sourced images. To address the efficiency-accuracy trade-off, we incorporate Squeeze-and-Excitation (SE) blocks (<1% parameter increase), gating mechanisms that reduce scale bias and improve small-object detection with marginal FLOPs cost, and hierarchical fusion, which raises FLOPs by 15% but yields diminishing returns on high-resolution data. The model achieved strong convergence (Training: 0.9957, validation: 0.9868) and 97.97% accuracy on 249 unseen field images. Final metrics (Accuracy: 0.992, AUC: 0.999998) surpassed standalone CNNs and Transformers—yet only when scale diversity was present. Statistical validation via confidence variance analysis and Kruskal-Wallis testing (H = 597.40, p = 8.48e−126) revealed the proposed model had the lowest variance (0.000010), confirming stable predictions. Most pairwise comparisons were significant at p < 0.05. ANOVA and bootstrapping further validated fusion's non-linear cost scaling. We demonstrated Pareto-efficient frontiers where hybrid models outperform their standalone counterparts only under certain conditions. This work challenges the notion that "more fusion is better," advocating context-aware fusion. Fusion is viable for cloud/server systems but must be pruned for edge deployment. We offer design guidelines for building cost-efficient, high-accuracy vision models in resource-constrained agricultural environments.

Article Details

Section
Articles