Advanced computer architecture optimization for machine learning/deep learning

Authors

  • Shefqet Meda
  • Ervin Domazet

DOI:

https://doi.org/10.59380/crj.vi5.5108

Abstract

Abstract The recent progress in Machine Learning (Géron, 2022) and particularly Deep Learning (Goodfellow, 2016) models exposed the limitations of traditional computer architectures. Modern algorithms demonstrate highly increased computational demands and data requirements that most existing architectures cannot handle efficiently. These demands result in training speed, inference latency, and power consumption bottlenecks, which is why advanced methods of computer architecture optimization are required to enable the development of ML/DL-dedicated efficient hardware platforms (Engineers, 2019). The optimization of computer architecture for applications of ML/DL becomes critical, due to the tremendous demand for efficient execution of complex computations by Neural Networks (Goodfellow, 2016). This paper reviewed the numerous approaches and methods utilized to optimize computer architecture for ML/DL workloads. The following sections contain substantial discussion concerning the hardware-level optimizations, enhancements of traditional software frameworks and their unique versions, and innovative explorations of architectures. In particular, we discussed hardware including specialized accelerators, which can improve the performance and efficiency of a computation system using various techniques, specifically describing accelerators like CPUs (multicore) (Hennessy, 2017), GPUs (Hwu, 2015) and TPUs (Contributors, 2017), parallelism in multicore architectures, data movement in hardware systems, especially techniques such as caching and sparsity, compression, and quantization, other special techniques and configurations, such as using specialized data formats, and measurement sparsity. Moreover, this paper provided a comprehensive analysis of current trends in software frameworks, Data Movement optimization strategies (A.Bienz, 2021), sparsity, quantization and compression methods, using ML for architecture exploration, and, DVFS (Hennessy, 2017),, which provides strategies for maximizing hardware utilization and power consumption during training, machine learning, dynamic voltage, and frequency scaling, runtime systems. Finally, the paper discussed research opportunity directions and the possibilities of computer architecture optimization influence in various industrial and academic areas of ML/DL technologies. The objective of implementing these optimization techniques is to largely minimize the current gap between the computational needs of ML/DL algorithms and the current hardware’s capability. This will lead to significant improvements in training times, enable real-time inference for various applications, and ultimately unlock the full potential of cutting-edge machine learning algorithms.

Keywords:

Computer Architecture Optimization, Machine Learning, Deep Learning, Parallelism,, Sparsity, Data Movement Optimization, Quantization, Compression, Software Framework Optimization, DVFS,, TPU, CPU, GPU, TensorFlow, Pytorch

Downloads

Download data is not yet available.

Author Biographies

Shefqet Meda

Department of Electronics & Telecommunications Engineering, Faculty of Engineering, Canadian
Institute of Technology, Albania

Ervin Domazet

Department of Computer Engineering, Faculty of Engineering, International Balkan University, Skopje, North Macedonia

References

  1. A.Bienz, L. N. (2021). Modeling Data Movement Performance on Heterogeneous Architectures. IEEE High Performance Extreme Computing Conference (HPEC) (pp. 1-7). Waltham, MA, USA: Institute of Electrical and Electronics Engineers Inc.

  2. Abadi, M. B. (2016). TensorFlow: A System for Large-Scale Machine Learning. 12th USENIX Symposium on Operating Systems Design and Implementation , 265–283.

  3. apache.org. (2024). APACHE MXNET:A FLEXIBLE AND EFFICIENT LIBRARY FOR DEEP LEARNING. Retrieved from https://mxnet.apache.org/versions/1.9.1/

  4. Brandon Reagen, R. A.-Y. (2017). Deep Learning for Computer Architects. In P. U. Margaret Martonosi, Synthesis Lectures on Computer Architecture. Springer Nature Switzerland.

  5. Contributors. (2017, June 26). In-Datacenter Performance Analysis of a Tensor Processing Unit . Retrieved from https://arxiv.org/pdf/1704.04760

  6. eitc.org. (2023, January). cpu-vs-gpu-vs-tpu. Retrieved from cpu-vs-gpu-vs-tpu: http://www.eitc.org/research-opportunities/photos1/cpu-vs-gpu-vs-tpu_012023a/image_view_fullscreen

  7. Engineers, I. o. (2019). 25th IEEE International Symposium on High Performance Computer Architecture. IEEE International Symposium on High Performance Computer Architecture, p. 734 .

  8. Gao, L. W. (2019). An Overview of Machine Learning in Computer Architecture. Journal of Computer Science and Technology, 709–731.

  9. Géron, A. (2022). Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 3rd Edition. O’Reilly Media.

  10. Goodfellow, I. B. (2016). Deep Learning. Cambridge, Massachusetts, London: MIT.

  11. Google. (2017, May 12). An in-depth look at Google’s first Tensor Processing Unit (TPU). Retrieved from https://cloud.google.com/blog/u/1/products/ai-machine-learning/an-in-depth-look-at-googles-first-tensor-processing-unit-tpu

  12. Hendrik Borghorst, O. S. (2019). CyPhOS – A Component-Based Cache-Aware Multi-core Operating System. Architecture of Computing Systems – ARCS 2019 (pp. 171–182). Springer, Cham.

  13. Hennessy, J. L. (2017). Computer Architecture-A quantitave Aproach . Morgan Kaufman.

  14. Hwu, W.-m. W. (2015). GPU Computing Gems. Emerald Edition.

  15. Jacob, B. K. (2018). Quantization and training of neural networks for efficient integer-arithmetic-only inference. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2704-2713.

  16. Joo-Young Kim, B. K.-H. (2022). Processing-in-Memory for AI: From Circuits to Systems. Springer Nature.

  17. Larkin Ridgway Scott, T. C. (2021). Scientific Parallel Computing. Princeton University.

  18. Migacz, S. (2024). Performance Tuning Guide. Retrieved from https://pytorch.org/tutorials/recipes/recipes/tuning_guide.html

  19. PATTERSON, J. L. (2019). A New Golden for Computer Architecture. Communications of the ACM, 48-60.

  20. Ruud Van Der Pas, E. S. (2017). Using OpenMP-The Next Step: Affinity, Accelerators, Tasking, and SIMD (Scientific and Engineering Computation). MIT Press.

  21. Sze, V., Chen, Y.-H., Yang, T.-J., & Emer, J. S. (2017, November 20). Efficient Processing of Deep Neural Networks: A Tutorial and Survey. Proceedings of the IEEE, pp. 2295 - 2329.

  22. TensorFlow. (2024). Theoretical and advanced machine learning with TensorFlow . Retrieved from https://www.tensorflow.org/resources/learn-ml/theoretical-and-advanced-machine-learning

  23. Tony Pourmohamad, H. K. (2021). Bayesian Optimization with Application to Computer Experiments. Springer.

  24. Vinh Nguyen, T. G. (2020, June 18). Optimizing the Deep Learning Recommendation Model on NVIDIA GPUs. Retrieved from https://developer.nvidia.com/blog/optimizing-dlrm-on-nvidia-gpus/

  25. Wijtvliet, M. W. (2019). Accelerating Machine Learning Workloads with OpenCL on FPGAs. IEEE International Conference on Cluster Computing.

  26. Zoran Jakšić, N. C. (2020). A highly parameterizable framework for Conditional Restricted Boltzmann Machine based workloads accelerated with FPGAs and OpenCL. Elseiver, 201-211.

Downloads

Published

2024-07-31

How to Cite

Meda, S., & Domazet, E. (2024). Advanced computer architecture optimization for machine learning/deep learning. CRJ, (1), 28–41. https://doi.org/10.59380/crj.vi5.5108

Issue

Section

Articles