Advanced computer architecture optimization for machine learning/deep learning
DOI:
https://doi.org/10.59380/crj.vi5.5108Abstract
Abstract The recent progress in Machine Learning (Géron, 2022) and particularly Deep Learning (Goodfellow, 2016) models exposed the limitations of traditional computer architectures. Modern algorithms demonstrate highly increased computational demands and data requirements that most existing architectures cannot handle efficiently. These demands result in training speed, inference latency, and power consumption bottlenecks, which is why advanced methods of computer architecture optimization are required to enable the development of ML/DL-dedicated efficient hardware platforms (Engineers, 2019). The optimization of computer architecture for applications of ML/DL becomes critical, due to the tremendous demand for efficient execution of complex computations by Neural Networks (Goodfellow, 2016). This paper reviewed the numerous approaches and methods utilized to optimize computer architecture for ML/DL workloads. The following sections contain substantial discussion concerning the hardware-level optimizations, enhancements of traditional software frameworks and their unique versions, and innovative explorations of architectures. In particular, we discussed hardware including specialized accelerators, which can improve the performance and efficiency of a computation system using various techniques, specifically describing accelerators like CPUs (multicore) (Hennessy, 2017), GPUs (Hwu, 2015) and TPUs (Contributors, 2017), parallelism in multicore architectures, data movement in hardware systems, especially techniques such as caching and sparsity, compression, and quantization, other special techniques and configurations, such as using specialized data formats, and measurement sparsity. Moreover, this paper provided a comprehensive analysis of current trends in software frameworks, Data Movement optimization strategies (A.Bienz, 2021), sparsity, quantization and compression methods, using ML for architecture exploration, and, DVFS (Hennessy, 2017),, which provides strategies for maximizing hardware utilization and power consumption during training, machine learning, dynamic voltage, and frequency scaling, runtime systems. Finally, the paper discussed research opportunity directions and the possibilities of computer architecture optimization influence in various industrial and academic areas of ML/DL technologies. The objective of implementing these optimization techniques is to largely minimize the current gap between the computational needs of ML/DL algorithms and the current hardware’s capability. This will lead to significant improvements in training times, enable real-time inference for various applications, and ultimately unlock the full potential of cutting-edge machine learning algorithms.Keywords:
Computer Architecture Optimization, Machine Learning, Deep Learning, Parallelism,, Sparsity, Data Movement Optimization, Quantization, Compression, Software Framework Optimization, DVFS,, TPU, CPU, GPU, TensorFlow, PytorchDownloads
References
-
A.Bienz, L. N. (2021). Modeling Data Movement Performance on Heterogeneous Architectures. IEEE High Performance Extreme Computing Conference (HPEC) (pp. 1-7). Waltham, MA, USA: Institute of Electrical and Electronics Engineers Inc.
-
Abadi, M. B. (2016). TensorFlow: A System for Large-Scale Machine Learning. 12th USENIX Symposium on Operating Systems Design and Implementation , 265–283.
-
apache.org. (2024). APACHE MXNET:A FLEXIBLE AND EFFICIENT LIBRARY FOR DEEP LEARNING. Retrieved from https://mxnet.apache.org/versions/1.9.1/
-
Brandon Reagen, R. A.-Y. (2017). Deep Learning for Computer Architects. In P. U. Margaret Martonosi, Synthesis Lectures on Computer Architecture. Springer Nature Switzerland.
-
Contributors. (2017, June 26). In-Datacenter Performance Analysis of a Tensor Processing Unit . Retrieved from https://arxiv.org/pdf/1704.04760
-
eitc.org. (2023, January). cpu-vs-gpu-vs-tpu. Retrieved from cpu-vs-gpu-vs-tpu: http://www.eitc.org/research-opportunities/photos1/cpu-vs-gpu-vs-tpu_012023a/image_view_fullscreen
-
Engineers, I. o. (2019). 25th IEEE International Symposium on High Performance Computer Architecture. IEEE International Symposium on High Performance Computer Architecture, p. 734 .
-
Gao, L. W. (2019). An Overview of Machine Learning in Computer Architecture. Journal of Computer Science and Technology, 709–731.
-
Géron, A. (2022). Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 3rd Edition. O’Reilly Media.
-
Goodfellow, I. B. (2016). Deep Learning. Cambridge, Massachusetts, London: MIT.
-
Google. (2017, May 12). An in-depth look at Google’s first Tensor Processing Unit (TPU). Retrieved from https://cloud.google.com/blog/u/1/products/ai-machine-learning/an-in-depth-look-at-googles-first-tensor-processing-unit-tpu
-
Hendrik Borghorst, O. S. (2019). CyPhOS – A Component-Based Cache-Aware Multi-core Operating System. Architecture of Computing Systems – ARCS 2019 (pp. 171–182). Springer, Cham.
-
Hennessy, J. L. (2017). Computer Architecture-A quantitave Aproach . Morgan Kaufman.
-
Hwu, W.-m. W. (2015). GPU Computing Gems. Emerald Edition.
-
Jacob, B. K. (2018). Quantization and training of neural networks for efficient integer-arithmetic-only inference. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2704-2713.
-
Joo-Young Kim, B. K.-H. (2022). Processing-in-Memory for AI: From Circuits to Systems. Springer Nature.
-
Larkin Ridgway Scott, T. C. (2021). Scientific Parallel Computing. Princeton University.
-
Migacz, S. (2024). Performance Tuning Guide. Retrieved from https://pytorch.org/tutorials/recipes/recipes/tuning_guide.html
-
PATTERSON, J. L. (2019). A New Golden for Computer Architecture. Communications of the ACM, 48-60.
-
Ruud Van Der Pas, E. S. (2017). Using OpenMP-The Next Step: Affinity, Accelerators, Tasking, and SIMD (Scientific and Engineering Computation). MIT Press.
-
Sze, V., Chen, Y.-H., Yang, T.-J., & Emer, J. S. (2017, November 20). Efficient Processing of Deep Neural Networks: A Tutorial and Survey. Proceedings of the IEEE, pp. 2295 - 2329.
-
TensorFlow. (2024). Theoretical and advanced machine learning with TensorFlow . Retrieved from https://www.tensorflow.org/resources/learn-ml/theoretical-and-advanced-machine-learning
-
Tony Pourmohamad, H. K. (2021). Bayesian Optimization with Application to Computer Experiments. Springer.
-
Vinh Nguyen, T. G. (2020, June 18). Optimizing the Deep Learning Recommendation Model on NVIDIA GPUs. Retrieved from https://developer.nvidia.com/blog/optimizing-dlrm-on-nvidia-gpus/
-
Wijtvliet, M. W. (2019). Accelerating Machine Learning Workloads with OpenCL on FPGAs. IEEE International Conference on Cluster Computing.
-
Zoran Jakšić, N. C. (2020). A highly parameterizable framework for Conditional Restricted Boltzmann Machine based workloads accelerated with FPGAs and OpenCL. Elseiver, 201-211.
References
A.Bienz, L. N. (2021). Modeling Data Movement Performance on Heterogeneous Architectures. IEEE High Performance Extreme Computing Conference (HPEC) (pp. 1-7). Waltham, MA, USA: Institute of Electrical and Electronics Engineers Inc.
Abadi, M. B. (2016). TensorFlow: A System for Large-Scale Machine Learning. 12th USENIX Symposium on Operating Systems Design and Implementation , 265–283.
apache.org. (2024). APACHE MXNET:A FLEXIBLE AND EFFICIENT LIBRARY FOR DEEP LEARNING. Retrieved from https://mxnet.apache.org/versions/1.9.1/
Brandon Reagen, R. A.-Y. (2017). Deep Learning for Computer Architects. In P. U. Margaret Martonosi, Synthesis Lectures on Computer Architecture. Springer Nature Switzerland.
Contributors. (2017, June 26). In-Datacenter Performance Analysis of a Tensor Processing Unit . Retrieved from https://arxiv.org/pdf/1704.04760
eitc.org. (2023, January). cpu-vs-gpu-vs-tpu. Retrieved from cpu-vs-gpu-vs-tpu: http://www.eitc.org/research-opportunities/photos1/cpu-vs-gpu-vs-tpu_012023a/image_view_fullscreen
Engineers, I. o. (2019). 25th IEEE International Symposium on High Performance Computer Architecture. IEEE International Symposium on High Performance Computer Architecture, p. 734 .
Gao, L. W. (2019). An Overview of Machine Learning in Computer Architecture. Journal of Computer Science and Technology, 709–731.
Géron, A. (2022). Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 3rd Edition. O’Reilly Media.
Goodfellow, I. B. (2016). Deep Learning. Cambridge, Massachusetts, London: MIT.
Google. (2017, May 12). An in-depth look at Google’s first Tensor Processing Unit (TPU). Retrieved from https://cloud.google.com/blog/u/1/products/ai-machine-learning/an-in-depth-look-at-googles-first-tensor-processing-unit-tpu
Hendrik Borghorst, O. S. (2019). CyPhOS – A Component-Based Cache-Aware Multi-core Operating System. Architecture of Computing Systems – ARCS 2019 (pp. 171–182). Springer, Cham.
Hennessy, J. L. (2017). Computer Architecture-A quantitave Aproach . Morgan Kaufman.
Hwu, W.-m. W. (2015). GPU Computing Gems. Emerald Edition.
Jacob, B. K. (2018). Quantization and training of neural networks for efficient integer-arithmetic-only inference. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2704-2713.
Joo-Young Kim, B. K.-H. (2022). Processing-in-Memory for AI: From Circuits to Systems. Springer Nature.
Larkin Ridgway Scott, T. C. (2021). Scientific Parallel Computing. Princeton University.
Migacz, S. (2024). Performance Tuning Guide. Retrieved from https://pytorch.org/tutorials/recipes/recipes/tuning_guide.html
PATTERSON, J. L. (2019). A New Golden for Computer Architecture. Communications of the ACM, 48-60.
Ruud Van Der Pas, E. S. (2017). Using OpenMP-The Next Step: Affinity, Accelerators, Tasking, and SIMD (Scientific and Engineering Computation). MIT Press.
Sze, V., Chen, Y.-H., Yang, T.-J., & Emer, J. S. (2017, November 20). Efficient Processing of Deep Neural Networks: A Tutorial and Survey. Proceedings of the IEEE, pp. 2295 - 2329.
TensorFlow. (2024). Theoretical and advanced machine learning with TensorFlow . Retrieved from https://www.tensorflow.org/resources/learn-ml/theoretical-and-advanced-machine-learning
Tony Pourmohamad, H. K. (2021). Bayesian Optimization with Application to Computer Experiments. Springer.
Vinh Nguyen, T. G. (2020, June 18). Optimizing the Deep Learning Recommendation Model on NVIDIA GPUs. Retrieved from https://developer.nvidia.com/blog/optimizing-dlrm-on-nvidia-gpus/
Wijtvliet, M. W. (2019). Accelerating Machine Learning Workloads with OpenCL on FPGAs. IEEE International Conference on Cluster Computing.
Zoran Jakšić, N. C. (2020). A highly parameterizable framework for Conditional Restricted Boltzmann Machine based workloads accelerated with FPGAs and OpenCL. Elseiver, 201-211.