ACADSTAFF UGM

CREATION
Title : An Advantage Actor-Critic Deep Reinforcement Learning Method for Power Management in HPC Systems
Author :

FITRA RAHMANI K (1) KADEK GEMILANG SANTIYUDA (2) Faizal Makhrus, S.Kom., M.Sc., Ph.D. (4) Muhammad Alfian Amrizal, B.Eng., M.I.S., Ph.D. (5)

Date : 8 2023
Keyword : HPC,Power management,Energy consumption,Deep Reinforcement Learning,Advantage actor-critic HPC,Power management,Energy consumption,Deep Reinforcement Learning,Advantage actor-critic
Abstract : A primary concern when deploying a High-Performance Computing (HPC) system is its high energy consumption. Typical HPC systems consist of hundreds to thousands of compute nodes that consume huge amount of electrical power even during their idle states. One way to increase the energy efficiency is to apply the backfilling method to the First Come First Serve (FCFS) job scheduler (FCFS+Backfilling). The backfilling method allows jobs that arrive later than the first job in the queue to be executed earlier if the starting time of the first job is not affected, therefore increasing the throughput and the energy efficiency of the system. Nodes that are idle for a specific amount of time can also be switched off to further improve the energy efficiency. However, switching off nodes based only on their idle time can also impair the energy efficiency and the throughput instead of improving them. As an example, new jobs may immediately arrive after nodes are switched off, hence missing the chance of directly executing the jobs via backfilling. This paper proposed a Deep Reinforcement Learning (DRL)-based method to predict the most appropriate timing to switch on/off nodes. A DRL agent is trained with Advantage Actor-Critic algorithm to decide which nodes must be switched on/off at a specific timestep. Our simulation results on NASA iPSC/860 HPC historical job dataset show that the proposed method can reduce the total energy consumption compared to most of the conventional timeout policies that switch off nodes after they became idle for some period of time.
Group of Knowledge :
Level : Internasional
Status :
Published
Document
No Title Document Type Action