Zeroing memory deallocator to reduce checkpoint sizes in virtualized HPC environments
journal contributionposted on 28.09.2018 by Ramy Gad, Simon Pickartz, Tim Suss, Lars Nagel, Stefan Lankes, Antonello Monti, Andre Brinkmann
Any type of content formally published in an academic journal, usually following a peer-review process.
Virtualization has become an indispensable tool in data centers and cloud environments to flexibly assign virtual machines (VMs) to resources. Virtualization also becomes more and more attractive for high-performance computing (HPC). This is mainly due to the strong isolation of VMs which enables: (1) the sharing of cluster nodes and optimization of the system’s overall utilization; (2) load balancing by means of migrations due to the reduction of residual dependencies; and (3) the creation of system-level checkpoints increasing the fault tolerance in an application-transparent way. On the downside, the additional virtualization layer conceals information that is only available on the process level. This information has a direct influence on the checkpoint size which should be kept as small as possible. In this paper, we propose a novel technique for checkpoint size reduction in virtualized environments. We exploit the fact that the hypervisor detects zero pages which are omitted when capturing a checkpoint. Moreover, compression techniques are applied for a further reduction of the checkpoint size. We therefore fill freed memory regions with zeros supporting both the zero-page detection and the compression. We evaluate our approach by taking the example of HPC applications. The results reveal a reduction of the checkpoint size by up to 9% when compression is disabled in the hypervisor and up to 49% with compression enabled. Furthermore, memory zeroing is able to reduce VM migration time by up to 10% when compression is disabled and by up to 60% when compression is enabled.
This research and development was supported by the Federal Ministry of Education and Research (BMBF) under Grant 01IH13004 (Project FAST) and Grant 01IH16010B (Project Envelope).
- Computer Science