Crash Recovery Improvements in Non-Volatile Main Memory (NVMM)

Technology #34234

Questions about this technology? Ask a Technology Manager

Download Printable PDF

Categories
Researchers
Yan Solihin, Ph.D.
External Link (www.cs.ucf.edu)
Mohammad Alshboul
James Tuck, Ph.D.
Patent Protection

US Patent Pending
Publications
Lazy Persistency: A High-Performing and Write-Efficient Software Persistency Technique
45th Annual International Symposium on Computer Architecture (ISCA), June 2018 IEEE, DOI: 10.1109/ISCA.2018.00044
Efficient Checkpointing of Loop-Based Codes for Non-Volatile Main Memory
26th International Conference on Parallel Architectures and Compilation Techniques (PACT), September 2017 IEEE, DOI: 10.1109/PACT.2017.58

Key Points

  • Improved methods of recovering data from NVMM using recomputation and lazy persistency
  • Reduces recovery time and overhead and removes the need to keep checkpoints or logs
  • Requires small software changes but no changes to hardware

Abstract

Researchers at the University of Central Florida and North Carolina State University have developed a way to reduce the execution time and write amplification associated with restoring data from non-volatile main memory (NVMM). Current crash recovery solutions use logging or checkpointing to provide failure safety to applications. However, these solutions are for volatile main memory and non-volatile disks, not NVM-based systems. As a result, they incur much higher execution time and write endurance overheads. Existing technologies also require specific hardware support or instruction set architecture (ISA) support to recover NVMM-stored data. These forms of support are not readily available in most machines today.

In comparison, the UCF technology provides unique data writing/backup methods to effectively use persistent main memory so that data recovery after a crash is faster and more accurate. The approach uses two main methods: recomputation and lazy persistency (LP). This combination avoids the need to expend large amounts of energy to rewrite lost data. Companies can rewrite their software and run them on any hardware platform to obtain the system recovery benefits.

Technical Details

The UCF invention comprises methods for accelerating program execution on NVM while at the same time reducing the number of writes. Included are steps for organizing a set of instructions into multiple regions. At least one of the regions is a recovery unit, and another is an error checking unit. The recovery unit includes written data to be transferred to NVMM, while the error checking unit summarizes the written data into a value.

One key aspect of the invention relaxes requirements for data consistency in logging and checkpointing schemes. Instead, it allows data to be in an inconsistent state during some phases of a program's lifetime by only logging enough state to enable recomputation. When a failure occurs, the approach recovers to a consistent state by determining which parts of the computation were incomplete and then recomputes them. Another aspect is the use of LP, a software persistency method. LP exploits the natural cache evictions to provide persistency without the need to eagerly flush cache blocks from the cache to the NVMM. Thus, the technique allows caches to slowly send dirty blocks (that is, modified and unsaved data) to the NVMM through natural evictions. Software error detection mechanisms (checksums) enable the system to discover persistency failures. Compared to the state-of-the-art Eager Persistency technique, LP reduces the execution time and write amplification overheads from 9 percent and 21 percent to only 1 percent and 3 percent, respectively.

Stage of Development

Prototype available.

Benefit

  • Provides near-zero execution time overhead and write endurance overhead
  • Works on any hardware platform without requiring any changes to the hardware and ISA
  • Eliminates the need for additional writes to NVMM while maintaining write endurance

Market Application

  • Emerging NVMs
  • Software development at the library levels to provide failure recovery to existing code
  • Loop-based kernels used in scientific computing