In modern microprocessors, locality of memory access is crucial to achieve high-performance due to a phenomenon known as the "memory wall". The quest for access locality tends to make algorithms more complex and oftentimes architecture dependent: programs are in principle portable but performance as a non-functional aspect is not portable. This problem is exacerbated for parallel programs on multi-core architectures, where a hierarchy of private and shared caches determines memory access latency.
Memory temperatures are an architectural extension and novel programming model. The key idea is to associate with a memory location a "temperature" that specifies the expected latency incurred by a processor core at data access: High temperature expresses 'closeness' and efficient accessibility, low temperatures the opposite. Different processor cores may observe different temperature values on the same memory address. Temperatures are merely a hint: inaccurate or untimely use of temperature information can result in inefficiency but will not cause incorrect program behavior.
The idea of enhancing locality through temperature information is simple: A thread should preferably carry out work with warm or hot input data. Many parallel algorithms follow a pattern where parallel threads can choose from a fine-grain pool of work, e.g., task-parallel programs. Such programs can, with little effort, be adapted so that individual threads control and prioritize task selection using temperature information.
We developed an emulator based on Valgrind for an X86-based architecture with support for memory temperatures and evaluated a temperature-aware task scheduler on mergesort. Our experiments demonstrated, that temperature-aware task selection can achieve almost perfect cache locality. However careful tuning of the task scheduler extension is necessary to avoid that overheads of the more complex task selection algorithm outpace performance gains through cache locality.
Recent changes: May 19, 2009.