CS552 Course Wiki: Spring 2016 : Memory-3 browse

Associativity
Memory Performance
Cache improvements
Fully associative mapping:
- A memory block can go into any cache block. This is not practicable to implement especially when the number of blocks in the cache is large.
Set-Associative mapping:
- Cache is divided into a number of sets, each set having n blocks in a n-way set-associative mapping. n=2 for a 2-way set associative mapping.

Here each memory block is mapped to a unique set in the cache, but within the set the memory block can be in any of the n blocks. The mapping to a set is given by: cache set = 'mem blocks' mod 'no of sets in cache'

              Cache
         0+==============+ <- set 0
          |		 |
         1+--------------+
          |		 |
         2+==============+ <- set 1
          |		 |
         3+--------------+
          |		 |
         4+==============+ <- set 2
          |		 |
         5+--------------+
          |		 |
         6+==============+ <- set 3
          |		 |
         7+--------------+
          |		 |
          +==============+

Since many memory blocks can be mapped to the same cache block, we need to identify which memory block is currently in a cache block. For this purpose the Most Significant (MS) bits of the memory location (block) is used. These bits constitute the tag field. Now if we divide the memory address in terms of offset, index and tag fields, we get: tag index offset. We also have a valid bit to ensure the validity of the cache line.
For fully associative, any line can be placed anywhere in the cache. Hence, the index bit is not present. To find a particular line in the cache, the tag is compared and the validity is checked. If tag matches and validity is set, there is a cache hit.
For n-way set-associative cache, there are n-entries for each cache index. The index component in the cache line is used to select a particular set. The different tags in a set can be compared in parallel. If there is a patch in tag and validity, we have a cache hit.
Evidently, associative caches require more hardware and are slower compared to direct mapped cache but improve conflict cache hit ratio especially for spatial workloads that may map to same cache line in a direct mapped cache. Since, associative caches can map multiple addresses in a set, introduction on new lines in the same way requires a "replacement policy" from the existing addresses. Common, replacement policies are LRU, LFU etc.
LRU can be simply implemented by recording timestamps with each cache block.

Improving cache performance
- Improving hit latency
  - Keep cache small.
  - Use multi-banked caches.
  - Simple cache replacement algorithms.(min. work during a hit)
- Reducing Miss rates
  - Better cache
  - Better cache replacement algorithm.
  - Increase associativity.
  - Pre-fetching.
  - Increase blocksize.
- Miss penalty
  - Multi level caches.
  - Deliver critical word first.
Loop Interchange optimization
- Consider a programming language(like C) that uses row major ordering for multi-dimensional array. That is it stores element in the order of row . So if one tries to access element in column order , those elements may not be in the same memory (and hence cache) block. Here is an example:
- for ( j = 0 ; j < 4000 ; j++ )
- for ( i = 0 ; i < 4000 ; i++ )
- {
- A[i][j] = A[i][j] + B [i][j] ;
- }
- The elements of the array are stored in the memory in the order
- A[0][0], A[0][1], A[0][2] ... A[1][0], A[1][1], ... ...
- A[3999][0], A[3999][1], ...
- But the elements are accessed in the order
- A[0][0], A[1][0], A[2][0] ... A[0][1], A[1][1], ...
- .....
- A[0][3999], A[1][3999], ...
- If the array is large (as in this case) and the cache is small (say consisting of 512 blocks), by the time we access all the elements of column 0 (i.e. A[0][0], A[1][0], ... A[3999][0]), the blocks brought in the first iterations are replaced by the accesses in the later iterations. Thus no temporal locality is exploited. Further since we access only one element in each row in the inner iteration, no spatial locality is exploited either.
- In this case if we have interchange the loops for i & j ; so accessing elements will be change in the way :
- A[0][0], A[0][1], A[0][2] ... A[1][0], A[1][1], ... ...
- A[3999][0], A[3999][1], ...
- which is same as the order in which they are stored in memory. Thus accessing them in the same order increases spatial locality. But it should be noted that loop interchange cannot be performed if it violates any data dependency.