and
Proceedings of the IEEE/MELECON-96, Bari, Italy, May 13.-16., 1996.
Department of Computer Engineering (1)
School of Electrical Engineering
University of Montenegro
Cetinjski put bb.
81000 Podgorica
Montenegro, Yugoslavia
Sun Microsystems Computer Corporation (2)
2550 Garcia Avenue
Mountain View, CA 94043, USA
The major problem of this research is how to make cache memory considerably more efficient for the same transistor count (which is important for the high-end microprocessor design) or how to make it considerably less area and power consumptive for the same performance (which is important for low-cost and low-power microprocessor design). This particular research paper is aimed at the uniprocessor cache memory [3], although the concept under consideration is fully applicable to the multiprocessor cache memory, as well [4].
Existing cache memory architectures are based on the application characteristics of spatial and temporal locality [1]. In current cache memories, the two characteristics are utilized jointly, i.e., the existing cache architectures imply one cache organization which benefits from both characteristics, and makes no attempt to take special advantage of one locality type in one set of conditions, and the other locality type in another set of conditions. Such approach is obviously not optimal, since data/instructions which exhibit a predominantly temporal locality and the data/instructions which exhibit a predominantly spatial locality are equally treated using the same architecural constructs.
In this paper, under the term "data exhibiting predominantly temporal locality", we mean that the probability of referencing a certain address again in near future is high. Note that a relatively large percentage of data from this group exhibits the temporal locality "strongly", rather than "predominantly" (e.g., scalars, local variables, synchronization variables, loop control variables, and similar).
On the other hand, under the term "data exhibiting predominantly spatial locality" we mean that, after a given address is accessed, the probability is relatively high that a neighboring address will also be accessed. That is why data is fetched in cache in blocks on cache miss. Note that a relatively small percentage of data from this group exhibits the spatial locality "strongly"; actually, most of the data from this group also exhibits a relatively strong temporal locality component. Consequently, the term spatio-temporal, rather than spatial, would better reflect the locality type; however, for simplicity reasons, from now on, we will be using the term "spatial" exclusively (e.g., elements of complex data structures, array elements, record elements, and similar).
The approach to be elaborated in this paper was initially introduced in [2]; however, this paper represents the first attempt to present and analyse the approach thoroughly.
Essence of the STS (Split Temporal/Spatial) cache approach under consideration here is as follows: AT COMPILE TIME, with some estimation probability, the data can be classified as those exhibiting predominantly temporal locality or spatial locality. The "temporal" data need cache hierarchy; however, a smaller cache capacity on each hierarchy level can satisfy the needs, since fetching of entire blocks is not necessary. The "spatial" data do not need any cache hierarchy, and a relatively small prefetch buffer is expected to satisfy the needs. Therefore, splitting the cache according to the STS principle makes the overall cache memory considerably smaller for approximately the same performance. AT RUN TIME, it may become obvious that (due to a wrong compile time estimation) some "temporal" data ended up in the spatial part of the cache architecture, and vice versa. Consequently, hardware mechanisms have to be provided to monitor the run time behavior of data, and to allow the run time transfer and reallocation between the two cache parts. These hardware mechanisms increase the hardware complexity; however, by a relatively small amount. Details of the STS cache organization are given in Figure 1.
![]() Figure 1: Organization of the STS Cache Legend: MM- main memory; TM+RC - transfer mechanism and reallocation logic; PFB+PFM - prefetch buffer; CPU - central processing unit; C1S - first level "spatial" cache; CnT - n-th level "temporal" cache (n=1,2,3); B - word-size prefetch buffer. Explanation: The optimal cache structure (from the locality related cache organization point of view) implies that cache is split into a "temporal" part (with a cache hierarchy and smaller caches on each of three levels) and a "spatial" part (without a cache hierarchy and with a prefetch buffer to mamin memory and a small word-size buffer for prefetching a next word to CPU). In addition, there is a transfer and retagging logic based on a run-time profiling mechanism to point to data items being initially allocated to a wrong cache part, as well as to those changing the behavior pattern during run time. |
Also, in the rest of the paper, we will consider only the splitting of the data cache, since that is where much better performance improvement results are expected (instructions exhibit mostly the spatial locality, and the "spatial" part of the STS cache is expected to dominate).
Further discussion is organized into the following subsections: (a) organization of the hardware, (b) compile-time algorithm for initial data allocation, (c) profile-time algorithm for the optional profile time retagging and optimization of relevant parameters, and (d) run-time algorithm for on-line reallocation of data. All these algorithms will be referred to as the STS set of algorithms.
Counter value (initially zero) is incremented on each access to the upper half of the block, decremented to the lower half of the block, and reset on the replacement of the block. When counter value reaches upper (x) or lower limit (-x), further counting is disabled. The counter values are checked periodically (because the density of accesses in a unit of time determines the degree of temporal locality). Period of counter checking is controlled by another counter which counts overall number of accesses to the block. When this counter reaches some prespecified value (y), the value of up-down counter is checked. If either limit (x or -x) is reached, the block is tagged as "temporal", otherwise, it is tagged as "spatial". After checking, proper tag is written in a D flip-flop, and both counters are reset. Even if the block changes its tag (retagged) on this occasion, it stays there until being evicted on replacement to memory with its new tag. On next access to this block, a cache miss will be incurred, and the block will be fetched in the appropriate cache part according to the new tag.
![]() if(hit.in.block) { if(-X < Xcount < X) if(Hi) Xcount=Xcount+1; else Xcount=Xcount-1; if(Ycount < Y) Ycount=Ycount+1; else { if(-X < Xcount < X) Tag=Spatial; else Tag=Tempotal; Xcount=0; Ycount=0; } }Figure 2: The simplest run time logic and algorithm for reallocation of data between the S- and the T-parts. Explanation: The two-counter logic is associated with each block in C3T and C1S to monitor the accesses for possible retagging. Blocks with changed tags are later reallocated to the other part of the STS cache (indirectly via memory). Precise run time algorithm for tagginh is shown in this figure. |
Organizational parameters of the cache memory are: 4-way set associativity, and LRU replacement algorithm. Inclusion property in maintained in the entire temporal hierarchy.
Hit[C1T] = 1 Miss[C1T] and Hit[C2T] = 2 Miss[C1T] and Miss[C2T] and Hit[C3T] = 8 Miss[C1T] and Miss[C2T] and Miss[C3T] = 32 Hit[B]/on.the.cpu.side] = 1 Miss [B(s)/on.the.cpu.side] and Hit[C1S] = 8 Miss [B(s)/on.the.cpu.side] and Miss[C1S] = 32All time-related parameters were chosen for a typical state-of-the-art CMOS VLSI. All analyzed structures are chosen to be of the same transistor count, for an easier comparison.
Simulation analysis should best be done using real benchmarks. In order to provide the ground for comparison with other existing approaches, a widespread benchmark suite SPLASH-2 can be used. The major goal of the simulation analysis would be to determine performance [5].
All performance related estimations can be based on the equal transistor count designs for Solution#1, Solution#2, and Solution#3. Also, the estimations could be performed for a smaller cache design STS-32 (cache transistor count typical of popular 32-bit microprocessors) and a larger cache design STS-64 (cache transistor count typical of popular 64-bit microprocessors).
Basic conclusions of the performance analysis are expected to be as follows: The results are approximately the same for the STS-32 and the STS-64 cases, which means that the obtained results can be extrapolated for the cases corresponding to the cache sizes of the future 128-bit microprocessors, and beyond; this may not be immediately clear, given that data structures do not necessarily map in a larger space for a 128-bit processor (block size and the tendency of the data structure sizes is more important).
Implementation analysis should best be done using the VLSI design tools which provide the area/time related information. First, the analysis can be performed using an FPGA VLSI family to prove the implementability of the run-time retagging and reallocation control logic (the core of the solution). Second, the analysis was done using the standard-cell VLSI area estimation (for the control part of the design) and the full-custom VLSI area estimation (for the storage part of the design), using a wide spread MOSIS compatible tool L-EDIT. The major goal of the implementation analysis would be to determine complexity.
As far as the feasibility of the design, it could be concluded even at this point that the entire control logic for run time reallocation can fit into a single FPGA VLSI chip (e.g., Lattice, Xilinx, Altera, ...).
As far as the complexity analysis, our first goal should be to compute the transistor count of the added run-time counters and the added constructs for the run-time reallocation control. Our final goal should be to determine the transistor count reduction of the STS cache of the same performance as the classical cache.
Basic conclusions of the complexity analysis are expected to be as follows: The results are again approximately the same for the STS-32 and the STS-64 cases, which means that the obtained results can be extrapolated for the cases corresponding to the cache sizes of the future modifications of these machines.
[2] Milutinovic, V., "Surviving the Design of a 200MHz RISC Microprocessor: Lessons Learned," IEEE Computer Society Press, Los Alamitos, California, USA, 1995.
[3] Smith, A.J., "Cache Memories," ACM Computing Surveys, September 1982, pp. 473-530.
[4] Tomasevic, M., Milutinovic,V., "The Cache Coherence Problem in Shared Memory Multiprocessor Systems," IEEE Computer Society Press, Los Alamitos, California, USA, 1993.
[5] Tremblay, M., Maturana, G., Inoue, A., Kohn, L., "A Fast and Flexible Performance Simulator for Micro-Architecture Trade-off Analysis on UltraSPARC-1," Proceedings of the IEEE Design Automation Conference, San Francisco, California, USA, June 1995.