Technical Report of the School of Electrical Engineering, University of Belgrade,
#35/95, January 1995, Belgrade, Serbia, Yugoslavia

and

Proceedings of the IEEE/MELECON-96, Bari, Italy, May 13.-16., 1996.


A New Cache Architecture Concept:
The Split Temporal/Spatial Cache

V. Milutinovic, M. Tomasevic, B. Markovic (1), and M. Tremblay (2)
Department of Computer Engineering
School of Electrical Engineering
University of Belgrade
POB 816
11000 Belgrade
Serbia, Yugoslavia
Phone: (381) 11-762-214; Fax: (381) 11-762-215;
E-Mail: emilutiv@etf.bg.ac.yu

Department of Computer Engineering (1)
School of Electrical Engineering
University of Montenegro
Cetinjski put bb.
81000 Podgorica
Montenegro, Yugoslavia

Sun Microsystems Computer Corporation (2)
2550 Garcia Avenue
Mountain View, CA 94043, USA

Abstract - This paper concentrates on the problem of better utilization of cache memory. Existing solutions imply no separation of data based on their type of locality. Here, a new cache architecture concept is introduced which implies that data are split into two different cache memory subsystems, according to the type of locality they exhibit: predominantly temporal or predominantly spatial. The two cache subsystems employ a different internal organization, for better utilization of data characteristics. This paper concentrates both on performance and complexity issues. The major conclusion is that performance benefits of the proposed approach considerably outweigh its somewhat increased complexity.

I. INTRODUCTION

The major problem of this research is how to make cache memory considerably more efficient for the same transistor count (which is important for the high-end microprocessor design) or how to make it considerably less area and power consumptive for the same performance (which is important for low-cost and low-power microprocessor design). This particular research paper is aimed at the uniprocessor cache memory [3], although the concept under consideration is fully applicable to the multiprocessor cache memory, as well [4].

Existing cache memory architectures are based on the application characteristics of spatial and temporal locality [1]. In current cache memories, the two characteristics are utilized jointly, i.e., the existing cache architectures imply one cache organization which benefits from both characteristics, and makes no attempt to take special advantage of one locality type in one set of conditions, and the other locality type in another set of conditions. Such approach is obviously not optimal, since data/instructions which exhibit a predominantly temporal locality and the data/instructions which exhibit a predominantly spatial locality are equally treated using the same architecural constructs.

In this paper, under the term "data exhibiting predominantly temporal locality", we mean that the probability of referencing a certain address again in near future is high. Note that a relatively large percentage of data from this group exhibits the temporal locality "strongly", rather than "predominantly" (e.g., scalars, local variables, synchronization variables, loop control variables, and similar).

On the other hand, under the term "data exhibiting predominantly spatial locality" we mean that, after a given address is accessed, the probability is relatively high that a neighboring address will also be accessed. That is why data is fetched in cache in blocks on cache miss. Note that a relatively small percentage of data from this group exhibits the spatial locality "strongly"; actually, most of the data from this group also exhibits a relatively strong temporal locality component. Consequently, the term spatio-temporal, rather than spatial, would better reflect the locality type; however, for simplicity reasons, from now on, we will be using the term "spatial" exclusively (e.g., elements of complex data structures, array elements, record elements, and similar).

The approach to be elaborated in this paper was initially introduced in [2]; however, this paper represents the first attempt to present and analyse the approach thoroughly.

Essence of the STS (Split Temporal/Spatial) cache approach under consideration here is as follows: AT COMPILE TIME, with some estimation probability, the data can be classified as those exhibiting predominantly temporal locality or spatial locality. The "temporal" data need cache hierarchy; however, a smaller cache capacity on each hierarchy level can satisfy the needs, since fetching of entire blocks is not necessary. The "spatial" data do not need any cache hierarchy, and a relatively small prefetch buffer is expected to satisfy the needs. Therefore, splitting the cache according to the STS principle makes the overall cache memory considerably smaller for approximately the same performance. AT RUN TIME, it may become obvious that (due to a wrong compile time estimation) some "temporal" data ended up in the spatial part of the cache architecture, and vice versa. Consequently, hardware mechanisms have to be provided to monitor the run time behavior of data, and to allow the run time transfer and reallocation between the two cache parts. These hardware mechanisms increase the hardware complexity; however, by a relatively small amount. Details of the STS cache organization are given in Figure 1.

Figure 1: Organization of the STS Cache

Legend: MM- main memory; TM+RC - transfer mechanism and reallocation logic; PFB+PFM - prefetch buffer; CPU - central processing unit; C1S - first level "spatial" cache; CnT - n-th level "temporal" cache (n=1,2,3); B - word-size prefetch buffer.

Explanation: The optimal cache structure (from the locality related cache organization point of view) implies that cache is split into a "temporal" part (with a cache hierarchy and smaller caches on each of three levels) and a "spatial" part (without a cache hierarchy and with a prefetch buffer to mamin memory and a small word-size buffer for prefetching a next word to CPU). In addition, there is a transfer and retagging logic based on a run-time profiling mechanism to point to data items being initially allocated to a wrong cache part, as well as to those changing the behavior pattern during run time.

II. DETAILS OF THE SOLUTIONS TO BE ANALYSED

This section presents the details of the overall STS cache architecture and its variations. One variation implies the incorporation of separate instructions for each data behavior type (e.g., temporal load, temporal store, spatial load, spatial store, etc.). Such a variation may be easier for existing instruction set architectures (ISAs). For instance, UltraSPARC architecture includes ASI (Address Space Identifier) loads which could be used for this purpose. The other variation implies the tagging and retagging of data ("temporal" and "spatial"). The two variations are theoretically equivalent. However, the analysis to follow is based only on data tagging, since it is practically more feasible in conditions of today's existing microprocessing which include no instructions like ASI (majority).

Also, in the rest of the paper, we will consider only the splitting of the data cache, since that is where much better performance improvement results are expected (instructions exhibit mostly the spatial locality, and the "spatial" part of the STS cache is expected to dominate).

Further discussion is organized into the following subsections: (a) organization of the hardware, (b) compile-time algorithm for initial data allocation, (c) profile-time algorithm for the optional profile time retagging and optimization of relevant parameters, and (d) run-time algorithm for on-line reallocation of data. All these algorithms will be referred to as the STS set of algorithms.

A. The solutions

The follow-up paper is aimed to compare three solutions of the same transistor count: the baseline, classical cache organization with three-level hierarchy, referred to as Solution#1, and two variations of the STS data cache referred to as Solution#2 (two hierarchy levels in temporal part, with larger capacity caches on each level) and Solution#3 (three hierarchy levels in temporal part with smaller capacity caches on each level). Solutions #2 and #3 include also a prefetch buffer (PFB) in spatial part for automatic prefetching from memory of next block to one being currently accessed. There is also a small word-size buffer (B) for prefetching a next word between CPU and C1S. The access times for different cache hierarchy levels should be chosen according to the technology under consideration; however, for comparison and illustration purposes, in the analysis to follow, we have chosen the access times of each next hierarchy cache level to be four times larger. The access time for the main memory (L) is again technology dependent; in the analysis to follow, we have chosen one concrete value (e.g., L=32). For example, UltraSPARC and Alpha21164 have 2 and 8 for the first two hierarchy levels, and 32 for the main memory.

B. The simplest compile-time algorithm

The simplest compile-time algorithm for the initial allocation implies that the simple variables and constants are labeled as T (temporal) and the elements of complex data structures are labeled as S (spatial). More sophisticated algorithms are possible, but will not be considered in this paper (they are the subject of a follow-up paper). Please, note that performance evaluation with semantic benchmarks (e.g., SPLASH-2) makes the initial allocation easy. However, performance evaluation with ready-to-use traces (e.g., ATUM) makes the initial allocation relatively cumbersome, so a default initial allocation may be more appropriate. Later on, at run-time, after some "cold start" period, the retagging mechanism (to be explained next) will bring the data onto some kind of "steady state". Note, compiler issues are outside the scope of this paper.

C. The simplest profile-time algorithm

The simplest profile-time algorithm implies the existence of a control logic to create different initial setups of the relevant parameters for the run-time reallocation (counter thresholds, etc.), and to measure the program execution time for each parameter setup. Normally, the setup which enables the best execution time would be selected to support the run-time activities. Of course, the same effect could be accomplished off-line, using an appropriate simulator. Once again, the profile time activities are optional, and will not be further elaborated in this paper.

D. The simplest run-time algorithm

The simplest run-time algorithm implies that each block in the spatial part (S1C) and the highest level of the temporal part (C3T) is associated a small up-down counter with it (which will increase the overall cache transistor count for about 5% to 10%), as shown in Figure 2.

Counter value (initially zero) is incremented on each access to the upper half of the block, decremented to the lower half of the block, and reset on the replacement of the block. When counter value reaches upper (x) or lower limit (-x), further counting is disabled. The counter values are checked periodically (because the density of accesses in a unit of time determines the degree of temporal locality). Period of counter checking is controlled by another counter which counts overall number of accesses to the block. When this counter reaches some prespecified value (y), the value of up-down counter is checked. If either limit (x or -x) is reached, the block is tagged as "temporal", otherwise, it is tagged as "spatial". After checking, proper tag is written in a D flip-flop, and both counters are reset. Even if the block changes its tag (retagged) on this occasion, it stays there until being evicted on replacement to memory with its new tag. On next access to this block, a cache miss will be incurred, and the block will be fetched in the appropriate cache part according to the new tag.

III. CONDITIONS AND ASSUMPTIONS

Next, this paper specifies precisely the conditions and the assumptions of the research to follow. Under the term conditions we imply the specifiers of the real environment. Under the term assumptions we imply the simplifications that facilitate the analysis without any negative impacts on the validity and the generality of the results.
if(hit.in.block)
      {    if(-X < Xcount < X)
                 if(Hi) Xcount=Xcount+1;
                 else    Xcount=Xcount-1;
            if(Ycount < Y) Ycount=Ycount+1;
            else {   if(-X < Xcount < X) Tag=Spatial;
                       else                        Tag=Tempotal;
                       Xcount=0; 
                       Ycount=0;
                     } 
         }
Figure 2: The simplest run time logic and algorithm for reallocation of data between the S- and the T-parts.

Explanation: The two-counter logic is associated with each block in C3T and C1S to monitor the accesses for possible retagging. Blocks with changed tags are later reallocated to the other part of the STS cache (indirectly via memory). Precise run time algorithm for tagginh is shown in this figure.

A. The architecture and organization level

Two lower hierarchy levels in the temporal part are write-through caches with block size of one word, while the spatial part and the highest level in the temporal part are write-back with the 8-word block size. In the "spatial" part, by definition, there are no other cache levels. The conditions for correct functioning of the retagging logic area are: a) count limit (x) for up-down counter must be no less than a half of block size (in words); and b) period of checking (y) must no less than count limit (x).

Organizational parameters of the cache memory are: 4-way set associativity, and LRU replacement algorithm. Inclusion property in maintained in the entire temporal hierarchy.

B. The design level

Design parameters of the cache memory are: each higher level has four times larger capacity, and the highest levels (C3T in the temporal part and C1S in the spatial part) are of the copy-back type, while all other levels (C2T and C1T in the temporal part) are of the write-through type.

C. The technology level

The following technology related values are assumed:
		Hit[C1T] = 1
		Miss[C1T] and Hit[C2T] = 2
		Miss[C1T] and Miss[C2T] and Hit[C3T] = 8
		Miss[C1T] and Miss[C2T] and Miss[C3T] = 32
		Hit[B]/on.the.cpu.side] = 1
		Miss [B(s)/on.the.cpu.side] and Hit[C1S] = 8
		Miss [B(s)/on.the.cpu.side] and Miss[C1S]  = 32
All time-related parameters were chosen for a typical state-of-the-art CMOS VLSI. All analyzed structures are chosen to be of the same transistor count, for an easier comparison.

IV. ANALYSIS

Any rigorous analysis implies two elements - performance and complexity analysis. Both elements of the analysis can be done analytically (using a mathematical model), simulationally (using a behavioral model), and implementationally (using a VLSI design model). The major purpose of the analytical analysis, which would make sense in this research, would be to calculate the optimal values of parameters relevant for the analysis to follow.

Simulation analysis should best be done using real benchmarks. In order to provide the ground for comparison with other existing approaches, a widespread benchmark suite SPLASH-2 can be used. The major goal of the simulation analysis would be to determine performance [5].

All performance related estimations can be based on the equal transistor count designs for Solution#1, Solution#2, and Solution#3. Also, the estimations could be performed for a smaller cache design STS-32 (cache transistor count typical of popular 32-bit microprocessors) and a larger cache design STS-64 (cache transistor count typical of popular 64-bit microprocessors).

Basic conclusions of the performance analysis are expected to be as follows: The results are approximately the same for the STS-32 and the STS-64 cases, which means that the obtained results can be extrapolated for the cases corresponding to the cache sizes of the future 128-bit microprocessors, and beyond; this may not be immediately clear, given that data structures do not necessarily map in a larger space for a 128-bit processor (block size and the tendency of the data structure sizes is more important).

Implementation analysis should best be done using the VLSI design tools which provide the area/time related information. First, the analysis can be performed using an FPGA VLSI family to prove the implementability of the run-time retagging and reallocation control logic (the core of the solution). Second, the analysis was done using the standard-cell VLSI area estimation (for the control part of the design) and the full-custom VLSI area estimation (for the storage part of the design), using a wide spread MOSIS compatible tool L-EDIT. The major goal of the implementation analysis would be to determine complexity.

As far as the feasibility of the design, it could be concluded even at this point that the entire control logic for run time reallocation can fit into a single FPGA VLSI chip (e.g., Lattice, Xilinx, Altera, ...).

As far as the complexity analysis, our first goal should be to compute the transistor count of the added run-time counters and the added constructs for the run-time reallocation control. Our final goal should be to determine the transistor count reduction of the STS cache of the same performance as the classical cache.

Basic conclusions of the complexity analysis are expected to be as follows: The results are again approximately the same for the STS-32 and the STS-64 cases, which means that the obtained results can be extrapolated for the cases corresponding to the cache sizes of the future modifications of these machines.

V. CONCLUSION

A new cache architecture is proposed and discussed, which is based on the separation of data exhibiting predominantly temporal and spatial locality. It is referred to as the STS cache architecture. The STS cache architecture considered here is expected to be both significantly more time-efficient and power-efficient, compared to the classical cache architectures. Exact performance and complexity data are the subject of a follow-up paper.

ACKNOWLEDGMENTS

The authors are thankful to the following individuals for providing the response to the basic notions presented here, and for their encouragement to continue with the research efforts in the domain of the STS cache: Professor Michael Flynn of Stanford University, Professor Alan Jay Smith of the University of California at Berkeley, Professor Gul Agha of the University of Illinois at Urbana-Champaign, Professors Hank Dietz, H.J. Siegel, and Jose Fortes of Purdue University, Professors Yale Patt and Trevor Mudge of the University of Michigan, Professor Ali Hurson of the University of Pennsylvania, and Professor Daniel Tabak of the George Mason University.

REFERENCES

[1] Flynn, M.J., "Advanced Computer Architecture and Organization," Jones and Bartlett Publishers, Boston, Massachusetts, USA, 1995.

[2] Milutinovic, V., "Surviving the Design of a 200MHz RISC Microprocessor: Lessons Learned," IEEE Computer Society Press, Los Alamitos, California, USA, 1995.

[3] Smith, A.J., "Cache Memories," ACM Computing Surveys, September 1982, pp. 473-530.

[4] Tomasevic, M., Milutinovic,V., "The Cache Coherence Problem in Shared Memory Multiprocessor Systems," IEEE Computer Society Press, Los Alamitos, California, USA, 1993.

[5] Tremblay, M., Maturana, G., Inoue, A., Kohn, L., "A Fast and Flexible Performance Simulator for Micro-Architecture Trade-off Analysis on UltraSPARC-1," Proceedings of the IEEE Design Automation Conference, San Francisco, California, USA, June 1995.