%T Instruction Level Characterization of the Cray MP Processor
%D May 1992
%R TR 1086
%I COMPUTER SCIENCES DEPARTMENT, UNIVERSITY OF WISCONSIN
%C MADISON, WI
%X Evolutionary computer architecture fundamentally relies on information
obtained from empirical characterizations of the nature of programs and of
dynamic program usage of machine features. While vector architectures have
dominated the supercomputer arena for over two decades and promise to continue
to provide superior single processor performance on a large class of scientific and engineering applications, detailed empirical characterizations of these
architectures and their workload has not been reported to date in the
literature. This dissertation fills this void in the empirical understanding
of machines by reporting an instruction-level study of a single processor
of the CRAY Y-MP, using as benchmarks the scientific and engineering
application programs that comprise the PERFECT Club benchmark suite.
The capability of the compiler is key to harnessing the power of a
machine today. Hence we study a version of the benchmarks that is
automatically optimized and vectorized by the Cray Research, Inc. production
FORTRAN compiler. Furthermore, several optimizations that are easily
implementable manually provide significant performance improvements over
the best efforts of the compiler today. Therefore we also study a version
of the benchmarks, hand-optimized by a team of Cray Research, Inc. programmers, that won the 1990 Gordon Bell PERFECT award for the fastest version of
the PERFECT Club benchmarks on any machine. In both cases, we study only
the user routines of the benchmarks.
We observe that the vectorization level of the user routines of the
programs varies widely for the compiler-optimized version of the programs,
as opposed to being uniformly high. While hand optimizations do improve
the vectorization level, several benchmarks still have vectorization
levels below 80%. Consequently, the performance of the non-vector
features of vector machines are important to program performance.
We observe that the scalar code in the programs contain several address
calculation operations and Cray-specific miscellaneous operations, in
addition to the floating-point operations. Hence adequate attention needs
to be paid to these instruction types. Towards exploiting the data
dependencies in the scalar code for faster execution, we characterize
the dependencies within the scalar basic blocks of the programs.
We observe that the utilization of the pipeline parallelism of
the functional units by scalar code is low. Thus, any latency
improvements that result from lower levels of functional unit
pipelining will enhance scalar performance. For the overall programs,
we observe that large basic blocks are significant in number and are
important to program performance. Thus compiler optimization techniques
geared towards large basic blocks are desirable. The level of vectorization
of memory operations is usually very high, thus emphasizing the need
for memory bandwidth. We observe that for the CRAY Y-MP hardware and
for the techniques currently employed by the Cray Research compiler, the
peak instruction issue rate of the CRAY Y-MP is quite adequate, especially
because of the presence of vector instructions. In addition to all the
issues mentioned above, several related issues are discussed and explored
in the dissertation.
%A G. Ramalingam
%A Thomas Reps
%T An Incremental Algorithm for a Generalization of the Shortest-Path
Problem
%D May 1992
%R 1087
%I COMPUTER SCIENCES DEPARTMENT, UNIVERSITY OF WISCONSIN
%C MADISON, WI
%X The grammar problem, a generalization of the single-source
shortest-path problem introduced by Knuth, is to compute the
minimum-cost derivation of a terminal string from one or
more non-terminals of a given context-free grammar, with the
cost of a derivation being suitably defined. In this paper
we present an incremental algorithm for a version of the
grammar problem. As a special case of this algorithm we
obtain an efficient incremental algorithm for the single-
source shortest-path problem with positive edge lengths.
The aspect of our incremental algorithm that distinguishes
it from all other work on the dynamic shortest-path problem
is its ability to handle "multiple heterogeneous modifica-
tions": between updates, the input graph is allowed to be
restructured by an arbitrary mixture of edge insertions,
edge deletions, and edge-length changes.
%A Men-Chow Chiang
%T Memory System Design for Bus Based Multiprocessors
%D May 1992
%R 1088
%I COMPUTER SCIENCES DEPARTMENT, UNIVERSITY OF WISCONSIN
%C MADISON, WI
%X This dissertation studies the design of single bus, shared memory
multiprocessors.
The purpose of the studies is to find optimum points in the design
space for different memory system components that include private
caches, shared bus and main memory.
Two different methodologies are used based on the operating environment of
a multiprocessor. For a multiprocessor operating in the Ithroughput-oriented
environment, Customized Mean Value Analysis (CMVA) models are developed to
evaluate the performance of the multiprocessor. The accuracy of the models
are validated by comparing their results to those generated by actual
trace-driven simulation over several thousand multiprocessor configurations.
The comparison results show that the CMVA models can be as accurate as
trace driven simulation in predicting the multiprocessor throughput and
bus utilization. The validated models are then used to evaluate design
choices that include cache size, cache block size, cache set-associativity,
bus switching method, bus width, and main memory latency.
For a multiprocessor operating in the speedup-oriented
environment a multiprocessor simulator is built to conduct
an execution driven simulation for several parallel benchmarks.
Performance statistics are collected separately for different
computational phases during the execution of a parallel program,
and are analyzed to characterize the behavior of the multiprocessor
in each computational phase and the overhead of parallel processing.
The results show that, with loop level parallelization, performance
bottlenecks can occur in the scheduling phase (if dynamic scheduling
is used to schedule loop iterations to processors) as well as the
independent computing phase. The cache miss ratio of shared data in
the independent computing phase is also found to increase with the
number of processors, meaning that even for this phase the speedup
will decrease instead of leveling off after the speedup has reached
its maximum.
The barrier operation in a bus based multiprocessor is found not to
be a serious concern. With an efficient atomic decrement instruction
available, the total time for all processors to decrement the barrier
counter can be very a small portion of overall execution. Both the
barrier and serial times can become less significant, i.e., become a smaller
portion of total execution time of a parallel program, when the problem
size is increased.
The impact of the synchronization method on the execution time of the
scheduling phase is also considered in some detail. The performance
of several alternatives to the base test&test&set implementation, such
as a software queue, a lock table (proposed in this thesis),
and a restricted fetch&add operation with combining, are evaluated.
The simulation result shows that both the lock table and restricted
combining methods can improve the scheduling rate significantly.
%A Michael J. Franklin
%A Michael J. Carey
%T Client-Server Caching Revisited
%D May 1992
%R TR 1089
%I COMPUTER SCIENCES DEPARTMENT, UNIVERSITY OF WISCONSIN
%C MADISON, WI
%X Most commercial and experimental Object-Oriented Database Systems (OODBMS)
are implemented using a client-server software architecture. An effective
technique for improving the performance of a client-server database system
is to allow client workstations to cache data and/or locks across transaction
boundaries. This paper extends an earlier study on the performance of caching
in client-server database systems [Care91a] in three ways. The first is a
re-examination of heuristics for algorithms that decide dynamically between
propagating changes or invalidating remote copies of data pages in order to
maintain cache consistency. The second extension is a performance study of
a class of caching algorithms known as "callback locking" algorithms. These
algorithms are of interest because they provide an alternative to the
optimistic techniques used in the earlier study and because they have recently
begun to find use in commercial systems. The third way that this paper extends
the previous study is to examine the performance of alternative caching
algorithms in light of current trends in processor and network speeds and to
more closely examine their robustness in the presence of data contention.
%A Manolis M. Tsangaris
%A Jeffrey F. Naughton
%T On the Performance of Object Clustering Techniques
%D June 1992
%R TR 1090
%I COMPUTER SCIENCES DEPARTMENT, UNIVERSITY OF WISCONSIN
%C MADISON, WI
%X In this paper we investigate the performance of some of the best-known
object clustering algorithms on four different workloads based upon
the Tektronix benchmark. For all four workloads, stochastic
clustering gave the best performance for a variety of performance
metrics. Since stochastic clustering is computationally expensive, it
is interesting that for every workload there was at least one cheaper
clustering algorithm that matched or almost matched stochastic
clustering. Unfortunately, for each workload, the algorithm that
approximated stochastic clustering was different. Our experiments
also demonstrated that even when the workload and object graph are fixed,
the choice of the clustering algorithm depends upon the goals of the
system. For example, if the goal is to perform well on traversals of
small portions of the database starting with a cold cache, the
important metric is the per-traversal expansion factor, and a well-chosen
placement tree will be nearly optimal; if the goal is to achieve a
high steady-state performance with a reasonably large cache, the
appropriate metric is the number of pages to which the clustering
algorithm maps the active portion of the database. For
this metric, the PRP clustering algorithm, which only uses access