NetNews Usenet Archive 1992 #16

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Usenet Archive 1992 #16 / NN_1992_16.iso / spool / comp / doc / techrepo / 132 < prev next >

Wrap

Internet Message Format | 1992-07-27 | 13.1 KB

Path: sparky!uunet!sun-barr!cs.utexas.edu!usc!zaphod.mps.ohio-state.edu!cis.ohio-state.edu!ucbvax!UXA.ECN.BGU.EDU!mflll From: mflll@UXA.ECN.BGU.EDU ("Dr. Laurence Leff") Newsgroups: comp.doc.techreports Subject: TRList File tech.wisjul92 Message-ID: <199207271731.AA08772@uxa.ecn.bgu.edu> Date: 27 Jul 92 17:31:30 GMT Sender: daemon@ucbvax.BERKELEY.EDU Distribution: world Organization: The Internet Lines: 259 Approved: trlist@smu.edu Technical Report List File tech.wisjul92 Bibliography of Technical Reports Computer Sciences Department University of Wisconsin April 1992 - July 1992 University of Wisconsin Technical Reports can be obtained via anonymous ftp from ftp.cs.wisc.edu. If you have problems or questions about ftp'ing these reports, please email "ftp@cs.wisc.edu". For printed copies, send requests to: Technical Report Librarian Computer Sciences Department University of Wisconsin 1210 W. Dayton Street Madison, WI 53706 USA or email: tech-reports@cs.wisc.edu ========================================================================== %A Sriram Vajapeyam %T Instruction Level Characterization of the Cray MP Processor %D May 1992 %R TR 1086 %I COMPUTER SCIENCES DEPARTMENT, UNIVERSITY OF WISCONSIN %C MADISON, WI %X Evolutionary computer architecture fundamentally relies on information obtained from empirical characterizations of the nature of programs and of dynamic program usage of machine features. While vector architectures have dominated the supercomputer arena for over two decades and promise to continue to provide superior single processor performance on a large class of scientific and engineering applications, detailed empirical characterizations of these architectures and their workload has not been reported to date in the literature. This dissertation fills this void in the empirical understanding of machines by reporting an instruction-level study of a single processor of the CRAY Y-MP, using as benchmarks the scientific and engineering application programs that comprise the PERFECT Club benchmark suite. The capability of the compiler is key to harnessing the power of a machine today. Hence we study a version of the benchmarks that is automatically optimized and vectorized by the Cray Research, Inc. production FORTRAN compiler. Furthermore, several optimizations that are easily implementable manually provide significant performance improvements over the best efforts of the compiler today. Therefore we also study a version of the benchmarks, hand-optimized by a team of Cray Research, Inc. programmers, that won the 1990 Gordon Bell PERFECT award for the fastest version of the PERFECT Club benchmarks on any machine. In both cases, we study only the user routines of the benchmarks. We observe that the vectorization level of the user routines of the programs varies widely for the compiler-optimized version of the programs, as opposed to being uniformly high. While hand optimizations do improve the vectorization level, several benchmarks still have vectorization levels below 80%. Consequently, the performance of the non-vector features of vector machines are important to program performance. We observe that the scalar code in the programs contain several address calculation operations and Cray-specific miscellaneous operations, in addition to the floating-point operations. Hence adequate attention needs to be paid to these instruction types. Towards exploiting the data dependencies in the scalar code for faster execution, we characterize the dependencies within the scalar basic blocks of the programs. We observe that the utilization of the pipeline parallelism of the functional units by scalar code is low. Thus, any latency improvements that result from lower levels of functional unit pipelining will enhance scalar performance. For the overall programs, we observe that large basic blocks are significant in number and are important to program performance. Thus compiler optimization techniques geared towards large basic blocks are desirable. The level of vectorization of memory operations is usually very high, thus emphasizing the need for memory bandwidth. We observe that for the CRAY Y-MP hardware and for the techniques currently employed by the Cray Research compiler, the peak instruction issue rate of the CRAY Y-MP is quite adequate, especially because of the presence of vector instructions. In addition to all the issues mentioned above, several related issues are discussed and explored in the dissertation. %A G. Ramalingam %A Thomas Reps %T An Incremental Algorithm for a Generalization of the Shortest-Path Problem %D May 1992 %R 1087 %I COMPUTER SCIENCES DEPARTMENT, UNIVERSITY OF WISCONSIN %C MADISON, WI %X The grammar problem, a generalization of the single-source shortest-path problem introduced by Knuth, is to compute the minimum-cost derivation of a terminal string from one or more non-terminals of a given context-free grammar, with the cost of a derivation being suitably defined. In this paper we present an incremental algorithm for a version of the grammar problem. As a special case of this algorithm we obtain an efficient incremental algorithm for the single- source shortest-path problem with positive edge lengths. The aspect of our incremental algorithm that distinguishes it from all other work on the dynamic shortest-path problem is its ability to handle "multiple heterogeneous modifica- tions": between updates, the input graph is allowed to be restructured by an arbitrary mixture of edge insertions, edge deletions, and edge-length changes. %A Men-Chow Chiang %T Memory System Design for Bus Based Multiprocessors %D May 1992 %R 1088 %I COMPUTER SCIENCES DEPARTMENT, UNIVERSITY OF WISCONSIN %C MADISON, WI %X This dissertation studies the design of single bus, shared memory multiprocessors. The purpose of the studies is to find optimum points in the design space for different memory system components that include private caches, shared bus and main memory. Two different methodologies are used based on the operating environment of a multiprocessor. For a multiprocessor operating in the Ithroughput-oriented environment, Customized Mean Value Analysis (CMVA) models are developed to evaluate the performance of the multiprocessor. The accuracy of the models are validated by comparing their results to those generated by actual trace-driven simulation over several thousand multiprocessor configurations. The comparison results show that the CMVA models can be as accurate as trace driven simulation in predicting the multiprocessor throughput and bus utilization. The validated models are then used to evaluate design choices that include cache size, cache block size, cache set-associativity, bus switching method, bus width, and main memory latency. For a multiprocessor operating in the speedup-oriented environment a multiprocessor simulator is built to conduct an execution driven simulation for several parallel benchmarks. Performance statistics are collected separately for different computational phases during the execution of a parallel program, and are analyzed to characterize the behavior of the multiprocessor in each computational phase and the overhead of parallel processing. The results show that, with loop level parallelization, performance bottlenecks can occur in the scheduling phase (if dynamic scheduling is used to schedule loop iterations to processors) as well as the independent computing phase. The cache miss ratio of shared data in the independent computing phase is also found to increase with the number of processors, meaning that even for this phase the speedup will decrease instead of leveling off after the speedup has reached its maximum. The barrier operation in a bus based multiprocessor is found not to be a serious concern. With an efficient atomic decrement instruction available, the total time for all processors to decrement the barrier counter can be very a small portion of overall execution. Both the barrier and serial times can become less significant, i.e., become a smaller portion of total execution time of a parallel program, when the problem size is increased. The impact of the synchronization method on the execution time of the scheduling phase is also considered in some detail. The performance of several alternatives to the base test&test&set implementation, such as a software queue, a lock table (proposed in this thesis), and a restricted fetch&add operation with combining, are evaluated. The simulation result shows that both the lock table and restricted combining methods can improve the scheduling rate significantly. %A Michael J. Franklin %A Michael J. Carey %T Client-Server Caching Revisited %D May 1992 %R TR 1089 %I COMPUTER SCIENCES DEPARTMENT, UNIVERSITY OF WISCONSIN %C MADISON, WI %X Most commercial and experimental Object-Oriented Database Systems (OODBMS) are implemented using a client-server software architecture. An effective technique for improving the performance of a client-server database system is to allow client workstations to cache data and/or locks across transaction boundaries. This paper extends an earlier study on the performance of caching in client-server database systems [Care91a] in three ways. The first is a re-examination of heuristics for algorithms that decide dynamically between propagating changes or invalidating remote copies of data pages in order to maintain cache consistency. The second extension is a performance study of a class of caching algorithms known as "callback locking" algorithms. These algorithms are of interest because they provide an alternative to the optimistic techniques used in the earlier study and because they have recently begun to find use in commercial systems. The third way that this paper extends the previous study is to examine the performance of alternative caching algorithms in light of current trends in processor and network speeds and to more closely examine their robustness in the presence of data contention. %A Manolis M. Tsangaris %A Jeffrey F. Naughton %T On the Performance of Object Clustering Techniques %D June 1992 %R TR 1090 %I COMPUTER SCIENCES DEPARTMENT, UNIVERSITY OF WISCONSIN %C MADISON, WI %X In this paper we investigate the performance of some of the best-known object clustering algorithms on four different workloads based upon the Tektronix benchmark. For all four workloads, stochastic clustering gave the best performance for a variety of performance metrics. Since stochastic clustering is computationally expensive, it is interesting that for every workload there was at least one cheaper clustering algorithm that matched or almost matched stochastic clustering. Unfortunately, for each workload, the algorithm that approximated stochastic clustering was different. Our experiments also demonstrated that even when the workload and object graph are fixed, the choice of the clustering algorithm depends upon the goals of the system. For example, if the goal is to perform well on traversals of small portions of the database starting with a cold cache, the important metric is the per-traversal expansion factor, and a well-chosen placement tree will be nearly optimal; if the goal is to achieve a high steady-state performance with a reasonably large cache, the appropriate metric is the number of pages to which the clustering algorithm maps the active portion of the database. For this metric, the PRP clustering algorithm, which only uses access probabilities achieves nearly optimal performance. %A Thomas Ball %A Susan Horwitz %T Constructing Control Flow From Control Dependence %D June 1992 %R TR 1091 %X Control dependences characterize how the predicates in a program govern the execution of other statements and predicates. Control dependences are defined in terms of a program's control-flow graph; given a control-flow graph G, a corresponding control dependence graph, CDG(G), can be constructed. This paper addresses the inverse problem: we define an algorithm that, given a control dependence graph C, finds a corresponding control-flow graph G (i.e., a graph G such that CDG(G) is isomorphic to C), or determines that no such control-flow graph exists. We call this process CDG-reconstitution. CDG-reconstitution is a necessary part of systems that use and transform program dependence graphs (graphs that combine data dependences and control dependences) as the basis for discovering and performing code transformations. For example, parallelizing or vectorizing compilers as well as program integration tools may construct the program dependence graph(s) of a program (or set of programs) and then manipulate them in various ways to form a new graph. Reconstitution is needed as a final step to find a corresponding program for this graph. Such a process must respect the data and control dependences of the graph - we concentrate on the reconstitution problems caused by control dependences. Our research provides two important advances over existing CDG-reconstitution algorithms. First, while previous algorithms handle only acyclic CDGs, our CDG-reconstitution algorithm correctly handles both acyclic and cyclic control dependence graphs. Second, our algorithm provides an improvement in efficiency over existing algorithms.