NetNews Usenet Archive 1992 #26

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Usenet Archive 1992 #26 / NN_1992_26.iso / spool / comp / arch / 10603 < prev next >

Wrap

Internet Message Format | 1992-11-10 | 4.0 KB

Xref: sparky comp.arch:10603 comp.lang.forth:3476 Path: sparky!uunet!news.tek.com!psgrain!charnel!rat!usc!elroy.jpl.nasa.gov!ames!sun-barr!male.EBay.Sun.COM!jethro.Corp.Sun.COM!exodus.Eng.Sun.COM!rbbb.Eng.Sun.COM!chased From: chased@rbbb.Eng.Sun.COM (David Chase) Newsgroups: comp.arch,comp.lang.forth Subject: Re: What's RIGHT with stack machines Date: 10 Nov 1992 22:39:42 GMT Organization: Sun Microsystems, Mt. View, Ca. Lines: 68 Message-ID: <lg0eheINNs7l@exodus.Eng.Sun.COM> References: <Bx5AIr.EAy.2@cs.cmu.edu> <1992Nov4.103008.2641@Informatik.TU-Muenchen.DE> <MIKE.92Nov9004026@guam.vlsivie.tuwien.ac.at> <id.D6UU.5Z@ferranti.com> NNTP-Posting-Host: rbbb >In article <MIKE.92Nov9004026@guam.vlsivie.tuwien.ac.at> mike@vlsivie.tuwien.ac.at (Michael Gschwind) writes: >> Once again, with technology of 10 years ago, they were nice, >> but it does pay to allocate registers and do scheduling, AND WE HAVE >> THE TECHNOLOGY NOW to do it. In article <id.D6UU.5Z@ferranti.com> peter@ferranti.com (peter da silva) writes: >If you can afford to compile your code for each new processor, yes. Otherwise >you have to assume that most code will use the scheduling that was best for >the first generation of the chip. Outside of engineering-class workstations >(a vanishingly small proportion of the total end-user micro market: PCs and >game machines clobber it by orders of magnitude) this is the normal case, >and in embedded systems (almost all the rest of the market) it's highly cost- >effective to minimize code size: ROMS are slow and expensive. >I predict that before too long all high performance commodity micros will do >scheduling at runtime. I don't think the situation is as clear-cut as you describe it. There are certain scheduling techniques that tend to work well no matter where you use them -- as long as you have enough registers, it doesn't hurt to stick a few instructions between a load into a register and the subsequent use of that register. On superscalar machines, it is generally a bad idea to do too many of exactly the same thing in a big lump (i.e., ld, ld, ld or fadd, fadd, fadd), and if you have the option of mixing things up a bit, you should. Increasing the size of basic blocks (through code replication, typically) is another trick for helping most machines, since branches often stall pipelines. These are general rules, and they won't yield optimum performance, but you must trade them off against the costs of generating implementation-specific code. Those costs include (1) less sharing of text and libraries (2) lots of cache flushing and (3) scheduling and register allocation are not necessarily cheap. Also, good scheduling and pipelining will require more information in the "binary" than is traditionally stored there. For starters, you'll need dependence information. Debugging your compilers (and your buggy applications) will be a real party in this sort of a world, because different loop schedules (based on buggy dependence information) may or may not exhibit the bug on different implementations of the same architecture. Other techniques that do appear to be highly implementation dependent (such as compiler-directed data prefetching) can instead be parameterized by a per-loop constant (that is, the prefetch distance is often loop-invariant, but the best prefetch distance varies from processor to processor). I'm not saying it isn't possible, but I think it will be fairly hard, and I suspect that the wins will not be as large as you hope. Obviously, people working at Sun worry about this, seeing as how there's at least 3 different chips that we might want our code to run on (SS2, SS10, SPARC Classic), and they all have somewhat different scheduling characteristics. On the other hand, if what you are optimizing is ROM usage, high-performance commodity micros might just run little byte-code interpreters. Of course, one trick to making your interpreted code run faster is to compile little fragments, which is sort of a generalization of scheduling at run-time. David Chase Sun