home *** CD-ROM | disk | FTP | other *** search
- Xref: sparky comp.arch:10603 comp.lang.forth:3476
- Path: sparky!uunet!news.tek.com!psgrain!charnel!rat!usc!elroy.jpl.nasa.gov!ames!sun-barr!male.EBay.Sun.COM!jethro.Corp.Sun.COM!exodus.Eng.Sun.COM!rbbb.Eng.Sun.COM!chased
- From: chased@rbbb.Eng.Sun.COM (David Chase)
- Newsgroups: comp.arch,comp.lang.forth
- Subject: Re: What's RIGHT with stack machines
- Date: 10 Nov 1992 22:39:42 GMT
- Organization: Sun Microsystems, Mt. View, Ca.
- Lines: 68
- Message-ID: <lg0eheINNs7l@exodus.Eng.Sun.COM>
- References: <Bx5AIr.EAy.2@cs.cmu.edu> <1992Nov4.103008.2641@Informatik.TU-Muenchen.DE> <MIKE.92Nov9004026@guam.vlsivie.tuwien.ac.at> <id.D6UU.5Z@ferranti.com>
- NNTP-Posting-Host: rbbb
-
- >In article <MIKE.92Nov9004026@guam.vlsivie.tuwien.ac.at> mike@vlsivie.tuwien.ac.at (Michael Gschwind) writes:
- >> Once again, with technology of 10 years ago, they were nice,
- >> but it does pay to allocate registers and do scheduling, AND WE HAVE
- >> THE TECHNOLOGY NOW to do it.
-
- In article <id.D6UU.5Z@ferranti.com> peter@ferranti.com (peter da silva) writes:
-
- >If you can afford to compile your code for each new processor, yes. Otherwise
- >you have to assume that most code will use the scheduling that was best for
- >the first generation of the chip. Outside of engineering-class workstations
- >(a vanishingly small proportion of the total end-user micro market: PCs and
- >game machines clobber it by orders of magnitude) this is the normal case,
- >and in embedded systems (almost all the rest of the market) it's highly cost-
- >effective to minimize code size: ROMS are slow and expensive.
-
- >I predict that before too long all high performance commodity micros will do
- >scheduling at runtime.
-
- I don't think the situation is as clear-cut as you describe it.
-
- There are certain scheduling techniques that tend to work well no
- matter where you use them -- as long as you have enough registers, it
- doesn't hurt to stick a few instructions between a load into a
- register and the subsequent use of that register. On superscalar
- machines, it is generally a bad idea to do too many of exactly the
- same thing in a big lump (i.e., ld, ld, ld or fadd, fadd, fadd), and
- if you have the option of mixing things up a bit, you should.
- Increasing the size of basic blocks (through code replication,
- typically) is another trick for helping most machines, since branches
- often stall pipelines.
-
- These are general rules, and they won't yield optimum performance, but
- you must trade them off against the costs of generating
- implementation-specific code. Those costs include
-
- (1) less sharing of text and libraries
- (2) lots of cache flushing and
- (3) scheduling and register allocation are not necessarily cheap.
-
- Also, good scheduling and pipelining will require more information in
- the "binary" than is traditionally stored there. For starters, you'll
- need dependence information. Debugging your compilers (and your buggy
- applications) will be a real party in this sort of a world, because
- different loop schedules (based on buggy dependence information) may or
- may not exhibit the bug on different implementations of the same
- architecture.
-
- Other techniques that do appear to be highly implementation dependent
- (such as compiler-directed data prefetching) can instead be
- parameterized by a per-loop constant (that is, the prefetch distance
- is often loop-invariant, but the best prefetch distance varies from
- processor to processor).
-
- I'm not saying it isn't possible, but I think it will be fairly hard,
- and I suspect that the wins will not be as large as you hope.
- Obviously, people working at Sun worry about this, seeing as how
- there's at least 3 different chips that we might want our code to run
- on (SS2, SS10, SPARC Classic), and they all have somewhat different
- scheduling characteristics.
-
- On the other hand, if what you are optimizing is ROM usage,
- high-performance commodity micros might just run little byte-code
- interpreters. Of course, one trick to making your interpreted code
- run faster is to compile little fragments, which is sort of a
- generalization of scheduling at run-time.
-
- David Chase
- Sun
-