home *** CD-ROM | disk | FTP | other *** search
- <head>
- <title="...forever...">
- <font=monaco10.fnt>
- <font=newy36.fnt>
- <font=time24.fnt>
- <image=back.raw w=256 h=256 t=-1>
- <buf=3590>
- <bgcolor=-1>
- <background=0>
- <link_color=000>
- <module=console.mod>
- <pal=back.pal>
- colors:
- 251 - black
- </head>
- <body>
- <frame x=0 y=0 w=640 h=3590 b=-1 c=-1>
-
-
- --- - -- - --------------------------------------------------------------------
- <f1><c000> 68030 <f0>
- <f1><c000> ST RAM and things around it <f0>
- ---------------------------------------------------------------------- ---- - -
-
- I tell you the truth - I'm always reading about how slow ST RAM is and that
- Fast RAM is something unbeatable etc, but I never knew WHY. There is only one
- Falcon timing document - the one published by Rodolphe Czuba with CT2. But in
- my eyes, this doc is so short/unclear and SHITTILY translated, there's no way
- to get the idea for non-hw freaks like me. But one nice day I told myself: "It
- can't be so hard... other ppl understood that, I can too" :-) So I downloaded
- both uk and fr version of the mentioned doc and started to read...
-
- OK, after some weeks :) I've got that damned idea... I have to say I haven't
- got any oscilloscope or any other device to verify my results, but there
- shouldn't be big differences. From time to time I use some numbers by Rodolphe.
- So, let's start!
-
- Let's talk something about timing tables at first:
-
- We want to know how many cycles the move.l (an)+,d0 instruction is. If you take
- a look into official timing tables by Motorola (and rewritten by Zyx/Inferiors,
- JHL/TRSI, aRt/TBL and others):
-
- MOVE EA,Dn 0 0 2(0/0/0) 2(0/1/0)
-
- and for EA calculation:
-
- (An)+ 0 1 3(1/0/0) 3(1/0/0)
- ----------------------------------------
- 5(1/0/0) 5(1/1/0)
-
- From this example you can see:
- - no overlapping will occur
- - move (an)+,dn takes 5 bus-cycles at all (cache setting doesn't matter)
- - data reading takes 1 bus-cycle
- - i-prefetching takes 0 (if it's in i-cache) or 1 (if isn't) bus-cycle
- - data writing takes 0 bus-cycles
-
- According to Motorola, this table assumes:
- - 32 bit data bus
- - 1 bus-cycle = 2 clock (cpu) periods
- - 2 i-prefetches per 1 bus cycle
- - long aligned operands incl. system stack
- - no wait states
- - data cache disabled
-
- Instruction prefetch means "loading" of the instruction into CPU, including
- additional <ea> extensions. In the simplest case, one instruction prefetch =
- one word prefetch (since basic instruction on 68k is 16 bit). In case of movem
- instruction for example, CPU has to prefetch one long = 2 word prefetches. But
- we see, it still takes the same time (it isn't important if we transfer a word
- or long with 32 bit data bus) Conclusion: it doesn't matter if we have to
- prefetch one or two words for a complete instruction, it takes the same time.
-
- bus activity: 1*2 (data read) + 0*2 (prefetching) = 2 in cache-case or
- 1*2 (data read) + 1*2 (prefetching) = 4 in non-cache-case
-
- So, number of internal cycles for execution:
- all cycles - bus activity = 5 - 2 = 3 for cache-case or
- 5 - 4 = 1 for non-cache-case
-
- Please note there's no dependency between internal cycles for both cache and
- non-cache cases. Sometimes it's faster executing/decoding with cache, sometimes
- without cache. Ofcourse we're talking only about internal cycles and very fast
- (2 clock periods per 32 bit) RAM access, in case of slow memory you will see
- that difference. Also keep in mind values for i-prefetching are AVERAGE ones
- for odd/even word -> final value is always less or equal to the current bus
- status. In practice: let's take our movem example. In all cases it is min. 2
- words long, right? But in calculations, it's always divided into 2 i-prefetches
- -> i-prefetching takes 2*2 = 4 clock periods and real bus status is only 1*2
- clock periods. (in case of long aligned operands and 32 bit data bus ofcourse
- !!!) Also don't be surprised if you get result of 0 internal cycles sometime.
-
- But hey, that's our dream machine and not Falcon030! =) Reference for us should
- be:
-
- - 16 bit data bus
- - 1 WORD bus-cycle = 4 clock periods (according to Czuba's doc)
- - refreshing every 15.6 us [wait states] (again Czuba's doc)
-
- This stuff implies that the table above is bad. At least the overall number of
- cycles. There's a need for some changes:
-
- - if you have 32 bit data bus, doesn't matter if you read (long aligned)
- long or word - it still takes the same time. In our case it DOES matter
- since 1 bus read = 1 WORD read, NOT LONG
- - as you can see, bus activity is 2 times longer (2 vs 4 clock periods)
-
- So correct timing should look like this:
-
- move.w (an)+,dn 0 1 7(1/0/0) 9(1/1/0)
- move.l (an)+,dn 0 1 11(2/0/0) 13(2/1/0)
-
- How I got these numbers?
-
- bus activity for move.w:
- 1*4 (data read) + 0*4 (prefetching) = 4 in cache-case or
- 1*4 (data read) + 1*4 (prefetching) = 8 in non-cache-case
-
- bus activity for move.l:
- 2*4 (data read) + 0*4 (prefetching) = 8 in cache-case or
- 2*4 (data read) + 1*4 (prefetching) = 12 in non-cache-case
-
- And from example above we know this move takes 3 (cache) / 1 (non-cache)
- internal cycles: 4+3=7 / 8+1=9 (move.w) and 8+3=11 / 12+1=13 (move.l)
-
- Here we see we needn't care about 1-word vs 2-word instructions anymore since
- our bus can transfer only words :) That means if an instruction takes one word,
- it needs 1*4 clock periods, if it takes two words, it needs 2*4 clock periods
- etc.
-
- We're more than 2 times slower than an 'original' 68030 in long transfer ! Here
- I stop talking about timing tables, it's more complicated than you could
- expect, I mean especially pipelining & intruction/data cache stuff, it's a
- topic for separate article. (mail me and I'll write about it)
-
- So, back to ST-RAM:
-
- The Falcon is equiped with Motorola 68030@16 MHz and 16 bit data bus. That
- means CPU can access to RAM only 1 word per bus-cycle. The one bus-cycle is 4
- clock periods long -> CPU can read/write one word every 4th clock period (in
- the next text I assume 16 MHz clock period = 1 cycle). Luckily, during this
- time CPU isn't halted totally - if CPU doesn't access any external source
- (ST/TT RAM, hardware regs, ...) you can execute some other instructions which
- are doing stuff in CPU (this should allow use of the data cache, too, but I
- didn't get very good results..) By the way, instructions... the fastest
- instruction on 030 takes 2 cycles, but you can't use these two intructions
- during the bus write since you have got less than 4 cycles, I don't know
- exactly why.. Or, you can write to ST-RAM one long (that means <8 cycles pause)
- and here you have time for three 2-cycles instructions.. not very much, I know
- =)
-
- So, if 1 cycle is 1/16000000 s = 62.5 * 10^-9 s = 62.5 ns and CPU needs 4
- cycles for RAM access, complete reading/writing is 4 * 62.5 = 250 ns long.
- Remember our SIMMs have usually 80 ns, that means CPU can access to this RAM
- 3.125 times faster ! And this is only beginning of time wasting :)
-
- 1 word = 2 bytes -> we can transfer 2 bytes per 4 cycles -> 1/2 byte[4 bits]
- per cycle. If one cycle is 62.5 ns:
-
- 0.5/62.5ns = 8 000 000 bytes per second. Fiction.
-
- Let's take a look at the move.l (an)+,dn again. We calculated in worst case (no
- instruction cache) it takes 13 cycles, right? What is the max reading speed?
-
- 16000000/13 longs/s = 16000000*4/13 bytes/s = 4 923 076 bytes/s. Shame!
-
- Do you think it can't be worse? Okie, let's continue...
-
- Here is another bus-cycle-stealer: DMA. DMA means Direct Memory Access. That
- 'direct' means if DMA device wants something from RAM, CPU can fuck up and has
- to wait until the DMA device has finished. In Falcon we have 3 DMA devices:
- Videl, Blitter and Sound DMA. I'm not going to talk about Blitter or SDMA since
- these chips aren't necessary for EVERY application, but grafix we need from
- time to time :-)
-
- Eeeehhh... what example to give? I think 320x240xTC could be interresting,
- couldn't it?
-
- 320 x 240 x 2 bytes = 153600 bytes = 38400 longs. Videl access to video-data in
- so called BURST mode, that means 17 longs per one Videl (!) access. This BURST
- mode looks like:
-
- 1st long = 2 (RAS cycle = init) + 1 (CAS cycle = data) = 3 cycles
- 2nd ~ 17th long = 16 (CAS cycles) = 16 cycles
- ---------
- 19 cycles
- (thanks to Rodolphe for this explanation)
-
- So one BURST access = 19 cycles.
- 38400/17 = 2259 BURST accesses and 2259 * 19 = 42921 cycles per screen.
-
- That gives us 42921 * 60 = 2 575 260 cycles per second for Videl (!!!). 60
- stands for 60 Hz ofcourse. 50 could be for 50 Hz, 100 for 100 Hz etc
-
- Due to DMA stealing we lose: 16000000 - 2575260 = 13 424 740 cycles/s !!!
-
- So, what about our move.l ?
-
- 13424740/13 longs/s = 13424740*4/13 bytes/s = 4 130 689 bytes per second...
-
- Are you still thinking Falcon is fantastically developed? =) Ok...
-
- This example assumes all your RAM reading is sequential, that means you're
- reading address n+$0, n+$2, n+$4, n+$6 etc. Maybe you think: "nah and what? I
- have all my data together and my program doesn't jump every time". Hahah my
- little lady! =) How does your RAM accessing look?
-
- lea something,a0
- move.l d0,(a0)+
- move.l d3,(a0)+
- :
- :
- for example. But remember the CPU fetches instructions from your code-space
- (text-segment), then writes some long data to ANOTHER area, then fetches next
- instruction, ... etc. And we still assume all your code and data is word/long
- aligned ! (for more info about aligning see below) For non-sequential access
- you have to add 2 cycles for precharge time (again Czuba's number) to
- instruction time. In our case:
-
- 13424740/(13+2) longs/s = 13424740*4/(13+2) bytes/s = 3 579 930 bytes/s...
-
- Our 256 byte chache can solve these troubles with precharging. If you put your
- loop into the instruction cache, the data bus will be used only for
- reading/writing data. If you're really, really lucky boy, in case of reading
- you can put both program and read data in the instr/data cache and no bus
- transfer will be done. In case of writing, data is ALWAYS transferred to
- memory, since our cache works in so called 'write-through' mode. 040 and higher
- works in 'write-back' mode, here it isn't always necessary to access RAM
- (better internal logic)
-
- Now, we killed the last cycle in our Falcon :-) Ah, not the last - there are
- still those stupid wait states... but as Rodolphe wrote, they don't affect
- calculations a lot. If you want some additional info how to implement them into
- timing tables, I refer you to the 68030 User's Manual by Motorola.
-
- <link=art46b.scr>Go to NEXT PART</l>
- </frame>
- </body>
-
-