No Fragments Archive 10: Diskmags

home *** CD-ROM | disk | FTP | other *** search

/ No Fragments Archive 10: Diskmags / nf_archive_10.iso / MAGS / CHOSNECK / CHOS2.ZIP / CHOSNECK.2ND / STUFF / DATAS.ZIP / ART46.SCR < prev next >

Wrap

Text File | 1989-06-08 | 10.9 KB | 241 lines

<head> <title="...forever..."> <font=monaco10.fnt> <font=newy36.fnt> <font=time24.fnt> <image=back.raw w=256 h=256 t=-1> <buf=3590> <bgcolor=-1> <background=0> <link_color=000> <module=console.mod> <pal=back.pal> colors: 251 - black </head> <body> <frame x=0 y=0 w=640 h=3590 b=-1 c=-1> --- - -- - -------------------------------------------------------------------- <f1><c000> 68030 <f0> <f1><c000> ST RAM and things around it <f0> ---------------------------------------------------------------------- ---- - - I tell you the truth - I'm always reading about how slow ST RAM is and that Fast RAM is something unbeatable etc, but I never knew WHY. There is only one Falcon timing document - the one published by Rodolphe Czuba with CT2. But in my eyes, this doc is so short/unclear and SHITTILY translated, there's no way to get the idea for non-hw freaks like me. But one nice day I told myself: "It can't be so hard... other ppl understood that, I can too" :-) So I downloaded both uk and fr version of the mentioned doc and started to read... OK, after some weeks :) I've got that damned idea... I have to say I haven't got any oscilloscope or any other device to verify my results, but there shouldn't be big differences. From time to time I use some numbers by Rodolphe. So, let's start! Let's talk something about timing tables at first: We want to know how many cycles the move.l (an)+,d0 instruction is. If you take a look into official timing tables by Motorola (and rewritten by Zyx/Inferiors, JHL/TRSI, aRt/TBL and others): MOVE EA,Dn 0 0 2(0/0/0) 2(0/1/0) and for EA calculation: (An)+ 0 1 3(1/0/0) 3(1/0/0) ---------------------------------------- 5(1/0/0) 5(1/1/0) From this example you can see: - no overlapping will occur - move (an)+,dn takes 5 bus-cycles at all (cache setting doesn't matter) - data reading takes 1 bus-cycle - i-prefetching takes 0 (if it's in i-cache) or 1 (if isn't) bus-cycle - data writing takes 0 bus-cycles According to Motorola, this table assumes: - 32 bit data bus - 1 bus-cycle = 2 clock (cpu) periods - 2 i-prefetches per 1 bus cycle - long aligned operands incl. system stack - no wait states - data cache disabled Instruction prefetch means "loading" of the instruction into CPU, including additional <ea> extensions. In the simplest case, one instruction prefetch = one word prefetch (since basic instruction on 68k is 16 bit). In case of movem instruction for example, CPU has to prefetch one long = 2 word prefetches. But we see, it still takes the same time (it isn't important if we transfer a word or long with 32 bit data bus) Conclusion: it doesn't matter if we have to prefetch one or two words for a complete instruction, it takes the same time. bus activity: 1*2 (data read) + 0*2 (prefetching) = 2 in cache-case or 1*2 (data read) + 1*2 (prefetching) = 4 in non-cache-case So, number of internal cycles for execution: all cycles - bus activity = 5 - 2 = 3 for cache-case or 5 - 4 = 1 for non-cache-case Please note there's no dependency between internal cycles for both cache and non-cache cases. Sometimes it's faster executing/decoding with cache, sometimes without cache. Ofcourse we're talking only about internal cycles and very fast (2 clock periods per 32 bit) RAM access, in case of slow memory you will see that difference. Also keep in mind values for i-prefetching are AVERAGE ones for odd/even word -> final value is always less or equal to the current bus status. In practice: let's take our movem example. In all cases it is min. 2 words long, right? But in calculations, it's always divided into 2 i-prefetches -> i-prefetching takes 2*2 = 4 clock periods and real bus status is only 1*2 clock periods. (in case of long aligned operands and 32 bit data bus ofcourse !!!) Also don't be surprised if you get result of 0 internal cycles sometime. But hey, that's our dream machine and not Falcon030! =) Reference for us should be: - 16 bit data bus - 1 WORD bus-cycle = 4 clock periods (according to Czuba's doc) - refreshing every 15.6 us [wait states] (again Czuba's doc) This stuff implies that the table above is bad. At least the overall number of cycles. There's a need for some changes: - if you have 32 bit data bus, doesn't matter if you read (long aligned) long or word - it still takes the same time. In our case it DOES matter since 1 bus read = 1 WORD read, NOT LONG - as you can see, bus activity is 2 times longer (2 vs 4 clock periods) So correct timing should look like this: move.w (an)+,dn 0 1 7(1/0/0) 9(1/1/0) move.l (an)+,dn 0 1 11(2/0/0) 13(2/1/0) How I got these numbers? bus activity for move.w: 1*4 (data read) + 0*4 (prefetching) = 4 in cache-case or 1*4 (data read) + 1*4 (prefetching) = 8 in non-cache-case bus activity for move.l: 2*4 (data read) + 0*4 (prefetching) = 8 in cache-case or 2*4 (data read) + 1*4 (prefetching) = 12 in non-cache-case And from example above we know this move takes 3 (cache) / 1 (non-cache) internal cycles: 4+3=7 / 8+1=9 (move.w) and 8+3=11 / 12+1=13 (move.l) Here we see we needn't care about 1-word vs 2-word instructions anymore since our bus can transfer only words :) That means if an instruction takes one word, it needs 1*4 clock periods, if it takes two words, it needs 2*4 clock periods etc. We're more than 2 times slower than an 'original' 68030 in long transfer ! Here I stop talking about timing tables, it's more complicated than you could expect, I mean especially pipelining & intruction/data cache stuff, it's a topic for separate article. (mail me and I'll write about it) So, back to ST-RAM: The Falcon is equiped with Motorola 68030@16 MHz and 16 bit data bus. That means CPU can access to RAM only 1 word per bus-cycle. The one bus-cycle is 4 clock periods long -> CPU can read/write one word every 4th clock period (in the next text I assume 16 MHz clock period = 1 cycle). Luckily, during this time CPU isn't halted totally - if CPU doesn't access any external source (ST/TT RAM, hardware regs, ...) you can execute some other instructions which are doing stuff in CPU (this should allow use of the data cache, too, but I didn't get very good results..) By the way, instructions... the fastest instruction on 030 takes 2 cycles, but you can't use these two intructions during the bus write since you have got less than 4 cycles, I don't know exactly why.. Or, you can write to ST-RAM one long (that means <8 cycles pause) and here you have time for three 2-cycles instructions.. not very much, I know =) So, if 1 cycle is 1/16000000 s = 62.5 * 10^-9 s = 62.5 ns and CPU needs 4 cycles for RAM access, complete reading/writing is 4 * 62.5 = 250 ns long. Remember our SIMMs have usually 80 ns, that means CPU can access to this RAM 3.125 times faster ! And this is only beginning of time wasting :) 1 word = 2 bytes -> we can transfer 2 bytes per 4 cycles -> 1/2 byte[4 bits] per cycle. If one cycle is 62.5 ns: 0.5/62.5ns = 8 000 000 bytes per second. Fiction. Let's take a look at the move.l (an)+,dn again. We calculated in worst case (no instruction cache) it takes 13 cycles, right? What is the max reading speed? 16000000/13 longs/s = 16000000*4/13 bytes/s = 4 923 076 bytes/s. Shame! Do you think it can't be worse? Okie, let's continue... Here is another bus-cycle-stealer: DMA. DMA means Direct Memory Access. That 'direct' means if DMA device wants something from RAM, CPU can fuck up and has to wait until the DMA device has finished. In Falcon we have 3 DMA devices: Videl, Blitter and Sound DMA. I'm not going to talk about Blitter or SDMA since these chips aren't necessary for EVERY application, but grafix we need from time to time :-) Eeeehhh... what example to give? I think 320x240xTC could be interresting, couldn't it? 320 x 240 x 2 bytes = 153600 bytes = 38400 longs. Videl access to video-data in so called BURST mode, that means 17 longs per one Videl (!) access. This BURST mode looks like: 1st long = 2 (RAS cycle = init) + 1 (CAS cycle = data) = 3 cycles 2nd ~ 17th long = 16 (CAS cycles) = 16 cycles --------- 19 cycles (thanks to Rodolphe for this explanation) So one BURST access = 19 cycles. 38400/17 = 2259 BURST accesses and 2259 * 19 = 42921 cycles per screen. That gives us 42921 * 60 = 2 575 260 cycles per second for Videl (!!!). 60 stands for 60 Hz ofcourse. 50 could be for 50 Hz, 100 for 100 Hz etc Due to DMA stealing we lose: 16000000 - 2575260 = 13 424 740 cycles/s !!! So, what about our move.l ? 13424740/13 longs/s = 13424740*4/13 bytes/s = 4 130 689 bytes per second... Are you still thinking Falcon is fantastically developed? =) Ok... This example assumes all your RAM reading is sequential, that means you're reading address n+$0, n+$2, n+$4, n+$6 etc. Maybe you think: "nah and what? I have all my data together and my program doesn't jump every time". Hahah my little lady! =) How does your RAM accessing look? lea something,a0 move.l d0,(a0)+ move.l d3,(a0)+ : : for example. But remember the CPU fetches instructions from your code-space (text-segment), then writes some long data to ANOTHER area, then fetches next instruction, ... etc. And we still assume all your code and data is word/long aligned ! (for more info about aligning see below) For non-sequential access you have to add 2 cycles for precharge time (again Czuba's number) to instruction time. In our case: 13424740/(13+2) longs/s = 13424740*4/(13+2) bytes/s = 3 579 930 bytes/s... Our 256 byte chache can solve these troubles with precharging. If you put your loop into the instruction cache, the data bus will be used only for reading/writing data. If you're really, really lucky boy, in case of reading you can put both program and read data in the instr/data cache and no bus transfer will be done. In case of writing, data is ALWAYS transferred to memory, since our cache works in so called 'write-through' mode. 040 and higher works in 'write-back' mode, here it isn't always necessary to access RAM (better internal logic) Now, we killed the last cycle in our Falcon :-) Ah, not the last - there are still those stupid wait states... but as Rodolphe wrote, they don't affect calculations a lot. If you want some additional info how to implement them into timing tables, I refer you to the 68030 User's Manual by Motorola. <link=art46b.scr>Go to NEXT PART</l> </frame> </body>