home *** CD-ROM | disk | FTP | other *** search
- Xref: sparky comp.unix.bsd:8565 comp.benchmarks:1622 comp.arch:10504 comp.arch.storage:750
- Path: sparky!uunet!spool.mu.edu!olivea!sgigate!odin!sgi!igor!jbass
- From: jbass@igor.tamri.com (John Bass)
- Newsgroups: comp.unix.bsd,comp.benchmarks,comp.arch,comp.arch.storage
- Subject: Disk performance issues, was IDE vs SCSI-2 using iozone
- Message-ID: <1992Nov7.102940.12338@igor.tamri.com>
- Date: 7 Nov 92 10:29:40 GMT
- Sender: jbass@dmsd.com
- Organization: DMS Design
- Lines: 317
-
- Copyright 1992, DMS Design, all rights reserved.
-
- There should be something to learn or debate in this posting for everyone!
-
- (Most of what follows is from a talk I gave this summer at SCO Forum,
- and have been for several years been trying to pound into systems designers &
- integrators everywhere. Many thanks to doug@sco.com for letting/paying me to
- learn so much about PC UNIX systems over the last 3 years and give the
- interesting IDE vs SCSI SCO Forum presentation. Too bad the politics
- prevented me from using it to vastly improve their filesystem and disk i/o.)
-
- There are a number of significant issues in comparing IDE vs SCSI-2, to
- avoid comparing apples to space ships -- this topic is loaded with traps.
-
- I am not a new kid on the block ... been around/in the computer biz since '68,
- UNIX since '75, SASI/SCSI from the dark days including design of several
- SCSI hostadapters (The 3 day MAC hack published in DDJ, a WD1000
- emulating SCSI hostadapter for a proprietary bus, and a single board computer
- design). I understand UNIX/Disk systems performance better than most.
-
- For years people have generally claimed SCSI to be faster than ESDI/IDE
- for all the wrong reasons ... this was mostly due to the fact that
- SCSI drives implemented lookahead caching of a reasonable length before
- caching appeared in WD1003 interface compatable controllers (MFM, RLL, ESDI,
- IDE). Today, nearly all have lookahead caching. Properly done, ESDI/IDE
- should be slightly faster than an equivalent SCSI implementation. This
- means hostadapters & drives of equivalent technology.
-
- PC Architecture and I/O subsystem architecture issues are the real answer
- to this question ... it is not only a drive technology issue.
-
-
-
-
- First, dumb IDE adapters are WD1003-WHA interface compatible, which means:
-
- 1) the transfer between memory and controller/drive are done in
- software .... IE tight IN16/WRITE16 or READ16/OUT16 loop. The
- IN and OUT instructions run a 286-6mhz bus speeds on all ISA
- machines 286 to 486 ... time invariant ... about 900us per 16bits.
- This will be refered to as Programmed I/O, or PIO.
-
- 2) the controller interrupts once per 512 sector, and the driver
- must transfer the sector. On drives much faster than 10mbit/sec
- the processor is completely saturated during diskio interrupts
- and no concurrent user processing takes place. This is fine for
- DOS, but causes severe performance problems for all multitasking OS's.
- (Poor man's disconnect/reconnect, allows multiple concurrent drives).
-
- 3) The DMA chips on the motherboard are much slower, since they
- lose additional bus cycles to request/release the bus, which is
- why WD/IDE boards are PIO.
-
- 4) Since data transfers hog the processor, the os/application are
- not able to keep up the disk data rate, and WILL LOOSE REVS
- (miss interleave).
-
- 5) for sequential disk reads with lookahead caching, the
- system is completely CPU bound for drives above 10mbit/sec.
- All writes, and reads without lookahead caching lose one
- rev per request, unless a very large low level interleave
- is present. 1:1 is ALMOST ALWAYS the best performing interleave,
- even taking this into account, due to multiple sector requests
- at the filesystem and paging interfaces.
-
- There will be strong a market for high performance IDE hostadapters
- in the future, for UNIX, NOVEL, OS/2 and NT ... which are NOT PIO via
- IN/OUT instructions. Both ISA memory mapped and Bus Mastering IDE
- host adapters should appear soon .... some vendors are even building
- RAID IDE hostadapters. I hope this article gets enough press to make
- endusers/vars aware enough to start asking for the RIGHT technology.
- While the ISA bus used like this is a little slow, it is fast enough
- to handle the same transfer rates and number of drives as SCSI. With
- a smart IDE drive ... we could also implement DEC RPO5 style rotational
- position sensing to improve disk queue scheduling ... worth another
- 15%-30% in performance.
-
- Local bus IDE may prove one solution.
-
-
-
-
- Secondly, SCSI hostadapters for the PC come in a wide variety of flavors:
-
- 1) (Almost) Fast Bus Mastering (ok, 3rd Party DMA) like the
- Adaptec 1542 and WD7000 series controllers. These are generally
- expensive compared to dumb IDE, but allow full host CPU concurrency
- with the disk data transfers (486 and cached 386 designs more so
- than simple 386 designs).
-
- 2) Dumb (or smart) PIO hostadapters like the ST01 which are
- very cheap, and are CPU hogs with poor performance just like IDE,
- for all the same reasons, plus a few. These are common with
- cheap CD-ROM and Scanner scsi adapters.
-
- What the market really needs are some CHEAP but very dumb IDE and SCSI
- adapters that are only a bus to bus interface with a fast Bus Mastering
- DMA for each drive. In theory these would be a medium sized gate array
- for IDE, plus a 53C?? for SCSI and cost about $40 IDE, and $60 SCSI. For
- 486 systems they would blow the sockets off even the fastest adapters built
- today since the 486 has faster CPU resources to follow SCSI protocol -- more
- so than what we find on the fast adapters, and certainly faster than the
- crawlingly slow 8085/Z80 adapters. With such IDE would be both faster and
- cheaper than SCSI -- maybe we would see more IDE tapes and CD ROMS. Certainly
- the products firmware development would be shorter than any SCSI-II effort.
-
- All IDE and SCSI drives have a microprocessor which oversees the bus and
- drive operation. Generally this is a VERY SLOW 8 bit micro ... 8048, Z80,
- or 8085 core/class CPU. The IDE bus protocol is MUCH simpler than SCSI-2,
- which allows IDE drives to be more responsive. Some BIG/FAST/EXPENSIVE
- SCSI drives are starting to use 16 micro's to get the performance up.
-
- First generation SCSI drives often had 6-10ms of command overhead in the
- drive ... limiting performance to two or three commands per revolution,
- which had better be multiple sectors to get any reasonable performance.
- SCO's AFS uses 8-16k clusters for this reason, ditto for berkeley fs.
-
- The fastest drives today still have command overhead times of near/over 1ms
- (partly command decode, the rest at status/msg_out/disconnect/select sequence).
- Most are still in the 2-5 range.
-
- What was fine in microcontrollers for SASI/SCSI-I ... is impossibly
- slow with the increased firmware requirements for a conforming SCSI-II!
-
- High performance hostadapters on the ESIA and MC platforms are appearing
- that have fast 16 bit micros ... and the current prices reflect not only
- the performance .... but reduced volumes as well.
-
- The Adaptec 1542 and WD 7000 hostadapters also use very slow microprocessors
- (8085 and Z80 respectively) and also suffer from command overhead problems.
- For this reason the Adaptec 1542 introduced queuing multiple requests inside
- the hostadapter, to minimize delays between requests that left the drive idle.
- For single process disk accesses ... this works just fine ... for multiple
- processes, the disk queue sorting breaks down and generates terrible seeking
- and a performance reduction of about 60%, unless very special disk sort and
- queuing steps are taken. Specifically, this means that steps should be taken
- to allow the process associated with the current request to lock the heads
- on track during successive io completions and filesystem read ahead operations
- to make use of the data in the lookahead cache. IE ... keep other processes
- requests out of the hostadapter! Allowing other regions requests into the
- queue flushes the lookahead cache when the sequential stream is broken.
-
- Lookahead caches are very good things ... but fragile ... the filesystem,
- disksort, and driver must all attempt to preserve locality long enough to
- allow them to work. This is a major problem for many UNIX systems ... DOS
- is usally easy .... single process, mostly reads. Other than the several
- extent based filesystems (SGI) ... the balance of the UNIX filesystems
- fail to maintain locality during allocation of blocks in a file ... some
- like the BSD filesystem and SCO's AFS manage short runs ... but not good
- enough. Log structured filesystems without extensive cache memory suffer
- and late binding suffer the same problem.
-
- Some controllers are attempting to partition the cache to minimize lookahead
- cache flushing ... for a small number of active disk processes/regions. For
- DOS this is ideal, handles the multiple read/write case as well. With UNIX at
- some point the number of active regions will exceed the number of cache
- partitions, the resulting cache flushing creates a step discontinuity in
- thruput, a reduction with severe hysterisis induced performance problems.
-
- There are several disk related step discontinuity/hysterisis problems that
- cause unstable performance walls or limits in most unix systems, even today.
- Poor filesystem designs, partitioning strategies, paging strategies are
- at the top of my list, which prevent current systems from linearly degrading
- with load ... nearly every vendor has done one or more stupid things to
- create limiting walls in the performance curve due to a step discontinuity
- with a large hysterisis function.
-
- Too much development, without the oversight of a skilled systems architect.
-
-
-
-
- One final note on caching hostadapters ... the filesystem should make better
- use of any memory devoted to caching, compared with ANY hostadapter. Unless
- there is some unreasonable restriction, the OS should yield better
- performance with the additional buffer cache space than the adapter.
- What flows in/out of the OS buffer cache generally is completely redundant
- if cached in the hostadapter. If the hostadapter shows gains by using
- some statistical caching ... vs the fifo cache in UNIX, then the
- UNIX cache should be able to get better gains by incorporating the
- same statistical cache in addition to the fifo cache. (If you are running
- a BINARY DISTRIBUTION and can not modify the OS, then if the hostadapter
- cache shows a win ... you are stuck and have no choice except to use it.)
-
- Systems vendors should look at caching adapters performance results and
- reflect wining caching strategies back into the OS for even bigger
- improvements. There should never be a viable UNIX caching controller
- market ... DOS needs all the help it can get.
-
-
-
-
-
-
-
- Now for a few interesting closing notes ...
-
- Even the fastest 486 PC UNIX systems are filesystem CPU bound to between
- 500KB and 2.5MB/sec ... drive subsystems faster than this are largely
- useless (a waste of money) ... especially most RAID designs. Doing
- page flipping (not bcopy) to put the stuff into user space can improve
- things if aligned properly inside well behaved applications.
-
- The single most critical issue for 486 PC UNIX applications under X/Motif is
- disk performance -- both filesystem and paging. Today's systems are only
- getting about 10% of the available or required disk bandwidth to provide
- acceptable performance for anything other than trival applications. The
- current UNIX filesystem designs and paging algorithms are a hopeless bottle
- neck for even uni-processor designs ... MP designs REQUIRE much better if
- they are going to support multiple Xterm session using significant X/Motif
- applications like FrameMaker or Island Write. RAID performance gains are
- not enough to makeup for the poor filesystem/paging algorithms. With the
- current filesystem/paging bottleneck, neither MP or Xterms are cost
- effective technologies.
-
- For small systems the berkeley 4.2BSD and Sprite LSF filesystems both fail
- to maintain head locality ... and as a result overwork the head positioner
- resulting in lower performance and early drive failure. With the student
- facility model a large number of users with small quotas consume the entire
- disk and to the filesystem present a largely ramdom access pattern. With such
- a model there is no penalty for spreading files evenly across cylinder groups
- or cyclicly across the drive ... in fact it helps minimize excessive long worst
- case seeks at deep queue lengths and results in linear degradation without
- step discontinuities.
-
- Workstations and medium sized development systems/servers have little if any
- queue depth and locality/Sequential allocation of data within a file is
- ABSOLUTELY CRITICAL to filesystem AND exec/paging performance. Compaction of
- referenced disk area by migration of frequently accessed files to cylinder and
- rotationally optimal locations is also ABSOLUTELY necessary, by order of
- reference (directories in search paths, include files, x/motif font and
- resource files) to get control of response times for huge applications. With
- this model, less than 5% of the disk is reference in any day, and less than
- 10% in any week ... 85% has reference times greater than 30 days, if ever.
- Any partitioning of the disk is absolutely WRONG ... paging should occur
- inside active filesystems to maintain locality of reference (no seeking).
- For zoned recorded drives the active region should be the outside tracks,
- highest transfer rates and performance ... ditto for fixed frequency drives,
- but for highest reliability due to lowest bit density. This is counter the
- center of the disk theory ... which is generally wrong for other reasons
- as well (when files are sorted by probablity of access and age).
-
- In fact the filesystem should span multiple spindles with replication of
- critical/high use files across multiple drives, cylinder locations, and
- rotational positions -- for both performance and redundancy. Late binding
- of buffer data to disk addresses (preferably long after close) allows not only
- best fit allocation, but choice of least loaded drive queue as well, at the
- time of queuing the request ... not long before. Mirroring, stripping,
- and most forms of RAID are poor ways to gain lesser redundancy and load
- balancing. Dynamic migration to & retrieval from a jukebox optical disk
- or tape backing store both solves the backup problem as well as transparent
- recovery across media/drive failures. Strict write ordering and checkpointing
- should make FSCK obsolete on restarts ... just a tool for handling media
- failures and OS corruptions.
-
- I spent the last 2-1/2 years trying to get SCO accept a contract to
- increase filesystem/paging performance by 300%-600% using the above ideas
- plus a few, but failed due to NIH, lack of marketing support for performance
- enhancements, and some confusion/concern over lack of portability of such
- development to the next generation OS platform (ACE/OSF/SVR4/???). I even
- proposed doing the development on XENIX without any SCO investment, just a
- royalty position if the product was marketable -- marketing didn't want to have
- possible XENIX performance enhancements outshine ODT performance efforts,
- which if I even got close to my goals would be TINY in comparison. Focused
- on POSIX, X/Motif, networked ODT desktops -- they seem to have lost sight
- of a small lowcost minimalist character based platform for point of sale
- and traditional terminal based var's. I know character based app's are the
- stone age ... but xterms just aren't cheap yet.
-
-
-
-
- Shrink wrapped OS's mean fewer Systems Programmer development jobs!
-
- With the success of SCO on PC platforms, there are only three viable
- UNIX teams left ... USL/UNIVEL, SunSoft, and SCO. The remaining vendors
- have to few shipments and tight margins, they are unable to fund any
- significant development effort outside simply porting and maintaining
- a relatively stock USL product -- unless subsidized by hardware sales
- (hp, apple, ibm, everex, ncr .... etc). A big change from 10 years
- ago, when a strong UNIX team was a key part of every systems vendors success.
-
- With the emerging USL Destiny and Microsoft NT battle, price for a
- minimalist UNIX application platform will be EVERYTHING, to offset
- the DOS/Windows camp's compatability and market lead in desktops. UNIX's
- continued existence outside a few niche markets lies largely in USL's
- ability to expand distribution channels for Destiny without destroying
- SCO, SUN, and other traditional unix suppliers ... failure to do so will
- give the press even more amunition that "UNIX is dead".
-
- It may require that USL's royalties be reduced to the point that UNIVEL,
- SCO & SUN can profitably re-sell Destiny in the face of an all out price
- war with Microsoft NT. A tough problem for USL in both distribution channel
- strategy and revenue capture ... Microsoft has direct retail presence (1 or 2
- levels of distribution vs 2, 3 or 4) which improves it's margins significantly
- over USL (and ability to cut cost). SCO and SunSoft are going to see per unit
- profits tumble 50% to 80% without a corresponding increase in sales in the
- near term -- many ENDUSERS/VAR's buying the current bundled products can get
- along quite well with the minimalist Destiny product -- as an applications
- platform. Developers (a much smaller market) will continue buying the
- whole bundle.
-
-
-
-
- Microsoft is VERY large & profitable, sadly USL, SCO & SUN only wish to be.
- It might be time to think about building NT experience and freeware clones.
- I hope our UNIVERSITY's can focus on training NT Systems jocks, there's going
- to be more than enough UNIX guys for quite some time .... even if USL pulls it
- off and only loses 60% of the market to NT.
-
- Fortunely for the hardware guys, designs efforts will be up for NT optimized
- high performance 486 & RISC systems! and everybody is going to go crazy for
- a while.
-
- John Bass, Sr. Engineer, DMS Design (415) 615-6706
- UNIX Consultant Development, Porting, Performance by Design
-