home *** CD-ROM | disk | FTP | other *** search
- Xref: sparky comp.benchmarks:1684 comp.arch.storage:768
- Path: sparky!uunet!europa.asd.contel.com!emory!swrinde!cs.utexas.edu!rutgers!cbmvax!jesup
- From: jesup@cbmvax.commodore.com (Randell Jesup)
- Newsgroups: comp.benchmarks,comp.arch.storage
- Subject: Re: Disk performance issues, was IDE vs SCSI-2 using iozone
- Message-ID: <37043@cbmvax.commodore.com>
- Date: 14 Nov 92 04:17:29 GMT
- References: <1992Nov11.064154.17204@fasttech.com> <1992Nov11.210749.3953@igor.tamri.com> <36995@cbmvax.commodore.com> <1992Nov12.193308.20297@igor.tamri.com>
- Reply-To: jesup@cbmvax.commodore.com (Randell Jesup)
- Organization: Commodore, West Chester, PA
- Lines: 352
-
- jbass@igor.tamri.com (John Bass) writes:
- >
- >I've gotten a lot of wonderful mail on this series of posting, the few
- >detractors make arguements similar to Randell's, so I will rebut his
- >public position a little more strongly than otherwise.
-
- Hmmm, I have this vision of a bunch of spectators waving flags in
- stands as we "battle" it out.... :-)
-
- The discussion topic (as you noted in the subject) has changed quite
- a bit from the original topic; I suspect assumptions about this topic have
- been part of the reasons for disagreement.
-
- (For historical interest, this started out with a "why is IOZone
- dangerous to use without understanding" message.)
-
- For the record, I used the Amiga (nee Tripos) filesystem more as an
- example of how implemtation details of an OS, such as the Unix buffer cache,
- can influence filesystem design and performance, and benchmarks. The Amiga
- filesystem is by no means perfect. It has some good qualities, such as good
- (close to driver limits) performance on large reads/writes and fast opening of
- files (due to hashed directories). It has some bad qualities too, such as
- using block lists instead of extents (fixing this would be an easy big win),
- slow directory scanning (though the recent release has an ok solution to this),
- and a few other things. It is designed for micro's (multitasking, but micros).
- It performs well in a single-user environment in most situations.
-
- Now on to the real discussion...
-
- > 2) Substantial in-field performance anoymolies where critical
- > data is in remapped blocks ... what if this was a bench
- > mark evaluation purchase and the customer based a $50M order
- > on it's performance? What about the ordinary customer who
- > just has to live with it's poor behaviour?
-
- That example is a good example of why I said I think we've been
- talking about different things/worlds. That's not the desktop market. That's
- not even close. In that sort of market, spending LOTS of engineering dollars
- to avoid worst-case problems or to get the last bit of performance out of
- a drive that is the gating performance issue for many users makes sense. It
- (usually) doesn't in the desktop market.
-
- Most drives that remap do so to spares on the same cylinder. While
- fetching such a spare will take time, it won't take much (average 1/2 rev).
- If you have contiguous data and the drive is doing read-ahead, it may only
- take an extra sector time, since the drive will be using that extra half-rev
- to throw data you'll want into it's buffer). Of course, if you run out of
- spares on the cylinder, you have to do an expensive extra seek.
-
- >my position was that remapping should be done in the filesystem
- >so that the bad-blocks would NEVER be allocated and need to be remapped. This
- >IS a major divergence in thought from current practice in UNIX ... not at all
- >for DOS which has always managed bad block info in the filesystem.
-
- There are costs with that approach also. Quite possibly the costs
- aren't large, but they are there. In our particular application, we make use
- of knowing that all blocks are usable to allow image-backups and copies of
- partitions - that's painful at best and at worst impossible without mapping
- in the driver or the device. However, for systems like most Unixes which
- rarely use removable media this would not be very important.
-
- >For WD1003 type controllers the long accepted practice was to flag each
- >sector/track bad at low level format, so that when DOS format or SCO badtrk
- >scanned the media it was ASSURED of finding the area bad and marking it so.
- >From a performance point of view (the one I consistantly take) this is vastly
- >better!
-
- One thing you miss in your discussion, though: all reasonable SCSI
- drives will tell you what sectors have been mapped out - Read Defect Data.
- If the filesystem wishes to use the information, it's there. You can even
- figure out if will replace a cylinder by examining some of the mode pages, or
- change the formatting so it doesn't spare on a per-cylinder basis, at least
- on some drives (exact options vary from drive to drive, but you can check the
- mode pages).
-
- >> The Read Multiple command performs similarly to the Read Sectors command.
- >> Interrupts are not generated on every sector, but on the transfer of a block
- >> which contains the number of sectors defined by a Set Multiple command.
- >
- >Sorry, but for PC's BRDY of the WD1010 is tied to IRQ, and does generate
- >an interrupt per sector -- EVEN WITH IDE .... This discussion started out as
- >a IDE/SCSI from a PC point of view ... you need to keep that in mind ... since
- >it may not be implemented that way on your Amiga.
-
- Let me reverse that: I told you what the AT-IDE spec says. Have you
- read the spec? IDE (_not_ WD1010) has NO BRDY signal. When using Read
- Multiple, it does NOT generate more than 1 interrupt per N sectors, where
- N is the value set with Set Multiple. IDE has one interrupt line: INTRQ.
-
- >This is a host adapter issue, and hopefully will go away in the future as
- >cheap DMA hostadpters become available.
-
- If they become available. The Dos/Windows people have little reason
- to care about DMA, and the OS/2/Windows-NT/Unix people generally opt for
- SCSI, for many reasons, including tape drives, larger drives, CDROM, more
- devices per interface, disconnect, ability to mount things outside the CPU
- box (IDE is limited to 12" of cable, basically it's a straight CPU bus), etc.
- We used IDE for only one reason: it's _cheap_. For mid/high-end we're
- committed to SCSI (with one exception for reasons that have little to do
- with technical issues).
-
- >> Yes, write-buffering does lose some error recovery chances, especially
- >>if there's no higher-level knowledge of possible write-buffering so
- >>filesystems can insert lock-points to retain consistency. However, it can be
- >>a vast speed improvement. It all depends on your (the user's) needs. Some
- >>can easily live with it, some can't, some need raid arrays and UPS's.
- >
- >It is only a vast speed improvement on single block filesystem designs ...
- >any design which combines requests into a single I/O will not see such
- >improvement ... log structured filesystems are a good modern example.
- >It certainly has no such effect for my current filesystem design.
-
- If it can provide the optimization and make the design of a higher
- level significantly simpler, it may be a win. Also, it can help non-"single-
- block filesystems" - the Amiga FS writes in as large amounts as possible, but
- it doesn't gather separate requests. Small writes end up in filesystem
- buffers, and when flushed the buffers are non-contiguous, so they must be
- sent as separate requests (driver's don't do gather-scatter on the Amiga).
- Large writes will become large writes at the drive, and write-caching will be
- irrelevant.
-
- >For the DOS user, while the speed up may be great ... I have grave questions
- >reqarding data reliability when the drive fails to post an error for the
- >sectors involved because the spare table overflowed. I also strongly
- >disagree with drive based automatic remapping since an overloaded power supply
- >or power supply going out of regulation will create excessive soft errors
- >which will trigger unnecessary remapping.
-
- First, I've only seen spurious drive errors once in my entire time
- working with HD's (when the air conditioning went out and it hit ~110F -
- without AC my office roasts), and the drive quickly shut down totally.
- Second, as I mentioned, write-buffering needn't be hidden from higher layers.
- Introducing filesystem lock-points (similar in concept to trap barrier
- instructions on a CPU) would resolve all your worries about missed errors.
-
- I also have almost never seen errors on write, I see a vast majority
- of read errors (which isn't suprising given how few people use verify).
-
- >Write buffering requires automatic remapping ... A good filesystem design
- >should not see any benefits from write buffering, and doesn't need/want
- >remapping. Nor do customers want random/unpredictable performance/response
- >times.
-
- Again, in the desktop market, few if any people notice what you
- call random performance.
-
- >>> Tapes are (or will be) here, and I
- >>>expect CDROMS (now partly proprietary & SCSI) to be mostly IDE & SCSI
- >>>in the future. IDE is already extending the WD1003 interface, I expect
- >>>addtional drive support will follow at some point, although multiple
- >>>hostadapters is a minor cost issue for many systems.
- >>
- >> There are rumbles in that direction. I'm not certain it's worth
- >>it, or that it can be compatible enough to gain any usage. Existing users
- >>who need lots of devices have no reason to switch from SCSI to IDE, and
- >>systems vendors have few reasons to spend money on lots of interfaces
- >>for devices that don't exist. The reason IDE became popular is that it was
- >>_cheap_, and no software had to be modified to use it.
- >
- >The fact is that IDE has become the storage bus of choice for low end
- >systems ... and other storage vendors will follow it to reduce interface
- >(extra slot/adapter) costs. In laptops, IDE IS THE STORAGE BUS,
- >no slots for other choices.
-
- Actually, the laptop/palmtop market seems to moving towards PCMCIA
- slots for IO and storage. I don't disagree that in low-power and/or low-
- cost applications that IDE has it's place. That's why it's in our 2 lowest-
- end machines. However, as I said, I don't see significant movement to
- extend it to more drives (I hear rumbles, but see no lightning...). Certainly
- there's no rush to IDE tape drives or CDROM drives. IDE is only _just_
- starting to try to support removable media at all (pushed by Syquest and/or
- Bernoulli, I think). Again, there's no indication that this is going to
- be a major factor. It could happen; I doubt it will.
-
- >> Sounds like the old IPI vs SCSI arguments over whether smart or dumb
- >>controllers are better (which is perhaps very similar to our current
- >>discussion, with a few caveats).
- >
- >This IS VERY MUCH LIKE that discussion ... BUT about how a seemingly good
- >10 year old decision has gone bad. Given the processor and drive speeds
- >of that era ... I also supported SCSI while actively pushing for reduced
- >Command Decode times. See article by Dan Jones @ Fortune Systems 1986 I
- >think in EDN regarding SCSI performance issues ... resulting from
- >the WD1000 emulating SCSI hostadapter I did for them under contract.
-
- Could you summarize for those without large libraries close at hand?
-
- >Has IPI largely died due to a lack of volume? ... SCSI proved easier to
- >interface, lowering base system costs .... just as IDE has. Certainly
- >the standardization on SCSI by Apple after my cheap DDJ published
- >hostadapter, was a major volume factor toward the success of SCSI
- >and embeded SCSI drives. The market changed so fast after the MacPlus
- >that DTC, XEBEC, and OMTI became has been's, even though they shaped
- >the entire market up to that point.
-
- Quite true. I don't follow IPI, but it seems to have faded from view
- at least.
-
- >>I would suggest
- >> (a) that's a highly contrived example, especially for
- >> the desktop machines that all IDE and most SCSI drives
- >> are designed for,
- >
- >For DOS you are completely right .... for any multitasking with a
- >high performance filesystem you missed the mark.
-
- You need to separate multi-tasking from multi-user. Single-user
- machines (and this includes most desktop Unix boxes) don't have the activity
- levels for the example you gave to have any relevance. It's rare that more
- than one or two files are being accessed in any given second or even minute.
- Also, in a single-user environment, average response time becomes the over-
- riding factor over total throughput.
-
- >> (b) both C-SCAN and most other disksorting algorithms have tradeoffs
- >> for the increase in total-system throughput and decrease in
- >> worst-case performance; FCFS actually performs quite well in
- >> actual use (I have some old comp.arch articles discussing this
- >> if people want to see them). The tradeoffs are usually in
- >> fairness, response time, and average performance (no straight
- >> disksort will help you until 3 requests are queued in the first
- >> place).
- >
- >While your short queue observations are quite true, your assumptions
- >stating this is the norm are quite different than mine, and are largely
- >an artifact of 1975 filesystem designs and buffering constraints.
- >In my world systems with 5-20 active users are common, with similar
- >average queue depths -- and FCFS is not an acceptable solution since it
- >blocks request locality resulting from steady state read-ahead service,
- >resulting in a significant loss of thruput (80% or more).
-
- I never said that FCFS was the best (or even acceptable) for all
- uses. Your "world" is not the desktop world.
-
- >The primary assumption in FCFS proponents is that all requests are unrelated
- >and have no bearing on the locality of future requests. In addition they
- >extrapolate response time fairness for a given single block globally.
- >In reality, from the users perspective, they judge such from how quickly
- >a given task completes ... and most things improve thruput, will improve
- >task completion times .... as long as they don't create the ability for
- >some process to hog resources.
-
- I think that even in larger systems, both throughput and response
- time are important. Throughput does you no good if you have to wait 5
- seconds because some other user was loading a 10MB simulation. Again, on
- desktop machines throughput becomes even less relevant. If you can show
- something is better than FCFS in those constraints, fine.
-
- >> For most desktop machines, even if we ignore single-tasking OS's,
- >>probably 99.44% of the time when disk requests occur there are no other
- >>currently active requests.
- >
- >Any strategy that offers performance gains under load can not be
- >dismissed out of hand. Especially RISC systems that completely out run
- >the disk subsystems. If your company is happy with slow single request
- >single process filesystems and hardware ... so be it, but to generalize
- >it to the rest of the market is folly. There are better filesystem designs
- >that do not produce this profile on even a single user, single process
- >desktop box.
-
- Certainly they shouldn't be "dismissed out of hand". However,
- recognition of the costs and complexity involved should be there also.
- Filesystems are complex beasts, and that complexity can lead to high
- maintenance costs. Things that increase the complexity of an already
- complex item for performance gains that are only rarely needed for the
- application are suspect. Also, adding something to an already complex
- object can be more costly than adding it to a simple object, because the
- complexities interact. An example: the Amiga FS is built around coroutines
- (which should give a good clue as to when it's base was designed...),
- basically one per active filehandle plus others as needed. This alone makes
- it very hard to modify; I have to take time to absorb the design every time
- before I can work on it, because of the complexity. Anything that added
- significant complexity to it would be a nightmare (rewriting from scratch
- is not an option, though I'd love to throw it out and start again...) This
- makes it easier to add mapping to either the driver or the device itself.
-
- >If this single user is going to run all the requests thru the cache
- >anyway ... why not help it up front ... and queue a significant amount
- >of the I/O on open or first reference. There are a few files were this
- >is not true ... such as append only log files ... but there are clues
- >that can be taken.
-
- Why should the single-user have to run everything through a cache?
- I think direct-DMA works very nicely for a single-user environment, especially
- if you don't own stock in RAM vendors (as the current maintainers of Unix,
- OS/2, and WindowsNT seem to). Reserve the buffers for things that are likely
- to be requested again or soon - most sequential reads of files are not re-used.
-
- >>>IDE drives should support the WD1010 SCAN_ID command allowing the driver
- >>>locate current head position ... no such SCSI feature exists.
-
- >The key word was should .... it was part of the WD1010 chipset that formed
- >the WD1003WHA interface standard. Again if nobody makes use of a feature,
- >it is optional ... I make the arguement at it has significant value.
-
- It also has significant cost for them to support, especially since
- they try to present an interface that looks like an old ST-506 MFM disk to
- the system. 99% of them already remap, since few of them have 17 sectors
- per track any more, not even counting zone recording or sparing.
-
- >>change, performance/cost ratios change, etc. I also think your opinions
- >>come from a large, heavy-io multi-user systems background; which while
- >>interesting for those working with such systems is pretty irrelevant to
- >>the desktop systems that IDE and (most of) SCSI are designed for.
- >
- >By background and perspective run the full range, single user single process
- >desktops to mulituser multiprocess multiprocessor servers. Clinging to old
- >filesystems designs is an option, but as I outlined ... your assumptions
- >and conclusions are vastly in conflict with designing storage systems
- >capable of servicing 486/RISC desktop machines ... and certainly in
- >conflict with the needs of fileservers and multiuser applications engines
- >found in 100,000's point of sale systems managing inventory and register
- >scanners in each sales island -- Sears, Kmart, every supermarket, autoparts
- >stores and a growing number of restrants and fast food stores.
-
- Those are interesting areas, but those are not desktop markets,
- and those are not single-user markets. Those are large, multi-user (from the
- point of the server) transaction and database server systems.
-
- >Your perspective for design tradeoffs for the Amiga home market may well
- >be correct ... but are in conflict with the larger comercial markets that
- >are the focus of the technologies I have discussed here and elsewhere.
-
- I made the mistake of trying to keep the discussion relevant to the
- original topic. If you had been more clear that your interests and comments
- were directed elsewhere, there would have been less confusion.
-
- >Ignore what your system does ... what CAN IT DO if things were better?
- >
- >Again ... Performance comes by DESIGN, not tuning, selection, or
- >non-reflective evolution. Too much performance engineering is making do
- >with existing poor designs ... instead of designing how it should be.
-
- True, but one rarely gets to re-do an existing design. If you're
- lucky, you get to occasionally redo a part of it, or spend a bit of time
- smoothing out one aspect of it. Most companies can't afford to take something
- that is sub-optimal but functional and start again from scratch. Look at
- the glorified program loader called MSDOS for an example of inertia. Even
- if I agreed that XYZ filesystem would be a major improvement for desktop
- systems (such as the Amiga), I would have to weigh the many months of work
- it takes to write and debug a full filesystem against both it's gains AND
- what other things I could do in that time elsewhere. If the gains will only
- be seen by .1% of the users, .1% of the time, I'm not going to spend _any_
- time on it. Call it the FS version of RISC philosophy - keep things simple.
-
- Sometimes good enough is all that's needed.
-
- --
- To be or not to be = 0xff
- -
- Randell Jesup, Jack-of-quite-a-few-trades, Commodore Engineering.
- {uunet|rutgers}!cbmvax!jesup, jesup@cbmvax.cbm.commodore.com BIX: rjesup
- Disclaimer: Nothing I say is anything other than my personal opinion.
-