home *** CD-ROM | disk | FTP | other *** search
- Xref: sparky comp.benchmarks:1673 comp.arch.storage:767
- Newsgroups: comp.benchmarks,comp.arch.storage
- Path: sparky!uunet!ukma!darwin.sura.net!sgiblab!sgigate!sgi!igor!jbass
- From: jbass@igor.tamri.com (John Bass)
- Subject: Re: Disk performance issues, was IDE vs SCSI-2 using iozone
- Message-ID: <1992Nov12.193308.20297@igor.tamri.com>
- Organization: TOSHIBA America MRI, South San Francisco, CA
- References: <1992Nov11.064154.17204@fasttech.com> <1992Nov11.210749.3953@igor.tamri.com> <36995@cbmvax.commodore.com>
- Date: Thu, 12 Nov 92 19:33:08 GMT
- Lines: 303
-
-
- I've gotten a lot of wonderful mail on this series of posting, the few
- detractors make arguements similar to Randell's, so I will rebut his
- public position a little more strongly than otherwise.
-
- jesup@cbmvax.commodore.com (Randell Jesup) writes:
- > You're making the assumption that IDE doesn't hide that from you as
- >well. It does. Some (many?) current IDE drives use the cylinder/head/sector
- >registers as merely a convoluted way to specify a block number, use zone
- >recording, etc. I strongly suspect that this will continue, as the benefits
- >of zone recording, sector replacement, etc are too large to ignore (a number
- >of filesystems (most?) require that the device drivers present them with a
- >perfect media, no unusable blocks. This requires remapping of bad blocks in
- >the disk controller (SCSI or IDE) or in the device driver itself. Usually the
- >controller has a better chance to do a good job at this.)
- > ...
- > There's a big philosophical argument over who should deal with media
- >problems. The consensus seems to be that they should be pushed into the
- >lowest level possible (fs->driver->host controller->drive controller). Having
- >written both filesystems and device drivers, I must agree with that (and yes
- >I've had to implement bad-block mapping at the driver level, such as for IDE).
-
- I didn't assume so, in fact I bitched about drives that refuse to turn it
- off and not allow the driver/filesystem to deal directly with the unmapped
- zones. Remapping is certainly an issue to debate ... while it makes
- filesystems and drivers a little easier (minor development effort for
- a relatively complex filesystem/driver design) ... as I mentioned the
- cost in performance for mid-high performance systems is too high on
- two accounts:
-
- 1) The remapping task can be performed more quickly by any
- 386/486/RISC processor ... the on-drive micros are slow.
-
- 2) Substantial in-field performance anoymolies where critical
- data is in remapped blocks ... what if this was a bench
- mark evaluation purchase and the customer based a $50M order
- on it's performance? What about the ordinary customer who
- just has to live with it's poor behaviour?
-
- I don't understand why "the controller has a better chance to do a good job
- at this" ... my position was that remapping should be done in the filesystem
- so that the bad-blocks would NEVER be allocated and need to be remapped. This
- IS a major divergence in thought from current practice in UNIX ... not at all
- for DOS which has always managed bad block info in the filesystem.
-
- For WD1003 type controllers the long accepted practice was to flag each
- sector/track bad at low level format, so that when DOS format or SCO badtrk
- scanned the media it was ASSURED of finding the area bad and marking it so.
- From a performance point of view (the one I consistantly take) this is vastly
- better!
-
- I have also written filesystems and drivers, I take a STRONGLY different
- stand ... given the DOS filesystem handling of bad blocks, I hardly
- consider this a consensus to do it in the drive. ALthough many software
- guys would like it there to make their job simpler. I strongly prefer
- the ability to have the drive present vendor defects with the ID's marked
- as bad.
-
- Nor would I like to be the customer who has his FAT or root inode over
- a remapped sector.
-
- >>> Once per sector? Don't PC's use the ReadMultiple/WriteMultiple
- >>>commands? I guess not (which matches what I've heard elsewhere). Our IDE
- >>Yes, Yes ... the interrupt for WD1003/IDE interfaces means the 512 byte sector
- >>buffer is full, and must be emptied. R/W Multiple are used, but it requires
- >>handling a transfer request interrupt for each sector, or busy waiting on
- >>data_request in the command status register ... hence poor man's disconnect
- >>from the processor bus.
- >
- > I think you're confused. The CAM-ATA spec (and all the IDE drives I've
- >played with) says that when read/write Multiple is used (with SetMultiple),
- >you get 1 interrupt per N sectors. From CAM-ATA rev 2.3:
- >
- > 9.12 Read Multiple Command
- >
- > The Read Multiple command performs similarly to the Read Sectors command.
- > Interrupts are not generated on every sector, but on the transfer of a block
- > which contains the number of sectors defined by a Set Multiple command.
-
- Sorry, but for PC's BRDY of the WD1010 is tied to IRQ, and does generate
- an interrupt per sector -- EVEN WITH IDE .... This discussion started out as
- a IDE/SCSI from a PC point of view ... you need to keep that in mind ... since
- it may not be implemented that way on your Amiga.
-
- This is a host adapter issue, and hopefully will go away in the future as
- cheap DMA hostadpters become available.
-
- > Yes, write-buffering does lose some error recovery chances, especially
- >if there's no higher-level knowledge of possible write-buffering so
- >filesystems can insert lock-points to retain consistency. However, it can be
- >a vast speed improvement. It all depends on your (the user's) needs. Some
- >can easily live with it, some can't, some need raid arrays and UPS's.
-
- It is only a vast speed improvement on single block filesystem designs ...
- any design which combines requests into a single I/O will not see such
- improvement ... log structured filesystems are a good modern example.
- It certainly has no such effect for my current filesystem design.
-
- For the DOS user, while the speed up may be great ... I have grave questions
- reqarding data reliability when the drive fails to post an error for the
- sectors involved because the spare table overflowed. I also strongly
- disagree with drive based automatic remapping since an overloaded power supply
- or power supply going out of regulation will create excessive soft errors
- which will trigger unnecessary remapping. When it was demanded by the powers
- that be at Fortune Systems ... we put it in ... only to take it when Field
- Service grew tried of the bad block tables overflowing and taking a big loss
- on good drives being returned due to "excessive bad blocks" as the result
- of normal (or abnormal) soft error rates due to other factors.
-
- Write buffering requires automatic remapping ... A good filesystem design
- should not see any benefits from write buffering, and doesn't need/want
- remapping. Nor do customers want random/unpredictable performance/response
- times.
-
- >
- >> Tapes are (or will be) here, and I
- >>expect CDROMS (now partly proprietary & SCSI) to be mostly IDE & SCSI
- >>in the future. IDE is already extending the WD1003 interface, I expect
- >>addtional drive support will follow at some point, although multiple
- >>hostadapters is a minor cost issue for many systems.
- >
- > There are rumbles in that direction. I'm not certain it's worth
- >it, or that it can be compatible enough to gain any usage. Existing users
- >who need lots of devices have no reason to switch from SCSI to IDE, and
- >systems vendors have few reasons to spend money on lots of interfaces
- >for devices that don't exist. The reason IDE became popular is that it was
- >_cheap_, and no software had to be modified to use it.
-
- The fact is that IDE has become the storage bus of choice for low end
- systems ... and other storage vendors will follow it to reduce interface
- (extra slot/adapter) costs. In laptops, IDE IS THE STORAGE BUS,
- no slots for other choices.
-
-
- I combined both his postings into a single reply.....
-
- > Sounds like the old IPI vs SCSI arguments over whether smart or dumb
- >controllers are better (which is perhaps very similar to our current
- >discussion, with a few caveats).
-
- This IS VERY MUCH LIKE that discussion ... BUT about how a seemingly good
- 10 year old decision has gone bad. Given the processor and drive speeds
- of that era ... I also supported SCSI while actively pushing for reduced
- Command Decode times. See article by Dan Jones @ Fortune Systems 1986 I
- think in EDN regarding SCSI performance issues ... resulting from
- the WD1000 emulating SCSI hostadapter I did for them under contract.
-
- Has IPI largely died due to a lack of volume? ... SCSI proved easier to
- interface, lowering base system costs .... just as IDE has. Certainly
- the standardization on SCSI by Apple after my cheap DDJ published
- hostadapter, was a major volume factor toward the success of SCSI
- and embeded SCSI drives. The market changed so fast after the MacPlus
- that DTC, XEBEC, and OMTI became has been's, even though they shaped
- the entire market up to that point.
-
- >I would suggest
- > (a) that's a highly contrived example, especially for
- > the desktop machines that all IDE and most SCSI drives
- > are designed for,
-
- For DOS you are completely right .... for any multitasking with a
- high performance filesystem you missed the mark.
-
- > (b) both C-SCAN and most other disksorting algorithms have tradeoffs
- > for the increase in total-system throughput and decrease in
- > worst-case performance; FCFS actually performs quite well in
- > actual use (I have some old comp.arch articles discussing this
- > if people want to see them). The tradeoffs are usually in
- > fairness, response time, and average performance (no straight
- > disksort will help you until 3 requests are queued in the first
- > place).
-
- While your short queue observations are quite true, your assumptions
- stating this is the norm are quite different than mine, and are largely
- an artifact of 1975 filesystem designs and buffering constraints.
- In my world systems with 5-20 active users are common, with similar
- average queue depths -- and FCFS is not an acceptable solution since it
- blocks request locality resulting from steady state read-ahead service,
- resulting in a significant loss of thruput (80% or more).
-
- The primary assumption in FCFS proponents is that all requests are unrelated
- and have no bearing on the locality of future requests. In addition they
- extrapolate response time fairness for a given single block globally.
- In reality, from the users perspective, they judge such from how quickly
- a given task completes ... and most things improve thruput, will improve
- task completion times .... as long as they don't create the ability for
- some process to hog resources.
-
- As such, windowed CSCAN (cyl/trk) in the reverse order of file block allocation
- gives you the best of both worlds ... the ability to gain localized burst
- behaviour during readahead as long as the application has low enough
- cpu requirements and can process the blocks ... plus forced breaks to
- round robin the queue at each window boundry. I first presented this issue
- at the Santa Monica USENIX conference in the late 70's, and have made
- it a key point in numerous other performance presentations at other
- conferences and lectures.
-
- In addition my current filesystem design, and XENIX work I did for SCO, both
- attempt to read-ahead entire files of moderate length. A single block
- read-ahead doesn't allow for improved scheduling ... burst read-ahead
- does and uniformly gets the task completed quicker.
-
- > (c) Your example also depends on very fast track-to-track stepping
- > times AND very high locality of the requests (within your fast
- > stepping distance), and becomes less relevant on drives with
- > large numbers of cylinders and fast spin times.
-
- Or drives with steppers or dedicated servo that have a more than one head.
- The larger the number of heads, the bigger the gain. In practice any drive
- with multiple heads, or a track to track seek time less than 1/2 rev time
- will service a minimium of 15%-30% additional requests per second under
- normal UNIX timesharing loads ... with a potential of several hundred percent
- on fast seek and/or many headed drives.
-
- Any facility that uses a disk reorg utility that provides locality
- will benefit greatly .... in the case of my filesystem which does
- active data migration and maintains a high degree of locality, the
- results are VERY significant.
-
- >
- > For most desktop machines, even if we ignore single-tasking OS's,
- >probably 99.44% of the time when disk requests occur there are no other
- >currently active requests.
-
- Any strategy that offers performance gains under load can not be
- dismissed out of hand. Especially RISC systems that completely out run
- the disk subsystems. If your company is happy with slow single request
- single process filesystems and hardware ... so be it, but to generalize
- it to the rest of the market is folly. There are better filesystem designs
- that do not produce this profile on even a single user, single process
- desktop box.
-
- If this single user is going to run all the requests thru the cache
- anyway ... why not help it up front ... and queue a significant amount
- of the I/O on open or first reference. There are a few files were this
- is not true ... such as append only log files ... but there are clues
- that can be taken.
-
- >>In addition, if the filesystem were then presented with storing a 9K
- >>file it would use 12/3/4 to 12/3/12 (best fit nearest active region).
- >>A big win over using 10/0/[2,5,8], 10/3/[4,8], 10/4/6-7, and 11/0/8 as
- >>convential filesystem with a bitmap would tend to allocate.
- >
- > Any best-fit will produce better read and rewrite performance over
- >a first-fit, bitmapped or not, at the potential cost of increased fragmentation
- >and slower block allocation (especially if the bitmap or free-list isn't all
- >kept in memory at once).
-
- I argue that given the distribution of file sizes and life-times,
- my filesystem model shows otherwise ... decreased fragmentation, slight CPU
- cost (0.2-1.5%) with a significant reduction in disk requests -- the primary
- resource to optimize. Much of decreased fragmentation comes from a secondary
- effect of freeing contiguous files ... more continguous freespace. Allocating
- fragmented files .... promotes fragmented free space. Again the common
- wisdom from studies on resource allocators, fails to take these side effects
- into account .... their models don't match this usage, and it is an error
- to extrapolate their results to all systems/applications.
-
- >>IDE drives should support the WD1010 SCAN_ID command allowing the driver
- >>locate current head position ... no such SCSI feature exists.
- >
- > No such IDE feature exists. This _was_ a discussion of IDE vs. SCSI.
- >As far as I know, no one has even proposed such a feature to the CAM-ATA
- >committee, let alone implemented it. The only positional information is the
- >index bit in the status register (I suspect many drives just leave it 0).
-
- The key word was should .... it was part of the WD1010 chipset that formed
- the WD1003WHA interface standard. Again if nobody makes use of a feature,
- it is optional ... I make the arguement at it has significant value.
-
- >>Given 1974-1986 hardware, most of the current filesystem design issues
- >>were correct .... to just OK. Given 1992-1995 hardware, the same tradeoffs
- >>are mostly WRONG. Performance comes by DESIGN, not tuning, selection, or
- >>non-reflective evolution. Too much performance engineering is making do with
- >>existing poor designs ... instead of designing how it should be.
- >
- > While your statement is correct, I think you're guilty here also.
- >Just because method/hack/whatever was a good choice for V7 Unix running RPS
- >drives, doesn't mean that that approach is (as) effective today. Technologies
- >change, performance/cost ratios change, etc. I also think your opinions
- >come from a large, heavy-io multi-user systems background; which while
- >interesting for those working with such systems is pretty irrelevant to
- >the desktop systems that IDE and (most of) SCSI are designed for.
-
- By background and perspective run the full range, single user single process
- desktops to mulituser multiprocess multiprocessor servers. Clinging to old
- filesystems designs is an option, but as I outlined ... your assumptions
- and conclusions are vastly in conflict with designing storage systems
- capable of servicing 486/RISC desktop machines ... and certainly in
- conflict with the needs of fileservers and multiuser applications engines
- found in 100,000's point of sale systems managing inventory and register
- scanners in each sales island -- Sears, Kmart, every supermarket, autoparts
- stores and a growing number of restrants and fast food stores.
-
- Your perspective for design tradeoffs for the Amiga home market may well
- be correct ... but are in conflict with the larger comercial markets that
- are the focus of the technologies I have discussed here and elsewhere.
-
- Ignore what your system does ... what CAN IT DO if things were better?
-
- Again ... Performance comes by DESIGN, not tuning, selection, or
- non-reflective evolution. Too much performance engineering is making do
- with existing poor designs ... instead of designing how it should be.
-