home *** CD-ROM | disk | FTP | other *** search
-
- Mini_HOWTO: Multi Disk System Tuning
-
- Version 0.6
- Date 960806
- By Stein Gjoen <sgjoen@nyx.net>
-
- This document was written for two reasons, mainly because I got hold
- of 3 old SCSI disks to set up my Linux system on and I was pondering
- how best to utilise the inherent possibilities of parallelizing in a
- SCSI system. Secondly I hear there is a prize for people who write
- docs...
-
- This is intended to be read in conjunction with the Linux File System
- Standard (FSSTD). It does not in any way replace it but tries to
- suggest where physically to place directories detailed in the FSSTD,
- in terms of drives, partitions, types, RAID, file system (fs),
- physical sizes and other parameters that should be considered and
- tuned in a Linux system, ranging from single home systems to large
- servers on the Internet.
-
- This is also a learning experience for myself and I hope I can start
- the ball rolling with this Mini-HOWTO and that it perhaps can evolve
- into a larger more detailed and hopefully even more correct HOWTO.
- Notes in square brackets indicate where I need more information.
-
- Note that this is a guide on how to design and map logical partitions
- onto multiple disks and tune for performance and reliability, NOT how
- to actually partition the disks or format them - yet.
-
- This is the third update, still without much in the way of inputs...
- Nevertheless this mini-HOWTO seems to be growing regardless and I
- expect I will have to turn this into a fully fledged HOWTO one of
- these days, I just need to learn the format.
-
- Hot news: I have been asked to add information on physical storage
- media as well as partitioning and make it all into a full sized
- HOWTO. this will take a bit of time to complete which means that
- other than bug fixes to this mini-HOWTO it will not be updated much
- until the new HOWTO is ready. In the meantime I will of course still
- be interested in feedback.
-
- More news: there has been a fair bit of interest in new kinds of
- file systems in the comp.os.linux newsgroups, in particular
- logging, journaling and inherited file systems. Watch out for
- updates.
-
- The latest version number of this document can be gleaned from my
- plan entry if you do "finger sgjoen@nox.nyx.net"
-
-
- In this version I have the pleasure of acknowledging even more people
- who have contributed in one way or another:
-
- ronnej@ucs.orst.edu
- cm@kukuruz.ping.at
- armbru@pond.sub.org
- nakano@apm.seikei.ac.jp (who is also doing the Japanese translation)
- R.P.Blake@open.ac.uk
- neuffer@goofy.zdv.Uni-Mainz.de
- sjmudd@phoenix.ea4els.ampr.org
-
- Not many still, so please read through this document, make a contribution
- and join the elite. If I have forgotten anyone, please let me know.
-
- So let's cut to the chase where swap and /tmp are racing along hard
- drive...
-
- ---------------------------------------------------------------
-
- 1. Considerations
-
- The starting point in this will be to consider where you are and what
- you want to do. The typical home system starts out with existing
- hardware and the newly converted Linux user will want to get the most
- out of existing hardware. Someone setting up a new system for a
- specific purpose (such as an Internet provider) will instead have to
- consider what the goal is and buy accordingly. Being ambitious I will
- try to cover the entire range.
-
- Various purposes will also have different requirements regarding file
- system placement on the drives, a large multiuser machine would
- probably be best off with the /home directory on a separate disk, just
- to give an example.
-
- In general, for performance it is advantageous to split most things
- over as many disks as possible but there is a limited number of
- devices that can live on a SCSI bus and cost is naturally also a
- factor. Equally important, file system maintenance becomes more
- complicated as the number of partitions and physical drives increases.
-
-
- 1.1 File system features
-
- The various parts of FSSTD have different requirements regarding
- speed, reliability and size, for instance losing root is a pain but
- can easily be recovered. Losing /var/spool/mail is a rather different
- issue. Here is a quick summary of some essential parts and their
- properties and requirements. [This REALLY need some beefing up]:
-
- 1.1.1 Swap
-
- Speed: Maximum! Though if you rely too much on swap you
- should consider buying some more RAM.
-
- Size: Similar as for RAM. Quick and dirty algorithm:
- just as for tea: 16M for the machine and 2M for each user. Smallest
- kernel run in 1M but is tight, use 4M for general work and light
- applications, 8M for X11 or GCC or 16M to be comfortable. [The
- author is known to brew a rather powerful cuppa tea...]
-
- Some suggest that swapspace should be 1-2 times the size of the
- RAM, pointing out that the locality of the programs determines how
- effective your added swapspace is. Note that using the same
- algorithm as for 4BSD is slightly incorrect as Linux does not
- allocate space for pages in core [More on this is coming soon].
-
- Reliability: Medium. When it fails you know it pretty quickly and
- failure will cost you some lost work. You save often, don't you?
-
- Note 1: Linux offers the possibility of interleaved swapping
- across multiple devices, a feature that can gain you much. Check out
- the "man 8 swapon" for more details. However, software raiding across
- multiple devices adds more overheads than you gain.
-
- Note 2: Some people use a RAM disk for swapping or some other
- filesystems. However, unless you have some very unusual requirements
- or setups you are unlikely to gain much from this as this cuts into
- the memory available for caching and buffering.
-
- 1.1.2 /tmp and /var/tmp
-
- Speed: Very high. On a separate disk/partition this will
- reduce fragmentation generally, though ext2fs handles fragmentation
- rather well.
-
- Size: Hard to tell, small systems are easy to run with just
- a few MB but these are notorious hiding places for stashing files
- away from prying eyes and quota enforcements and can grow without
- control on larger machines. Suggested: small machine: 8M, large
- machines up to 500M (The machine used by the author at work has 1100
- users and a 300M /tmp directory).
-
- Reliability: Low. Often programs will warn or fail gracefully when
- these areas fail or are filled up. Random file errors will of course
- be more serious, no matter what file area this is.
-
- (* That was 50 lines, I am home and dry! *)
-
- 1.1.3 Spool areas (/var/spool/news, /var/spool/mail)
-
- Speed: High, especially on large news servers. News transfer
- and expiring are disk intensive and will benefit from fast drives.
- Print spools: low. Consider RAID0 for news.
-
- Size: For news/mail servers: whatever you can afford. For
- single user systems a few MB will be sufficient if you read
- continuously. Joining a list server and taking a holiday is on the
- other hand it is not a good idea. (Again the machine I use at work
- has 100M reserved for the entire /var/spool)
-
- Reliability: Mail: very high, news: medium, print spool: low. If
- your mail is very important (isn't it always?) consider RAID for
- reliability. [Is mail spool failure frequent? I have never experienced
- it but there are people catering to this market of reliability...]
-
- Note: Some of the news documentation suggests putting all
- the .overview files on a drive separate from the news files, check out
- all news FAQs for more information.
-
- 1.1.4 Home directories (/home)
-
- Speed: Medium. Although many programs use /tmp for temporary
- storage, other such as some newsreaders frequently update files in the
- home directory which can be noticeable on large multiuser systems. For
- small systems this is not a critical issue.
-
- Size: Tricky! On some systems people pay for storage so this
- is usually then a question of finance. Large systems such as nyx.net
- (which is a free Internet service with mail, news and WWW services)
- run successfully with a suggested limit of 100K per user and 300K as
- max. If however you are writing books or are doing design work the
- requirements balloon quickly.
-
- Reliability: Variable. Losing /home on a single user machine is
- annoying but when 2000 users call you to tell you their home
- directories are gone it is more than just annoying. For some their
- livelihood relies on what is here. You do regular backups of course?
-
- Note: You might consider RAID for either speed or
- reliability. If you want extremely high speed and reliability you
- might be looking at other OSes and platforms anyway. (Fault tolerance
- etc.)
-
- 1.1.5 Main binaries ( /usr/bin and /usr/local/bin)
-
- Speed: Low. Often data is bigger than the programs which are
- demand loaded anyway so this is not speed critical. Witness the
- successes of live file systems on CD ROM.
-
- Size: The sky is the limit but 200M should give you most of
- what you want for a comprehensive system. (The machine I use, including
- the libraries, uses about 800M)
-
- Reliability: Low. This is usually mounted under root where all
- the essentials are collected. Nevertheless losing all the binaries is
- a pain...
-
- 1.1.6 Libraries ( /usr/lib and /usr/local/lib)
-
- Speed: Medium. These are large chunks of data loaded often,
- ranging from object files to fonts, all susceptible to bloating. Often
- these are also loaded in their entirety and speed is of some use here.
-
- Size: Variable. This is for instance where word processors
- store their immense font files. The few that have given me feedback on
- this report about 70M in their various lib directories. The following
- ones are some of the largest diskhogs: GCC, Emacs, TeX/LaTeX, X11 and
- perl.
-
- Reliability: Low. See point 1.1.5
-
- 1.1.7 Root
-
- Speed: Quite low: only the bare minimum is here, much of
- which is only run at startup time.
-
- Size: Relatively small. However it is a good idea to keep
- some essential rescue files and utilities on the root partition and
- some keep several kernel versions. Feedback suggests about 20M would
- be sufficient.
-
-
- Reliability: High. A failure here will possible cause a bit of grief
- and can take a little time rescuing your boot partition. Naturally
- you do have a rescue disk?
-
-
- 1.2 Explanation of terms
-
- Naturally the faster the better but often the happy installer of Linux
- has several disks of varying speed and reliability so even though this
- document describes performance as 'fast' and 'slow' it is just a rough
- guide since no finer granularity is feasible. Even so there are a few
- details that should be kept in mind:
-
- 1.2.1 Speed
-
- This is really a rather woolly mix of several terms: CPU load,
- transfer setup overhead, disk seek time and transfer rate. It is in
- the very nature of tuning that there is no fixed optimum, and in most
- cases price is the dictating factor. CPU load is only significant for
- IDE systems where the CPU does the transfer itself [more details
- needed here !!] but is generally low for SCSI, see SCSI documentation
- for actual numbers. Disk seek time is also small, usually in the
- millisecond range. This however is not a problem if you use command
- queueing on SCSI where you then overlap commands keeping the bus busy
- all the time. News spools are a special case consisting of a huge
- number of normally small files so in this case seek time can become
- more significant.
-
- 1.2.2 Reliability
-
- Naturally none wants low reliability disks but one might be better off
- regarding old disks as unreliable. Also for RAID purposes (See the
- relevant docs) it is suggested to use a mixed set of disks so that
- simultaneous disk crashes becomes less likely.
-
-
- 1.3 Technologies
-
- In order to decide how to get the most of your devices you need to
- know what technologies are available and their implications. As always
- there can be some tradeoffs with respect to speed, reliability, power,
- flexibility, ease of use and complexity.
-
- 1.3.1 RAID
-
- This is a method of increasing reliability, speed or both by using multiple
- disks in parallel thereby decreasing access time and increasing transfer
- speed. A checksum or mirroring system can be used to increase reliability.
- Large servers can take advantage of such a setup but it might be overkill
- for a single user system unless you already have a large number of disks
- available. See other docs and FAQs for more information.
-
- For Linux one can set up a RAID system using either software (the md module
- in the kernel) or hardware, using a Linux compatible controller. Check the
- documentation for what controllers can be used. A hardware solution is
- usually faster, and perhaps also safer, but comes at a significant cost.
-
- 1.3.2 AFS, Veritas and Other Volume Management Systems
-
- Although multiple partitions and disks have the advantage of making for more
- space and higher speed and reliability there is a significant snag: if for
- instance the /tmp partition is full you are in trouble even if the news spool
- is empty, it is not easy to retransfer quotas across disks. Volume management
- is a system that does just this and AFS and Veritas are two of the best known
- examples. Some also offer other file systems like log file systems and others
- optimised for reliability or speed. Note that Veritas is not available (yet)
- for Linux and it is not certain they can sell kernel modules without providing
- source for their proprietary code, this is just mentioned for information on
- what is out there. Still, you can check their web page http://www.veritas.com
- to see how such systems function.
-
- Derek Atkins, of MIT, ported AFS to Linux and has also set up a mailing list
- for this: linux-afs@mit.edu which is open to the public, requests to join
- the list goes to linux-afs-request@mit.edu and finally bug reports should
- go to linux-afs-bugs@mit.edu. Important: as AFS uses encryption it is
- restricted software and cannot easily be exported from the US. AFS is now
- sold by Transarc and they have set up a www site. The directory structure
- there has been reorganized recently so I cannot give a more accurate URL
- than just http://www.transarc.com which lands you in the root of the web
- site. There you can also find much general information as well as a FAQ.
-
- 1.3.3 Linux md Kernel Patch
-
- There is however one kernel project that attempts to do some of this, md,
- which has been part of the kernel distributions since 1.3.69. Currently
- providing spanning and RAID it is still in early development and people are
- reporting varying degrees of success as well as total wipe out. Use with
- caution.
-
- 1.3.4 General File System Consideration
-
- In the Linux world ext2fs is well established as a general purpose system.
- Still for some purposes others can be a better choice. News spools lend
- themselves to a log file based system whereas high reliability data might
- need other formats. This is a hotly debated topic and there are currently
- few choices available but work is underway. Log file systems also have the
- advantage of very fast file checking. Mailservers in the 100G class can
- suffer file checks taking several days before becoming operational after
- rebooting.
-
-
- [I believe someone from Yggdrasil mentioned a log file based
- system once, details? And AFS is available to Linux I think, sources
- anyone?]
-
-
- There is room for access control lists (ACL) and other unimplemented
- features in the existing ext2fs, stay tuned for future updates. There has
- been some talk about adding on the fly compression too.
-
- 1.3.5 Compression
-
- Disk versus file compression is a hotly debated topic especially regarding
- the added danger of file corruption. Nevertheless there are several options
- available for the adventurous administrators. These take on many forms,
- from kernel modules and patches to extra libraries but note that most
- suffer various forms of limitations such as being read-only. As development
- takes place at neckbreaking speed the specs have undoubtedly changed by the
- time you read this. As always: check the latest updates yourself. Here only
- a few references are given.
-
- - DouBle features file compression with some limitations.
- - Zlibc adds transparent on-the-fly decompression of files as they load.
- - there are many modules available for reading compressed files or
- partitions that are native to various other operating systems though
- currently most of these are read-only.
-
- Also there is the user file system that allows ftp based file system and
- some compression (arcfs) plus fast prototyping and many other features.
-
- Recent kernels feature the loop or loopback device which can be used to put
- a complete file system within a file. There are some possibilities for
- using this for making new filesystems with compression, tarring etc.
-
- Note that this device is unrelated to the network loopback device.
-
- 1.3.5 Physical Sector Positioning
-
- Some seek time reduction can be achieved by positioning frequently
- accessed sectors in the middle so that the average seek distance and
- therefore the seek time is short. This can be done either by using
- fdisk or cfdisk to make a partition on the middle sectors or by first
- making a file (using dd) equal to half the size of the entire disk
- before creating the files that are frequently accessed, after which
- the dummy file can be deleted. Both cases assume starting from an
- empty disk.
-
- This little trick can be used both on ordinary drives as well as RAID
- systems. In the latter case the calculation for centering the sectors
- will be different, if possible. Consult the latest RAID manual.
-
-
- 2 Disk Layout
-
- With all this in mind we are now ready to embark on the layout [and no
- doubt controversy]. I have based this on my own method developed when I
- got hold of 3 old SCSI disks and boggled over the possibilities.
-
- 2.1 Selection
-
- Determine your needs and set up a list of all the parts of the file system
- you want to be on separate partitions and sort them in descending order of
- speed requirement and how much space you want to give each partition.
-
- If you plan to RAID make a note of the disks you want to use and what
- partitions you want to RAID. Remember various RAID solutions offers
- different speeds and degrees of reliability.
-
- (Just to make it simple I'll assume we have a set of identical SCSI disks
- and no RAID)
-
- 2.2 Mapping
-
- Then we want to place the partitions onto physical disks. The point of the
- following algorithm is to maximise parallelizing and bus capacity. In this
- example the drives are A, B and C and the partitions are 987654321 where 9
- is the partition with the highest speed requirement. Starting at one drive
- we 'meander' the partition line over and over the drives in this way:
-
- A : 9 4 3
- B : 8 5 2
- C : 7 6 1
-
- This makes the 'sum of speed requirements' the most equal across each
- drive.
-
- 2.3 Optimizing
-
- After this there are usually a few partitions that have to be 'shuffled' over
- the drives either to make them fit or if there are special considerations
- regarding speed, reliability, special file systems etc. Nevertheless this
- gives [what this author believes is] a good starting point for the complete
- setup of the drives and the partitions. In the end it is actual use that will
- determine the real needs after we have made so many assumptions. After
- commencing operations one should assume a time comes when a repartitioning
- will be beneficial.
-
- 2.4 Pitfalls
-
- The dangers of splitting up everything into separate partitions are
- briefly mentioned in the section about volume management. Still, several
- people have asked me to emphasize this point more strongly: when one
- partition fills up it cannot grow any further, no matter if there is
- plenty of space in other partitions.
-
- In particular look out for explosive growth in the news spool
- (/var/spool/news). For multi user machines with quotas keep
- an eye on /tmp and /var/tmp as some people try to hide their
- files there, just look out for filenames ending in gif or jpeg...
-
- In fact, for single physical drives this scheme offers very little gains
- at all, other than making file growth monitoring easier (using 'df') and
- there is no scope for parallel disk access. A freely available volume
- management system would solve this but this is still some time in the
- future.
-
- Partitions and disks are easily monitored using 'df' and should be done
- frequently, perhaps using a cron job or some other general system
- management tool. [Is any such tool currently available?]
-
-
- 3 Further Information
-
- There is wealth of information one should go through when setting up a
- major system, for instance for a news or general Internet service provider.
- The FAQs in the following groups are useful:
-
- News groups:
- comp.arch.storage, comp.sys.ibm.pc.hardware.storage, alt.filesystems.afs,
- comp.periphs.scsi ...
- Mailing lists:
- raid, scsi ...
-
- Many mailing lists are at vger.rutgers.edu but this is notoriously
- overloaded, try to find a mirror. There are some lists mirrored at
- http://www.redhat.com [more references please!].
- Remember you can also use the web search engines and that some, like
- altavista, also can search usenet news.
-
- [much more info needed here]
-
- 4 Concluding Remarks
-
- Disk tuning and partition decisions are difficult to make, and there are no
- hard rules here. Nevertheless it is a good idea to work more on this as the
- payoffs can be considerable. Maximizing usage on one drive only while the
- others are idle is unlikely to be optimal, watch the drive light, they are
- not there just for decoration. For a properly set up system the lights should
- look like Christmas in a disco. Linux offers software RAID but also support
- for some hardware base SCSI RAID controllers. Check what is available. As
- your system and experiences evolve you are likely to repartition and you
- might look on this document again. Additions are always welcome.
-
- Currently the only supported hardware SCSI RAID controllers are the
- SmartCache [I/III/IV] and SmartRAID [I/III/IV] controller families
- from DTP. These controllers are supported by the EATA/DMA driver in
- the standard kernel. This company also has an informative web page
- at http://www.dpt.com which also describes various general aspects
- of RAID and SCSI in addition to the product related information.
- [Please let me know if there are other hardware RAID controllers
- available for linux.]
-
-