home *** CD-ROM | disk | FTP | other *** search
- Path: sparky!uunet!mcsun!news.funet.fi!hydra!klaava!torvalds
- From: torvalds@klaava.Helsinki.FI (Linus Benedict Torvalds)
- Newsgroups: comp.os.linux
- Subject: Re: A question about Kernel system call mechanism
- Keywords: linux, kernel, system call
- Message-ID: <1992Aug20.122051.24901@klaava.Helsinki.FI>
- Date: 20 Aug 92 12:20:51 GMT
- References: <1992Aug19.174117.21233@ramsey.cs.laurentian.ca>
- Organization: University of Helsinki
- Lines: 102
-
- In article <1992Aug19.174117.21233@ramsey.cs.laurentian.ca> ron@ramsey.cs.laurentian.ca (Ron Prediger [Velociraptor]) writes:
- >I am relatively new to Linux and have been examining the kernel source.
- >
- >1) Does anyone know how linux passes parameters from the user process to the
- >kernel service routine ? Below is what I think is happening and where I
- >am confused.
- >
- >It appears that system calls are handled using interrupt or trap gates
- >resident in the Interrupt descriptor table (IDT). From reading the Intel
- >386 ref. manual I understand that a stack switch occurs automatically when
- >a less privileged process accesses a gate for a more privileged subroutine.
-
- Correct so far...
-
- >What I can't see is how the kernel service routine gets the system call
- >parameters (ie. addresses, etc) from the user process. Is there code
- >somewhere which copies these parameters from the original (level 3) stack to
- >the more privileged (level 0) stack ? If linux had used call gates to
- >implement system calls, the parameters would automatically be copied to the
- >privileged routine's stack by the 386. (This automatic
- >copy of parameters does not occur when referencing interrupt/trap gates.)
-
- I didn't like system call gates: they are too complicated for my taste
- (besides, you have to know how many arguments to copy, or have a
- specific system call gate for each type of argument: maybe not a bad
- idea, but...). Anyway, things are easier than you make them out to be:
- the arguments are simply passed in the normal registers.
-
- Passing arguments in the registers allows you 6 (32-bit) direct
- arguments (not counting %eax, which is used to tell which system call
- you want handled), and more if you simply set up a pointer to a block in
- user space. And the beauty of it all is that they are automatically put
- on the stack in as arguments to the system calls when the registers are
- saved - see the file linux/kernel/sys_call.S, which saves all the
- necessary state information. It's the simplest and fastest way I could
- find: linux doesn't even save the state in some special task-structure
- like other unices seem to do, but just leaves the regs on the stack,
- ready for popping when the process returns from the interrupt.
-
- >2) It appears that Linux is making use of segment registers (FS,GS) and the
- >LDT/GDT to transfer the actual data (ie. from a read system call) between
- >user and kernel address spaces. Is this observation correct ?
-
- Actually, only %fs is used: it points to the user-space segment when in
- a system call. Thus linux never needs to check any bounds when copying
- from/to user space: it's automatically handled by the hardware. The
- get_fs_XXX() and put_fs_XXX() (XXX=byte, word, long) inline functions
- can be used to transfer bytes from/to user space, and memcpy_tofs() or
- memcpy_fromfs() can be used to move bigger blocks between kernel and
- user segments.
-
- What happens at a system call is roughly:
-
- user space:
- - load the arguments into registers (%eax contains the system call
- index, %ebx... contain the parameters)
- - do an "int $0x80", moving to kernel mode:
-
- kernel space:
- - clear the direction-flag, as gcc assumes this
- - save the system call number: a negative number means the interrupt
- was caused by a hardware IRQ or trap.
- - save all the segment registers
- - save %eax (which happens to be the same number we saved earlier if
- this is a normal system call)
- - save the other registers: they automatically form the stack frame for
- the system call.
- - call the appropriate system call handler by indexing the appropriate
- table with %eax.
-
- The handler does it's stuff - it /can/ change the stack frame if it
- wants to, and thus return information in any registers it wants to, but
- that is really discouraged, and all system calls currently just return
- their result in %eax as part of their normal return.
-
- - check if there were any signals, and change the return stack (both in
- kernel and user space) appropriately if so, invoking the signal
- handler instead of returning directly.
- - pop all the saved registers, and do an iret, returning to user mode.
-
- While the system call runs, the %ds and %es registers point to kernel
- data space, and %fs points to user space. But the system calls may
- change %fs for their own needs: for example symbolic links result in
- changing %fs to kernel space for a while as the name is parsed directly
- from the kernel buffers instead of from user space etc.
-
- Note that normal faults/traps and IRQ's do essentially exactly the same,
- except for "fast" IRQ's, which just save a minimal amount of information
- and don't do the signal checking (used by things like the serial
- handlers). Also, they naturally haven't got any "system call number",
- but have their own routine that is called after the stack is set up.
-
- As to the GDT: the GDT contains just two normal segment entries: GDT[1]
- is the kernel code segment descriptor, and GDT[2] is the kernel data
- descriptor. The rest of the global descriptor table is filled with TSS
- and LDT descriptors. The local descriptor tables normally contain just
- the user-space code/data descriptors in LDT[1] and LDT[2], but it's
- flexible enough to be extended if something wants to have more segments
- in user space (I think the xenix emulator uses this, although I haven't
- looked at the code yet).
-
- Linus
-