NetNews Usenet Archive 1992 #18

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Usenet Archive 1992 #18 / NN_1992_18.iso / spool / comp / os / linux / 8690 < prev next >

Wrap

Internet Message Format | 1992-08-20 | 5.6 KB

Path: sparky!uunet!mcsun!news.funet.fi!hydra!klaava!torvalds From: torvalds@klaava.Helsinki.FI (Linus Benedict Torvalds) Newsgroups: comp.os.linux Subject: Re: A question about Kernel system call mechanism Keywords: linux, kernel, system call Message-ID: <1992Aug20.122051.24901@klaava.Helsinki.FI> Date: 20 Aug 92 12:20:51 GMT References: <1992Aug19.174117.21233@ramsey.cs.laurentian.ca> Organization: University of Helsinki Lines: 102 In article <1992Aug19.174117.21233@ramsey.cs.laurentian.ca> ron@ramsey.cs.laurentian.ca (Ron Prediger [Velociraptor]) writes: >I am relatively new to Linux and have been examining the kernel source. > >1) Does anyone know how linux passes parameters from the user process to the >kernel service routine ? Below is what I think is happening and where I >am confused. > >It appears that system calls are handled using interrupt or trap gates >resident in the Interrupt descriptor table (IDT). From reading the Intel >386 ref. manual I understand that a stack switch occurs automatically when >a less privileged process accesses a gate for a more privileged subroutine. Correct so far... >What I can't see is how the kernel service routine gets the system call >parameters (ie. addresses, etc) from the user process. Is there code >somewhere which copies these parameters from the original (level 3) stack to >the more privileged (level 0) stack ? If linux had used call gates to >implement system calls, the parameters would automatically be copied to the >privileged routine's stack by the 386. (This automatic >copy of parameters does not occur when referencing interrupt/trap gates.) I didn't like system call gates: they are too complicated for my taste (besides, you have to know how many arguments to copy, or have a specific system call gate for each type of argument: maybe not a bad idea, but...). Anyway, things are easier than you make them out to be: the arguments are simply passed in the normal registers. Passing arguments in the registers allows you 6 (32-bit) direct arguments (not counting %eax, which is used to tell which system call you want handled), and more if you simply set up a pointer to a block in user space. And the beauty of it all is that they are automatically put on the stack in as arguments to the system calls when the registers are saved - see the file linux/kernel/sys_call.S, which saves all the necessary state information. It's the simplest and fastest way I could find: linux doesn't even save the state in some special task-structure like other unices seem to do, but just leaves the regs on the stack, ready for popping when the process returns from the interrupt. >2) It appears that Linux is making use of segment registers (FS,GS) and the >LDT/GDT to transfer the actual data (ie. from a read system call) between >user and kernel address spaces. Is this observation correct ? Actually, only %fs is used: it points to the user-space segment when in a system call. Thus linux never needs to check any bounds when copying from/to user space: it's automatically handled by the hardware. The get_fs_XXX() and put_fs_XXX() (XXX=byte, word, long) inline functions can be used to transfer bytes from/to user space, and memcpy_tofs() or memcpy_fromfs() can be used to move bigger blocks between kernel and user segments. What happens at a system call is roughly: user space: - load the arguments into registers (%eax contains the system call index, %ebx... contain the parameters) - do an "int $0x80", moving to kernel mode: kernel space: - clear the direction-flag, as gcc assumes this - save the system call number: a negative number means the interrupt was caused by a hardware IRQ or trap. - save all the segment registers - save %eax (which happens to be the same number we saved earlier if this is a normal system call) - save the other registers: they automatically form the stack frame for the system call. - call the appropriate system call handler by indexing the appropriate table with %eax. The handler does it's stuff - it /can/ change the stack frame if it wants to, and thus return information in any registers it wants to, but that is really discouraged, and all system calls currently just return their result in %eax as part of their normal return. - check if there were any signals, and change the return stack (both in kernel and user space) appropriately if so, invoking the signal handler instead of returning directly. - pop all the saved registers, and do an iret, returning to user mode. While the system call runs, the %ds and %es registers point to kernel data space, and %fs points to user space. But the system calls may change %fs for their own needs: for example symbolic links result in changing %fs to kernel space for a while as the name is parsed directly from the kernel buffers instead of from user space etc. Note that normal faults/traps and IRQ's do essentially exactly the same, except for "fast" IRQ's, which just save a minimal amount of information and don't do the signal checking (used by things like the serial handlers). Also, they naturally haven't got any "system call number", but have their own routine that is called after the stack is set up. As to the GDT: the GDT contains just two normal segment entries: GDT[1] is the kernel code segment descriptor, and GDT[2] is the kernel data descriptor. The rest of the global descriptor table is filled with TSS and LDT descriptors. The local descriptor tables normally contain just the user-space code/data descriptors in LDT[1] and LDT[2], but it's flexible enough to be extended if something wants to have more segments in user space (I think the xenix emulator uses this, although I haven't looked at the code yet). Linus