6.S081/6.828 2019 Lecture 4: System Call Entry/Exit Today: how system calls get into / out of the kernel the start of some detailed investigation of O/S internals What happens when a program calls write(fd, buf, n)? [user/kernel diagram] could we use an ordinary function call into the kernel? that would be fast and flexible -- easy to e.g. pass and return complex data types and a mechanism familiar to programmers Can't use function call for system call! Due to the need for isolation. Isolation looms over much of O/S design. What is isolation? enforced separation to contain effects of failures the process is the usual unit of isolation prevent process X from wrecking or spying on process Y r/w memory, use 100% of CPU, change FDs, &c prevent a process from interfering with the operating system in the face of malice as well as bugs a bad process may try to trick the h/w or kernel the main isolation tools address spaces a process address only its own memory (not kernel, not other processes) [user address space: 0..trapframe+trampoline] privilege mode in CPU hardware prevents a process from accessing devices and sensitive CPU register e.g. address space configuration registers syscall implementations explicitly enforce controls careful user/kernel transfer (today) CPU privilege mode on RISC-V: supervisor mode vs user mode supervisor mode can do many things that user mode cannot access devices configure address spaces (virtual memory) read/write special registers the kernel runs in supervisor mode ordinary programs run in user mode every serious microprocessor has a similar user/kernel mode flag let's look at how xv6 system calls enter/leave the kernel example: the shell printing a prompt with write(2, "$ ", 2) [sh.c's write(2, "$ ", 2); emacs kernel/usys.S] that's a C function call to write in usys.S note the ecall instruction! the overall trajectory write() trampoline / uservec usertrap() in trap.c syscall() in syscall.c sys_write() in sysfile.c usertrapret() trampoline / userret write() what's the state of the machine at this point? user address space, kernel address space trampoline at top of user address space trapframe just under the trampoline kernel stack for shell in the kernel trapframe content, set up in advance by kernel [on board] see proc.h kernel page table kernel stack pointer address of usertrap() function in kernel space for saved user pc space for 32 saved user registers special RISC-V registers (many more, here are relevant ones) Chapter 10 in The RISC-V Reader [on board] stvec -- ecall jumps here in kernel; address of trampoline sepc -- ecall saves user pc here scause -- ecall sets to 8 to indicate a system call sscratch -- address of trapframe satp -- current page table C function calling convention on RISC-V i.e. how function calls use the 32 RISC-V registers important since system calls start as C function calls e.g. shell's call to write() a0..a7 -- arguments ra -- return address a0 -- return value Now I'll walk through the system call entry with the gdb debugger. sh.asm says that write() function is at address 0xd68 let's start there $ make qemu-gdb (gdb) b *0xd68 -- write() in the shell (gdb) c (gdb) delete 1 the "li a7, 16" tells the kernel which system call SYS_write is 16 kernel code will eventually check user's register a7 let's look at the registers (gdb) info reg $sp is a low address -- user memory starts at zero write() arguments: a0 is fd, a1 is buf, a2 is n (gdb) x/1c $a1 the shell is printing the $ prompt (gdb) stepi the ecall instruction is going to switch to kernel mode! (gdb) stepi we're now in kernel mode. ecall did three things: jump to $stvec (i.e. set $pc to $stvec) save $pc in $sepc change mode from user to supervisor -- we can't see this (gdb) print/x $pc (gdb) print/x $stvec the kernel earlier set up $stvec (gdb) print/x $sepc the hardware saves $pc since it overwrites it again, all registers but $pc still have precious user values (gdb) info reg note: ecall lets a user program switch to privileged supervisor mode but it does *let* let the user program control what instructions are executed in that mode. ecall always jumps to $stvec, which only can be written in supervisor mode (not by user programs), and which the kernel carefully set up to a known entry point (trampoline). can we jump to kernel C functions at this point? no: need to switch to kernel address space need to save all 32 registers they still hold user values but kernel code will use (overwrite) the registers need to make $sp to point to top of kernel stack why didn't RISC-V design have ecall do all these things automatically? how to save the 32 user registers? we want to store them in the trapframe but a store instruction needs an address, which must be in a register. where to get the address? what register to put it in, given that all 32 hold user values? the answer: RISC-V has a special sscratch register, previously set by the kernel. kernel set sscratch to the address of the trapframe. csrrw instruction can exchange a register and sscratch we are now executing uservec in kernel/trampoline.S the purpose of trampoline.S is to hold the machine code needed to set up for C uservec starts with csrrw a0, sscratch, a0 saves user a0 in sscratch causes a0 to point to trapframe for saving user registers gdb has already executed this instruction (I don't know why) (gdb) print/x $a0 -- now points to trapframe in kernel (gdb) print/x $sscratch -- now holds saved user a0 now there are 32 saves of user registers to the trapframe, via a0 so they can be restored when the system call returns let's skip them (gdb) b *0x3ffffff076 (gdb) c now we're setting up to be able to run C code in the kernel first a stack previously, kernel put a pointer to top of this process's kernel stack in trapframe "ld sp, 8(a0)" fetches it remember a0 points to the trapframe at this point the only kernel object the code knows how to get at is the trapframe, so everything has to be loaded from there. (gdb) stepi retrieve a pointer to info about the current process, from trapframe, into tp (gdb) stepi we want to jump to the kernel C function usertrap(), which the kernel previously saved in the trapframe. "ld t0, 16(a0)" fetches it into t0, we'll use it in a moment (gdb) stepi we are still running in an address space set up for the user program address spaces are configured using page tables, which the CPU knows about we need to tell the CPU to switch to the kernel page table / address space again, previously the kernel stashed a page table pointer in the trapframe we'll load the page table pointer into t1 (more on page tables &c next lecture) (gdb) stepi (gdb) stepi the csrw satp, t1 actually installs the kernel page table now we can directly get at all kernel data and instructions (gdb) stepi the jr t0 is a jump to usertrap() (using t0 retrieved from trapframe) (gdb) stepi we're now in usertrap() in kernel/trap.c various traps come here, e.g. errors, device interrupts, and system calls it looks in the scause register to see that the trap is a sys call see Figure 10.3 on page 102 of The RISC-V Reader cause 8 is a system call (gdb) next ... until syscall() (gdb) step now we're in syscall() kernel/syscall.c it retrieves the system call number from saved register a7 (remember?) p->tf points to the trapframe, with saved registers p->tf->a7 holds 16, SYS_write p->tf->a0 holds write() first argument -- fd p->tf->a1 holds buf p->tf->a2 holds n (gdb) next ... (gdb) print num then dispatches through syscall[num], a table of functions (gdb) next ... (gdb) step aha, we're in sys_write at this point system call implementations are fairly ordinary C code let's skip to the end, to see how a system call returns to user code (gdb) finish notice that write() produced console output (the shell's $ prompt) back to syscall() the p->tf->a0 assignment puts the sys call return value in user a0 will be restored later the C calling convention on RISC-V puts return values in a0 (gdb) print ret write() returned 2 -- two characters -- $ and space (gdb) finish back to usertrap() (gdb) next (gdb) step now we're in usertrapret(), which starts the process of returning to the user program at the end, we'll use the RISC-V sret instruction but we need to prepare a few registers that sret uses sepc (user program counter) sstatus (the "previous mode" bit) before sret we need to prepare for the next system call set up trapframe and s* registers and we need to restore the user environment restore user registers switch to user address space code in usertrapret(): the w_stvec(TRAMPOLINE ...) causes ecall to jump to uservec (as we've seen) the assignments to p->tf->... set up values the trampoline code will need e.g. the address of the top of the kernel stack the sstatus code will cause sret to return to user (non-privileged) mode the w_sepc tells sret what user pc to resume at ends with jump to userret in trampoline.S (gdb) b *0x3ffffff090 (gdb) c the csrw satp switches to the user address space (gdb) stepi (gdb) stepi (gdb) stepi the csrw scratch puts the user a0 into sscratch just before sret we'll do a swap, so that a0 holds the user a0 and sscratch holds trapframe pointer. which is what uservec expects. now lots of loads from the trapframe into registers these restore the user registers (gdb) b *0x3ffffff10a here's the csrw that swaps a0 with sscratch (gdb) print/x $a0 -- the return value from write() (gdb) print/x $sscratch -- trapframe address for uservec now we're at the sret instruction (gdb) stepi now we're back in the user program (pc = 0xd6e) returning 2 from the write() function (gdb) print/x $a0 and we're done! summary system call entry/exit is far more complex than function call design is driven by the need for isolation you should ask yourself whether all this complexity is needed what other designs might be possible? next lecture: address spaces, page tables, virtual memory