6.1810 2022 Lecture 6: System Call Entry/Exit Today: user -> kernel transition system calls, faults, interrupts enter the kernel in the same way lots of careful design and important detail important for isolation and performance What needs to happen when a program makes a system call, e.g. write()? [CPU | user/kernel diagram] CPU resources are set up for user execution (not kernel) 32 registers, sp, pc, privilege mode, satp, stvec, sepc, ... what needs to happen? save 32 user registers and pc switch to supervisor mode switch to kernel page table switch to kernel stack jump to kernel C code high-level goals don't let user code interfere with user->kernel transition e.g. don't execute user code in supervisor mode! transparent to user code -- resume without disturbing Today we're focusing on the user/kernel transition and ignoring what the system call implemenation does once in the kernel but the sys call impl has to be careful and secure also! What does the CPU's "mode" protect? i.e. what does switching mode from user to supervisor allow? supervisor can read/write CPU control registers: satp -- page table physical address stvec -- ecall jumps here in kernel; points to trampoline sepc -- ecall saves user pc here sscratch -- temporary for a0 supervisor can use PTEs that have no PTE_U flag but supervisor has no other powers! e.g. can't use addresses that aren't the in page table so kernel has to carefully set things up so it can work preview: write() write() returns User ecall ----------------------------------------------------------------- sret uservec() in trampoline.S userret() in trampoline.S Kernel usertrap() in trap.c usertrapret() in trap.c syscall() in syscall.c ^ sys_write() in sysfile.c ---| let's watch an xv6 system call entering/leaving the kernel xv6 shell writing its $ prompt sh.c line 137: write(2, "$ ", 2); user/usys.S line 29 this is the write() function, still in user space a7 tells the kernel what system call we want -- SYS_write = 16 ecall -- triggers the user/kernel transition let's start by putting a breakpoint on the ecall user/sh.asm says write()'s ecall is at address 0xe18 $ make qemu-gdb (gdb) b *0xe18 (gdb) c (gdb) delete 1 (gdb) x/3i 0xe16 let's look at the registers (gdb) print $pc (gdb) info reg $pc and $sp are at low addresses -- user memory starts at zero C on RISC-V puts function arguments in a0, a1, a2, &c write() arguments: a0 is fd, a1 is buf, a2 is n (gdb) x/2c $a1 the shell is printing the $ prompt what page table is in use? (gdb) print/x $satp not very useful qemu: control-a c, info mem there are mappings for seven pages [address space diagram] instructions x2, data, stack guard (no PTE_U), stack then two high mystery pages: trapframe and trampoline there are no mappings for kernel memory, devices, physical mem let's execute the ecall (gdb) stepi where are we? (gdb) print $pc we're executing at a very high virtual address (gdb) x/6i 0x3ffffff000 these are the instructions we're about to execute see uservec in kernel/trampoline.S it's the start of the kernel's trap handling code (gdb) info reg the registers hold user values (except $pc) qemu: info mem we're still using the user page table note that $pc is in the trampoline page, the very last page we're executing in the "trampoline" page, which contains the start of the kernel's trap handling code. ecall doesn't switch page tables, so these kernel instructions have to exist somewhere in the user page table. the trampoline page is the answer: the kernel maps it at the top of every user page table. the kernel sets $stvec to the trampoline page's virtual address. the trampoline is protected: no PTE_U flag. (gdb) print/x $stvec can we tell that we're in supervisor mode? I don't know a way to find the mode directly but observe $pc is executing in a page with no PTE_U flag lack of crash implies we are in supervisor mode how did we get here? ecall did three things: change mode from user to supervisor save $pc in $sepc (gdb) print/x $sepc jump to $stvec (i.e. set $pc to $stvec) the kernel previously set $stvec, before jumping to user space note: ecall lets user code switch to supervisor mode but the kernel immediately gains control via $stvec so the user program itself can't execute as supervisor what needs to happen now? save the 32 user register values (for later transparent resume) switch to kernel page table set up stack for kernel C code jump to kernel C code why didn't the RISC-V designers have ecall do these things for us? ecall does as little as possible to give O/S designers scope for very fast syscalls / faults / intrs maybe O/S can handle some traps w/o switching page tables maybe we can map BOTH user and kernel into a single page table so no page table switch required maybe some registers do not have to be saved maybe no stack is required for simple system calls what are the options at this point for saving user registers? can we just write them somewhere convenient in physical memory? no, even supervisor mode is constrained to use the page table can we first set satp to the kernel page table? supervisor mode is allowed to set satp... but we don't know the address of the kernel page table at this point! and we need a free register to even execute csrw satp, $xx two parts to the solution for where to save the 32 user registers: 1) xv6 maps a 2nd kernel page, the trapframe, into every user page table it has space to hold the saved registers the kernel gives each process a different trapframe page the page at 0x3fffffe000 is the trapframe page see struct trapframe in kernel/proc.h (but we still need a register holding the trapframe's address...) 2) RISC-V provides the sscratch register supervisor code can use sscratch for temporary storage user code isn't allowed to use sscratch, so no need to save see this at the start of uservec in trapframe.S: csrw sscratch, a0 then a few instructions to load TRAPFRAME into a0 (gdb) stepi (gdb) stepi (gdb) stepi (gdb) stepi (gdb) print/x $a0 address of the trapframe (gdb> print/x $sscratch 0x2, the old first argument (fd) now uservec() has 32 saves of user registers to the trapframe, via a0 so they can be restored later, when the system call returns let's skip them (gdb) b *0x3ffffff07e (gdb) c now we're setting up to be able to run C code in the kernel first a stack previously, kernel put a pointer to top of this process's kernel stack in trapframe look at struct trapframe in kernel/proc.h "ld sp, 8(a0)" fetches the kernel stack pointer remember a0 points to the trapframe at this point the only kernel data the code can get at is the trapframe, so everything has to be loaded from there. (gdb) stepi retrieve hart ID into tp (gdb) stepi we want to jump to the kernel C function usertrap(), which the kernel previously saved in the trapframe. "ld t0, 16(a0)" fetches it into t0, we'll use it in a moment, after switching to the kernel page table (gdb) stepi load a pointer to the kernel pagetable from the trapframe, and load it into satp, and issue an sfence to clear the TLB. (gdb) stepi (gdb) stepi (gdb) stepi why isn't there a crash at this point? after all we just switched page tables while executing! answer: the trampoline page is mapped at the same virtual address in the kernel page table as well as every user page table (gdb) print $pc qemu: info mem with the kernel page table we can now use kernel functions and data the jr t0 is a jump to usertrap() (using t0 retrieved from trapframe) (gdb) print/x $t0 (gdb) x/4i $t0 (gdb) stepi (gdb) tui enable we're now in usertrap() in kernel/trap.c various traps come here, e.g. errors, device interrupts, and system calls usertrap() looks in the scause register to see the trap cause see Figure 10.3 on page 102 of The RISC-V Reader scause = 8 is a system call (gdb) next ... until syscall() (gdb) step (gdb) next now we're in syscall() kernel/syscall.c myproc() uses tp to retrieve current struct proc * p->xxx is usually a slot in the current process's struct proc syscall() retrieves the system call number from saved register a7 p->trapframe points to the trapframe, with saved registers p->trapframe->a7 holds 16, SYS_write p->trapframe->a0 holds write() first argument -- fd p->trapframe->a1 holds buf p->trapframe->a2 holds n (gdb) next ... (gdb) print num then dispatches through syscall[num], a table of functions (gdb) next ... (gdb) step aha, we're in sys_write. at this point system call implementations are fairly ordinary C code. let's skip to the end, to see how a system call returns to user space. (gdb) finish notice that write() produced console output (the shell's $ prompt) back to syscall() the p->tf->a0 assignment causes (eventually) a0 to hold the return value the C calling convention on RISC-V puts return values in a0 (gdb) next back to usertrap() (gdb) print p->trapframe->a0 write() returned 2 -- two characters -- $ and space (gdb) next (gdb) step now we're in usertrapret(), which starts the process of returning to the user program we need to prepare for the next user->kernel transition stvec = uservec (the trampoline), for the next ecall trapframe satp = kernel page table, for next uservec trapframe sp = top of kernel stack trapframe trap = usertrap trapframe hartid = hartid (in tp) at the end, we'll use the RISC-V sret instruction we need to prepare a few registers that sret uses sstatus -- set the "previous mode" bit to user sepc -- the saved user program counter (from trap entry) we'll need to switch to the user page table not OK in usertrapret(), since it's not mapped in the user page table. need a page that's mapped in both user and kernel page table -- the trampoline. jump to userret in trampoline.S (gdb) tui disable (gdb) step (gdb) x/8i 0x3ffffff09c a0 holds user page table address the csrw satp switches to the user address space (gdb) stepi (gdb) stepi (qemu) info mem now 32 loads from the trapframe into registers these restore the user registers let's skip over them (gdb) b *0x3ffffff11a (gdb) c a0 is restored last, after which we can no longer get at TRAPFRAME (gdb) print/x $a0 -- the return value from write() now we're at the sret instruction (gdb) print $pc (gdb) stepi (gdb) print $pc now we're back in the user program ($pc = 0xe1c) returning 2 from the write() function (gdb) print/x $a0 and we're done with a system call! summary system call entry/exit is far more complex than function call much of the complexity is due to the requirement for isolation and the desire for simple and fast hardware mechanisms a few design questions to ponder: can an evil program abuse the entry mechanism? can you think of ways to make the hardware or software simpler? can you think of ways to make traps faster?