6.1810 2023 Lecture 6: System Call Entry/Exit

Today: user -> kernel transition
  system calls, exceptions, device interrupts enter the kernel in the same way
  lots of careful design and important detail
  important for isolation and performance

What needs to happen when a program makes a system call, e.g. write()?
  [CPU | user/kernel diagram]
  CPU resources are set up for user execution (not kernel)
    32 registers, pc, privilege mode, satp
  what needs to happen?
    save 32 user registers and pc
    switch to supervisor mode
    switch to kernel page table
    switch to kernel stack
    jump to kernel C code
  high-level goals
    don't let user code interfere with user->kernel transition
      e.g. don't execute user code in supervisor mode!
    transparent to user code -- resume without disturbing

Today we're focusing on the user/kernel transition
  and ignoring what the system call implemenation does once in the kernel

preview:
  write()                        write() returns
  ecall                                                     User
  ----------------------------------------------------------------
                                 sret                       Kernel
  uservec in trampoline.S        userret in trampoline.S  
  usertrap() in trap.c           usertrapret() in trap.c
  syscall() in syscall.c           ^
  sys_write() in sysfile.c      ---|

let's watch an xv6 system call entering/leaving the kernel
  xv6 shell writing its $ prompt
  user/sh.c line 137: write(2, "$ ", 2);
  user/usys.S line 29
    this is the write() function, still in user space
  a7 tells the kernel what system call we want -- SYS_write = 16
  ecall -- triggers the user/kernel transition

let's start by putting a breakpoint on the ecall
  user/sh.asm
  search for <write>:
  write()'s ecall is at address 0xe18

$ make qemu-gdb
$ gdb
(gdb) b *0xe18
(gdb) c
(gdb) delete 1
(gdb) b usertrapret
(gdb) x/3i 0xe16

let's print the registers
(gdb) p $pc
(gdb) p $sp
(gdb) p $a0 -- fd
(gdb) p/x $a1 -- "$ "
(gdb) p $a2 -- n

$pc and $sp are at low addresses -- user memory starts at zero
C on RISC-V puts function arguments in a0, a1, a2, &c
write() arguments: a0 is fd, a1 is buf, a2 is n

(gdb) x/2c $a1

the shell is printing the $ prompt, as we expected

what page table is in use?
  (gdb) p/x $satp
        not very useful
  qemu: control-a c, info mem
    there are mappings for seven pages
    [address space diagram -- 3.4.pdf]
    instructions x2, data, stack guard (no PTE_U), stack
    then two high pages: trapframe and trampoline
      data and code for user->kernel transition
    there are no mappings for kernel memory, devices, physical mem

let's execute the ecall

(gdb) stepi

where are we?
  (gdb) p $pc
        a very high virtual address -- the trampoline
  (gdb) x/6i $pc
        this is uservec in kernel/trampoline.S
        it's the start of the kernel's trap handling code
  (gdb) p $sp
  (gdb) p $a0
        the registers hold user values (except $pc)
  qemu: info mem
        we're still using the user page table
        $pc is in the very last page, the trampoline

trampoline: the start of the kernel's trap handling code.
  must be in user page table, since ecall doesn't change satp.
  at the top to avoid punching a holein user virtual address space.
  protected: no PTE_U flag.

the kernel previously set $stvec to the trampoline page:
  (gdb) p/x $stvec

can we tell that we're in supervisor mode?
  I don't know a way to find the mode directly
  but observe $pc is executing in a page with no PTE_U flag
    lack of crash implies we are in supervisor mode

how did we get here?
  ecall did four things:
    change mode from user to supervisor
    save $pc in $sepc
      (gdb) p/x $sepc
    jump to $stvec (i.e. set $pc to $stvec)
      the kernel previously set $stvec, before jumping to user space
    disable (really postpone) further interrupts
      (gdb) p/x $sstatus
      SIE 0x02 is clear (the 0x20 is SPIE -- previous)

note: ecall lets user code switch to supervisor mode
  but the kernel immediately gains control via $stvec
  so the user program itself can't execute as supervisor

what needs to happen now?
  save the 32 user register values (for later transparent resume)
  switch to kernel page table
  set up stack for kernel C code
  jump to kernel C code -- usertrap()

why didn't the RISC-V designers have ecall do these things for us?
  to give O/S designers scope for very fast syscalls / exceptions / intrs
    maybe O/S can handle some traps w/o switching page tables
    maybe we can map BOTH user and kernel into a single page table
       so no page table switch required
    maybe some registers do not have to be saved
    maybe no stack is required for simple system calls
  so ecall does as little as possible

can we just write the 32 registers somewhere convenient in physical memory?
  no, even supervisor mode is constrained to use the page table

can we first set satp to the kernel page table?
  supervisor mode is allowed to set satp...
  but we don't know the address of the kernel page table at this point!
  and we need a free register to even execute csrw satp, $xx

we need one of the 32 general purpose registers to hold an address
  of the memory into which we'll save the 32 user registers
  but all 32 hold user values which we must preserve for eventual return

two parts to the solution for where to save the 32 user registers:
  1) xv6 maps a 2nd kernel page, the trapframe, into the user page table
     at a known virtual address, always the same: 0x3fffffe000
     trapframe has space to hold the saved registers
     the kernel gives each process a different trapframe page
     see struct trapframe in kernel/proc.h
     (but we still need a register holding the trapframe's address...)
  2) RISC-V provides the sscratch register
     supervisor code can use sscratch for temporary storage
     user code isn't allowed to use sscratch, so no need to save

see this at the start of uservec in trampoline.S:
  csrw sscratch, a0
then a few instructions to load TRAPFRAME into a0

(gdb) stepi
(gdb) p/x $sscratch
      0x2, the old first argument (fd)
(gdb) stepi
(gdb) stepi
(gdb) stepi
(gdb) p/x $a0
      address of the trapframe

now uservec() has 32 saves of user registers to the trapframe, via a0
  so they can be restored later, when the system call returns
  let's skip them

(gdb) b *0x3ffffff07e
(gdb) c

now we're setting up to be able to run C code in the kernel
  couldn't before this, since C code would have overwritten
  user registers and stack; thus trampoline is assembler.
first a stack
  previously, kernel put a pointer to top of this process's
    kernel stack in trapframe
  look at struct trapframe in kernel/proc.h
  "ld sp, 8(a0)" fetches the kernel stack pointer
  remember a0 points to the trapframe
  at this point the only kernel data mapped in the page table
    is the trapframe, so everything has to be loaded from there.

(gdb) stepi

retrieve hart ID into tp

(gdb) stepi

we want to jump to the kernel C function usertrap(), which
  the kernel previously saved in the trapframe.
  "ld t0, 16(a0)" fetches it into t0, we'll use it in a moment,
    after switching to the kernel page table

(gdb) stepi

load a pointer to the kernel pagetable from the trapframe,
and load it into satp, and issue an sfence to clear the TLB.

(gdb) stepi
(gdb) stepi
(gdb) stepi

why isn't there a crash when satp is changed?
  after all we just switched page tables while executing!
  answer: the trampoline page is mapped at the same virtual address
    in the kernel page table as well as every user page table

(gdb) p $pc
qemu: info mem

with the kernel page table we can now use kernel functions and data

the jr t0 is a jump to usertrap() (using t0 retrieved from trapframe)

(gdb) p/x $t0
(gdb) p usertrap
(gdb) stepi
(gdb) layout src

we're now in usertrap() in kernel/trap.c
  all traps from user-space follow the path we've just seen
  system calls, device interrupts, exceptions

Q: what if a device interrupt occurs while executing in the trampoline code?

usertrap() looks in the scause register to see the trap cause
  (gdb) p $scause
  page 71 of riscv-privileged-X.pdf -- Table 4.2
  scause = 8 is a system call

(gdb) next ... until syscall()
(gdb) step
(gdb) next

now we're in syscall() kernel/syscall.c
myproc() uses tp to retrieve current struct proc *
p->xxx is usually a slot in the current process's struct proc

syscall() retrieves the system call number from saved register a7
  p->trapframe points to the trapframe, with saved registers
  p->trapframe->a7 holds 16, SYS_write
  p->trapframe->a0 holds write() first argument -- fd
  p->trapframe->a1 holds buf
  p->trapframe->a2 holds n

Q: why can't this code just look at a7? why must it look in p->trapframe?

(gdb) next ...
(gdb) p num

then dispatches through syscalls[num], a table of functions

(gdb) p syscalls[num]
(gdb) next ...
(gdb) step

aha, we're in sys_write.
at this point system call implementations are fairly ordinary C code.
let's skip to the end, to see how a system call returns to user space.

(gdb) finish

notice that write() produced console output (the shell's $ prompt)
syscall()'s p->tf->a0 assignment causes (eventually) a0 to hold the return value
  the C calling convention on RISC-V puts return values in a0

(gdb) next

back to usertrap()

(gdb) p p->trapframe->a0

write() returned 2 -- two characters -- $ and space

(gdb) next
(gdb) step

now we're in usertrapret(), which starts the process of returning
  to the user program

(gdb) b 129

we need to prepare for the next user->kernel transition
  stvec = uservec (the trampoline), for the next ecall
  trapframe satp = kernel page table, for next uservec
  trapframe sp = top of kernel stack
  trapframe trap = usertrap
  trapframe hartid = hartid (in tp)

at the very end, we'll use the RISC-V sret instruction
  we need to prepare a few registers that sret uses
  sstatus -- set the "previous mode" bit to user
  sepc -- the saved user program counter (from trap entry)

we'll need to switch to the user page table
  not OK in usertrapret(), since it's not mapped in the user page table.
  need a page that's mapped in both user and kernel page table -- the trampoline.
  jump to userret in trampoline.S

(gdb) tui disable
(gdb) step
(gdb) x/8i $pc

a0 holds user page table address
the csrw satp switches to the user address space

(gdb) stepi
(gdb) stepi
(qemu) info mem

now 32 loads from the trapframe into registers
  these restore the user registers
  let's skip over them

(gdb) b *0x3ffffff11a
(gdb) c

a0 is restored last, after which we can no longer get at TRAPFRAME

(gdb) p/x $a0 -- the return value from write()

now we're at the sret instruction

sret:
  copies sepc to pc
  changes mode to user
  re-enables interrupts (really copies SPIE to SIE)

(gdb) p $pc
(gdb) p $sepc
(gdb) stepi
(gdb) p $pc

now we're back in the user program ($pc = 0xe1c)
  returning 2 from the write() function

(gdb) p/x $a0

and we're done with a system call!

summary
  system call entry/exit is far more complex than function call
  much of the complexity is due to the requirement for isolation
    and the desire for simple and fast hardware mechanisms
  a few design questions to ponder:
    can an evil program abuse the entry mechanism?
    can you think of ways to make the hardware or software simpler?
    can you think of ways to make traps faster?