6.1810 2022 Lecture 6: System Call Entry/Exit

Today: user -> kernel transition
  system calls, faults, interrupts enter the kernel in the same way
  lots of careful design and important detail
  important for isolation and performance

What needs to happen when a program makes a system call, e.g. write()?
  [CPU | user/kernel diagram]
  CPU resources are set up for user execution (not kernel)
    32 registers, sp, pc, privilege mode, satp, stvec, sepc, ...
  what needs to happen?
    save 32 user registers and pc
    switch to supervisor mode
    switch to kernel page table
    switch to kernel stack
    jump to kernel C code
  high-level goals
    don't let user code interfere with user->kernel transition
      e.g. don't execute user code in supervisor mode!
    transparent to user code -- resume without disturbing

Today we're focusing on the user/kernel transition
  and ignoring what the system call implemenation does once in the kernel
  but the sys call impl has to be careful and secure also!

What does the CPU's "mode" protect?
  i.e. what does switching mode from user to supervisor allow?
  supervisor can read/write CPU control registers:
    satp -- page table physical address
    stvec -- ecall jumps here in kernel; points to trampoline
    sepc -- ecall saves user pc here
    sscratch -- temporary for a0
  supervisor can use PTEs that have no PTE_U flag
  but supervisor has no other powers!
    e.g. can't use addresses that aren't the in page table
    so kernel has to carefully set things up so it can work

preview:
  write()                        write() returns              User
  ecall
  -----------------------------------------------------------------
                                 sret
  uservec() in trampoline.S      userret() in trampoline.S   Kernel
  usertrap() in trap.c           usertrapret() in trap.c
  syscall() in syscall.c           ^
  sys_write() in sysfile.c      ---|

let's watch an xv6 system call entering/leaving the kernel
  xv6 shell writing its $ prompt
  sh.c line 137: write(2, "$ ", 2);
  user/usys.S line 29
    this is the write() function, still in user space
  a7 tells the kernel what system call we want -- SYS_write = 16
  ecall -- triggers the user/kernel transition

let's start by putting a breakpoint on the ecall
  user/sh.asm says write()'s ecall is at address 0xe18

$ make qemu-gdb
(gdb) b *0xe18
(gdb) c
(gdb) delete 1
(gdb) x/3i 0xe16

let's look at the registers
(gdb) print $pc
(gdb) info reg

$pc and $sp are at low addresses -- user memory starts at zero
C on RISC-V puts function arguments in a0, a1, a2, &c
write() arguments: a0 is fd, a1 is buf, a2 is n

(gdb) x/2c $a1

the shell is printing the $ prompt

what page table is in use?
  (gdb) print/x $satp
        not very useful
  qemu: control-a c, info mem
    there are mappings for seven pages
    [address space diagram]
    instructions x2, data, stack guard (no PTE_U), stack
    then two high mystery pages: trapframe and trampoline
    there are no mappings for kernel memory, devices, physical mem

let's execute the ecall

(gdb) stepi

where are we?
  (gdb) print $pc
        we're executing at a very high virtual address
  (gdb) x/6i 0x3ffffff000
        these are the instructions we're about to execute
        see uservec in kernel/trampoline.S
        it's the start of the kernel's trap handling code
  (gdb) info reg
        the registers hold user values (except $pc)
  qemu: info mem
        we're still using the user page table
        note that $pc is in the trampoline page, the very last page

we're executing in the "trampoline" page, which contains the start of
the kernel's trap handling code. ecall doesn't switch page tables, so
these kernel instructions have to exist somewhere in the user page
table. the trampoline page is the answer: the kernel maps it at the
top of every user page table. the kernel sets $stvec to the trampoline
page's virtual address. the trampoline is protected: no PTE_U flag.

(gdb) print/x $stvec

can we tell that we're in supervisor mode?
  I don't know a way to find the mode directly
  but observe $pc is executing in a page with no PTE_U flag
    lack of crash implies we are in supervisor mode

how did we get here?
  ecall did three things:
    change mode from user to supervisor
    save $pc in $sepc
      (gdb) print/x $sepc
    jump to $stvec (i.e. set $pc to $stvec)
      the kernel previously set $stvec, before jumping to user space

note: ecall lets user code switch to supervisor mode
  but the kernel immediately gains control via $stvec
  so the user program itself can't execute as supervisor

what needs to happen now?
  save the 32 user register values (for later transparent resume)
  switch to kernel page table
  set up stack for kernel C code
  jump to kernel C code

why didn't the RISC-V designers have ecall do these things for us?
  ecall does as little as possible
  to give O/S designers scope for very fast syscalls / faults / intrs
    maybe O/S can handle some traps w/o switching page tables
    maybe we can map BOTH user and kernel into a single page table
       so no page table switch required
    maybe some registers do not have to be saved
    maybe no stack is required for simple system calls

what are the options at this point for saving user registers?
  can we just write them somewhere convenient in physical memory?
    no, even supervisor mode is constrained to use the page table
  can we first set satp to the kernel page table?
    supervisor mode is allowed to set satp...
    but we don't know the address of the kernel page table at this point!
    and we need a free register to even execute csrw satp, $xx

two parts to the solution for where to save the 32 user registers:
  1) xv6 maps a 2nd kernel page, the trapframe, into every user page table
     it has space to hold the saved registers
     the kernel gives each process a different trapframe page
     the page at 0x3fffffe000 is the trapframe page
     see struct trapframe in kernel/proc.h
     (but we still need a register holding the trapframe's address...)
  2) RISC-V provides the sscratch register
     supervisor code can use sscratch for temporary storage
     user code isn't allowed to use sscratch, so no need to save

see this at the start of uservec in trapframe.S:
  csrw sscratch, a0
then a few instructions to load TRAPFRAME into a0

(gdb) stepi
(gdb) stepi
(gdb) stepi
(gdb) stepi
(gdb) print/x $a0
      address of the trapframe
(gdb> print/x $sscratch
      0x2, the old first argument (fd)

now uservec() has 32 saves of user registers to the trapframe, via a0
  so they can be restored later, when the system call returns
  let's skip them

(gdb) b *0x3ffffff07e
(gdb) c

now we're setting up to be able to run C code in the kernel
first a stack
  previously, kernel put a pointer to top of this process's
    kernel stack in trapframe
  look at struct trapframe in kernel/proc.h
  "ld sp, 8(a0)" fetches the kernel stack pointer
  remember a0 points to the trapframe
  at this point the only kernel data the code can
    get at is the trapframe, so everything has to be loaded from there.

(gdb) stepi

retrieve hart ID into tp

(gdb) stepi

we want to jump to the kernel C function usertrap(), which
  the kernel previously saved in the trapframe.
  "ld t0, 16(a0)" fetches it into t0, we'll use it in a moment,
    after switching to the kernel page table

(gdb) stepi

load a pointer to the kernel pagetable from the trapframe,
and load it into satp, and issue an sfence to clear the TLB.

(gdb) stepi
(gdb) stepi
(gdb) stepi

why isn't there a crash at this point?
  after all we just switched page tables while executing!
  answer: the trampoline page is mapped at the same virtual address
    in the kernel page table as well as every user page table

(gdb) print $pc
qemu: info mem

with the kernel page table we can now use kernel functions and data

the jr t0 is a jump to usertrap() (using t0 retrieved from trapframe)

(gdb) print/x $t0
(gdb) x/4i $t0
(gdb) stepi
(gdb) tui enable

we're now in usertrap() in kernel/trap.c
  various traps come here, e.g. errors, device interrupts, and system calls
  usertrap() looks in the scause register to see the trap cause
    see Figure 10.3 on page 102 of The RISC-V Reader
  scause = 8 is a system call

(gdb) next ... until syscall()
(gdb) step
(gdb) next

now we're in syscall() kernel/syscall.c
myproc() uses tp to retrieve current struct proc *
p->xxx is usually a slot in the current process's struct proc

syscall() retrieves the system call number from saved register a7
  p->trapframe points to the trapframe, with saved registers
  p->trapframe->a7 holds 16, SYS_write
  p->trapframe->a0 holds write() first argument -- fd
  p->trapframe->a1 holds buf
  p->trapframe->a2 holds n

(gdb) next ...
(gdb) print num

then dispatches through syscall[num], a table of functions

(gdb) next ...
(gdb) step

aha, we're in sys_write.
at this point system call implementations are fairly ordinary C code.
let's skip to the end, to see how a system call returns to user space.

(gdb) finish

notice that write() produced console output (the shell's $ prompt)
back to syscall()
the p->tf->a0 assignment causes (eventually) a0 to hold the return value
  the C calling convention on RISC-V puts return values in a0

(gdb) next

back to usertrap()

(gdb) print p->trapframe->a0

write() returned 2 -- two characters -- $ and space

(gdb) next
(gdb) step

now we're in usertrapret(), which starts the process of returning
  to the user program

we need to prepare for the next user->kernel transition
  stvec = uservec (the trampoline), for the next ecall
  trapframe satp = kernel page table, for next uservec
  trapframe sp = top of kernel stack
  trapframe trap = usertrap
  trapframe hartid = hartid (in tp)

at the end, we'll use the RISC-V sret instruction
  we need to prepare a few registers that sret uses
  sstatus -- set the "previous mode" bit to user
  sepc -- the saved user program counter (from trap entry)

we'll need to switch to the user page table
  not OK in usertrapret(), since it's not mapped in the user page table.
  need a page that's mapped in both user and kernel page table -- the trampoline.
  jump to userret in trampoline.S

(gdb) tui disable
(gdb) step
(gdb) x/8i 0x3ffffff09c

a0 holds user page table address
the csrw satp switches to the user address space

(gdb) stepi
(gdb) stepi
(qemu) info mem

now 32 loads from the trapframe into registers
  these restore the user registers
  let's skip over them

(gdb) b *0x3ffffff11a
(gdb) c

a0 is restored last, after which we can no longer get at TRAPFRAME

(gdb) print/x $a0 -- the return value from write()

now we're at the sret instruction

(gdb) print $pc
(gdb) stepi
(gdb) print $pc

now we're back in the user program ($pc = 0xe1c)
  returning 2 from the write() function

(gdb) print/x $a0

and we're done with a system call!

summary
  system call entry/exit is far more complex than function call
  much of the complexity is due to the requirement for isolation
    and the desire for simple and fast hardware mechanisms
  a few design questions to ponder:
    can an evil program abuse the entry mechanism?
    can you think of ways to make the hardware or software simpler?
    can you think of ways to make traps faster?