6.828 2017 Lecture 8: Isolation mechanisms Today: user/kernel isolation xv6 system call as case study * Multiple processes drive the key requirements: multiplexing isolation interaction/sharing * Isolation is often the most constraining requirement. * What is isolation? enforced separation to contain effects of failures the process is the usual unit of isolation prevent process X from wrecking or spying on process Y r/w memory, use 100% of CPU, change FDs, &c prevent a process from interfering with the operating system in the face of malice as well as bugs a bad process may try to trick the h/w or kernel * the kernel uses hardware mechanisms as part of process isolation: user/kernel mode flag address spaces timeslicing system call interface * the hardware user/kernel mode flag controls whether instructions can access privileged h/w called CPL on the x86, bottom two bits of %cs register CPL=0 -- kernel mode -- privileged CPL=3 -- user mode -- no privilege x86 CPL protects many processor registers relevant to isolation I/O port accesses control register accesses (eflags, %cs4, ...) including %cs itself affects memory access permissions, but indirectly the kernel must set all this up correctly every serious microprocessor has some kind of user/kernel flag * how to do a system call -- switching CPL Q: would this be an OK design for user programs to make a system call: set CPL=0 jmp sys_open bad: user-specified instructions with CPL=0 Q: how about a combined instruction that sets CPL=0, but *requires* an immediate jump to someplace in the kernel? bad: user might jump somewhere awkward in the kernel the x86 answer: there are only a few permissible kernel entry points ("vectors") INT instruction sets CPL=0 and jumps to an entry point but user code can't otherwise modify CPL or jump anywhere else in kernel system call return sets CPL=3 before returning to user code also a combined instruction (can't separately set CPL and jmp) * the result: well-defined notion of user vs kernel either CPL=3 and executing user code or CPL=0 and executing from entry point in kernel code not: CPL=0 and executing user code CPL=0 and executing anywhere in kernel the user pleases * how to isolate process memory? idea: "address space" give each process some memory it can access for its code, variables, heap, stack prevent it from accessing other memory (kernel or other processes) * how to create isolated address spaces? xv6 uses x86 "paging hardware" in the memory management unit (MMU) MMU translates (or "maps") every address issued by program CPU -> MMU -> RAM | pagetable VA -> PA MMU translates all memory references: user and kernel, instructions and data instructions use only VAs, never PAs kernel sets up a different page table for each process each process's page table allows access only to that process's RAM ### Let's look at how xv6 system calls are implemented xv6 process/stack diagram: user process ; kernel thread user stack ; kernel stack two mechanisms: switch between user/kernel switch between kernel threads trap frame kernel function calls... struct context * simplified xv6 user/kernel virtual address-space setup FFFFFFFF: ... 80000000: kernel user stack user data 00000000: user instructions kernel configures MMU to give user code access only to lower half separate address space for each process but kernel (high) mappings are the same for every process system call starting point: executing in user space, sh writing its prompt sh.asm, write() library function break *0xd42 x/3i 0x10 in eax is the system call number for write info reg cs=0x1b, B=1011 -- CPL=3 => user mode esp and eip are low addresses -- user virtual addresses x/4x $esp ebf is return address -- in printf 2 is fd 0x3f7a is buffer on the stack 1 is count i.e. write(2, 0x3f7a, 1) x/c 0x3f7a INT instruction, kernel entry stepi info reg cs=0x8 -- CPL=3 => kernel mode note INT changed eip and esp to high kernel addresses where is eip? at a kernel-supplied vector -- only place user can go so user program can't jump to random places in kernel with CPL=0 x/6wx $esp INT saved a few user registers err, eip, cs, eflags, esp, ss why did INT save just these registers? they are the ones that INT overwrites what INT did: switched to current process's kernel stack saved some user registers on kernel stack set CPL=0 start executing at kernel-supplied "vector" where did esp come from? kernel told h/w what kernel stack to use when creating process Q: why does INT bother saving the user state? how much state should be saved? transparency vs speed saving the rest of the user registers on the kernel stack trapasm.S alltraps pushal pushes 8 registers: eax .. edi x/19x $esp 19 words at top of kernel stack: ss esp eflags cs eip err -- INT saved from here up trapno ds es fs gs eax..edi will eventually be restored, when system call returns meanwhile the kernel C code sometimes needs to read/write saved values struct trapframe in x86.h Q: why are user registers saved on the kernel stack? why not save them on the user stack? entering kernel C code the pushl %esp creates an argument for trap(struct trapframe *tf) now we're in trap() in trap.c print tf print *tf kernel system call handling device interrupts and faults also enter trap() trapno == T_SYSCALL myproc() struct proc in proc.h myproc()->tf -- so syscall() can get at call # and arguments syscall() in syscall.c looks at tf->eax to find out which system call SYS_write in syscalls[] maps to sys_write sys_write() in sysfile.c arg*() read write(fd,buf,n) arguments from the user stack argint() in syscall.c proc->tf->esp + xxx restoring user registers syscall() sets tf->eax to return value back to trap() finish -- returns to trapasm.S info reg -- still in kernel, registers overwritten by kernel code stepi to iret info reg most registers hold restored user values eax has write() return value of 1 esp, eip, cs still have kernel values x/5x $esp saved user state: eip, cs, eflags, esp, ss IRET pops those user registers from the stack and thereby re-enters user space with CPL=3 Q: do we really need IRET? could we use ordinary instructions to restore the registers? could IRET be simpler? back to user space stepi info reg * Summary intricate design for User/Kernel transition how bad is a bug in this design? kernel must take adversarial view of user process doesn't trust user stack checks arguments page table confines what memory user program can read/write next lecture