6.1810 2023 Lecture 3: OS design Lecture Topic: OS design -- high level Starting up xv6 Many details deferred to later lectures OS picture apps: sh, echo, ... system call interface (open, read, fork, ...) kernel CPU, RAM, disk "OS" versus "kernel" Isolation a big reason for separate protected kernel Strawman design: No OS [sh, echo | CPU, RAM, disk] Applications directly interact with hardware efficient! flexible! and sometimes a good idea. Main problem with No OS: lack of isolation Resource isolation: One app uses too much memory, or hogs the CPU, or uses all the disk space Memory isolation: One app's bug writes into another's memory Unix system call interface limits apps in a way that helps isolation often by abstracting hardware resources fork() and processes abstract cores OS transparently switches cores among processes Enforces that processes give them up Can have more processes than cores exec()/sbrk() and virtual addresses abstract RAM Each process has its "own" memory -- an address space OS decides where to place app in memory OS confines a process to using its own memory files abstract disk-level blocks OS ensures that different uses don't conflict OS enforces permissions pipes abstract memory sharing System call interface carefully thought out to provide isolation But still allow controlled sharing, and portability Isolation is about security as well as bugs What do OS designers assume about security? We assume user code is actively malicious Actively trying to break out of isolation Actively trying to trick system calls into doing something stupid Actively trying to interfere with other programs We assume kernel code is trustworthy We assume kernel developers are well-meaning and competent We're not too worried about kernel code abusing other kernel code. Of course, there are nevertheless bugs in kernels So kernel must treat all user interaction carefully => Requires a security mindset Any bug in kernel may be a security exploit How can a kernel defend itself against user code? two big components: hardware-level controls on user instructions careful system call interface and implementation hardware-level isolation CPUs and kernels are co-designed - user/supervisor mode - virtual memory user/supervisor mode (also called kernel mode) supervisor mode: can execute "privileged" instructions e.g., device h/w access e.g., modify page tables user mode: cannot execute privileged instructions Kernel in supervisor mode, applications in user mode [RISC-V has also an M mode, which we mostly ignore] Processors provide virtual memory page table maps VA -> PA Limits what memory a user process can use OS sets up page tables so that each application can access only its memory And cannot get at kernel's memory Page table only be changed in supervisor mode We'll spend a lot of time looking at virtual memory... The RISC-V Instruction Set Manual Volume II: Privileged Architecture supervisor-only instructions, registers -- p. 11 page tables How do system calls work? Applications run in user mode System calls must execute in kernel in supervisor mode Must somehow allow applications to get at privileged resources! Solution: instruction to change mode in controlled way open(): ecall ecall does a few things change to supervisor mode start executing at a known point in kernel code kernel is expecting to receive control at that point in its code a bit involved, will discuss in a later lecture Aside: can one have process isolation WITHOUT h/w-supported supervisor/user mode and virtual memory? yes! use a strongly-typed programming language For example, see Singularity O/S but users can then use only approved languages/compilers still, h/w user/supervisor mode is the most popular plan Monolothic kernel [diagram] kernel is a single big program implementing all system calls Xv6 does this. Linux etc. too. kernel interface == system call interface - good: easy for subsystems to cooperate one cache shared by file system and virtual memory - bad: interactions are complex leads to bugs no isolation within kernel for e.g. device drivers Microkernel design [diagram] minimal kernel IPC, memory, processes but *not* other system calls OS services run as ordinary user programs FS, net, device drivers so shell opens a file by sending msg thru kernel to FS service kernel interface != system call interface - good: encourages modularity; limit damage from kernel bugs - bad: may be hard to get good performance How common are kernel bugs? Common Vulnerabilities and Exposures web site https://cve.mitre.org/cgi-bin/cvekey.cgi?keyword=linux Both monolithic and microkernel designs widely used O/S kernels are an active area of development phone, cloud, embedded, iot, &c lwn.net Let's look at xv6 in particular xv6 runs only on RISC-V CPUs and requires a specific setup of surrounding devices -- the board modeled on the "SiFive HiFive Unleashed" board hifive.pdf A simple board (e.g., no display) - RISC-V processor with 4 cores - RAM (128 MB) - UART for console - disk-like storage - ethernet - boards like this are pretty cheap, though not powerful Qemu emulates this CPU and a similar set of board devices - called "virt", as in "qemu -machine virt" https://github.com/riscv/riscv-qemu/wiki - close to the SiFive board (https://www.sifive.com/boards) but with virtio for disk What's inside the RISC-V chip on this board? four cores, each with 32 registers ALU (add, mul, &c) MMU control registers timer, interrupt logic bus interface the cores are largely independent, e.g. each has its own registers they share RAM they share the board devices xv6 kernel source % make clean % ls kernel e.g. file system in kernel/fs.c % vi kernel/defs.h -- shows modules, internal interfaces small enough for us to understand all by end of semester much smaller than linux, but captures some key ideas building xv6 % make gcc on each kernel/*.c, .o files, linker, kernel/kernel % ls -l kernel/kernel % more kernel/kernel.asm and produces a disk image containing file system % ls -l fs.img qemu % make qemu qemu, loads kernel binary into "memory", simulates a disk with fs.img jumps to kernel's first instruction qemu maintains mock hardware registers and RAM, interprets instructions I'll walk through xv6 booting up, to first process making first system call % make CPUS=1 qemu-gdb % riscv64-unknown-elf-gdb (gdb) b *0x80000000 (gdb) c kernel is loaded at 0x80000000 b/c that's where RAM starts lower addresses are device hardware % vi kernel/entry.S "m mode" set up stack for C function calls jump to start, which is C code % vi start.c sets up hardware for interrupts &c changes to supervisor mode jumps to main (gdb) b main (gdb) c (gdb) tui enable main() core 0 sets up a lot of software / hardware other cores wait "next" through first kernel printfs let's glance at an example of initialization -- kernel memory allocator (gdb) step -- into kinit() (gdb) step -- into freerange() (gdb) step -- into free() % vi kernel/kalloc.c kinit/freerange find all pages of physical RAM make a list from them threaded through the first 64 bytes of each page [diagram] struct run the cast in kfree() and the list insert a simple allocator, only 4096-byte units, for e.g. user memory how to get processes going? our goal is to get the first C user-level program running called init (see user/init.c) init starts up everything else (just console sh on xv6) need: struct proc user memory instruction bytes in user memory user registers, at least sp and epc main() does this by calling userinit() (gdb) b userinit (gdb) continue % vi kernel/proc.c allocproc() struct proc p->pagetable back to userinit() % vi user/initcode.S exec("/init", ...) ecall a7, SYS_exec % vi kernel/syscall.h note SYS_exec is number 7 back to userinit() epc -- where process will start in *user* space and sp p->state = RUNNABLE (gdb) b *0x0 (gdb) c (gdb) tui disable (gdb) x/10i 0 what's the effect of ecall? (gdb) b syscall (gdb) c back in the kernel (gdb) tui enable (gdb) n (gdb) n (gdb) n (gdb) print num from saved user register a7 (gdb) print syscalls[7] (gdb) b exec (gdb) c % vi kernel/exec.c a complex system call read file from disk "ELF" format text, data defensive, lots of checks don't be tricked into overwriting kernel memory! allocate stack write arguments onto stack epc = sp = (gdb) c % vi user/init.c top-level process console file descriptors, 0 and 1 sh Next lecture: virtual memory and page tables