OSTEP Note: Ch2-10, Intro & CPU Virtualization

这边是我OS Three Easy Piece的阅读笔记, 全书见:
http://pages.cs.wisc.edu/~remzi/OSTEP/

本文包含书中第二章至第十章的内容,主要介绍了OS如何实现在单个CPU上运行多个进程。书本从两个角度来探讨这一问题:

  • Mechanism:Limited Direct Execution
    这部分介绍了CPU虚拟化实现的基本原理,如何在保证OS对CPU绝对掌握权(为了保证每个process都能公平地享有资源&保证process private data不被其他process获取)的前提下,实现CPU的共享。
  • Policy:各种scheduling algorithm
    这部分讨论了OS如何分配CPU使用权。从理想化的STCF(shortest time-to-completion first)的scheduling算法讲起,再讲到图灵奖得主Corbato提出的MLFQ,和Linux CFS和在multiprocessor上的scheduling algorithm。

以下是笔记内容:

Intro

What is OS

  • Virtualization: physical resources -> virtual resources
    • CPU/ Memory
  • APIs: System call, allow users to use virtual machine
  • Resource Manager

Question

  • Virtualization: How does OS attain virtualization?
    • policy, efficiency, hardware support
  • Concurrency: How to write program?
    • OS mechanism/ hardware support
  • Persistence: Data can be easily lost, OS-> file system
    • mechanism/ performance/ reliability

CPU Virtualization

The Abstraction: Process

RoadMap

  • process is a running program
  • Many processes share the CPU
    • Mechanism(low-level): time-sharing, with context switch
    • Policy(high-level): scheduling algorithm

Process in detail

a running program, summarized by the pieces of the system it accesses, such as:

  • Memory -> address space
  • Registers: Program counter, stack pointer, frame pointer
  • I/O information: a list of files the process opened

Process API

create, destroy, wait, miscellaneous control, status (how long it has run for, what state it is in)

  1. Creation

    Step 1. load code and static data from disk to memory

    ​ Modern OS loads lazily(only when data is needed)

    Step 2. Allocate run-time stack, initialize args(argc, argv)

    Step 3. Allocate heap

    Step 4. Initialize file descriptors (stdin, stdout, stderr)

    Step 5. Jump to main(), context switch to new process

  2. Process States

    Running, Ready, Blocked(waiting for I/O)

Example: Process creation in UNIX (Ch.5)

  1. fork() create new process

    The child process is almost same copy of its parent, but now with its own address space, child can access the file opened by parent, basically, everything the same!

  2. wait()

    Return child pid if any child process exit, if the parent process has no children, return code is -1

  3. exec() execute another program

    overwrite the code segment of the current process with the target executable

    -> a successful exec() never return

    Variants - p: include search path, e: environment, l: list args in c array

    TIPS: man -S 2 pipe for system call!

Q: Why separating fork() and exec()? Why not having a system call creating a new process with another program?

A: For linux shell, allow modifying environment after fork() and before exec(). (I/O redirection)

  1. Process Control and Users

    Process control achieved by sending signal to process or process group

    kill()

Process Control Block

  • Process list - PCB(Process Control Block)
    • ready processes, blocked processes
  • About each process
    • Context (physical registers)
    • Process state(initial(Embryo), ready, running, blocked, final(zombie, exit but not cleaned up by parent))
    • PID, parent process, opened files

Mechanism: Limited Direct Execution

Problem:

  1. Performance: Implement virtualization without adding excessive overhead?
  2. Control: OS should retain control, stop user from executing unwanted code.OSTEP Note: Ch2-10, Intro & CPU Virtualization

Problem 1: Restricted Operations

Process may need restricted resources: I/O devices, CPU, memory

System call: will put the value into specific register(in assembly code), perform the trap instruction.

  1. execute trap instruction: raise privilege level, save process value(PC, flags,…) in kernel stack

  2. trap table: trap handler for hardware interrupt, keyboard interrupt, and system call

    system-call number is assigned to each system call.OSTEP Note: Ch2-10, Intro & CPU Virtualization
    Q1. Kernel code executed in kernel address space? Another process or not?

A: Yes, it is executed in kernel address space, but it is not a new process somehow… It may be treated as a special process.

Q2. What if you could install your own trap table? Could you take over the machine?

A: Might be troublesome since you can jump arbitrarily.

Q3. What is kernel stack and kernel register?

A: For a trap execution, the user stack can’t be used because it might be corrupted by malicious user process. The kernel stack is used by kernel code only, so it is safe. Kernel stack is in kernel address space and could not be accessed by the user.

Problem2: Switching Between Processes

How does OS regain control?

Solution 1: Wait for system call.

Solution 2: Timer interrupt. (Similar to system call, but called from time to time)

Concurrency? OS might disable interrupt during the handler.

Imbench is used for measuring the cost of context switch

OSTEP Note: Ch2-10, Intro & CPU Virtualization

Timer interrupt: reg saved by hardware since it is caused by hardware.

Q1. What happens when interrupt is disabled, will interrupt be lost?

A. Interrupt will be queued by the hardware but not serviced. But it depends on the hardware.

Scheduling

Metrics

Performance:
Tturnaround=TcompletionTarrivalTresponse=Tfirst runTarrival T_{turnaround} = T_{completion}-T_{arrival}\\ T_{response} = T_{first\text{ }run}-T_{arrival}\\
Fairness

Basic Algorithm

FIFO/ FCFS

Advantage:

  1. Simple, easy to implement.
  2. Low overhead
  3. No starving
  4. Optimal average turnaround time for same sized job

Disadvantage:

  1. Convoy effect. Heavy weight consumers may come first. Poor average turnaround time for different sized jobs. (i.e. sensitive to arrival time)
  2. Not responsive to interactive tasks

Shortest job first(SJF)

Advantage:

  1. Optimal average turnaround time for works available simultaneously.

Disadvantage:

  1. If short tasks arrive a bit late, it still causes convoy effects.
  2. Bad for response time
  3. Starve long jobs
  4. Needs estimate of execution time

Shortest time-to-completion first (STCF)

Preemptive

Disadvantage:

  1. Needs estimate of execution time

Round Robin

Advantages:

  1. Decrease response time
  2. No starvation

Disadvantage:

  1. Increase turnaround time (especially for simultaneous, equal length).
  2. Extra overhead.

Summary

  1. There is tradeoff between turnaround time and response time. The basic algorithm is good for one or the other. If we want to balance this two, more advanced algorithm is needed.

  2. If I/O is considered, overlap is needed to hide the time for I/O.

  3. If no more oracle, SJF/ STCF is not useful.

The Multi-Level Feedback Queue(MLFQ)

Target: Short turnaround time + responsive + without oracle

Solution: Priority + Change priority based on history

Workload: short-running interactive jobs + long-running CPU-bound jobs(response time not important)

Rule 1. large priority first.

Rule 2. same priority Round Robin.

Change priority

  • Move down

    Rule 3. initially highest priority.

    Rule 4a. Use up an entire time slice, reduced priority. ->CPU bound

    Rule 4b. Give up CPU, same priority. -> interactive job

    Intent: Approximate SJF, the job is assumed to be a short job, slowly move down the priority queue if it is actually a long job. The job does not get penalty if it is interactive.

    Problem:

  1. Starvation

  2. User can game the scheduler

  3. Program change its behavior (CPU bound->interactive)

  • Priority Boost

    Solve Problem 1 and Problem 3

    Rule 5. After a while (S), all jobs are top priority.

  • Better Accounting

    Sovle Problem 2.

    Rule 4. Once a job uses up its time allotment, reduce priority.

Tuning MLFQ and Other Issues

  1. #queues
  2. Time slice per queue (shorter time slice for higher priority)
  3. Priority boost time S

User advice is needed!

Proportional Share

Lottery Scheduling

Each job obtain certain percentage of CPU time

A: 0-74

B: 75-99

Scheduling is just random number from 0-99

  • User assign ticket currency
  • Ticket can be transferred between processes
  • Ticket inflation(get a boost in ticket) in cooperative environment

Tip: Organize the list with the process with highest tickets come first.

Q1. How to assign ticket?

A. Let user make the decision? Actually, this problem remain open.

Q2. Why not determinisitic?

A. Compared with stride scheduling, lottery scheduling has no global state and is easier for implementation.

Stride Scheduling

Proportional share with deterministic scheduler with stride.

The Linux Completely Fair Scheduler (CFS)

fair-share scheduling, efficient, scalable

Idea: Count vruntime of each process. Pick the process with lowest vruntime to run. (RR if there is a tie)

Control Parameters

  1. sched_latency: total schedule latency, with the goal of giving each process at least one chance to run in that cycle

    time slice = schedule latency/ # processes

    min granularity = minimum value for time slice, to gurantee the time slice not too small

  2. Weighting (Niceness)

    adjust priority of the process

    nice varies from -20 to +19, can be mapped to weight
    time_slicek=weightki=0n1weightisched_latencyvruntimei=vruntimei+weight0weightiruntimei time\_slice_k=\dfrac{weight_k}{\sum_{i=0}^{n-1}weight_i}\cdot sched\_latency\\ vruntime_i = vruntime_i + \dfrac{weight_0}{weight_i}\cdot runtime_i

Red-Black Trees

  • Use to store process, vruntime as the key

  • enhance efficiency

Avoid Starvation

  • Set the vruntime of the newly waked up process to be the minimum value in the tree

Multiprocessor Scheduling(Advanced)

How can we extend the idea of simple CPU to work on multiple processors?

Cache Affinity

Keep the process on the same CPU if possible.

Single-Queue Multiprocessor Scheduling (SQMS)

Single queue, take the process outside the queue

Advantage:

  1. Simple, reuse the knowledge from single processor

Disadvantage:

  1. Locking needed for accessing queue, introduce synchronization overhead
  2. x cache affinity

Multi-Queue Multiprocessor Scheduling (MQMS)

Each process stick to one queue

Advantage:

  1. cache affinity
  2. scalable

Disadvantage:

  1. load imbalance will cause problem -> need to migrate job, called work stealing
    • Still need to decide how often to steal.

Linux Multiprocessor Scheduler

  • O(1)
    • MLFQ with multiple queue
  • CFS with multiple queue
  • BF Scheduler
    • Single queue
    • Earliest Eligible Virtual Deadline First