CSE444: Database Systems Internals notes3
整理自CSE 444 Database Internals, Spring 2019 的课程Lectures,课程地址:https://courses.cs.washington.edu/courses/cse444/19sp/
Lecture5-6 Indexing
Searching in a Heap File
Heap File Search Example
10,000 students
10 student records per page
Total number of pages: 1,000 pages
Find student whose sid is 80 – Must read on average 500 pages
Find all students older than 20 – Must read all 1,000 pages
Can we do better?
Sequential File
Sequential File Example
Total number of pages: 1,000 pages
Find student whose sid is 80 – Could do binary search, read log2(1,000) ≈ 10 pages
Find all students older than 20 – Must still read all 1,000 pages
Can we do even better?
Note: Sorted files are inefficient for inserts/deletes
Creating Indexes in SQL
Outline
Indexes
Index: data structure that organizes data records on disk to optimize selections on the search key fields for the index
An index contains a collection of data entries, and supports efficient retrieval of all data entries with a given search key value k
Indexes are also access methods!
- So they provide the same API as we have seen for Heap Files
- And efficiently support scans over tuples matching predicate on search key
Search key = can be any set of fields
- not the same as the primary key, nor a key
Index = collection of data entries
Data entry for key k can be:
- (k, RID)
- (k, list-of-RIDs)
- The actual record with key k: In this case, the index is also a special file organization,Called: “indexed file organization”
Page Format Approach 2
Different Types of Files
For the data inside base relations:
- Heap file (tuples stored without any order)
- Sequential file (tuples sorted on some attribute(s))
- Indexed file (tuples organized following an index)
Then we can have additional index files that store (key,rid) pairs
Index can also be a “covering index”
- Index contains (search key + other attributes, rid)
- Index suffices to answer some queries
Primary Index
Primary Index with Duplicate Keys
Secondary Indexes
Clustered vs. Unclustered Index
Clustered/Unclustered
Primary index = clustered by definition
Secondary indexes = usually unclustered
Index Classification Summary
Large Indexes
What if index does not fit in memory?
Would like to index the index itself
- Hash-based index
- Tree-based index
Hash-Based Index
Tree-Based Index
How many index levels do we need?
Can we create them automatically? Yes!
Can do something even more powerful!
B+ Trees
Search trees
Idea in B Trees
- Make 1 node = 1 page (= 1 block)
• Idea in B+ Trees
- Keep tree balanced in height – dynamic rather than static
- Make leaves into a linked list : facilitates range queries
B+ Trees Basics
Parameter d = the degree
Each node has d <= m <= 2d keys (except root)
Each node also has m+1 pointers
Each leaf has d <= m <= 2d keys:
B+ Trees Properties
For each node except the root, maintain 50% occupancy of keys
Insert and delete must rebalance to maintain constraints
Searching a B+ Tree
B+ Tree Example
B+ Tree Design
B+ Trees in Practice
Insertion in a B+ Tree
Deletion in a B+ Tree
Summary on B+ Trees