Indexing Overview

Definition

An index is a file of (keyVal, tupleID) pairs, e.g.

Types of Indexes

A 1d index is based on the value of a single attribute A.

Some possible properties of A:

May be used to sort data file (or may be sorted on some other field).
Values may be unique (or there may be multiple instances).

Taxonomy of index types, based on properties of index attribute:

Primary: index on unique field, data file may be sorted on A.
Clustering: index on non-unique field, data file sorted on A.
Secondary: file not sorted on A.

A given table may have indexes on several attributes.

Index Structures

Indexes may be structured in several ways:

Dense: every tuple is referenced by an index entry.
Sparse: only some tuples are referenced by index entries.
Single-level: tuples are accessed directly from index file.
Multi-level: may need to access several pages to reach tuple.

Selection with Primary Index

Note: data file might not be ordered, index files are always ordered.

One Queries

ix = # binary search index for entry with key K

if: #nothing found
    return NotFound

b = getPage(pageOf(ix.tid))
t = getTuple(b, offsetOf(tx.tid)) # may require reading overflow pages

return t

Worst case: read log_2(i) index pages + read 1 + Ov data pages.

Thus, Cost_one,prim = log_2(i) + 1 + Ov.

Range Queries

For range queries on primary key:

Use index search to find lower bound.
Read index sequentially until reach upper bound.
Accumulate set of buckets to be examined.
Examine each bucket in turn to check for matches.

For queries not involving primary key, index gives no help.

Algorithm when data file is not sorted:

# E.g. select * from R where a between low and high

pages = {}
results = {}
ixPage = findIndexPage(R.ixf, low)

while ixTuple := getNextIndexTuple(R.ixf):
    if ixTup.key > high:
        break
    pages = pages.union(pageOf(ix.Tup.tid))

for pid in pages:
    while buf = getPage(R.datf, pid):
        for t in buf:
            if low <= t.a and t.a <= high:
                results = results.add(t)

return results

Partial Match Retrieval (PMR) Queries

For PMR queries involving primary key:

Search as if performing one query.

For queries not involving primary key, index gives no help.

Insertion with Primary Index

Overview:

tid = # insert tuple into page P at position p
# find location for new index entry (k, tid) into index file

Problem: order of index entries must be maintained:

Need to avoid overflow pages in index.
So, reorganise index file by moving entries down.

Reorganisation requires, on average, read/write half of index file: Cost_insert,prim = (log_2(i))_r + i / 2 * (1_r + 1_w) + (1 + Ov)_r + (1 + δ)_w.

Deletion with Primary Index

Overview:

# find tuple using index
# mark tuple as deleted
# delete index entry for tuple

If we delete index entries by marking:

Cost_delete,prim = (log_2(i))_r + (1 + Ov)_R + 1_w + 1_w.

If we delete index entry by file reorganisation:

Cost_delete,prim = (log_2(i)) + (1 + Ov)_R + i / 2 * (1_r + 1_w) + 1_w.

Clustering Index

Data file sorted: can use one index entry for each unique key value.

Cost penalty: maintaining both index and data file as sorted.

Secondary Index

Data file not sorted: want one index entry for each key value.

Cost_pmr = (log_2(i)_ix1 + a_ix2 + b_q * (1 + Ov))

Multi-level Indexes

Secondary index used two index files to speed up search:

By keeping the initial index search relatively quick.
ix1 is small (depends on number of unique key values).
ix2 is large (depends on amount of repetition of keys).
Typically b_ix1 << b_ix2 << b.

Could improve further by:

Making ix1 sparse, since ix2 is guaranteed to be ordered.
In this case: b_ix1 = ceil(b_ix2 / c_i).
If ix1 becomes too large, add ix3 and make ix2 sparse.
If data file ordered on key, could make ix3 sparse.

Ultimately, reduce top-level of index hierarchy to one page.

Algorithm for one queries:

xpid = # top level index page
for level in range(1, d + 1):
    # read index entry xpid
    # search index page for Jth entry where index[J].key <= K <= index[J + 1].key
    if J == -1:
        return NotFound
    xpid = index[J].page

pid = xpid
# search page pid and its overflow pages

Cost_one,mli = (d + 1 + Ov)_r.

Note that d = ceil(log_ci(r)) and c_i is large because index entries are small.