Linear Hashing
Overview
Linear Hashing uses a systematic method of growing data file:
- Hash function adapts to changing address range (via
sp
andd
). - Systematic splitting controls length of overflow chains.
Advantage: does not require auxiliary storage for directory. Disadvantage: requires overflow pages (doesn’t split on full pages).
Method
File grows linearly (one page at a time, at regular intervals). Has phases of expansion; over each phase, b
doubles.
Selection with Linear Hashing
Power of 2 Case
If b
is a power of 2, the file behaves exactly like standard hashing. The d
bits of the hash are used to compute the page address.
# select * from R where k = val
h = hash(val)
P = bits(d, h) # lower order bits
for t in P
if t.k == val:
return t
for t in overflowPages(P):
if t.k == val:
return t
Average Cost_one = 1 + Ov
.
Non-Power of 2 Case
If b
is not a power of 2, treat different parts of the file different.
Parts A and C are treated as if part of a file of size 2^(d + 1)
. Part B is treated as if part of a file of size 2^d
. Part D does not yet exist (tuples in B may eventually move into it).
h = hash(val)
pid = bits(d, h)
# only change is to use more bits if page is before split pointer
if pid < sp:
pid - bits(d + 1, h)
for t in P
if t.k == val:
return t
for t in overflowPages(P):
if t.k == val:
return t
Insertion with Linear Hashing
pid = bits(d, hash(val))
if pid < sp:
pid = bits(d + 1, hash(val))
P = getPage(f, pid)
for page in bucket(P):
if space in page:
page.insert(newTuple)
break
if not insertion:
newOverflowPage = addNewOverflowPage(P)
newOverflowPage.insert(newTuple)
if needSplit:
# partition tuples from bucket sp into buckets sp and sp + 2^d
sp += 1
if isPowerOf2(sp):
d += 1
sp = 0
Splitting
Two approaches to triggering a split:
- Split every time a tuple is inserted into full page.
- Split when load factor reaches threshold (every
k
inserts).
Note: always split page sp
, even if not full or “current”.
Systematic splitting:
- Eventually reduces length of every overflow chain.
- Helps to maintain short average overflow chain length.
newsp = sp + 2 ** d
oldsp = sp
for t in P[oldsp] and overFlowPages(P[oldsp]):
p = bits(d + 1, hash(t.k))
if p == newp:
bucket[newsp].add(t)
else:
bucket[oldsp].add(t)
sp += 1
if isPowerOf2(sp):
d += 1
sp = 0
Insertion Cost
If no split required, same as standard hashing:
Cost_insert:
Best:1_r + 1_w
, Average:(1 + Ov)_r + 1w
, Worst:(1 + max(Ov))_r + 2_W
.
If split occurs, incurs Cost_insert
plus cost of splitting:
- Read page
sp
(plus all of its overflow pages). - Write page
sp
(and its new overflow pages). - Write page
sp + 2^d
(and its new overflow pages).
On average, Cost_split = (1 + Ov)_r + (2 + Ov)_w
.
Deletion with Linear Hashing
Deletion is similar to ordinary static hash file. BUt might wish to contract file when enough tuples removed.
Method:
- Remove last bucket in data file (contracts linearly).
- Merge tuples from bucket with its buddy page (using
d - 1
hash bits).