SIMC Indexing
Overview
In a SIMC indexing scheme:
- Tuple descriptor formed by overlaying attribute codewords.
- Each codeword is
m
-bits long and hask
-bits set to 1.
A tuple descriptor desc(t)
is:
- A bit-string,
m
-bits long, wherej <= nk
bits are set to 1. desc(t) = cw(A_1) OR cw(A_2) OR .. OR cw(A_n)
.
def generateSignature(attributes: List[Attribute], m: int, k: int) -> bits:
desc: bits = 0
numAttributes: int = len(attributes)
for i in range(numAttributes):
cw: bits = codeword(attributes[i], m, k)
desc = desc | cw
Queries
To answer query q
in SIMC:
- Generate
desc(q)
by OR-ing codewords for known attributes.- Unknown attributes will not contribute to descriptor.
- Attempt to match
desc(q)
against all signatures in signature file.
def matches(sig: bits, qdesc) -> bool:
# Ensures all bits set in qdesc are set in sig.
# sig is allowed to have additional set bits.
# Possible because it was an unknown attribute in the query.
# Or false positive match.
return (sig & qdesc) == qdesc
def query(r: Rel, q: Query):
pagesToCheck = set()
for descriptor in r.signatureFile():
if matches(descriptor, desc(q)):
pid = pageOf(tupleID(descriptor))
pagesToCheck = pagesToCheck.union(pid)
# scan b_sq = b+q + δ pages to check for matches.
SIMC Parameters
Let p_F
denote the likelihood of a false match.
Ways to reduce false matches:
- Use different hash function for each attribute (
h_i
forA_i
). - Increase descriptor size (
m
).- Tradeoff: larger
m
means larger signature file ==> read more signature data.
- Tradeoff: larger
- Choose
k
so that ~=** half of bits are set**.- Tradeoff:
- High
k
==> increased overlapping. - Low
k
==> increased hash collisions.
- High
- Tradeoff:
Choosing Optimal m
and k
- Start by choosing acceptable
p_F
. - Follow these formulas:
k = 1/ln(2) * ln(1/p_F)
.m = (1/ln(2))^2 * n * ln(1/p_F)
.
Query Cost for SIMC
Cost to answer PMR query: Cost_pmr = b_d + b_sq
.
- Read
r
descriptors onb_D
descriptor pages. - Then read
b_sq
data pages for matches.
b_D = ceil(r / c_D)
where c_D = floor(B / ceil(m / 8))
.
- Recall:
r
is total number of tuples.c_D
is capacity of signatures per page.
b_sq
includes pages with r_q
matching tuples and r_F
false matches.
- Expected false matches =
r_F = (r - r_q) * p_F ~= r * p_F
ifr_q << r
. - Example:
- Worst:
b_sq = r_q + r_f
.- All matching tuples on different pages and all false matches on different pages.
- Best:
b_sq = 1
. - Average:
b_sq = ceil(b(r_q + r_f) / r)
.
- Worst:
Page-level SIMC
Alternative SIMC approach: one descriptor for each data page.
- Every attribute of every tuple in page contributes to descriptor.
- Potentially more efficient.
Size of page descriptor (PD):
- Use previous formulas for
m
andk
but multiply by number of tuples per page.
File organisation:
- Note: each signature now corresponds to a page of tuples.
PMR Query Algorithm
def query(r: Rel, q: Query) -> List[Tuple]:
pagesToCheck = set()
for i in range(len(r.signatureFile())):
descriptor = r.signatureFile()[i]
if matches(descriptor[i], desc(q)):
pid = i
pagesToCheck = pagesToCheck.union(pid)
# read and scan b_sq data pages
for pid in pagesToCheck:
# check matching tuples.
return # matching tuples
Bit-sliced SIMC
Improvement: store b
m
-bit page descriptors as m
b
-bit bit-slices.
PMR Query Algorithm
matches = ~0 // all ones
// scan m r-bit slices
for each bit i set to 1 in desc(q) {
slice = fetch bit-slice i
matches = matches & slice
}
for each bit i set to 1 in matches {
fetch page i
scan page for matching records
}
Effective because desc(q)
typically has less than half bits set to 1.
Comparison of Approaches
Tuple-based:
r
signatures,m
-bit signatures,k
bits per attribute.- Read all pages of signature file in filtering for a query.
Page-based:
b
signatures,m_p
-bit signatures,k
bits per attribute.- Read all pages of signature file in filtering for a query.
Bit-sliced:
m
-signatures,b
-bit slices,k
bits per attribute.- Read less than half of the signature file in filtering for a query.
All signature files are roughly the same size, for a given p_F
.