Multi-Dimensional Hashing
Hashing and PMR
For a PMR query like:
select * from R where a_1 = C_1 and ... a_n = C_n
- If one
a_i
is the hash key, query is very efficient. - If no
a_i
is the hash key, need to use linear search.
Can be alleviated using multi-attribute hashing (MAH).
- Form a composite hash value involving all attributes.
- At query time, some components of composite hash are known.
- Allows us to limit number of data pages needed to be checked.
- For unknown components, generate all possibilities for pages.
MAH works in conjunction with any dynamic hashing scheme.
Hashing Parameters
- File size:
b = 2^d
pages. So used
-bit hash values. - Relation has
n
attributes:a_1, a_2, ... , a_n
. - Attribute
a_i
has hash functionh_i
. - Attribute
a_i
contributesd_i
bits to the combined hash value. - Total bits:
d = sum(d_i for i=1..=n)
. - Choice vector specifies for all
k in 0..=d - 1
bitj
fromh_i(a_i)
contributes bitk
in combined hash value.
MA.Hashing Example
Consider relation:
Deposit(branch, acctNo, name, amount)
Assume 8 main data pages (plus overflows) with the hash parameters:
d = 3, d_1 = 1, d_2 = 1, d_3 = 1, d_4 = 0
.- Attribute 4 (
amount
) is ignored as we assume we never do equality conditions on this field.
- Attribute 4 (
Choice vector:
- Bit 0 in hash comes from bit 0 of
h_1(a_1)
. - Bit 1 in hash comes from bit 0 of
h_2(a_2)
. - Etc.
Consider the tuple:
branch | acctNo | name | amount |
---|---|---|---|
Downtown | 101 | Johnston | 512 |
Hash Function Implementation
HashVal = # unsigned int
@dataclass
class CVElem:
attr: int
bit: int
ChoiceVec = List[CVElem]
ith_bit = lambda i, val: # get ith bit from val
def hash(t: Tuple, cv: ChoiceVec, d: int) -> HashVal:
hashed_attributes: HashVal = # list of size nattr(t).
res: HashVal = 0
for i in range(1, nAttr(t) + 1):
hashed_attributes[i] = hash_any(attrVal(t, i))
for i in range(0, d):
a = cv[i].attr
b = cv[i].bit
res = res | (ith_bit(i=b, val=hashed_attributes[a]) << i)
return res
Queries with MAH
In a partial match query:
- Values of some attributes are known.
- Values of other attributes are unknown.
Consider query:
select amount from Deposit where name='Green'
Matching tuples must be in pages: 100, 101, 110 or 111
.
Query Algorithm
def partialHash(q: Query) -> Tuple[int, int]:
numStars = 0
for i in q.attributes:
if hasValue(q, i):
# set d[i] bits in composite hash using choice vector and hash(q, i).
else:
# set d[i] *'s in composite hash using choice vector
numStars += d[i]
return d, numStars
def findTuples(r: Relation, q: Query) -> Set[Tuple]:
results = set()
compositeHash, numStars = partialHash(q)
for i in range(0, 2 ** numStars):
filledCompositeHash = # replace *'s in compositeHash using i and choice vector.
buf = getPage(fileOf(r), filledCompositeHash)
for t in buf:
if satisfiesQuery(t, q):
results.add(t)
return results
Query Cost
Cost(Q) = 2^s
where s
is the number of stars (attributes not appearing in the query).
Query distribution gives probability p_Q
of asking each query type Q
.
Min query cost occurs when all attributes are used in query:
Cost_min,pmr = 1
.
Max query occurs when no attributes are specified:
Cost_max,pmr = 2^d = b
.
Average cost is given by weighted sum over all query types:
Optimising MA.Hashing Cost
For a given application, useful to minimise Cost_pmr
. Can be achieved by choosing appropriate values for d_i
(choice vector).
Heuristics:
- Distribution of query types (more bits to frequently used attributes).
- Size of attribute domain (<= bits to represent all values in domain).
- Discriminatory power (more bits to highly discriminating attributes).
Trade-off: making Q_j
more efficient makes Q_k
less efficient.
This is a combinatorial optimisation problem.