Hashing and PMR

For a PMR query like:

select * from R where a_1 = C_1 and ... a_n = C_n
  • If one a_i is the hash key, query is very efficient.
  • If no a_i is the hash key, need to use linear search.

Can be alleviated using multi-attribute hashing (MAH).

  • Form a composite hash value involving all attributes.
  • At query time, some components of composite hash are known.
    • Allows us to limit number of data pages needed to be checked.
    • For unknown components, generate all possibilities for pages.

MAH works in conjunction with any dynamic hashing scheme.

Hashing Parameters

  • File size: b = 2^d pages. So use d-bit hash values.
  • Relation has n attributes: a_1, a_2, ... , a_n.
  • Attribute a_i has hash function h_i.
  • Attribute a_i contributes d_i bits to the combined hash value.
  • Total bits: d = sum(d_i for i=1..=n).
  • Choice vector specifies for all k in 0..=d - 1bit j from h_i(a_i) contributes bit k in combined hash value.

MA.Hashing Example

Consider relation:

Deposit(branch, acctNo, name, amount)

Assume 8 main data pages (plus overflows) with the hash parameters:

  • d = 3, d_1 = 1, d_2 = 1, d_3 = 1, d_4 = 0.
    • Attribute 4 (amount) is ignored as we assume we never do equality conditions on this field.

Choice vector:

  • Bit 0 in hash comes from bit 0 of h_1(a_1).
  • Bit 1 in hash comes from bit 0 of h_2(a_2).
  • Etc.

Consider the tuple:

branch acctNo name amount
Downtown 101 Johnston 512

Hash Function Implementation

HashVal = # unsigned int

@dataclass
class CVElem:
    attr: int
    bit: int

ChoiceVec = List[CVElem]

ith_bit = lambda i, val: # get ith bit from val

def hash(t: Tuple, cv: ChoiceVec, d: int) -> HashVal:
    hashed_attributes: HashVal = # list of size nattr(t).
    res: HashVal = 0

    for i in range(1, nAttr(t) + 1):
        hashed_attributes[i] = hash_any(attrVal(t, i))

    for i in range(0, d):
        a = cv[i].attr
        b = cv[i].bit
        res = res | (ith_bit(i=b, val=hashed_attributes[a]) << i)

    return res

Queries with MAH

In a partial match query:

  • Values of some attributes are known.
  • Values of other attributes are unknown.

Consider query:

select amount from Deposit where name='Green'

Matching tuples must be in pages: 100, 101, 110 or 111.

Query Algorithm

def partialHash(q: Query) -> Tuple[int, int]:
    numStars = 0

    for i in q.attributes:
        if hasValue(q, i):
            # set d[i] bits in composite hash using choice vector and hash(q, i).
        else:
            # set d[i] *'s in composite hash using choice vector
            numStars += d[i]

    return d, numStars

def findTuples(r: Relation, q: Query) -> Set[Tuple]:
    results = set()
    compositeHash, numStars = partialHash(q)

    for i in range(0, 2 ** numStars):
        filledCompositeHash = # replace *'s in compositeHash using i and choice vector.
        buf = getPage(fileOf(r), filledCompositeHash)
        for t in buf:
            if satisfiesQuery(t, q):
                results.add(t)

    return results

Query Cost

Cost(Q) = 2^s where s is the number of stars (attributes not appearing in the query).

Query distribution gives probability p_Q of asking each query type Q.

Min query cost occurs when all attributes are used in query:

  • Cost_min,pmr = 1.

Max query occurs when no attributes are specified:

  • Cost_max,pmr = 2^d = b.

Average cost is given by weighted sum over all query types:

Optimising MA.Hashing Cost

For a given application, useful to minimise Cost_pmr. Can be achieved by choosing appropriate values for d_i (choice vector).

Heuristics:

  • Distribution of query types (more bits to frequently used attributes).
  • Size of attribute domain (<= bits to represent all values in domain).
  • Discriminatory power (more bits to highly discriminating attributes).

Trade-off: making Q_j more efficient makes Q_k less efficient.

This is a combinatorial optimisation problem.