Signature-based Indexing
Indexing with Signatures
Signature based indexing:
- Designed for PMR queries.
- Does not try to achieve better than O(n) performance.
- Attempts to provide efficient linear search.
Each tuple is associated with a signature:
- A compact lossy descriptor for the tuple.
- Formed by combining information from multiple attributes.
- Stored in a signature file, parallel to data file.
Pre-filtering is done via signatures.
File Organisation with Signatures

One signature slot per tuple slot; unused signature slots are zeroed.
Signatures do NOT determine record placement.
- Can use with other indexing.
Signatures
Signatures summarises the data from a tuple.
A tuple consists of n attribute values A_1, ... A_n.
A codeword cw(A_i) is:
- A bit-string,
mbits long, wherekbits are set to 1 (k << m). - Derived from the value of a single attribute
A_i.
A tuple descriptor (signature) is built
- By combining
cw(A_i)fori = 1..=n. - Aim to have roughly half of the bits set to 1.
Generating Codewords
def codeword(attribute_value: bytes, m: int, k: int) -> bits:
numBits: int = 0
codeword: bits = 0
seed_random(hash(attr_value))
while numBits < k:
i = random() % m
if ((1 << i) & codeword) == 0:
codeword |= 1 << i
numBits += 1
return codeword # m-bits with k 1-bits and m-k 0-bits.
Superimposed Codewords (SIMC)
In a SIMC indexing scheme, tuple descriptors are formed by overlaying attribute codewords (bitwise-or).

A SIMC tuple descriptor desc(t) is:
mbit long bit string, wherej <= nkbits are set to 1.desc(t) = cw(A_1) OR cw(A_2) OR ... OR cw(A_n)
Concatenated Codewords (CATC)
In a CATC indexing scheme, tuple descriptors are formed by concatenating attribute codewords.

A CATC tuple descriptor is:
mbit long bit string, wherej = nkbits are set to 1.desc(t) = cw(A_1) + cw(A_2) + ... + cw(A_n).
Each codeword is p = m / n bits long, with k = p / 2 bits set to 1.
Queries Using Signatures
To answer query q with a signature based index:
- Generate a query descriptor
q. - Scan the signature file using the query descriptor.
- If
sig_imatchesdesc(q), then tupleimay be a match.
desc(q) is formed from codewords of known attributes. Effectively, any unknown attribute A_i has cw(A_i) = 0.
False Matches
Both SIMC and CATC can produce false matches:
matches(D[i], desc(q))is true, butTup[i]is not a solution forq.- Natural result of hash functions not being injective.
For SIMC, overlaying could also product unfortunate bit combinations.
- Need to choose good
mandkvalue.
SIMC vs CATC
Both build m-bit wide signatures with ~1/2 bits set to 1.
Both have codewords with ~m/(2n) bits set to 1.
CATC: codewords are m/n = p-bits wide.
- Shorter codewords ==> more hash collisions.
- Also has option of different length codeword
p_ifor eachA_iwithsum(p_i for i in nattributes) == m.
SIMC: codewords are also m-bits wide.
- Longer codewords ==> less hash collisions but also has overlay collisions.