2. Key Terms

Terms
Term	Definition
Entity	a record in a dataset, e.g. {“name”:”John Adams”, addr:”28 chery st”}
Attributes	the fields that will be used for deduplication, e.g. (“name”, “address”)
comparison pairs	a pair of entity IDs that will be compared, e.g. if {28: “john”, 24:”sarah”} thene (24,28) would be a comparison pair
block scheme	a function used for blocking, e.g. first_two_characters(firstname)
signature	the output of a function on a field e.g. first_two_characters(“john”) = “jo”
forward index	a mapping from entity to signature; mappings can be concatenated to a dataframe where rows represent entities, columns are block schemes, and values are signatures
inverted index	a mapping from signature to an array of entities that share the signature, e.g. if {28:”john”, “30”:”joe”, 24:”sarah”} then {“jo”:[28,30], “sa”:[24]} is the inverted index
block conjunction	a conjunction of block schemes, e.g. “first 2 characters of name” AND “exact match on postcode” AND “common acronym”
reduction ratio	the number of comparisons omitted from blocking divided by the total possible number of comparisons that would be made without blocking
coverage	“positive coverage” is the percentage of samples labeled as “match” that are “covered” by the blocking conjunction, where “covered” means that applying the blocking conjunction yields comparison pairs that contain the positively labeled sample. “negative coverage” is defined in the same way, except it is the percentage of samples labeled as “not a match” that are “covered”