Blocker

block.blocking

class oagdedupe.block.blocking.Blocking(repo: ~oagdedupe.db.base.BaseRepositoryBlocking, conj: ~oagdedupe.block.base.BaseConjunctions = <class 'oagdedupe.block.learner.Conjunctions'>, forward: ~oagdedupe.block.base.BaseForward = <class 'oagdedupe.block.forward.Forward'>, pairs: ~oagdedupe.block.base.BasePairs = <class 'oagdedupe.block.pairs.Pairs'>, optimizer: ~typing.Optional[~oagdedupe.block.base.BaseConjunctions] = None)[source]

General interface for blocking: - forward: constructs forward indices - conjunctions: learns best conjunctions - pairs: generates pairs from inverted indices

__init__(repo: ~oagdedupe.db.base.BaseRepositoryBlocking, conj: ~oagdedupe.block.base.BaseConjunctions = <class 'oagdedupe.block.learner.Conjunctions'>, forward: ~oagdedupe.block.base.BaseForward = <class 'oagdedupe.block.forward.Forward'>, pairs: ~oagdedupe.block.base.BasePairs = <class 'oagdedupe.block.pairs.Pairs'>, optimizer: ~typing.Optional[~oagdedupe.block.base.BaseConjunctions] = None) None
__post_init__()[source]
_abc_impl = <_abc_data object>
_check_rr(stats: StatsDict) bool[source]

check if new block scheme is below minium reduction ratio

conj

alias of Conjunctions

forward

alias of Forward

optimizer: BaseConjunctions = None
pairs

alias of Pairs

repo: BaseRepositoryBlocking
save(full: bool = False)[source]

save comparison pairs, using conjunctions list;

if using sample, build all forward indices first, otherwise builds forward index as needed

save_comparisons(table: str, n_covered: int) None[source]

Iterates through best conjunction from best to worst.

For each conjunction, append comparisons to “comparisons” or “full_comparisons” (if using full data).

Stop if (a) subsequent conjunction yields a reduction ratio below the minimum rr setting or (b) the number of comparison pairs gathered exceeds n_covered.

Parameters
  • table (str) – table used to get pairs (either blocks_train for sample or blocks_df for full df)

  • n_covered (int) – number of records that the conjunctions should cover

block.forward

This module contains objects used to construct blocks by creating forward index.

class oagdedupe.block.forward.Forward(repo: BaseRepositoryBlocking, settings: Settings)[source]

Used to build forward indices. A forward index is a table where rows are entities, columns are block schemes, and values contain signatures.

settings
Type

Settings

repository
Type

BaseRepositoryBlocking

__init__(repo: BaseRepositoryBlocking, settings: Settings) None
_abc_impl = <_abc_data object>
build_forward_indices(rl: str = '', full: bool = False, conjunction: Optional[Tuple[str]] = None) None[source]

Build forward indices for train or full datasets

repo: BaseRepositoryBlocking
settings: Settings

block.learner

This module contains objects used to construct learn the best block scheme conjunctions and uses these to generate comparison pairs.

class oagdedupe.block.learner.Conjunctions(optimizer: BaseOptimizer, settings: Settings)[source]

For each block scheme, get the best block scheme conjunctions of lengths 1 to k using greedy dynamic programming approach.

optimizer
Type

BaseOptimizer

settings
Type

Settings

__init__(optimizer: BaseOptimizer, settings: Settings) None
property conjunctions_list: List[StatsDict]

flattens, dedupes and sorts list of conjunctions

Return type

List[StatsDict]

optimizer: BaseOptimizer
settings: Settings

block.sql