Cluster

Cluster

class oagdedupe.cluster.cluster.ConnectedComponents(repo: BaseRepository, settings: Settings)[source]

Uses a graph to retrieve connected components

__init__(repo: BaseRepository, settings: Settings) None
_abc_impl = <_abc_data object>
get_connected_components(scores: DataFrame) DataFrame[source]

Build graph with “matched” candidate pairs, weighted by p(match).

Need to add feature to consider weights when generating connected components.

Parameters

scores (pd.DataFrame) – dataframe with pair indices and match scores

Returns

dataframe mapping cluster index to entity index

Return type

pd.DataFrame

For record linkage:

Build graph with “matched” candidate pairs, weighted by p(match).

Keeps track of whether index is from left or right dataframe

Need to add feature to consider weights when generating connected components.

Parameters

scores (pd.DataFrame) – dataframe with pair indices and match scores

Returns

dataframe mapping cluster index to entity index

Return type

pd.DataFrame

get_df_cluster(**kwargs)
repo: BaseRepository
settings: Settings