4. Record Linkage

Below is an example that links df to df2, on attributes columns specified in settings (dataframes should share these columns).

4.1. train model

import glob
import pandas as pd
from oagdedupe.api import Dedupe

files = glob.glob(
   "/mnt/Research.CF/References & Training/Satchel/dedupe_rl/baseline_datasets/north_carolina_voters/*"
)[:1]
df = pd.concat([pd.read_csv(f) for f in files]).reset_index(drop=True)
for attr in settings.attributes:
   df[attr] = df[attr].astype(str)

files2 = glob.glob(
   "/mnt/Research.CF/References & Training/Satchel/dedupe_rl/baseline_datasets/north_carolina_voters/*"
)[1:2]
df2 = pd.concat([pd.read_csv(f) for f in files2]).reset_index(drop=True)
for attr in settings.attributes:
   df2[attr] = df2[attr].astype(str)

df = df.sample(100_000, random_state=1234)
df2 = df2.sample(100_000, random_state=1234)

d = RecordLinkagee(settings=settings)
d.initialize(df=df, df2=df2, reset=True)

# %%
# pre-processes data and stores pre-processed data, comparisons, ID matrices in SQLite db
d.fit_blocks()

4.2. start fastAPI

Run

DEDUPER_NAME="<project name>";
DEDUPER_FOLDER="<project folder>";
python -m dedupe.fastapi.main

replacing <project name> and <project folder> with your project settings (for the example above, test and ./.dedupe).

4.3. label-studio

Return to label-studio and start labelling. When the queue falls under 5 tasks, fastAPI will update the model with labelled samples then send more tasks to review.

4.4. predictions

To get predictions, simply run the predict() method.

d = Dedupe(settings=Settings(name="test", folder="./.dedupe"))
d.predict()

See run.py for the full working example.