4. Record Linkage
Below is an example that links df to df2, on attributes columns specified in settings (dataframes should share these columns).
4.1. train model
import glob
import pandas as pd
from oagdedupe.api import Dedupe
files = glob.glob(
"/mnt/Research.CF/References & Training/Satchel/dedupe_rl/baseline_datasets/north_carolina_voters/*"
)[:1]
df = pd.concat([pd.read_csv(f) for f in files]).reset_index(drop=True)
for attr in settings.attributes:
df[attr] = df[attr].astype(str)
files2 = glob.glob(
"/mnt/Research.CF/References & Training/Satchel/dedupe_rl/baseline_datasets/north_carolina_voters/*"
)[1:2]
df2 = pd.concat([pd.read_csv(f) for f in files2]).reset_index(drop=True)
for attr in settings.attributes:
df2[attr] = df2[attr].astype(str)
df = df.sample(100_000, random_state=1234)
df2 = df2.sample(100_000, random_state=1234)
d = RecordLinkagee(settings=settings)
d.initialize(df=df, df2=df2, reset=True)
# %%
# pre-processes data and stores pre-processed data, comparisons, ID matrices in SQLite db
d.fit_blocks()
4.2. start fastAPI
Run
DEDUPER_NAME="<project name>";
DEDUPER_FOLDER="<project folder>";
python -m dedupe.fastapi.main
replacing <project name> and <project folder> with your project settings (for the example above, test and ./.dedupe).
4.3. label-studio
Return to label-studio and start labelling. When the queue falls under 5 tasks, fastAPI will update the model with labelled samples then send more tasks to review.
4.4. predictions
To get predictions, simply run the predict() method.
d = Dedupe(settings=Settings(name="test", folder="./.dedupe"))
d.predict()
See run.py for the full working example.