3. Generate Training Samples

First, we load the dataframe to shema.df.

Second, we generate training samples which consist of three parts, which we name “positive”, “negative” and “unlabelled” samples:

positive samples: a single sample repeated 4 times
negative samples: 10 random samples
unlabeled samples: a sample size of settings.model.n

These n + 10 + 4 records are loaded into schema.train.

Third, we create schema.labels, which contains nC2 comparison pairs generated using the positive and negative samples. The pairs from positive samples are labeled as a match, while the pairs from negative samples are labeled as a non-match.

Finally, compute distances between comparison pairs. If there are 3 attributes (e.g. name, address, age), there would be 3 separate distance computations.