Assurance - moj-analytical-services/splink_demos

Assessing the quality of linkage results is a crucial step in the data analysis process. This can be achieved through various methods, including visual inspection, statistical tests, and benchmarking. Here, we will discuss the possible options and provide examples for each option using the documentation and code snippets provided in the splink_demos project.

Visual Inspection

Visual inspection is a simple yet effective way to assess the quality of linkage results. By examining the linked data, you can identify potential issues such as duplicate records, missing values, or incorrect matches. The following code snippet from tutorials/scv.html demonstrates how to visualize the linkage results using a scatter plot:

# Visualize the linkage results using a scatter plot
scv.scatter_plot(df_linked, 'x', 'y', c='cluster_id')

Statistical Tests

Statistical tests can be used to evaluate the quality of linkage results by measuring the similarity between records. The following code snippet from tutorials/07_Quality_assurance.ipynb demonstrates how to calculate the Jaro-Winkler distance, a string comparison metric, between two records:

from splink.scorers import jaro_winkler_similarity

# Calculate the Jaro-Winkler distance between two records
record1 = ('John', 'Doe', '[email protected]')
record2 = ('Jane', 'Doe', '[email protected]')
similarity_score = jaro_winkler_similarity(record1, record2)
print(f'Jaro-Winkler distance: {similarity_score:.4f}')

Benchmarking

Benchmarking involves comparing the linkage results against a ground truth dataset to evaluate the accuracy of the linkage process. The following code snippet from tutorials/07_Quality_assurance.ipynb demonstrates how to calculate the precision, recall, and F1 score of the linkage results:

from sklearn.metrics import precision_score, recall_score, f1_score

# Calculate the precision, recall, and F1 score of the linkage results
y_true = [0, 1, 1, 0, 1, 0]  # Ground truth labels
y_pred = [0, 1, 1, 1, 1, 0]  # Predicted labels
precision = precision_score(y_true, y_pred)
recall = recall_score(y_true, y_pred)
f1 = f1_score(y_true, y_pred)
print(f'Precision: {precision:.4f}')
print(f'Recall: {recall:.4f}')
print(f'F1 score: {f1:.4f}')

In conclusion, assessing the quality of linkage results can be achieved through visual inspection, statistical tests, and benchmarking. By using these methods, you can ensure that your linkage results are accurate and reliable.