API Endpoints for moj-analytical-services/splink_demos

Overview

This document provides an in-depth exploration of the routes defined within the moj-analytical-services/splink_demos codebase. Each route serves a specific purpose related to data exploration, model estimation, and linking analysis.

Defined Routes

Exploratory Analysis

The exploratory analysis is defined in the Jupyter notebook file located at tutorials/02_Exploratory_analysis.ipynb. The exploration includes insights about data entry errors and unique identifiers:

- Looking at the "Bottom 5 values by value count", we can see typos in the data in most fields. This tells us this information was possibly entered by hand, or using Optical Character Recognition, giving us an insight into the type of data entry errors we may see.

  • Email is a much more uniquely-identifying field than any others, with a maximum value count of 6. It's likely to be a strong linking variable.

Model Estimation

The notebook tutorials/04_Estimating_model_parameters.ipynb addresses model estimation with a specific focus on detecting unlinkable records:

## Detecting unlinkable records

An interesting application of our trained model that is useful to explore before making any predictions is to detect 'unlinkable' records.

Unlinkable records are those which do not contain enough information to be linked. A simple example would be a record containing only 'John Smith', and null in all other fields.

Blocking Rules

The blocking rules are pivotal for determining which records to compare in analyses. They are defined in tutorials/03_Blocking.ipynb. The effectiveness of blocking rules directly impacts the linkage performance:

**Blocking rules are the most important determinant of the performance of your linkage job**.

When deciding on your blocking rules, you're trading off accuracy for performance.

Data Visualization

Data visualization functionalities are spread across various files like examples/duckdb/deduplicate_50k_synthetic.ipynb. This functionality utilizes libraries such as Vega to embed visual representations:

if(typeof define === "function" && define.amd) {
requirejs.config({paths});
require(["vega-embed"], displayChart, err => showError(`Error loading script: ${err.message}`));
} else {
maybeLoadScript("vega", "5")
.then(() => maybeLoadScript("vega-lite", "5.8.0"))
.then(() => maybeLoadScript("vega-embed", "6"))
.catch(showError)
.then(() => displayChart(vegaEmbed));
}

Linkage Estimation

Linkage estimation routines utilize Expectation-Maximization algorithms as showcased in the file examples/duckdb/link_only.ipynb:

session_dob = linker.estimate_parameters_using_expectation_maximisation("l.dob = r.dob")
session_email = linker.estimate_parameters_using_expectation_maximisation("l.email = r.email")
session_first_name = linker.estimate_parameters_using_expectation_maximisation("l.first_name = r.first_name")

Conclusion

The defined routes within this codebase are designed to facilitate exploratory data analysis, model estimation, linkage detection, and visualization. Each route plays a critical role in the overall functionality of the Splink demos, empowering developers to leverage data linkage strategies efficiently.

This documentation offers a clear roadmap of the routes embedded within the moj-analytical-services/splink_demos and highlights the unique functionalities each route encompasses.

Quoted source: The details and code snippets referenced in this document are derived directly from the provided input text.