Splink Library

This repository contains interactive notebooks containing demonstration and tutorials for version 3 of the Splink record linking library, the homepage for which is here: https://github.com/moj-analytical-services/splink.

Splink Library Overview

The Splink library is a powerful tool for performing record linkage and deduplication. It provides a comprehensive set of features for cleaning, blocking, comparing, and linking records.

Getting Started

To use Splink, you need to install the library and its dependencies:

  • Download java for your specific OS from here: https://www.oracle.com/java/technologies/downloads/
  • Check the installation went correctly by using: java -version within a terminal instance. It should return details of your java installation.
  • Clone this repository: git clone https://github.com/moj-analytical-services/splink_demos.git
  • Create a virtual environment using: python3 -m venv venv
  • Activate the virtual environment: source venv/bin/activate
  • Install the package list (which includes pyspark) with: pip3 install -r requirements.txt

Input Data Requirements

Before you can use Splink, your input datasets must meet several prerequisites:

  • Unique ID Column: Each input dataset must have a unique ID column, which is unique within the dataset. By default, Splink assumes this column will be called unique_id, but this can be changed with the unique_id_column_name key in your Splink settings. The unique id is essential because it enables Splink to keep track of each row correctly.
  • Conformant Datasets: Input datasets must be conformant, meaning they share the same column names and data formats. For instance, if one dataset has a “date of birth” column and another has a “dob” column, rename them to match. Ensure data type and number formatting are consistent across both columns. The order of columns in input dataframes is not important.
  • Cleaning: Ensure your data is clean before using Splink. This includes handling missing values, removing special characters, and standardizing data formats.

The Linkage Workflow

The Splink library follows a typical record linkage workflow:

  1. Exploratory Analysis: Analyze your data to understand its distribution and identify potential challenges for linkage. For example, look at the frequency of different values in each field, the number of missing values, and the data types.

  2. Choosing Blocking Rules: Blocking rules are essential for optimizing linkage performance. They help reduce the number of comparisons Splink needs to perform by grouping records that are likely to be matches. You can specify blocking rules as SQL expressions.

  3. Estimating Model Parameters: Splink uses a probabilistic model to score candidate pairs and determine which ones should be linked. To build this model, you need to define how the information in the input records should be compared. This is done using Comparisons.

  4. Prediction: After training the model, Splink can be used to predict linkage outcomes for new data. The library offers several options for visualizing these predictions, including scatterplots, heatmaps, and cluster studio visualizations.

  5. Quality Assurance: Perform quality assurance checks on your predictions to evaluate the performance of your model. Splink provides various metrics and visualization tools to identify false positives and false negatives.

Example Notebooks

The repository contains various example notebooks that demonstrate how to use Splink for record linkage and deduplication. These notebooks cover a range of use cases and demonstrate the different functionalities of the library.

  • Introductory Tutorial: A five-part tutorial demonstrating how to de-duplicate a small dataset using simple settings.
  • Real Time Linkage: Demonstrates Splink’s incremental and real-time linkage capabilities using the linker.compare_two_records and linker.find_matches_to_new_records functions.

Visualization Tools

Splink provides several visualization tools to help you explore your data and understand the results of your linkage models. These tools include:

  • Comparison Viewer: A tool for interactively exploring the results of a linkage model.
  • Cluster Studio: A tool for visualizing clusters of linked records.

Further Resources

Top-Level Directory Explanations

examples/ - This directory likely contains examples or sample code for using the project’s components.

examples/athena/ - This subdirectory may contain examples using Amazon Athena, an interactive query service for analyzing data in Amazon S3 using standard SQL.

examples/athena/dashboards/ - This subdirectory may contain Athena dashboard files.

examples/duckdb/ - This subdirectory may contain examples using DuckDB, an open-source in-memory analytic database.

examples/duckdb/dashboards/ - This subdirectory may contain DuckDB dashboard files.

examples/sqlite/ - This subdirectory may contain examples using SQLite, a popular open-source database management system.

examples/sqlite/dashboards/ - This subdirectory may contain SQLite dashboard files.

tutorials/ - This directory may contain tutorials or guides for using the project.