Jupyter Notebook Development

This repository provides a collection of Jupyter notebooks demonstrating Splink’s capabilities for data linking.

Prerequisites:

  • Python 3.10
  • Java (required for pyspark)
  • pip

Installation:

  1. Clone the repository:
    git clone https://github.com/moj-analytical-services/splink_demos
              
  2. Create a virtual environment:
    python3 -m venv venv
              source venv/bin/activate
              
  3. Install required packages:
    pip3 install -r requirements.txt
              
  4. Install a Jupyter kernel for the virtual environment:
    python -m ipykernel install --user --name=splink_demos
              
  5. Launch Jupyter Lab:
    jupyter lab
              

Notebooks:

  • Tutorials: The tutorials directory contains step-by-step tutorials that guide you through the process of using Splink.
    • 01_Prerequisites.ipynb: Sets up the environment and introduces basic Splink concepts.
    • 02_Exploratory_analysis.ipynb: Explores data and identifies potential linking columns.
    • 03_Blocking.ipynb: Defines blocking rules to efficiently compare records.
    • 04_Estimating_model_parameters.ipynb: Estimates model parameters for accurate linking.
  • Examples: The examples directory contains end-to-end examples showcasing various Splink use cases.
    • duckdb: Demonstrates Splink usage with DuckDB database.
    • quick_and_dirty_persons.ipynb: Provides a basic example of linking person data.

Key Features:

  • Interactive Exploration: Jupyter notebooks provide a flexible and interactive environment for exploring data and experimenting with Splink’s features.
  • Step-by-Step Guidance: Tutorials and examples make it easy to learn and apply Splink concepts.
  • Comprehensive Documentation: The notebooks include in-depth explanations and comments, making it easier to understand the code and workflow.
  • Efficient Data Handling: Splink leverages pyspark for large-scale data processing, allowing for efficient linking of massive datasets.

Using Notebooks:

  • Running Notebooks: You can run each notebook interactively in Jupyter Lab.
  • Modifying Notebooks: Feel free to modify and experiment with the provided code to learn more about Splink and its capabilities.

Contributing:

This repository welcomes contributions. If you find a bug, have a feature request, or would like to share your own Splink notebooks, please submit a pull request.

Top-Level Directory Explanations

examples/ - This directory likely contains examples or sample code for using the project’s components.

examples/athena/ - This subdirectory may contain examples using Amazon Athena, an interactive query service for analyzing data in Amazon S3 using standard SQL.

examples/athena/dashboards/ - This subdirectory may contain Athena dashboard files.

examples/duckdb/ - This subdirectory may contain examples using DuckDB, an open-source in-memory analytic database.

examples/duckdb/dashboards/ - This subdirectory may contain DuckDB dashboard files.

examples/sqlite/ - This subdirectory may contain examples using SQLite, a popular open-source database management system.

examples/sqlite/dashboards/ - This subdirectory may contain SQLite dashboard files.

tutorials/ - This directory may contain tutorials or guides for using the project.