Jupyter Notebook Development
This repository provides a collection of Jupyter notebooks demonstrating Splink’s capabilities for data linking.
Prerequisites:
- Python 3.10
- Java (required for
pyspark
) pip
Installation:
- Clone the repository:
git clone https://github.com/moj-analytical-services/splink_demos
- Create a virtual environment:
python3 -m venv venv source venv/bin/activate
- Install required packages:
pip3 install -r requirements.txt
- Install a Jupyter kernel for the virtual environment:
python -m ipykernel install --user --name=splink_demos
- Launch Jupyter Lab:
jupyter lab
Notebooks:
- Tutorials: The
tutorials
directory contains step-by-step tutorials that guide you through the process of using Splink.01_Prerequisites.ipynb
: Sets up the environment and introduces basic Splink concepts.02_Exploratory_analysis.ipynb
: Explores data and identifies potential linking columns.03_Blocking.ipynb
: Defines blocking rules to efficiently compare records.04_Estimating_model_parameters.ipynb
: Estimates model parameters for accurate linking.
- Examples: The
examples
directory contains end-to-end examples showcasing various Splink use cases.duckdb
: Demonstrates Splink usage with DuckDB database.quick_and_dirty_persons.ipynb
: Provides a basic example of linking person data.
Key Features:
- Interactive Exploration: Jupyter notebooks provide a flexible and interactive environment for exploring data and experimenting with Splink’s features.
- Step-by-Step Guidance: Tutorials and examples make it easy to learn and apply Splink concepts.
- Comprehensive Documentation: The notebooks include in-depth explanations and comments, making it easier to understand the code and workflow.
- Efficient Data Handling: Splink leverages
pyspark
for large-scale data processing, allowing for efficient linking of massive datasets.
Using Notebooks:
- Running Notebooks: You can run each notebook interactively in Jupyter Lab.
- Modifying Notebooks: Feel free to modify and experiment with the provided code to learn more about Splink and its capabilities.
Contributing:
This repository welcomes contributions. If you find a bug, have a feature request, or would like to share your own Splink notebooks, please submit a pull request.
Top-Level Directory Explanations
examples/ - This directory likely contains examples or sample code for using the project’s components.
examples/athena/ - This subdirectory may contain examples using Amazon Athena, an interactive query service for analyzing data in Amazon S3 using standard SQL.
examples/athena/dashboards/ - This subdirectory may contain Athena dashboard files.
examples/duckdb/ - This subdirectory may contain examples using DuckDB, an open-source in-memory analytic database.
examples/duckdb/dashboards/ - This subdirectory may contain DuckDB dashboard files.
examples/sqlite/ - This subdirectory may contain examples using SQLite, a popular open-source database management system.
examples/sqlite/dashboards/ - This subdirectory may contain SQLite dashboard files.
tutorials/ - This directory may contain tutorials or guides for using the project.