Documentation & Communication
This repository provides a collection of interactive notebooks demonstrating and providing tutorials for Splink version 3, a record linking library.
Jupyter Notebooks
Tutorials:
tutorials/00_Tutorial_Introduction.ipynb
: Introduces the tutorial series, covering the project’s objectives, key concepts, and prerequisites.tutorials/01_Prerequisites.ipynb
: Outlines essential steps before using Splink, including data preparation, ensuring consistent data representation, and understanding null values.tutorials/02_Exploratory_analysis.ipynb
: Guides users through exploring the data to understand its characteristics and potential challenges for data linking. It covers data visualization, summarizing the dataset, and identifying key variables for linking.tutorials/04_Estimating_model_parameters.ipynb
: Explains the process of specifying and estimating a linkage model, introducing “Comparisons” which define how data from different columns is compared.tutorials/06_Visualising_predictions.ipynb
: Introduces Splink’s tools for visualizing predictions, enabling users to gain insights into the model’s behavior and identify potential issues.tutorials/07_Quality_assurance.ipynb
: Focuses on quality assurance of prediction results, including visualization methods and formal accuracy analysis.tutorials/02_Exploratory_analysis.ipynb
: Emphasizes the importance of understanding data peculiarities and their relevance to data linking. This notebook showcases basic exploratory analysis techniques and interprets the findings.tutorials/04_Estimating_model_parameters.ipynb
: delves into the concept of “Comparisons,” which are crucial for defining how data from different columns is compared for linkage. It covers the concept of similarity assessment using SQL expressions.tutorials/07_Quality_assurance.ipynb
: Highlights the importance of quality assurance for linkage models and introduces tools for accuracy analysis. This notebook aims to help users understand the likelihood of false positives and false negatives.
Examples:
examples/duckdb/transactions.ipynb
: Provides a practical example of linking banking transactions. This notebook demonstrates how to perform a one-to-one link on fake data with specific features, such as time delays, hidden fees, and truncated memos.
Documentation
README.md
: Provides a brief overview of the repository, links to the Splink homepage, and highlights its focus on demonstrating and providing tutorials for Splink version 3.tutorials/scv.html
: Contains code snippets showcasing the use of Splink’s visual utilities and tools.examples/duckdb/dashboards/comparison_viewer_transactions.html
: Contains code snippets related to data visualization and user interface elements.
Contributing
Please refer to the CONTRIBUTING.md file for details on how to contribute to this project.
References
Notes
- Ensure data consistency by standardizing formats and handling invalid data before using Splink.
- Represent null values as true nulls (not empty strings) for accurate matching.
- Utilize Splink’s visualization tools to gain insights into model behavior and identify potential issues.
- Conduct quality assurance measures, including visual inspection and formal accuracy analysis, to assess the performance of your linkage models.
Top-Level Directory Explanations
examples/ - This directory likely contains examples or sample code for using the project’s components.
examples/athena/ - This subdirectory may contain examples using Amazon Athena, an interactive query service for analyzing data in Amazon S3 using standard SQL.
examples/athena/dashboards/ - This subdirectory may contain Athena dashboard files.
examples/duckdb/ - This subdirectory may contain examples using DuckDB, an open-source in-memory analytic database.
examples/duckdb/dashboards/ - This subdirectory may contain DuckDB dashboard files.
examples/sqlite/ - This subdirectory may contain examples using SQLite, a popular open-source database management system.
examples/sqlite/dashboards/ - This subdirectory may contain SQLite dashboard files.
tutorials/ - This directory may contain tutorials or guides for using the project.