Documentation & Communication

This repository provides a collection of interactive notebooks demonstrating and providing tutorials for Splink version 3, a record linking library.

Jupyter Notebooks

  • Tutorials:

    • tutorials/00_Tutorial_Introduction.ipynb: Introduces the tutorial series, covering the project’s objectives, key concepts, and prerequisites.
    • tutorials/01_Prerequisites.ipynb: Outlines essential steps before using Splink, including data preparation, ensuring consistent data representation, and understanding null values.
    • tutorials/02_Exploratory_analysis.ipynb: Guides users through exploring the data to understand its characteristics and potential challenges for data linking. It covers data visualization, summarizing the dataset, and identifying key variables for linking.
    • tutorials/04_Estimating_model_parameters.ipynb: Explains the process of specifying and estimating a linkage model, introducing “Comparisons” which define how data from different columns is compared.
    • tutorials/06_Visualising_predictions.ipynb: Introduces Splink’s tools for visualizing predictions, enabling users to gain insights into the model’s behavior and identify potential issues.
    • tutorials/07_Quality_assurance.ipynb: Focuses on quality assurance of prediction results, including visualization methods and formal accuracy analysis.
    • tutorials/02_Exploratory_analysis.ipynb: Emphasizes the importance of understanding data peculiarities and their relevance to data linking. This notebook showcases basic exploratory analysis techniques and interprets the findings.
    • tutorials/04_Estimating_model_parameters.ipynb: delves into the concept of “Comparisons,” which are crucial for defining how data from different columns is compared for linkage. It covers the concept of similarity assessment using SQL expressions.
    • tutorials/07_Quality_assurance.ipynb: Highlights the importance of quality assurance for linkage models and introduces tools for accuracy analysis. This notebook aims to help users understand the likelihood of false positives and false negatives.
  • Examples:

    • examples/duckdb/transactions.ipynb: Provides a practical example of linking banking transactions. This notebook demonstrates how to perform a one-to-one link on fake data with specific features, such as time delays, hidden fees, and truncated memos.

Documentation

  • README.md: Provides a brief overview of the repository, links to the Splink homepage, and highlights its focus on demonstrating and providing tutorials for Splink version 3.
  • tutorials/scv.html: Contains code snippets showcasing the use of Splink’s visual utilities and tools.
  • examples/duckdb/dashboards/comparison_viewer_transactions.html: Contains code snippets related to data visualization and user interface elements.

Contributing

Please refer to the CONTRIBUTING.md file for details on how to contribute to this project.

References

Notes

  • Ensure data consistency by standardizing formats and handling invalid data before using Splink.
  • Represent null values as true nulls (not empty strings) for accurate matching.
  • Utilize Splink’s visualization tools to gain insights into model behavior and identify potential issues.
  • Conduct quality assurance measures, including visual inspection and formal accuracy analysis, to assess the performance of your linkage models.

Top-Level Directory Explanations

examples/ - This directory likely contains examples or sample code for using the project’s components.

examples/athena/ - This subdirectory may contain examples using Amazon Athena, an interactive query service for analyzing data in Amazon S3 using standard SQL.

examples/athena/dashboards/ - This subdirectory may contain Athena dashboard files.

examples/duckdb/ - This subdirectory may contain examples using DuckDB, an open-source in-memory analytic database.

examples/duckdb/dashboards/ - This subdirectory may contain DuckDB dashboard files.

examples/sqlite/ - This subdirectory may contain examples using SQLite, a popular open-source database management system.

examples/sqlite/dashboards/ - This subdirectory may contain SQLite dashboard files.

tutorials/ - This directory may contain tutorials or guides for using the project.