Database & Data Source Integration @ moj-analytical-services/splink_demos

Database and Data Source Integration

Splink offers flexible integration with various data sources and databases. This enables the processing of record linkage tasks across different data management systems. Some of the supported options include:

DuckDB: DuckDB is a highly performant in-memory database that is well-suited for smaller datasets. This option is often recommended for datasets up to 1 million records. tutorials/00_Tutorial_Introduction.ipynb examples/duckdb/real_time_record_linkage.ipynb examples/duckdb/febrl4.ipynb
SQLite: SQLite is a file-based database that is commonly used for simple data storage and retrieval. examples/sqlite/dashboards/50k_cluster.html examples/sqlite/dashboards/50k_deterministic_cluster.html tutorials/scv.html examples/duckdb/dashboards/50k_cluster.html examples/duckdb/dashboards/comparison_viewer_transactions.html tutorials/cluster_studio.html examples/athena/dashboards/50k_cluster.html
Spark: Spark is a distributed data processing engine that is suitable for large-scale data linking. It provides a more scalable solution for datasets exceeding 1 million records. examples/spark/febrl4.ipynb

Example

Let’s examine how Splink can be configured for DuckDB using the examples/duckdb/real_time_record_linkage.ipynb notebook.

import splink
          import duckdb
          import pandas as pd
          import json
          
          # Load pre-trained linkage model from a file
          with open("trained_linkage_model.json", "r") as f:
              settings = json.load(f)
          
          # Create a Splink linker object with DuckDB backend
          con = duckdb.connect()
          linker = splink.Linker(settings=settings,  database_engine="duckdb", con=con)

In this example, we load the linkage model from a JSON file and create a Linker object. We set the database_engine to “duckdb” and specify the DuckDB connection object. This configures Splink to utilize DuckDB for all its operations.

Important Considerations

Before applying Splink, it’s essential to ensure your datasets meet certain requirements for proper integration:

Unique IDs: Every dataset must contain a column representing unique IDs. By default, Splink expects this column to be named unique_id, but this can be customized with the unique_id_column_name setting. tutorials/01_Prerequisites.ipynb
Conformant Data: Datasets should be “conformant”, meaning they share identical column names and data formats. tutorials/01_Prerequisites.ipynb
Data Consistency: Ensure consistent data representation across all datasets. Standardize date formats, match text cases, and handle invalid data appropriately. tutorials/01_Prerequisites.ipynb
Null Values: Represent null values using true nulls, not empty strings. Splink differentiates between these two, ensuring proper matching. tutorials/01_Prerequisites.ipynb

By adhering to these guidelines, you ensure seamless integration and effective record linkage with Splink.

Top-Level Directory Explanations

examples/ - This directory likely contains examples or sample code for using the project’s components.

examples/athena/ - This subdirectory may contain examples using Amazon Athena, an interactive query service for analyzing data in Amazon S3 using standard SQL.

examples/athena/dashboards/ - This subdirectory may contain Athena dashboard files.

examples/duckdb/ - This subdirectory may contain examples using DuckDB, an open-source in-memory analytic database.

examples/duckdb/dashboards/ - This subdirectory may contain DuckDB dashboard files.

examples/sqlite/ - This subdirectory may contain examples using SQLite, a popular open-source database management system.

examples/sqlite/dashboards/ - This subdirectory may contain SQLite dashboard files.

tutorials/ - This directory may contain tutorials or guides for using the project.

Database and Data Source Integration

Top-Level Directory Explanations

Explanation

Graph

Symbols

We couldn't identify any entrypoints. If you believe this to be incorrect then please contact support.