Database and Data Source Integration

Splink offers flexible integration with various data sources and databases. This enables the processing of record linkage tasks across different data management systems. Some of the supported options include:

Example

Let’s examine how Splink can be configured for DuckDB using the examples/duckdb/real_time_record_linkage.ipynb notebook.

import splink
          import duckdb
          import pandas as pd
          import json
          
          # Load pre-trained linkage model from a file
          with open("trained_linkage_model.json", "r") as f:
              settings = json.load(f)
          
          # Create a Splink linker object with DuckDB backend
          con = duckdb.connect()
          linker = splink.Linker(settings=settings,  database_engine="duckdb", con=con)
          

In this example, we load the linkage model from a JSON file and create a Linker object. We set the database_engine to “duckdb” and specify the DuckDB connection object. This configures Splink to utilize DuckDB for all its operations.

Important Considerations

Before applying Splink, it’s essential to ensure your datasets meet certain requirements for proper integration:

  • Unique IDs: Every dataset must contain a column representing unique IDs. By default, Splink expects this column to be named unique_id, but this can be customized with the unique_id_column_name setting. tutorials/01_Prerequisites.ipynb
  • Conformant Data: Datasets should be “conformant”, meaning they share identical column names and data formats. tutorials/01_Prerequisites.ipynb
  • Data Consistency: Ensure consistent data representation across all datasets. Standardize date formats, match text cases, and handle invalid data appropriately. tutorials/01_Prerequisites.ipynb
  • Null Values: Represent null values using true nulls, not empty strings. Splink differentiates between these two, ensuring proper matching. tutorials/01_Prerequisites.ipynb

By adhering to these guidelines, you ensure seamless integration and effective record linkage with Splink.

Top-Level Directory Explanations

examples/ - This directory likely contains examples or sample code for using the project’s components.

examples/athena/ - This subdirectory may contain examples using Amazon Athena, an interactive query service for analyzing data in Amazon S3 using standard SQL.

examples/athena/dashboards/ - This subdirectory may contain Athena dashboard files.

examples/duckdb/ - This subdirectory may contain examples using DuckDB, an open-source in-memory analytic database.

examples/duckdb/dashboards/ - This subdirectory may contain DuckDB dashboard files.

examples/sqlite/ - This subdirectory may contain examples using SQLite, a popular open-source database management system.

examples/sqlite/dashboards/ - This subdirectory may contain SQLite dashboard files.

tutorials/ - This directory may contain tutorials or guides for using the project.