Database and Data Source Integration
Splink offers flexible integration with various data sources and databases. This enables the processing of record linkage tasks across different data management systems. Some of the supported options include:
- DuckDB: DuckDB is a highly performant in-memory database that is well-suited for smaller datasets. This option is often recommended for datasets up to 1 million records. tutorials/00_Tutorial_Introduction.ipynb examples/duckdb/real_time_record_linkage.ipynb examples/duckdb/febrl4.ipynb
- SQLite: SQLite is a file-based database that is commonly used for simple data storage and retrieval. examples/sqlite/dashboards/50k_cluster.html examples/sqlite/dashboards/50k_deterministic_cluster.html tutorials/scv.html examples/duckdb/dashboards/50k_cluster.html examples/duckdb/dashboards/comparison_viewer_transactions.html tutorials/cluster_studio.html examples/athena/dashboards/50k_cluster.html
- Spark: Spark is a distributed data processing engine that is suitable for large-scale data linking. It provides a more scalable solution for datasets exceeding 1 million records. examples/spark/febrl4.ipynb
Example
Let’s examine how Splink can be configured for DuckDB using the examples/duckdb/real_time_record_linkage.ipynb
notebook.
import splink
import duckdb
import pandas as pd
import json
# Load pre-trained linkage model from a file
with open("trained_linkage_model.json", "r") as f:
settings = json.load(f)
# Create a Splink linker object with DuckDB backend
con = duckdb.connect()
linker = splink.Linker(settings=settings, database_engine="duckdb", con=con)
In this example, we load the linkage model from a JSON file and create a Linker
object. We set the database_engine
to “duckdb” and specify the DuckDB connection object. This configures Splink to utilize DuckDB for all its operations.
Important Considerations
Before applying Splink, it’s essential to ensure your datasets meet certain requirements for proper integration:
- Unique IDs: Every dataset must contain a column representing unique IDs. By default, Splink expects this column to be named
unique_id
, but this can be customized with theunique_id_column_name
setting. tutorials/01_Prerequisites.ipynb - Conformant Data: Datasets should be “conformant”, meaning they share identical column names and data formats. tutorials/01_Prerequisites.ipynb
- Data Consistency: Ensure consistent data representation across all datasets. Standardize date formats, match text cases, and handle invalid data appropriately. tutorials/01_Prerequisites.ipynb
- Null Values: Represent null values using true nulls, not empty strings. Splink differentiates between these two, ensuring proper matching. tutorials/01_Prerequisites.ipynb
By adhering to these guidelines, you ensure seamless integration and effective record linkage with Splink.
Top-Level Directory Explanations
examples/ - This directory likely contains examples or sample code for using the project’s components.
examples/athena/ - This subdirectory may contain examples using Amazon Athena, an interactive query service for analyzing data in Amazon S3 using standard SQL.
examples/athena/dashboards/ - This subdirectory may contain Athena dashboard files.
examples/duckdb/ - This subdirectory may contain examples using DuckDB, an open-source in-memory analytic database.
examples/duckdb/dashboards/ - This subdirectory may contain DuckDB dashboard files.
examples/sqlite/ - This subdirectory may contain examples using SQLite, a popular open-source database management system.
examples/sqlite/dashboards/ - This subdirectory may contain SQLite dashboard files.
tutorials/ - This directory may contain tutorials or guides for using the project.