In the splink_demos
project, various datasets are used to demonstrate different features and functionalities of the Splink probabilistic record linkage tool. Here are some of the datasets used and their relevance:
FEBRl Datasets: The FEBRl (Federated Electronic Birth Registration) datasets are used in several examples to demonstrate the process of probabilistic record linkage. These datasets contain synthetic records of birth registrations with varying levels of missingness and errors. The datasets used include
source.txt
,dataset3.csv
,dataset4a.csv
, anddataset4b.csv
. These datasets are used to demonstrate the process of data cleaning, standardization, and linkage using Splink. (Source)Synthetic Datasets: Synthetic datasets are used in the
sqlite
andduckdb
examples to demonstrate the process of data linkage using Splink. These datasets are generated using thesplink_generate
tool and contain synthetic records with varying levels of missingness and errors. The datasets used includefake_1000_combined.json
,fake_1000_combined.csv
, andfake_1000_combined.parquet
. These datasets are used to demonstrate the process of data linkage using Splink with different data formats and databases. (Source)Real-world Datasets: Real-world datasets are used in the
athena
andduckdb
examples to demonstrate the process of data linkage using Splink. These datasets are obtained from public sources and contain real-world records with varying levels of missingness and errors. The datasets used includeopenml_credit_approval.csv
andopenml_credit_approval.parquet
. These datasets are used to demonstrate the process of data linkage using Splink with real-world data. (Source)Pre-linked Datasets: Pre-linked datasets are used in the
duckdb
examples to demonstrate the process of data analysis and visualization using Splink. These datasets are obtained by linking the FEBRl datasets using Splink and contain linked records with matching and non-matching pairs. The datasets used include50k_cluster.csv
,50k_deterministic_cluster.csv
, and50k_transactions.csv
. These datasets are used to demonstrate the process of data analysis and visualization using Splink with linked data. (Source)
These datasets are used to demonstrate the various features and functionalities of Splink, including data cleaning, standardization, linkage, analysis, and visualization. By using a variety of datasets with different characteristics, the splink_demos
project provides a comprehensive overview of the capabilities of Splink.