Datasets - moj-analytical-services/splink_demos

In the splink_demos project, various datasets are used to demonstrate different features and functionalities of the Splink probabilistic record linkage tool. Here are some of the datasets used and their relevance:

  1. FEBRl Datasets: The FEBRl (Federated Electronic Birth Registration) datasets are used in several examples to demonstrate the process of probabilistic record linkage. These datasets contain synthetic records of birth registrations with varying levels of missingness and errors. The datasets used include source.txt, dataset3.csv, dataset4a.csv, and dataset4b.csv. These datasets are used to demonstrate the process of data cleaning, standardization, and linkage using Splink. (Source)

  2. Synthetic Datasets: Synthetic datasets are used in the sqlite and duckdb examples to demonstrate the process of data linkage using Splink. These datasets are generated using the splink_generate tool and contain synthetic records with varying levels of missingness and errors. The datasets used include fake_1000_combined.json, fake_1000_combined.csv, and fake_1000_combined.parquet. These datasets are used to demonstrate the process of data linkage using Splink with different data formats and databases. (Source)

  3. Real-world Datasets: Real-world datasets are used in the athena and duckdb examples to demonstrate the process of data linkage using Splink. These datasets are obtained from public sources and contain real-world records with varying levels of missingness and errors. The datasets used include openml_credit_approval.csv and openml_credit_approval.parquet. These datasets are used to demonstrate the process of data linkage using Splink with real-world data. (Source)

  4. Pre-linked Datasets: Pre-linked datasets are used in the duckdb examples to demonstrate the process of data analysis and visualization using Splink. These datasets are obtained by linking the FEBRl datasets using Splink and contain linked records with matching and non-matching pairs. The datasets used include 50k_cluster.csv, 50k_deterministic_cluster.csv, and 50k_transactions.csv. These datasets are used to demonstrate the process of data analysis and visualization using Splink with linked data. (Source)

These datasets are used to demonstrate the various features and functionalities of Splink, including data cleaning, standardization, linkage, analysis, and visualization. By using a variety of datasets with different characteristics, the splink_demos project provides a comprehensive overview of the capabilities of Splink.