Data Manipulation & Analysis
This outline provides an overview of the data manipulation and analysis functionalities within the Splink project, focusing on the “tutorials” directory. The primary motivation is to empower developers in understanding how to prepare, manipulate, and analyze datasets used for record linkage. This involves tasks like data cleaning, transformation, and feature engineering.
1. Exploratory Analysis
The “02_Exploratory_analysis.ipynb” notebook serves as an illustrative example of how to perform basic exploratory analysis. This analysis aims to understand the data and identify characteristics relevant to data linking. The notebook demonstrates reading in data and performing initial data exploration steps.
1.1. Reading In Data
The notebook uses a synthetic dataset named “fake_20000.csv” located in the “data” directory. The data consists of 20,000 rows with columns including unique_id
, first_name
, surname
, dob
, city
, email
, and cluster
. It is a synthetic dataset with duplicates, designed for demonstration purposes. The first few rows are shown below.
unique_id first_name surname dob city email cluster
0 0 Robert Alan 1971-06-24 NaN NaN 0
1 1 Robert Allen 1971-05-24 NaN NaN 1
2 2 Rob Allen 1971-06-24 London NaN 1
3 3 Robert Alen 1971-06-24 Lonon NaN 0
4 4 Grace NaN 1997-04-26 Hull NaN 0
2. Data Preparation
The data needs to be prepared for record linkage, which often involves:
Data Cleaning: Handling missing values, removing inconsistent data, and correcting typos. For example, the
city
column in the “fake_20000.csv” dataset exhibits inconsistent spellings (e.g., “oLondon”, “oLndon”, “Lndoo”, “Mancesthr”).Data Transformation: Applying transformations like converting date formats, standardizing text, and creating new features.
3. Feature Engineering
Feature engineering plays a crucial role in record linkage. Splink employs various string similarity measures and other features to identify potential matches. Examples of features:
- Jaro-Winkler Similarity: The
case_expression
in the “fake_1000_combined.json” file demonstrates the use of Jaro-Winkler similarity for comparingsurname
values.
{
"col_name": "surname",
"num_levels": 3,
"term_frequency_adjustments": true,
"gamma_index": 1,
"data_type": "string",
"fix_u_probabilities": false,
"fix_m_probabilities": false,
"case_expression": "case\n when surname_l is null or surname_r is null then -1\n when jaro_winkler_sim(surname_l, surname_r) >= 1.0 then 2\n when jaro_winkler_sim(surname_l, surname_r) >= 0.88 then 1\n else 0 end as gamma_surname",
"m_probabilities": [
0.38219314199949384,
0.12219952769923849,
0.49560733030126775
],
"u_probabilities": [
0.9926988136057959,
0.007301186394204133
],
"tf_adjustment_weights": [
0.0,
1.0
]
}
- Exact Matching: The
case_expression
in the “fake_1000_combined.json” file illustrates the use of exact matching for comparingemail
values.
{
"col_name": "email",
"num_levels": 2,
"term_frequency_adjustments": true,
"gamma_index": 1,
"data_type": "string",
"fix_u_probabilities": false,
"fix_m_probabilities": false,
"case_expression": "case\n when email_l is null or email_r is null then -1\n when email_l = email_r then 1\n else 0 end as gamma_email",
"m_probabilities": [
0.33765358450383004,
0.6623464154961699
],
"u_probabilities": [
0.9990162350531908,
0.000983764946809028
],
"tf_adjustment_weights": [
0.0,
1.0
]
}
4. Data Analysis
Clustering: The notebook uses the
cluster
column to identify groups of potentially linked records.Visualization: The notebook uses visualizations to gain insights into the data distribution, such as histograms for numeric features and bar charts for categorical features.
Statistical Summaries: The notebook calculates various statistical summaries, such as mean, standard deviation, and frequency distributions, to understand the data’s characteristics.
5. Additional Configurations
The “fake_1000_combined.json” file also provides additional configuration parameters for Splink. For example:
em_convergence
: The convergence criterion for the Expectation-Maximization (EM) algorithm, which helps find the most likely linkage.source_dataset_column_name
: Specifies the name of the column that identifies the source dataset for each record.unique_id_column_name
: Indicates the name of the column containing unique identifiers for each record.
6. Conclusion
Understanding the data manipulation and analysis processes within Splink is essential for effective record linkage. The “tutorials” directory, particularly the “02_Exploratory_analysis.ipynb” notebook, offers a practical starting point for developers to familiarize themselves with these functionalities.
## Top-Level Directory Explanations
<a class='local-link directory-link' data-ref="data/" href="#data/">data/</a> - This directory likely contains data used by the project. The specific contents of this directory may vary.
<a class='local-link directory-link' data-ref="examples/" href="#examples/">examples/</a> - This directory likely contains examples or sample code for using the project's components.
<a class='local-link directory-link' data-ref="examples/athena/" href="#examples/athena/">examples/athena/</a> - This subdirectory may contain examples using Amazon Athena, an interactive query service for analyzing data in Amazon S3 using standard SQL.
<a class='local-link directory-link' data-ref="examples/athena/dashboards/" href="#examples/athena/dashboards/">examples/athena/dashboards/</a> - This subdirectory may contain Athena dashboard files.
<a class='local-link directory-link' data-ref="examples/duckdb/" href="#examples/duckdb/">examples/duckdb/</a> - This subdirectory may contain examples using DuckDB, an open-source in-memory analytic database.
<a class='local-link directory-link' data-ref="examples/duckdb/dashboards/" href="#examples/duckdb/dashboards/">examples/duckdb/dashboards/</a> - This subdirectory may contain DuckDB dashboard files.
<a class='local-link directory-link' data-ref="examples/sqlite/" href="#examples/sqlite/">examples/sqlite/</a> - This subdirectory may contain examples using SQLite, a popular open-source database management system.
<a class='local-link directory-link' data-ref="examples/sqlite/dashboards/" href="#examples/sqlite/dashboards/">examples/sqlite/dashboards/</a> - This subdirectory may contain SQLite dashboard files.
<a class='local-link directory-link' data-ref="tutorials/" href="#tutorials/">tutorials/</a> - This directory may contain tutorials or guides for using the project.