Data Manipulation & Analysis @ moj-analytical-services/splink_demos

Data Manipulation & Analysis

This outline provides an overview of the data manipulation and analysis functionalities within the Splink project, focusing on the “tutorials” directory. The primary motivation is to empower developers in understanding how to prepare, manipulate, and analyze datasets used for record linkage. This involves tasks like data cleaning, transformation, and feature engineering.

1. Exploratory Analysis

The “02_Exploratory_analysis.ipynb” notebook serves as an illustrative example of how to perform basic exploratory analysis. This analysis aims to understand the data and identify characteristics relevant to data linking. The notebook demonstrates reading in data and performing initial data exploration steps.

1.1. Reading In Data

The notebook uses a synthetic dataset named “fake_20000.csv” located in the “data” directory. The data consists of 20,000 rows with columns including unique_id, first_name, surname, dob, city, email, and cluster. It is a synthetic dataset with duplicates, designed for demonstration purposes. The first few rows are shown below.

   unique_id first_name surname         dob    city                    email  cluster
          0          0     Robert    Alan  1971-06-24     NaN                      NaN       0
          1          1     Robert   Allen  1971-05-24     NaN                      NaN       1
          2          2        Rob   Allen  1971-06-24  London                      NaN       1
          3          3     Robert    Alen  1971-06-24   Lonon                      NaN       0
          4          4      Grace     NaN  1997-04-26    Hull                      NaN       0

2. Data Preparation

The data needs to be prepared for record linkage, which often involves:

Data Cleaning: Handling missing values, removing inconsistent data, and correcting typos. For example, the city column in the “fake_20000.csv” dataset exhibits inconsistent spellings (e.g., “oLondon”, “oLndon”, “Lndoo”, “Mancesthr”).
Data Transformation: Applying transformations like converting date formats, standardizing text, and creating new features.

3. Feature Engineering

Feature engineering plays a crucial role in record linkage. Splink employs various string similarity measures and other features to identify potential matches. Examples of features:

Jaro-Winkler Similarity: The case_expression in the “fake_1000_combined.json” file demonstrates the use of Jaro-Winkler similarity for comparing surname values.

{
              "col_name": "surname",
              "num_levels": 3,
              "term_frequency_adjustments": true,
              "gamma_index": 1,
              "data_type": "string",
              "fix_u_probabilities": false,
              "fix_m_probabilities": false,
              "case_expression": "case\n    when surname_l is null or surname_r is null then -1\n    when jaro_winkler_sim(surname_l, surname_r) >= 1.0 then 2\n    when jaro_winkler_sim(surname_l, surname_r) >= 0.88 then 1\n    else 0 end as gamma_surname",
              "m_probabilities": [
                  0.38219314199949384,
                  0.12219952769923849,
                  0.49560733030126775
              ],
              "u_probabilities": [
                  0.9926988136057959,
                  0.007301186394204133
              ],
              "tf_adjustment_weights": [
                  0.0,
                  1.0
              ]
          }

Exact Matching: The case_expression in the “fake_1000_combined.json” file illustrates the use of exact matching for comparing email values.

{
              "col_name": "email",
              "num_levels": 2,
              "term_frequency_adjustments": true,
              "gamma_index": 1,
              "data_type": "string",
              "fix_u_probabilities": false,
              "fix_m_probabilities": false,
              "case_expression": "case\n    when email_l is null or email_r is null then -1\n    when email_l = email_r then 1\n    else 0 end as gamma_email",
              "m_probabilities": [
                  0.33765358450383004,
                  0.6623464154961699
              ],
              "u_probabilities": [
                  0.9990162350531908,
                  0.000983764946809028
              ],
              "tf_adjustment_weights": [
                  0.0,
                  1.0
              ]
          }

4. Data Analysis

Clustering: The notebook uses the cluster column to identify groups of potentially linked records.
Visualization: The notebook uses visualizations to gain insights into the data distribution, such as histograms for numeric features and bar charts for categorical features.
Statistical Summaries: The notebook calculates various statistical summaries, such as mean, standard deviation, and frequency distributions, to understand the data’s characteristics.

5. Additional Configurations

The “fake_1000_combined.json” file also provides additional configuration parameters for Splink. For example:

em_convergence: The convergence criterion for the Expectation-Maximization (EM) algorithm, which helps find the most likely linkage.
source_dataset_column_name: Specifies the name of the column that identifies the source dataset for each record.
unique_id_column_name: Indicates the name of the column containing unique identifiers for each record.

6. Conclusion

Understanding the data manipulation and analysis processes within Splink is essential for effective record linkage. The “tutorials” directory, particularly the “02_Exploratory_analysis.ipynb” notebook, offers a practical starting point for developers to familiarize themselves with these functionalities.


          ## Top-Level Directory Explanations
          
          <a class='local-link directory-link' data-ref="data/" href="#data/">data/</a> - This directory likely contains data used by the project. The specific contents of this directory may vary.
          
          <a class='local-link directory-link' data-ref="examples/" href="#examples/">examples/</a> - This directory likely contains examples or sample code for using the project's components.
          
          <a class='local-link directory-link' data-ref="examples/athena/" href="#examples/athena/">examples/athena/</a> - This subdirectory may contain examples using Amazon Athena, an interactive query service for analyzing data in Amazon S3 using standard SQL.
          
          <a class='local-link directory-link' data-ref="examples/athena/dashboards/" href="#examples/athena/dashboards/">examples/athena/dashboards/</a> - This subdirectory may contain Athena dashboard files.
          
          <a class='local-link directory-link' data-ref="examples/duckdb/" href="#examples/duckdb/">examples/duckdb/</a> - This subdirectory may contain examples using DuckDB, an open-source in-memory analytic database.
          
          <a class='local-link directory-link' data-ref="examples/duckdb/dashboards/" href="#examples/duckdb/dashboards/">examples/duckdb/dashboards/</a> - This subdirectory may contain DuckDB dashboard files.
          
          <a class='local-link directory-link' data-ref="examples/sqlite/" href="#examples/sqlite/">examples/sqlite/</a> - This subdirectory may contain examples using SQLite, a popular open-source database management system.
          
          <a class='local-link directory-link' data-ref="examples/sqlite/dashboards/" href="#examples/sqlite/dashboards/">examples/sqlite/dashboards/</a> - This subdirectory may contain SQLite dashboard files.
          
          <a class='local-link directory-link' data-ref="tutorials/" href="#tutorials/">tutorials/</a> - This directory may contain tutorials or guides for using the project.

Data Manipulation & Analysis

Explanation

Graph

Symbols

We couldn't identify any entrypoints. If you believe this to be incorrect then please contact support.