Model Training & Evaluation

The codebase for model training and evaluation in Splink uses an expectation-maximization (EM) algorithm to estimate model parameters, which represent the likelihood of a comparison between records resulting in a match. This process involves iterative steps to refine these parameters.

How It Works:

  1. Blocking: The training data is divided into smaller blocks based on blocking rules. This reduces the number of comparisons needed during the estimation process.
  2. EM Algorithm: The EM algorithm uses the blocking rules to estimate the probability of a comparison resulting in a match (m_probability) and the probability of two random records being a match (probability_two_random_records_match).
  3. Iterative Refinement: The EM algorithm iteratively refines these probabilities until they converge to a stable point.
  4. Convergence Criteria: The algorithm stops when the change in parameter estimates is below a certain threshold.
  5. Model Training: The m_probability values for each comparison level are used to train the model.

Parameter Estimation Passes:

Splink can perform multiple passes of parameter estimation. Each pass focuses on estimating parameters for a subset of columns, based on the blocking rules.

  • First Pass:
    • Blocking: The blocking rule used in this pass can be specified by the user. For example, "l.first_name = r.first_name and l.surname = r.surname".
    • Parameter Estimation: Estimates are made for comparisons that are not used in the blocking rule. For instance, if the blocking rule uses first_name and surname, the parameter estimates will be for columns like dob, city, and email. [Source: tutorials/04_Estimating_model_parameters.ipynb]
  • Subsequent Passes:
    • Blocking: Each subsequent pass typically blocks on a different column or set of columns. This helps estimate parameters for comparisons that were not possible to estimate in previous passes. [Source: tutorials/04_Estimating_model_parameters.ipynb]

Example:

  • In a first pass, we might block on dob, estimating parameters for first_name and surname. In a second pass, we could block on first_name and surname, estimating parameters for dob, city, and email. [Source: tutorials/04_Estimating_model_parameters.ipynb]

Important Considerations:

  • Training Data Quality: The quality of the training data is crucial for accurate parameter estimation.
  • Blocking Rules: The choice of blocking rules significantly affects the efficiency and accuracy of the training process.
  • Comparison Levels: Each comparison has multiple levels based on the similarity threshold used for comparison. For example, Jaro_winkler_similarity Username >= 0.88 is one level of comparison for the email column.
  • Missing Estimates: If a comparison level is not observed in the training data, the m_probability for that level cannot be trained. Splink will issue a warning and use default values for these missing estimates. [Source: tutorials/04_Estimating_model_parameters.ipynb]
  • Model Evaluation: Once the model is trained, it can be evaluated using a labeled dataset with known matches. [Source: tutorials/07_Quality_assurance.ipynb]

Model Training Output:

Splink provides informative output messages during the training process, including:

  • Convergence Information: Number of EM iterations, largest change in parameter estimates during each iteration.
  • Trained Parameters: List of m_probability values that were trained.
  • Missing Estimates: List of m_probability values that could not be trained due to missing observations.

Example Output Messages:

  • “EM converged after 6 iterations”
  • “m probability not trained for email - Jaro_winkler_similarity Username >= 0.88 (comparison vector value: 1). This usually means the comparison level was never observed in the training data.”
  • “Your model is not yet fully trained. Missing estimates for:\n - first_name (no m values are trained).” [Source: tutorials/04_Estimating_model_parameters.ipynb]

Model Evaluation:

  • Splink includes tools for visualizing model predictions. This helps assess model performance and identify potential issues. [Source: tutorials/06_Visualising_predictions.ipynb]
  • Accuracy analysis can be performed using labeled datasets to evaluate model performance metrics. [Source: tutorials/07_Quality_assurance.ipynb]

Note: This outline is based on information from the provided code files and URLs.

Top-Level Directory Explanations

demo_settings/ - This directory may contain settings or configurations for various demos in the project.

examples/ - This directory likely contains examples or sample code for using the project’s components.

examples/duckdb/ - This subdirectory may contain examples using DuckDB, an open-source in-memory analytic database.

examples/duckdb/dashboards/ - This subdirectory may contain DuckDB dashboard files.

tutorials/ - This directory may contain tutorials or guides for using the project.