Model Training & Evaluation
The codebase for model training and evaluation in Splink uses an expectation-maximization (EM) algorithm to estimate model parameters, which represent the likelihood of a comparison between records resulting in a match. This process involves iterative steps to refine these parameters.
How It Works:
- Blocking: The training data is divided into smaller blocks based on blocking rules. This reduces the number of comparisons needed during the estimation process.
- EM Algorithm: The EM algorithm uses the blocking rules to estimate the probability of a comparison resulting in a match (
m_probability
) and the probability of two random records being a match (probability_two_random_records_match
). - Iterative Refinement: The EM algorithm iteratively refines these probabilities until they converge to a stable point.
- Convergence Criteria: The algorithm stops when the change in parameter estimates is below a certain threshold.
- Model Training: The
m_probability
values for each comparison level are used to train the model.
Parameter Estimation Passes:
Splink can perform multiple passes of parameter estimation. Each pass focuses on estimating parameters for a subset of columns, based on the blocking rules.
- First Pass:
- Blocking: The blocking rule used in this pass can be specified by the user. For example,
"l.first_name = r.first_name and l.surname = r.surname"
. - Parameter Estimation: Estimates are made for comparisons that are not used in the blocking rule. For instance, if the blocking rule uses
first_name
andsurname
, the parameter estimates will be for columns likedob
,city
, andemail
. [Source: tutorials/04_Estimating_model_parameters.ipynb]
- Blocking: The blocking rule used in this pass can be specified by the user. For example,
- Subsequent Passes:
- Blocking: Each subsequent pass typically blocks on a different column or set of columns. This helps estimate parameters for comparisons that were not possible to estimate in previous passes. [Source: tutorials/04_Estimating_model_parameters.ipynb]
Example:
- In a first pass, we might block on
dob
, estimating parameters forfirst_name
andsurname
. In a second pass, we could block onfirst_name
andsurname
, estimating parameters fordob
,city
, andemail
. [Source: tutorials/04_Estimating_model_parameters.ipynb]
Important Considerations:
- Training Data Quality: The quality of the training data is crucial for accurate parameter estimation.
- Blocking Rules: The choice of blocking rules significantly affects the efficiency and accuracy of the training process.
- Comparison Levels: Each comparison has multiple levels based on the similarity threshold used for comparison. For example,
Jaro_winkler_similarity Username >= 0.88
is one level of comparison for theemail
column. - Missing Estimates: If a comparison level is not observed in the training data, the
m_probability
for that level cannot be trained. Splink will issue a warning and use default values for these missing estimates. [Source: tutorials/04_Estimating_model_parameters.ipynb] - Model Evaluation: Once the model is trained, it can be evaluated using a labeled dataset with known matches. [Source: tutorials/07_Quality_assurance.ipynb]
Model Training Output:
Splink provides informative output messages during the training process, including:
- Convergence Information: Number of EM iterations, largest change in parameter estimates during each iteration.
- Trained Parameters: List of
m_probability
values that were trained. - Missing Estimates: List of
m_probability
values that could not be trained due to missing observations.
Example Output Messages:
- “EM converged after 6 iterations”
- “m probability not trained for email - Jaro_winkler_similarity Username >= 0.88 (comparison vector value: 1). This usually means the comparison level was never observed in the training data.”
- “Your model is not yet fully trained. Missing estimates for:\n - first_name (no m values are trained).” [Source: tutorials/04_Estimating_model_parameters.ipynb]
Model Evaluation:
- Splink includes tools for visualizing model predictions. This helps assess model performance and identify potential issues. [Source: tutorials/06_Visualising_predictions.ipynb]
- Accuracy analysis can be performed using labeled datasets to evaluate model performance metrics. [Source: tutorials/07_Quality_assurance.ipynb]
Note: This outline is based on information from the provided code files and URLs.
Top-Level Directory Explanations
demo_settings/ - This directory may contain settings or configurations for various demos in the project.
examples/ - This directory likely contains examples or sample code for using the project’s components.
examples/duckdb/ - This subdirectory may contain examples using DuckDB, an open-source in-memory analytic database.
examples/duckdb/dashboards/ - This subdirectory may contain DuckDB dashboard files.
tutorials/ - This directory may contain tutorials or guides for using the project.