Model Deployment & Pipelining
This section outlines the design and implementation of model deployment and pipelining within the Splink project. The focus is on providing developers with a comprehensive understanding of how to deploy trained models and automate record linkage processes.
Deployment Options
The project offers various deployment options for trained Splink models, catering to different use-case requirements.
1. Saving Model to a JSON File:
- Purpose: Enables model persistence and re-use in subsequent sessions or applications.
- Implementation: Use the
save_model_to_json
method of theLinker
object. - Example:
settings = linker.save_model_to_json("../demo_settings/saved_model_from_demo.json", overwrite=True)
2. Deploying Models for Real-time Record Linkage:
- Purpose: Provides a framework for deploying models for real-time record linkage applications.
- Implementation: Utilize the
real_time_record_linkage.ipynb
notebook within thedemo_settings
directory. - Example: Refer to the
real_time_record_linkage.ipynb
notebook for a step-by-step implementation.
Pipelining with DuckDB
The project leverages DuckDB as a robust database engine for efficient pipelining and data manipulation.
1. Link Only Example:
- Purpose: Demonstrates a streamlined record linkage workflow using DuckDB.
- Implementation: The
link_only.ipynb
notebook provides a concise example. - Example:
linker.estimate_u_using_random_sampling(max_pairs=1e6, seed=1)
2. FEBRL4 Example:
- Purpose: Presents a more comprehensive example involving data exploration, model definition, and prediction generation.
- Implementation: Explore the
febrl4.ipynb
notebook. - Example:
import splink.duckdb.comparison_library as cl
import splink.duckdb.comparison_template_library as ctl
import splink.duckdb.comparison_level_library as cll
simple_model_settings = {
**basic_settings,
"blocking_rules_to_generate_predictions": blocking_rules,
"comparisons": [
cl.exact_match("given_name", term_frequency_adjustments=True),
cl.exact_match("surname", term_frequency_adjustments=True),
cl.exact_match("street_number", term_frequency_adjustments=True),
],
"retain_intermediate_calculation_columns": True,
}
3. Transactions Example:
- Purpose: Illustrates how to handle transaction data effectively within the record linkage pipeline.
- Implementation: Examine the
transactions.ipynb
notebook. - Example:
linker.estimate_u_using_random_sampling(max_pairs=1e6)
Model Training and Parameter Estimation
Splink employs Expectation-Maximization (EM) algorithm for parameter estimation during model training.
1. Estimating Model Parameters:
- Purpose: Demonstrates the process of estimating model parameters using EM algorithm.
- Implementation: The
04_Estimating_model_parameters.ipynb
notebook guides this process. - Example:
linker.estimate_u_using_random_sampling(max_pairs=5e6)
2. Parameter Estimation Passes:
- Purpose: Highlights the use of multiple estimation passes to refine model parameters.
- Implementation: The
04_Estimating_model_parameters.ipynb
notebook includes examples of such passes. - Example:
linker.estimate_u_using_random_sampling(max_pairs=5e6)
Visualizing Predictions
The project provides tools for visualizing predictions, enhancing model understanding and confidence.
1. Visualizing Predictions Notebook:
- Purpose: Provides a comprehensive guide to visualizing predictions with Splink.
- Implementation: Refer to the
06_Visualising_predictions.ipynb
notebook. - Example:
linker.m_u_parameters_chart()
Conclusion
This outline has presented the core features and functionalities of model deployment and pipelining within the Splink project. By leveraging these tools and techniques, developers can effectively deploy trained models, automate record linkage processes, and build robust solutions for real-world applications.
Top-Level Directory Explanations
examples/ - This directory likely contains examples or sample code for using the project’s components.
examples/athena/ - This subdirectory may contain examples using Amazon Athena, an interactive query service for analyzing data in Amazon S3 using standard SQL.
examples/athena/dashboards/ - This subdirectory may contain Athena dashboard files.
examples/duckdb/ - This subdirectory may contain examples using DuckDB, an open-source in-memory analytic database.
examples/duckdb/dashboards/ - This subdirectory may contain DuckDB dashboard files.
examples/sqlite/ - This subdirectory may contain examples using SQLite, a popular open-source database management system.
examples/sqlite/dashboards/ - This subdirectory may contain SQLite dashboard files.
tutorials/ - This directory may contain tutorials or guides for using the project.