Model Deployment & Pipelining

This section outlines the design and implementation of model deployment and pipelining within the Splink project. The focus is on providing developers with a comprehensive understanding of how to deploy trained models and automate record linkage processes.

Deployment Options

The project offers various deployment options for trained Splink models, catering to different use-case requirements.

1. Saving Model to a JSON File:

  • Purpose: Enables model persistence and re-use in subsequent sessions or applications.
  • Implementation: Use the save_model_to_json method of the Linker object.
  • Example:
settings = linker.save_model_to_json("../demo_settings/saved_model_from_demo.json", overwrite=True) 
          

2. Deploying Models for Real-time Record Linkage:

  • Purpose: Provides a framework for deploying models for real-time record linkage applications.
  • Implementation: Utilize the real_time_record_linkage.ipynb notebook within the demo_settings directory.
  • Example: Refer to the real_time_record_linkage.ipynb notebook for a step-by-step implementation.

Pipelining with DuckDB

The project leverages DuckDB as a robust database engine for efficient pipelining and data manipulation.

1. Link Only Example:

  • Purpose: Demonstrates a streamlined record linkage workflow using DuckDB.
  • Implementation: The link_only.ipynb notebook provides a concise example.
  • Example:
linker.estimate_u_using_random_sampling(max_pairs=1e6, seed=1)
          

2. FEBRL4 Example:

  • Purpose: Presents a more comprehensive example involving data exploration, model definition, and prediction generation.
  • Implementation: Explore the febrl4.ipynb notebook.
  • Example:
import splink.duckdb.comparison_library as cl
          import splink.duckdb.comparison_template_library as ctl
          import splink.duckdb.comparison_level_library as cll
          
          simple_model_settings = {
              **basic_settings,
              "blocking_rules_to_generate_predictions": blocking_rules,
              "comparisons": [
                  cl.exact_match("given_name", term_frequency_adjustments=True),
                  cl.exact_match("surname", term_frequency_adjustments=True),
                  cl.exact_match("street_number", term_frequency_adjustments=True),
              ],
              "retain_intermediate_calculation_columns": True,
          }
          

3. Transactions Example:

  • Purpose: Illustrates how to handle transaction data effectively within the record linkage pipeline.
  • Implementation: Examine the transactions.ipynb notebook.
  • Example:
linker.estimate_u_using_random_sampling(max_pairs=1e6)
          

Model Training and Parameter Estimation

Splink employs Expectation-Maximization (EM) algorithm for parameter estimation during model training.

1. Estimating Model Parameters:

  • Purpose: Demonstrates the process of estimating model parameters using EM algorithm.
  • Implementation: The 04_Estimating_model_parameters.ipynb notebook guides this process.
  • Example:
linker.estimate_u_using_random_sampling(max_pairs=5e6)
          

2. Parameter Estimation Passes:

  • Purpose: Highlights the use of multiple estimation passes to refine model parameters.
  • Implementation: The 04_Estimating_model_parameters.ipynb notebook includes examples of such passes.
  • Example:
linker.estimate_u_using_random_sampling(max_pairs=5e6)
          

Visualizing Predictions

The project provides tools for visualizing predictions, enhancing model understanding and confidence.

1. Visualizing Predictions Notebook:

  • Purpose: Provides a comprehensive guide to visualizing predictions with Splink.
  • Implementation: Refer to the 06_Visualising_predictions.ipynb notebook.
  • Example:
linker.m_u_parameters_chart()
          

Conclusion

This outline has presented the core features and functionalities of model deployment and pipelining within the Splink project. By leveraging these tools and techniques, developers can effectively deploy trained models, automate record linkage processes, and build robust solutions for real-world applications.

Top-Level Directory Explanations

examples/ - This directory likely contains examples or sample code for using the project’s components.

examples/athena/ - This subdirectory may contain examples using Amazon Athena, an interactive query service for analyzing data in Amazon S3 using standard SQL.

examples/athena/dashboards/ - This subdirectory may contain Athena dashboard files.

examples/duckdb/ - This subdirectory may contain examples using DuckDB, an open-source in-memory analytic database.

examples/duckdb/dashboards/ - This subdirectory may contain DuckDB dashboard files.

examples/sqlite/ - This subdirectory may contain examples using SQLite, a popular open-source database management system.

examples/sqlite/dashboards/ - This subdirectory may contain SQLite dashboard files.

tutorials/ - This directory may contain tutorials or guides for using the project.