Workflows and Best Practices - moj-analytical-services/splink_demos

This guide provides an overview of the workflows and best practices for the splink_demos project, which is hosted on GitHub. The project is designed to be simple, consistent, and repeatable, following the KISS principle. The main dependencies include pyspark, splink, pyarrow, scikit-learn, and duckdb.

Project Setup

To get started with the project, follow these steps:

  1. Download and unzip the project.
  2. Add products to the project’s installs directory.
  3. Run init.sh (for Unix) or init.bat (for Windows) to install the project.

The project template structure includes:

  • docs/: project documentation and screenshots.
  • notebooks/: Jupyter notebooks for demos and examples.
  • scripts/: utility scripts and code snippets.
  • data/: sample datasets for demos and examples.

Jupyter Notebooks and Interactive Widgets

The splink_demos project makes extensive use of Jupyter Notebooks and JupyterLab for interactive data exploration and analysis. The ipywidgets library is used to create interactive widgets, which facilitate user input and customization.

Here’s an example of using ipywidgets to create a simple slider:

import ipywidgets as widgets
from IPython.display import display

slider = widgets.FloatSlider(value=7.5, min=0.1, max=10.0, step=0.1)
display(slider)

Testing and Continuous Integration

The project uses pytest for testing and nbmake to build and execute Jupyter notebooks as part of the continuous integration (CI) process. The CI workflow is triggered on every push to the main branch and ensures that the code and examples run as expected.

To run tests and build notebooks locally, use the following commands:

pytest
nbmake

GitOps Workflow

The splink_demos project follows the GitOps workflow for managing infrastructure and deployments. This approach involves using Git as the single source of truth for both code and infrastructure configuration.

The project includes an example GitOps workflow using Argo CD and Linkerd. This workflow demonstrates how to securely generate and manage Linkerd’s mTLS private keys and certificates using Sealed Secrets and cert-manager. It also shows how to integrate the auto proxy injection feature into the workflow.

To learn more about the GitOps workflow with Linkerd and Argo CD, refer to the official Linkerd documentation.

Browser Testing and WebdriverIO

The project uses WebdriverIO for browser testing and integration tests. This allows for fast and easy sanity-checking of UI changes in a fast-moving dev environment.

Here’s an example of using WebdriverIO to open a webpage and take a screenshot:

const wdio = require('webdriverio');

const opts = {
path: '/wd/hub',
port: 4444,
capabilities: [{
maxInstances: 1,
browserName: 'chrome'
}]
};

const driver = wdio.promiseChainRemote(opts);

driver.init()
.then(() => driver.execute('return window.location.href'))
.then((url) => console.log(`Current URL is ${url}`))
.then(() => driver.takeScreenshot())
.then((screenshot) => console.log(`Screenshot taken`))
.catch((err) => console.error(err))
.finally(() => driver.end());

Best Practices

  • Keep it simple: focus on explicit, clear, well-defined, bounded, understandable, and introspectable behavior.
  • Minimize resource requirements: Linkerd should impose as minimal a performance and resource cost as possible.
  • Just work: Linkerd should not break existing applications, nor should it require complex configuration to get started or to do something simple.
  • Use GitOps workflow for managing infrastructure and deployments.
  • Use browser testing and WebdriverIO for integration tests and sanity-checking UI changes.

Resources