Data Visualization
This outline covers the data visualization aspects of the Splink project, with a focus on the visualizations used in the tutorials and examples.
Visualization Techniques
Splink utilizes a variety of visualization techniques, including:
- Bar Charts: For displaying value counts and frequency distributions, these charts help understand the data’s shape and identify potential outliers. Example:
tutorials/02_Exploratory_analysis.ipynb
, where a bar chart shows the “Top 10 values by value count.” - Heatmaps: To visualize the missingness of data, these maps help identify columns with a high proportion of null values. Example:
tutorials/02_Exploratory_analysis.ipynb
, where a heatmap depicts the “Missingness” of columns. - Stacked Bar Charts: Representing the distribution of comparison vector values across different columns, these charts help analyze how different columns contribute to the overall linkage process. Example:
tutorials/06_Visualising_predictions.ipynb
, where a stacked bar chart shows the distribution of “comparison vector value” across different columns. - Text Overlays: These overlays are used on bar charts to provide additional information for each bar, like the specific values being compared. Example:
tutorials/06_Visualising_predictions.ipynb
, where text overlays show the values being compared for each column. - Interactive Charts: Enabled by libraries like Altair, Splink leverages interactive features like brushing and filtering to explore data dynamically. Example:
scv.html
, which demonstrates interactive brushing and filtering to analyze comparison vector distributions and probabilities.
Visualization Examples
Exploratory Analysis
In the tutorials/02_Exploratory_analysis.ipynb
notebook, the following visualizations are used:
- Value Counts: Bar charts show the value counts for different columns. Example: Top 10 values by value count in the
dob
column, Bottom 5 values by value count in thesurname
column. - Missingness: Heatmaps visualize the proportion of null values in each column.
Visualizing Predictions
The tutorials/06_Visualising_predictions.ipynb
notebook provides a detailed view of how Splink’s prediction results can be visualized.
- Comparison Vector Values: Stacked bar charts show the distribution of comparison vector values across different columns.
- Text Overlays: Overlays are added to each bar to display specific values being compared.
Comparison Vector Visualization (SCV)
scv.html
uses interactive charts:
- Comparison Vector Frequency: A bar chart with brushing and filtering capabilities shows the frequency of different comparison vector values.
- Comparison Vector Probability: An interactive heatmap displays the match probabilities for different comparison vector values, color-coded for easy visualization.
Visualization Libraries
Splink utilizes various libraries for data visualization:
- Altair: A declarative statistical visualization library, providing the ability to create interactive and dynamic visualizations. This library is frequently used in the tutorials and examples to create interactive charts. Example:
scv.html
andtutorials/02_Exploratory_analysis.ipynb
. - Pandas: This powerful library offers various data manipulation and visualization functionalities. Example:
tutorials/06_Visualising_predictions.ipynb
utilizes Pandas to create stacked bar charts.
Project Context
These visualizations are crucial for understanding data characteristics, exploring potential relationships, and interpreting prediction results in the context of data linking. They enable developers and analysts to gain valuable insights into the quality and consistency of the data, ultimately improving the accuracy of data linking processes.
Top-Level Directory Explanations
examples/ - This directory likely contains examples or sample code for using the project’s components.
examples/athena/ - This subdirectory may contain examples using Amazon Athena, an interactive query service for analyzing data in Amazon S3 using standard SQL.
examples/athena/dashboards/ - This subdirectory may contain Athena dashboard files.
examples/duckdb/ - This subdirectory may contain examples using DuckDB, an open-source in-memory analytic database.
examples/duckdb/dashboards/ - This subdirectory may contain DuckDB dashboard files.
examples/sqlite/ - This subdirectory may contain examples using SQLite, a popular open-source database management system.
examples/sqlite/dashboards/ - This subdirectory may contain SQLite dashboard files.
tutorials/ - This directory may contain tutorials or guides for using the project.