Learn using Shoulder.dev

Shoulder.dev transforms codebases into tailored learning experiences. Below are organized categories of the codebase to help you start with your initial focus.

Reason: Understanding the core concepts of record linkage and deduplication, including techniques like blocking, comparison, and clustering, which form the foundation of Splink. Example: Analyzing the “deduplicate_50k_synthetic.ipynb” notebook within the “duckdb” directory.

Splink Library

Reason: Learning the functionalities and API of the Splink library, including its specific features for record linkage and deduplication. Example: Exploring the “splink” package in “requirements.txt” and examining how it’s used within various notebooks.

Data Manipulation & Analysis

Reason: Understanding how to prepare, manipulate, and analyze datasets used for record linkage. This includes tasks like data cleaning, transformation, and feature engineering. Example: Studying the “02_Exploratory_analysis.ipynb” notebook in the “tutorials” directory.

Data Formats & Storage

Reason: Understanding the various data formats used in the codebase (e.g., CSV, Parquet) and their implications for data storage and processing. Example: Examining the data files within the “data” directory and analyzing how they’re loaded and processed in notebooks.

Data Visualization

Reason: Developing skills to visualize data effectively, particularly for tasks like exploring relationships between records and visualizing clustering results. Example: Analyzing the various HTML dashboards generated by the “duckdb” examples, such as “50k_cluster.html” or “comparison_viewer_transactions.html.”

Database & Data Source Integration

Reason: Understanding how Splink integrates with various data sources and databases like DuckDB, SQLite, and Spark. Example: Studying the “duckdb” and “spark” directories and observing how Splink is configured for different data sources.

Jupyter Notebook Development

Reason: Developing proficiency with Jupyter notebooks for data exploration, analysis, and code execution. Example: Working through the interactive notebooks in the “tutorials” and “examples” directories.

Model Training & Evaluation

Reason: Understanding how to train record linkage models, assess their performance, and refine model parameters for optimal results. Example: Studying the “04_Estimating_model_parameters.ipynb” notebook in the “tutorials” directory.

Model Deployment & Pipelining

Reason: Learning how to deploy trained models for real-world applications and develop automated pipelines for record linkage processes. Example: Examining the “demo_settings” directory and exploring the “real_time_record_linkage.ipynb” notebook.

Testing & Debugging

Reason: Acquiring skills to write and execute tests for record linkage models, diagnose issues, and troubleshoot code. Example: Studying how unit tests could be implemented for specific components of Splink or examining the “recreate_venv.sh” script for potential testing and debugging practices.

CI/CD

Reason: Understanding concepts of continuous integration and continuous delivery (CI/CD) for efficient development and deployment of record linkage solutions. Example: Investigating how CI/CD processes could be implemented for the “splink_demos” project, potentially using tools like GitHub Actions.

Security

Reason: Learning how to secure data and code within the record linkage process, considering aspects like data access control and preventing data breaches. Example: Investigating the “requirements.txt” file for potential security-related dependencies, or considering additional security measures for handling sensitive data.

Software Version Control

Reason: Understanding Git commands for version control, enabling collaboration, tracking changes, and managing different code versions. Example: Analyzing the project’s commit history on GitHub, reviewing the “README.md” instructions for cloning and working with the repository.

Documentation & Communication

Reason: Learning how to effectively document code and explain its functionality, ensuring clear communication and maintainability. Example: Reviewing the “README.md” file, the “tutorials” directory, and the “scv.html” file for examples of documentation and communication within the project.

Learn using Shoulder.dev

Shoulder.dev transforms codebases into tailored learning experiences. Below are organized categories of the codebase to help you start with your initial focus.

Hello World