Learn using Shoulder.dev
Shoulder.dev transforms codebases into tailored learning experiences. Below are organized categories of the codebase to help you start with your initial focus.
Record Linkage & Deduplication
Reason: Understanding the core concepts of record linkage and deduplication, including techniques like blocking, comparison, and clustering, which form the foundation of Splink. Example: Analyzing the “deduplicate_50k_synthetic.ipynb” notebook within the “duckdb” directory.
Reason: Learning the functionalities and API of the Splink library, including its specific features for record linkage and deduplication. Example: Exploring the “splink” package in “requirements.txt” and examining how it’s used within various notebooks.
Reason: Understanding how to prepare, manipulate, and analyze datasets used for record linkage. This includes tasks like data cleaning, transformation, and feature engineering. Example: Studying the “02_Exploratory_analysis.ipynb” notebook in the “tutorials” directory.
Reason: Understanding the various data formats used in the codebase (e.g., CSV, Parquet) and their implications for data storage and processing. Example: Examining the data files within the “data” directory and analyzing how they’re loaded and processed in notebooks.
Reason: Developing skills to visualize data effectively, particularly for tasks like exploring relationships between records and visualizing clustering results. Example: Analyzing the various HTML dashboards generated by the “duckdb” examples, such as “50k_cluster.html” or “comparison_viewer_transactions.html.”
Database & Data Source Integration
Reason: Understanding how Splink integrates with various data sources and databases like DuckDB, SQLite, and Spark. Example: Studying the “duckdb” and “spark” directories and observing how Splink is configured for different data sources.
Reason: Developing proficiency with Jupyter notebooks for data exploration, analysis, and code execution. Example: Working through the interactive notebooks in the “tutorials” and “examples” directories.
Reason: Understanding how to train record linkage models, assess their performance, and refine model parameters for optimal results. Example: Studying the “04_Estimating_model_parameters.ipynb” notebook in the “tutorials” directory.
Reason: Learning how to deploy trained models for real-world applications and develop automated pipelines for record linkage processes. Example: Examining the “demo_settings” directory and exploring the “real_time_record_linkage.ipynb” notebook.
Reason: Acquiring skills to write and execute tests for record linkage models, diagnose issues, and troubleshoot code. Example: Studying how unit tests could be implemented for specific components of Splink or examining the “recreate_venv.sh” script for potential testing and debugging practices.
Reason: Understanding concepts of continuous integration and continuous delivery (CI/CD) for efficient development and deployment of record linkage solutions. Example: Investigating how CI/CD processes could be implemented for the “splink_demos” project, potentially using tools like GitHub Actions.
Reason: Learning how to secure data and code within the record linkage process, considering aspects like data access control and preventing data breaches. Example: Investigating the “requirements.txt” file for potential security-related dependencies, or considering additional security measures for handling sensitive data.
Reason: Understanding Git commands for version control, enabling collaboration, tracking changes, and managing different code versions. Example: Analyzing the project’s commit history on GitHub, reviewing the “README.md” instructions for cloning and working with the repository.
Reason: Learning how to effectively document code and explain its functionality, ensuring clear communication and maintainability. Example: Reviewing the “README.md” file, the “tutorials” directory, and the “scv.html” file for examples of documentation and communication within the project.