Data Formats & Storage
This section outlines the various data formats used in the Splink record linking library, including their advantages and disadvantages. The goal is to provide guidance for developers on data storage and processing.
Data Formats:
CSV (Comma-Separated Values): Widely used format for storing data in plain text format. Each row represents a record, and each column represents a field.
- Advantages: Easy to read and write; human-readable format; supported by numerous tools and libraries.
- Disadvantages: Can be slow for large datasets; not suitable for complex data structures; can be prone to errors during parsing.
- Examples:
data/synthetic_1000.csv
(used intutorials/02_Exploratory_analysis.ipynb
)- The “aapl,” “alphabet,” and “cars” datasets (used in
tutorials/scv.html
)
Parquet: A columnar format designed for efficient storage and retrieval of large datasets. It’s a more compact and performant alternative to CSV.
- Advantages: Efficient storage and retrieval of data; optimized for analytical workloads; suitable for large datasets.
- Disadvantages: More complex format; may require specific libraries or tools.
- Example: See
splink_demos/examples/duckdb/real_time_record_linkage.ipynb
which uses Parquet files in the ‘data’ folder.
SQLite Database: A lightweight, embedded database that stores data in a file.
- Advantages: Simple and efficient; suitable for smaller datasets; can be easily integrated into applications.
- Disadvantages: Not as scalable as other databases; less feature-rich than relational databases.
- Examples:
examples/sqlite/deduplicate_50k_synthetic.ipynb
uses SQLite for the deduplication of a synthetic dataset.tutorials/scv.html
utilizes a SQLite client (SQLiteDatabaseClient).
Data Storage:
File System: The simplest approach, storing data in separate files.
- Advantages: Easy to manage; suitable for smaller datasets; can be used with various data formats.
- Disadvantages: Can be inefficient for large datasets; may require manual file management.
Database: Provides a structured way to store and manage data.
- Advantages: Enhanced data integrity and consistency; efficient data access and retrieval; supports complex relationships between data.
- Disadvantages: Can be complex to set up and maintain; may require specialized skills.
Additional Considerations
- Data Consistency and Cleaning: Before using Splink, ensure data consistency by standardizing formats, matching text case, and handling null values.
- Data Type: Splink treats null values differently from empty strings, so using true nulls guarantees proper matching across datasets.
- Performance: The choice of data format and storage method can significantly impact performance. For larger datasets, consider using optimized formats like Parquet and storing data in a database.
Sources:
tutorials/01_Prerequisites.ipynb
tutorials/02_Exploratory_analysis.ipynb
tutorials/03_Blocking.ipynb
examples/duckdb/real_time_record_linkage.ipynb
examples/sqlite/deduplicate_50k_synthetic.ipynb
README.md
tutorials/scv.html
examples/sqlite/dashboards/50k_cluster.html
examples/duckdb/dashboards/50k_deterministic_cluster.html
This documentation provides a comprehensive overview of data formats and storage considerations for developers working with Splink.
Top-Level Directory Explanations
examples/ - This directory likely contains examples or sample code for using the project’s components.
examples/athena/ - This subdirectory may contain examples using Amazon Athena, an interactive query service for analyzing data in Amazon S3 using standard SQL.
examples/athena/dashboards/ - This subdirectory may contain Athena dashboard files.
examples/duckdb/ - This subdirectory may contain examples using DuckDB, an open-source in-memory analytic database.
examples/duckdb/dashboards/ - This subdirectory may contain DuckDB dashboard files.
examples/sqlite/ - This subdirectory may contain examples using SQLite, a popular open-source database management system.
examples/sqlite/dashboards/ - This subdirectory may contain SQLite dashboard files.
tutorials/ - This directory may contain tutorials or guides for using the project.