{
“cells”: [
{
“cell_type”: “markdown”,
“metadata”: {},
“source”: [
“## Linking a dataset of real historical persons with Deterrministic Rules\n”,
“\n”,
“While Splink is primarily a tool for probabilistic records linkage, there is functionality to perform deterministic (i.e. rules based) linkage.\n”,
“\n”,
“In this example, we deduplicate a more realistic dataset. The data is based on historical persons scraped from wikidata. Duplicate records are introduced with a variety of errors introduced. The probabilistic dedupe of the same dataset can be found at Deduplicate 50k rows historical persons
.”
]
},
{
“cell_type”: “code”,
“execution_count”: 1,
“metadata”: {},
“outputs”: [
{
“data”: {
“text/html”: [
“
\n”,
“\n”,
“\n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
“
| unique_id | cluster | full_name | first_and_surname | first_name | surname | dob | birth_place | postcode_fake | gender | occupation |
---|
0 | Q2296770-1 | Q2296770 | thomas clifford, 1st baron clifford of chudleigh | thomas chudleigh | thomas | chudleigh | 1630-08-01 | devon | tq13 8df | male | politician |
---|
1 | Q2296770-2 | Q2296770 | thomas of chudleigh | thomas chudleigh | thomas | chudleigh | 1630-08-01 | devon | tq13 8df | male | politician |
---|
2 | Q2296770-3 | Q2296770 | tom 1st baron clifford of chudleigh | tom chudleigh | tom | chudleigh | 1630-08-01 | devon | tq13 8df | male | politician |
---|
3 | Q2296770-4 | Q2296770 | thomas 1st chudleigh | thomas chudleigh | thomas | chudleigh | 1630-08-01 | devon | tq13 8hu | None | politician |
---|
4 | Q2296770-5 | Q2296770 | thomas clifford, 1st baron chudleigh | thomas chudleigh | thomas | chudleigh | 1630-08-01 | devon | tq13 8df | None | politician |
---|
\n”,
“
”
],
“text/plain”: [
” unique_id cluster full_name \n”,
“0 Q2296770-1 Q2296770 thomas clifford, 1st baron clifford of chudleigh \n”,
“1 Q2296770-2 Q2296770 thomas of chudleigh \n”,
“2 Q2296770-3 Q2296770 tom 1st baron clifford of chudleigh \n”,
“3 Q2296770-4 Q2296770 thomas 1st chudleigh \n”,
“4 Q2296770-5 Q2296770 thomas clifford, 1st baron chudleigh \n”,
“\n”,
” first_and_surname first_name surname dob birth_place \n”,
“0 thomas chudleigh thomas chudleigh 1630-08-01 devon \n”,
“1 thomas chudleigh thomas chudleigh 1630-08-01 devon \n”,
“2 tom chudleigh tom chudleigh 1630-08-01 devon \n”,
“3 thomas chudleigh thomas chudleigh 1630-08-01 devon \n”,
“4 thomas chudleigh thomas chudleigh 1630-08-01 devon \n”,
“\n”,
” postcode_fake gender occupation \n”,
“0 tq13 8df male politician \n”,
“1 tq13 8df male politician \n”,
“2 tq13 8df male politician \n”,
“3 tq13 8hu None politician \n”,
“4 tq13 8df None politician ”
]
},
“execution_count”: 1,
“metadata”: {},
“output_type”: “execute_result”
}
],
“source”: [
“from splink.duckdb.linker import DuckDBLinker\n”,
“import altair as alt\n”,
“alt.renderers.enable(‘html’)\n”,
“\n”,
“import pandas as pd \n”,
“pd.options.display.max_rows = 1000\n”,
“df = pd.read_parquet(“../../data/historical_figures_with_errors_50k.parquet”)\n”,
“df.head()”
]
},
{
“cell_type”: “markdown”,
“metadata”: {},
“source”: [
“When defining the settings object, simply pass your deterministic rules into blocking_rules_to_generate_predictions
. \n”,
“\n”,
“For a deterministic linkage, the entire linkage methodology is based on these rules, so there is no need to define comparisons
nor any other parameters required for model training in a probabilistic model.”
]
},
{
“cell_type”: “code”,
“execution_count”: 2,
“metadata”: {},
“outputs”: [],
“source”: [
“# Simple settings dictionary will be used for exploratory analysis\n”,
“settings = {\n”,
” “link_type”: “dedupe_only”,\n”,
” “blocking_rules_to_generate_predictions”: [\n”,
” “l.first_name = r.first_name and l.surname = r.surname and l.dob = r.dob”,\n”,
” “l.surname = r.surname and l.dob = r.dob and l.postcode_fake = r.postcode_fake”,\n”,
” “l.first_name = r.first_name and l.dob = r.dob and l.occupation = r.occupation”,\n”,
” ],\n”,
” “retain_matching_columns”: True,\n”,
” “retain_intermediate_calculation_columns”: True,\n”,
“}\n”,
“linker = DuckDBLinker(df, settings)\n”,
“\n”,
“linker.debug_mode = False”
]
},
{
“cell_type”: “markdown”,
“metadata”: {},
“source”: [
“Once the linker
object is defined, you can profile the dataset columns.”
]
},
{
“cell_type”: “code”,
“execution_count”: 3,
“metadata”: {},
“outputs”: [
{
“data”: {
“text/html”: [
“\n”,
“\n”,
“
\n”,
“”
],
“text/plain”: [
“alt.VConcatChart(…)”
]
},
“execution_count”: 3,
“metadata”: {},
“output_type”: “execute_result”
}
],
“source”: [
“linker.profile_columns(\n”,
” [“first_name”, “surname”, “substr(dob, 1,4)”], top_n=10, bottom_n=5\n”,
“)”
]
},
{
“cell_type”: “markdown”,
“metadata”: {},
“source”: [
“In a deterministic linkage, the blocking rules chart shows how many records have been matched by each of the deterministic rules.”
]
},
{
“cell_type”: “code”,
“execution_count”: 4,
“metadata”: {},
“outputs”: [
{
“data”: {
“text/html”: [
“\n”,
“\n”,
“
\n”,
“”
],
“text/plain”: [
“alt.Chart(…)”
]
},
“execution_count”: 4,
“metadata”: {},
“output_type”: “execute_result”
}
],
“source”: [
“linker.cumulative_num_comparisons_from_blocking_rules_chart()”
]
},
{
“cell_type”: “markdown”,
“metadata”: {},
“source”: [
“The results of the linkage can be viewed with the deterministic_link
function.”
]
},
{
“cell_type”: “code”,
“execution_count”: 5,
“metadata”: {},
“outputs”: [
{
“data”: {
“text/html”: [
“
\n”,
“\n”,
“\n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
“
| unique_id_l | unique_id_r | dob_l | dob_r | postcode_fake_l | postcode_fake_r | occupation_l | occupation_r | surname_l | surname_r | first_name_l | first_name_r | match_key | match_probability |
---|
0 | Q2296770-1 | Q2296770-6 | 1630-08-01 | 1630-08-01 | tq13 8df | tq13 8df | politician | politician | chudleigh | chudleigh | thomas | thomas | 0 | 1.0 |
---|
1 | Q2296770-2 | Q2296770-6 | 1630-08-01 | 1630-08-01 | tq13 8df | tq13 8df | politician | politician | chudleigh | chudleigh | thomas | thomas | 0 | 1.0 |
---|
2 | Q2296770-3 | Q2296770-7 | 1630-08-01 | 1630-08-01 | tq13 8df | tq13 8df | politician | NaN | chudleigh | chudleigh | tom | tom | 0 | 1.0 |
---|
3 | Q2296770-4 | Q2296770-6 | 1630-08-01 | 1630-08-01 | tq13 8hu | tq13 8df | politician | politician | chudleigh | chudleigh | thomas | thomas | 0 | 1.0 |
---|
4 | Q2296770-5 | Q2296770-6 | 1630-08-01 | 1630-08-01 | tq13 8df | tq13 8df | politician | politician | chudleigh | chudleigh | thomas | thomas | 0 | 1.0 |
---|
\n”,
“
”
],
“text/plain”: [
” unique_id_l unique_id_r dob_l dob_r postcode_fake_l \n”,
“0 Q2296770-1 Q2296770-6 1630-08-01 1630-08-01 tq13 8df \n”,
“1 Q2296770-2 Q2296770-6 1630-08-01 1630-08-01 tq13 8df \n”,
“2 Q2296770-3 Q2296770-7 1630-08-01 1630-08-01 tq13 8df \n”,
“3 Q2296770-4 Q2296770-6 1630-08-01 1630-08-01 tq13 8hu \n”,
“4 Q2296770-5 Q2296770-6 1630-08-01 1630-08-01 tq13 8df \n”,
“\n”,
” postcode_fake_r occupation_l occupation_r surname_l surname_r \n”,
“0 tq13 8df politician politician chudleigh chudleigh \n”,
“1 tq13 8df politician politician chudleigh chudleigh \n”,
“2 tq13 8df politician NaN chudleigh chudleigh \n”,
“3 tq13 8df politician politician chudleigh chudleigh \n”,
“4 tq13 8df politician politician chudleigh chudleigh \n”,
“\n”,
” first_name_l first_name_r match_key match_probability \n”,
“0 thomas thomas 0 1.0 \n”,
“1 thomas thomas 0 1.0 \n”,
“2 tom tom 0 1.0 \n”,
“3 thomas thomas 0 1.0 \n”,
“4 thomas thomas 0 1.0 ”
]
},
“execution_count”: 5,
“metadata”: {},
“output_type”: “execute_result”
}
],
“source”: [
“df_predict = linker.deterministic_link()\n”,
“df_predict.as_pandas_dataframe().head()”
]
},
{
“cell_type”: “markdown”,
“metadata”: {},
“source”: [
“Which can be used to generate clusters. \n”,
“\n”,
“Note, for deterministic linkage, each comparison has been assigned a match probability of 1, so to generate clusters, set threshold_match_probability=1
in the cluster_pairwise_predictions_at_threshold
function.”
]
},
{
“cell_type”: “code”,
“execution_count”: 6,
“metadata”: {},
“outputs”: [
{
“name”: “stderr”,
“output_type”: “stream”,
“text”: [
“Completed iteration 1, root rows count 94\n”,
“Completed iteration 2, root rows count 10\n”,
“Completed iteration 3, root rows count 0\n”
]
}
],
“source”: [
“clusters = linker.cluster_pairwise_predictions_at_threshold(df_predict, threshold_match_probability=1)”
]
},
{
“cell_type”: “code”,
“execution_count”: 7,
“metadata”: {},
“outputs”: [
{
“data”: {
“text/html”: [
“
\n”,
“\n”,
“\n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
” \n”,
“
| cluster_id | unique_id | cluster | full_name | first_and_surname | first_name | surname | dob | birth_place | postcode_fake | gender | occupation |
---|
0 | Q21461054-1 | Q43139177-6 | Q43139177 | fred valter | fred valter | fred | valter | 1850-01-01 | NaN | NaN | NaN | NaN |
---|
1 | Q8021225-2 | Q8021225-2 | Q8021225 | williamson john reynolds | williamson reynolds | williamson | reynolds | 1861-10-21 | chesterfield | s41 0ee | male | composer |
---|
2 | Q8021877-4 | Q8021877-4 | Q8021877 | william ward | william ward | william | ward | 1857-01-01 | gateshead | NaN | male | dancer |
---|
3 | Q8032443-14 | Q8032443-14 | Q8032443 | woodbutn | woodbutn | woodbutn | NaN | 1701-01-01 | penarth | cf64 3jj | NaN | cricketer |
---|
4 | Q8032443-1 | Q8032443-2 | Q8032443 | woodburn | woodburn | woodburn | NaN | 1701-01-01 | vale of glamorgan | cf64 3jj | male | cricketer |
---|
\n”,
“
”
],
“text/plain”: [
” cluster_id unique_id cluster full_name \n”,
“0 Q21461054-1 Q43139177-6 Q43139177 fred valter \n”,
“1 Q8021225-2 Q8021225-2 Q8021225 williamson john reynolds \n”,
“2 Q8021877-4 Q8021877-4 Q8021877 william ward \n”,
“3 Q8032443-14 Q8032443-14 Q8032443 woodbutn \n”,
“4 Q8032443-1 Q8032443-2 Q8032443 woodburn \n”,
“\n”,
” first_and_surname first_name surname dob birth_place \n”,
“0 fred valter fred valter 1850-01-01 NaN \n”,
“1 williamson reynolds williamson reynolds 1861-10-21 chesterfield \n”,
“2 william ward william ward 1857-01-01 gateshead \n”,
“3 woodbutn woodbutn NaN 1701-01-01 penarth \n”,
“4 woodburn woodburn NaN 1701-01-01 vale of glamorgan \n”,
“\n”,
” postcode_fake gender occupation \n”,
“0 NaN NaN NaN \n”,
“1 s41 0ee male composer \n”,
“2 NaN male dancer \n”,
“3 cf64 3jj NaN cricketer \n”,
“4 cf64 3jj male cricketer ”
]
},
“execution_count”: 7,
“metadata”: {},
“output_type”: “execute_result”
}
],
“source”: [
“clusters.as_pandas_dataframe(limit=5)”
]
},
{
“cell_type”: “markdown”,
“metadata”: {},
“source”: [
“These results can then be passed into the Cluster Studio Dashboard
.”
]
},
{
“cell_type”: “code”,
“execution_count”: 8,
“metadata”: {},
“outputs”: [
{
“data”: {
“text/html”: [
“\n”,
” \n”,
” ”
],
“text/plain”: [
“”
]
},
“execution_count”: 8,
“metadata”: {},
“output_type”: “execute_result”
}
],
“source”: [
“linker.cluster_studio_dashboard(df_predict, clusters, “dashboards/50k_deterministic_cluster.html”, sampling_method=’by_cluster_size’, overwrite=True)\n”,
“\n”,
“from IPython.display import IFrame\n”,
“\n”,
“IFrame(\n”,
” src=”./dashboards/50k_deterministic_cluster.html”, width=”100%”, height=1200\n”,
“) ”
]
}
],
“metadata”: {
“kernelspec”: {
“display_name”: “Python 3 (ipykernel)”,
“language”: “python”,
“name”: “python3”
},
“language_info”: {
“codemirror_mode”: {
“name”: “ipython”,
“version”: 3
},
“file_extension”: “.py”,
“mimetype”: “text/x-python”,
“name”: “python”,
“nbconvert_exporter”: “python”,
“pygments_lexer”: “ipython3”,
“version”: “3.9.2”
}
},
“nbformat”: 4,
“nbformat_minor”: 4
}