Author: Olivier Vitrac, PhD, HDR | olivier.vitrac@adservio.fr | Adservio | 2025-12-11
Matching column names between two tables1β π VISUAL DEMO β Hungarian Algorithm for Matching Column Names1β1. Similarity Matrix (A Γ B)1β2. Convert to Cost Matrix = (1 - similarity)1β3. Hungarian Algorithm β Visual StepsStep 1 β Subtract row minimaStep 2 β Subtract column minimaStep 3 β Cover all zeros with the minimum number of lines1β4. Choose one zero per row and per column (optimal matching)1β5. Final Semantic Alignment (maximizing total similarity)π― 2βALGORITHM | Goal in this problemπ§ Step 1 β Build a similarity matrix (semantic scores)π Step 2 β Convert similarity to costπΊοΈ Step 3 β Interpret the problem as a bipartite graphπ Step 4 β Why greedy matching failsβοΈ Step 5 β Hungarian algorithm (intuitive explanation)1. Subtract row minima2. Subtract column minima3. Cover zeros with minimal number of lines4. Adjust the matrix until an optimal set of zeros appears5. Select exactly one zero per row to create the optimal assignmentπ Final result from the algorithmπ§© Why the Hungarian algorithm is perfect for the project HCMβ Ensures globally optimal matchingβ Resolves ambiguityβ Works with any similarity metricβ Scales wellβ Clean fallback3βCODE EXAMPLE3β1. Core idea3β2. Full example code (ready to drop in a module)3β3. How can we adapt/generalize this
We illustrate:
A has 4 columns
B has 6 columns
We compute a similarity matrix β cost matrix β Hungarian assignment
Greater (cosine) = higher similarity.
xxxxxxxxxxB columnsβββββββββββββββββββββββββββββββββββββββββββββββA β id_emp surname given_nm hours dept βcolumns β ββββββββββΌββββββββββββββββββββββββββββββββββββββββββββββ€EmpID β 0.95 0.10 0.05 0.02 0.01 βFName β 0.05 0.10 0.92 0.03 0.01 βLName β 0.02 0.88 0.15 0.01 0.01 βTotHrs β 0.03 0.05 0.02 0.97 0.01 ββββββββββββββββββββββββββββββββββββββββββββββββ
Lower cost = better match.
xxxxxxxxxxB columnsβββββββββββββββββββββββββββββββββββββββββββββββA β id_emp surname given_nm hours dept ββββββββββΌββββββββββββββββββββββββββββββββββββββββββββββ€EmpID β 0.05 0.90 0.95 0.98 0.99 βFName β 0.95 0.90 0.08 0.97 0.99 βLName β 0.98 0.12 0.85 0.99 0.99 βTotHrs β 0.97 0.95 0.98 0.03 0.99 ββββββββββββββββββββββββββββββββββββββββββββββββ
We make each row have a zero at its best match.
xxxxxxxxxxRow minima: [0.05, 0.08, 0.12, 0.03]After subtracting row minima:B columnsβββββββββββββββββββββββββββββββββββββββββββββββEmpID β 0.00 0.85 0.90 0.93 0.94 βFName β 0.87 0.82 0.00 0.89 0.91 βLName β 0.86 0.00 0.73 0.87 0.87 βTotHrs β 0.94 0.92 0.95 0.00 0.96 ββββββββββββββββββββββββββββββββββββββββββββββββ
Make each column also have a zero.
xxxxxxxxxxColumn minima: [0.00, 0.00, 0.00, 0.00, 0.87]Subtract:B columnsβββββββββββββββββββββββββββββββββββββββββββββββEmpID β 0.00 0.85 0.90 0.93 0.07 βFName β 0.87 0.82 0.00 0.89 0.04 βLName β 0.86 0.00 0.73 0.87 0.00 βTotHrs β 0.94 0.92 0.95 0.00 0.09 ββββββββββββββββββββββββββββββββββββββββββββββββ
Now we have multiple zeros appearing in meaningful places.
We draw horizontal/vertical lines to cover all 0 entries with as few lines as possible.
xxxxxxxxxxB columnsβββββββββββββββββββββββββββββββββββββββββββββββββEmpID β 0 85 90 93 07 βFName β 87 82 0 89 04 β row line βLName β 86 0 73 87 0 β row line βTotHrs β 94 92 95 0 09 β row line ββ β β β ββ | column lines cover zeros | ββββββββββββββββββββββββββββββββββββββββββββββββββMinimum lines = 4 β equals number of rows β solution exists.
Since we can cover all zeros with m = 4 lines, the Hungarian conditions say:
A perfect assignment exists among the zeros.
No more matrix adjustments are needed.
We now select non-conflicting zeros:
xxxxxxxxxxOptimal zero choices:EmpID β id_emp (0)FName β given_nm (0)LName β surname (0)TotHrs β hours (0)
xxxxxxxxxxA column β B column similarity---------------------------------------------------EmployeeID β id_employee 0.95FirstName β given_name 0.92LastName β surname 0.88TotalHours β hours_total_month 0.97---------------------------------------------------Global semantic matching = OPTIMAL
β No two A-columns are matched to the same B-column β Matches represent maximum possible global similarity β All ambiguity is resolved using global optimization
Below is an intuitive, context-specific explanation of the Hungarian algorithm exactly for the problem of matching column names of table A to column names of table B using semantic similarity.
We have:
Table A with m column names
Table B with n column names where n β₯ m
We want to match each column of A to exactly one distinct column of B so that:
Matched pairs are as semantically similar as possible.
Total similarity is maximized globally (not column by column independently).
This is exactly a linear assignment problem, solved optimally by the Hungarian algorithm.
Compute a matrix:
where:
Example:
| A \ B | id_employee | surname | given_name | hours_total_month |
|---|---|---|---|---|
| EmployeeID | 0.95 | 0.10 | 0.05 | 0.02 |
| FirstName | 0.05 | 0.10 | 0.92 | 0.03 |
| LastName | 0.02 | 0.88 | 0.15 | 0.01 |
| TotalHours | 0.03 | 0.05 | 0.02 | 0.97 |
The Hungarian algorithm minimizes total cost.
But we want to maximize total similarity.
So define:
Thus:
High similarity β low cost
Low similarity β high cost
This ensures the Hungarian algorithm will select high-semantic-similarity pairs.
Imagine:
Left side nodes = columns of A
Right side nodes = columns of B
Every possible pair (Aα΅’ β Bβ±Ό) has a cost
We want to draw m edges, each connecting:
one column in A,
to one distinct column in B,
such that:
If you pick for each Aα΅’ the Bβ±Ό with the highest similarity individually, you may get a collision:
Example:
| A | Best match in B | Similarity |
|---|---|---|
| FirstName | given_name | 0.92 |
| LastName | given_name | 0.88 |
Greedy strategy tries to match both A columns to the same B column β forbidden.
You then need a global strategy that resolves conflicts while preserving the optimal total similarity.
That is exactly what Hungarian solves optimally.
Think of it as:
A smart system that assigns each A-column to a unique B-column so that the total assignment cost is minimal.
Hungarian performs these conceptual operations:
Reduces each row so that the smallest cost becomes 0. This identifies potential good matches.
Ensures each column also has a 0 reachable, improving fairness.
Zeros represent βpromising candidate matchesβ. Covering them identifies whether an optimal assignment exists among them.
If not enough zeros exist to assign m columns:
reduce uncovered elements,
increase covered ones,
generate new zeros.
This gives the globally best matching.
You receive pairs like:
xxxxxxxxxxA: EmployeeID β B: id_employee (similarity = 0.95)A: FirstName β B: given_name (similarity = 0.92)A: LastName β B: surname (similarity = 0.88)A: TotalHours β B: hours_total_month (similarity = 0.97)
Properties:
Every column in A is matched exactly once.
No two A columns share the same B column.
The total similarity is globally maximized, not heuristically chosen.
No heuristic. No local maximum. Guaranteed optimal.
If multiple A columns are semantically close to the same B column, the algorithm distributes them optimally.
TF-IDF, embeddings, Jaccard, Levenshtein⦠all fine.
Even 200Γ400 similarity matrices run in milliseconds.
If similarity < threshold β βno good matchβ.
Hereβs a concrete, reasonably robust pattern we can adopt in Python:
A & B are tables (e.g. Pandas DataFrames) or more generally two lists (I do not recommend sets) storing the names
We build a similarity matrix between column names using TF-IDF + cosine similarity (simple, language-agnostic, decent semantics)
We convert similarity β cost and run the Hungarian algorithm (linear_sum_assignment) to get the best one-to-one mapping
B can have more columns than A: we match every column of A to a distinct column of B
For column names:
Normalize (lowercase, strip punctuation, split on _, etc.).
Turn all names into TF-IDF vectors.
Compute cosine similarity between each column of A and each column of B.
Build a cost matrix = 1 - similarity.
Use Hungarian algo to find minimal total cost (i.e. maximal total similarity).
Optionally, drop matches below a min_similarity threshold.
xxxxxxxxxxfrom __future__ import annotations
import refrom dataclasses import dataclassfrom typing import List, Optional, Sequence, Dict, Tuple
import numpy as npfrom scipy.optimize import linear_sum_assignmentfrom sklearn.feature_extraction.text import TfidfVectorizerfrom sklearn.metrics.pairwise import cosine_similarity
# ---------- Data structures ----------
class ColumnMatch: """Represents the match of a column in A to a column in B.""" col_a: str col_b: Optional[str] # None if no reasonable match found similarity: float index_a: int index_b: Optional[int]
# ---------- Text normalization utilities ----------
def normalize_name(name: str) -> str: """ Normalize a column name for semantic comparison.
Operations: - lowercasing - replace separators (_,-,.) by space - remove non-alphanumeric chars - collapse multiple spaces """ s = name.lower() s = s.replace("_", " ").replace("-", " ").replace(".", " ") s = re.sub(r"[^a-z0-9\s]", " ", s) # keep only letters, digits, spaces s = re.sub(r"\s+", " ", s).strip() return s
def apply_synonyms(name: str, synonyms: Dict[str, str]) -> str: """ Very simple synonym expansion: replace full tokens using a dict. E.g. {"surname": "last name", "forename": "first name"}. """ tokens = name.split() replaced = [synonyms.get(tok, tok) for tok in tokens] return " ".join(replaced)
# ---------- Similarity matrix construction ----------
def build_similarity_matrix( cols_a: Sequence[str], cols_b: Sequence[str], synonyms: Optional[Dict[str, str]] = None,) -> np.ndarray: """ Build an |A| x |B| similarity matrix between column names using TF-IDF + cosine similarity.
Parameters ---------- cols_a : list of str Column names of table A (to be matched). cols_b : list of str Column names of table B (reference / superset). synonyms : dict, optional Optional mapping token -> replacement token applied after normalization.
Returns ------- sim : np.ndarray of shape (len(cols_a), len(cols_b)) Similarity scores in [0, 1]. """ # Normalize and optionally apply synonyms def preprocess_list(cols: Sequence[str]) -> List[str]: out = [] for c in cols: s = normalize_name(c) if synonyms: s = apply_synonyms(s, synonyms) out.append(s) return out
norm_a = preprocess_list(cols_a) norm_b = preprocess_list(cols_b)
# Fit TF-IDF on all names together all_names = norm_a + norm_b vectorizer = TfidfVectorizer(ngram_range=(1, 3)) # 1-3 grams to catch short phrases tfidf = vectorizer.fit_transform(all_names)
tfidf_a = tfidf[: len(norm_a), :] tfidf_b = tfidf[len(norm_a) :, :]
# Cosine similarity A x B sim = cosine_similarity(tfidf_a, tfidf_b) return sim
# ---------- Hungarian-based matcher ----------
def best_column_matching( cols_a: Sequence[str], cols_b: Sequence[str], synonyms: Optional[Dict[str, str]] = None, min_similarity: float = 0.3,) -> Tuple[List[ColumnMatch], np.ndarray]: """ Compute best one-to-one matching between columns of A and B using Hungarian algorithm on a semantic similarity matrix.
Parameters ---------- cols_a : list of str Column names of table A (to be matched). cols_b : list of str Column names of table B (reference / superset). synonyms : dict, optional Optional mapping token -> replacement token (e.g. "surname" -> "last name"). min_similarity : float Minimum similarity required to accept a match. If the best similarity for a column in A is below this threshold, the match is considered "None".
Returns ------- matches : list of ColumnMatch One entry per column in A. sim_matrix : np.ndarray Full similarity matrix (shape |A| x |B|). """ if len(cols_a) == 0: return [], np.zeros((0, len(cols_b)))
sim = build_similarity_matrix(cols_a, cols_b, synonyms=synonyms) # Hungarian solves a MIN cost assignment, so we transform: cost = 1.0 - sim
# Hungarian algorithm row_ind, col_ind = linear_sum_assignment(cost) # Note: row_ind is [0..len(cols_a)-1], col_ind is chosen subset of B
matches: List[ColumnMatch] = [] for i_a, i_b in zip(row_ind, col_ind): s = float(sim[i_a, i_b]) if s >= min_similarity: matches.append( ColumnMatch( col_a=cols_a[i_a], col_b=cols_b[i_b], similarity=s, index_a=i_a, index_b=i_b, ) ) else: # no acceptable match matches.append( ColumnMatch( col_a=cols_a[i_a], col_b=None, similarity=s, index_a=i_a, index_b=None, ) )
# Ensure matches are ordered by index_a (same order as A) matches.sort(key=lambda m: m.index_a)
return matches, sim
# ---------- Convenience wrapper for Pandas ----------
def match_dataframe_columns( df_a, df_b, synonyms: Optional[Dict[str, str]] = None, min_similarity: float = 0.3,) -> List[ColumnMatch]: """ Convenience wrapper when A and B are pandas.DataFrame.
Returns a list of ColumnMatch. """ cols_a = list(df_a.columns) cols_b = list(df_b.columns) matches, _ = best_column_matching( cols_a, cols_b, synonyms=synonyms, min_similarity=min_similarity ) return matches
# ---------- Example usage as a script ----------
if __name__ == "__main__": # EXAMPLE WITHOUT PANDAS (just lists of names) cols_a = ["EmployeeID", "FirstName", "LastName", "TotalHours"] cols_b = [ "id_employee", "hours_total_month", "surname", "given_name", "department_code", "creation_date", ]
synonyms = { # token-level synonym expansion (after normalization) "firstname": "first name", "lastname": "last name", "surname": "last name", "given": "first", # crude but illustrates the idea "employeeid": "employee id", }
matches, sim_matrix = best_column_matching( cols_a, cols_b, synonyms=synonyms, min_similarity=0.25 )
print("Similarity matrix (A x B):") print(sim_matrix)
print("\nBest matches:") for m in matches: if m.col_b is None: print( f"A: {m.col_a:<15} β NO GOOD MATCH (best similarity = {m.similarity:.3f})" ) else: print( f"A: {m.col_a:<15} β B: {m.col_b:<20} (similarity = {m.similarity:.3f})" )Swap TF-IDF for embeddings for deeper semantics (e.g. sentence-transformers):
Replace build_similarity_matrixβs TF-IDF part by a call to a pre-loaded embedding model.
Similarity remains cosine between A-vectors and B-vectors.
Tune min_similarity:
Higher (0.5β0.7) if the labels are relatively clean and English-like.
Lower (0.2β0.3) if labels are noisy, multilingual, or very short.
Use ColumnMatch structure to:
Automatically rename Aβs columns to the matched ones from B.
Flag columns with col_b is None for manual inspection.