Matching column names between two tables

Author: Olivier Vitrac, PhD, HDR | olivier.vitrac@adservio.fr | Adservio | 2025-12-11


1– πŸ” VISUAL DEMO β€” Hungarian Algorithm for Matching Column Names

We illustrate:


1–1. Similarity Matrix (A Γ— B)

Greater (cosine) = higher similarity.


1–2. Convert to Cost Matrix = (1 - similarity)

Lower cost = better match.


1–3. Hungarian Algorithm β€” Visual Steps


Step 1 β€” Subtract row minima

We make each row have a zero at its best match.


Step 2 β€” Subtract column minima

Make each column also have a zero.

Now we have multiple zeros appearing in meaningful places.


Step 3 β€” Cover all zeros with the minimum number of lines

We draw horizontal/vertical lines to cover all 0 entries with as few lines as possible.

Since we can cover all zeros with m = 4 lines, the Hungarian conditions say:

A perfect assignment exists among the zeros.

No more matrix adjustments are needed.


1–4. Choose one zero per row and per column (optimal matching)

We now select non-conflicting zeros:


1–5. Final Semantic Alignment (maximizing total similarity)

βœ” No two A-columns are matched to the same B-column βœ” Matches represent maximum possible global similarity βœ” All ambiguity is resolved using global optimization


Below is an intuitive, context-specific explanation of the Hungarian algorithm exactly for the problem of matching column names of table A to column names of table B using semantic similarity.


🎯 2–ALGORITHM | Goal in this problem

We have:

We want to match each column of A to exactly one distinct column of B so that:

This is exactly a linear assignment problem, solved optimally by the Hungarian algorithm.


🧠 Step 1 β€” Build a similarity matrix (semantic scores)

Compute a matrix:

(1)Sim(i,j)

where:

Example:

A \ Bid_employeesurnamegiven_namehours_total_month
EmployeeID0.950.100.050.02
FirstName0.050.100.920.03
LastName0.020.880.150.01
TotalHours0.030.050.020.97

πŸ”„ Step 2 β€” Convert similarity to cost

The Hungarian algorithm minimizes total cost.

But we want to maximize total similarity.

So define:

(2)Cost(i,j)=1βˆ’Sim(i,j)

Thus:

This ensures the Hungarian algorithm will select high-semantic-similarity pairs.


πŸ—ΊοΈ Step 3 β€” Interpret the problem as a bipartite graph

Imagine:

We want to draw m edges, each connecting:

such that:

(3)Total cost is minimal⇔Total similarity is maximal.

πŸ… Step 4 β€” Why greedy matching fails

If you pick for each Aα΅’ the Bβ±Ό with the highest similarity individually, you may get a collision:

Example:

ABest match in BSimilarity
FirstNamegiven_name0.92
LastNamegiven_name0.88

Greedy strategy tries to match both A columns to the same B column β†’ forbidden.

You then need a global strategy that resolves conflicts while preserving the optimal total similarity.

That is exactly what Hungarian solves optimally.


βš™οΈ Step 5 β€” Hungarian algorithm (intuitive explanation)

Think of it as:

A smart system that assigns each A-column to a unique B-column so that the total assignment cost is minimal.

Hungarian performs these conceptual operations:

1. Subtract row minima

Reduces each row so that the smallest cost becomes 0. This identifies potential good matches.

2. Subtract column minima

Ensures each column also has a 0 reachable, improving fairness.

3. Cover zeros with minimal number of lines

Zeros represent β€œpromising candidate matches”. Covering them identifies whether an optimal assignment exists among them.

4. Adjust the matrix until an optimal set of zeros appears

If not enough zeros exist to assign m columns:

5. Select exactly one zero per row to create the optimal assignment

This gives the globally best matching.


🎁 Final result from the algorithm

You receive pairs like:

Properties:


🧩 Why the Hungarian algorithm is perfect for the project HCM

βœ” Ensures globally optimal matching

No heuristic. No local maximum. Guaranteed optimal.

βœ” Resolves ambiguity

If multiple A columns are semantically close to the same B column, the algorithm distributes them optimally.

βœ” Works with any similarity metric

TF-IDF, embeddings, Jaccard, Levenshtein… all fine.

βœ” Scales well

Even 200Γ—400 similarity matrices run in milliseconds.

βœ” Clean fallback

If similarity < threshold β†’ β€œno good match”.


3–CODE EXAMPLE

Here’s a concrete, reasonably robust pattern we can adopt in Python:


3–1. Core idea

For column names:

  1. Normalize (lowercase, strip punctuation, split on _, etc.).

  2. Turn all names into TF-IDF vectors.

  3. Compute cosine similarity between each column of A and each column of B.

  4. Build a cost matrix = 1 - similarity.

  5. Use Hungarian algo to find minimal total cost (i.e. maximal total similarity).

  6. Optionally, drop matches below a min_similarity threshold.


3–2. Full example code (ready to drop in a module)


3–3. How can we adapt/generalize this