Matching column names between two tables

Author: Olivier Vitrac, PhD, HDR | olivier.vitrac@adservio.fr | Adservio | 2025-12-11

Matching column names between two tables1– 🔍 VISUAL DEMO — Hungarian Algorithm for Matching Column Names1–1. Similarity Matrix (A × B)1–2. Convert to Cost Matrix = (1 - similarity)1–3. Hungarian Algorithm — Visual StepsStep 1 — Subtract row minimaStep 2 — Subtract column minimaStep 3 — Cover all zeros with the minimum number of lines1–4. Choose one zero per row and per column (optimal matching)1–5. Final Semantic Alignment (maximizing total similarity)🎯 2–ALGORITHM | Goal in this problem🧠 Step 1 — Build a similarity matrix (semantic scores)🔄 Step 2 — Convert similarity to cost🗺️ Step 3 — Interpret the problem as a bipartite graph🏅 Step 4 — Why greedy matching fails⚙️ Step 5 — Hungarian algorithm (intuitive explanation)1. Subtract row minima2. Subtract column minima3. Cover zeros with minimal number of lines4. Adjust the matrix until an optimal set of zeros appears5. Select exactly one zero per row to create the optimal assignment🎁 Final result from the algorithm🧩 Why the Hungarian algorithm is perfect for the project HCM✔ Ensures globally optimal matching✔ Resolves ambiguity✔ Works with any similarity metric✔ Scales well✔ Clean fallback3–CODE EXAMPLE3–1. Core idea3–2. Full example code (ready to drop in a module)3–3. How can we adapt/generalize this

1– 🔍 VISUAL DEMO — Hungarian Algorithm for Matching Column Names

We illustrate:

A has 4 columns
B has 6 columns
We compute a similarity matrix → cost matrix → Hungarian assignment

1–1. Similarity Matrix (A × B)

Greater (cosine) = higher similarity.


xxxxxxxxxx
                B columns
        ┌─────────────────────────────────────────────┐
A       │  id_emp   surname   given_nm   hours   dept │
columns │                                             │
────────┼─────────────────────────────────────────────┤
EmpID   │   0.95      0.10      0.05      0.02   0.01 │
FName   │   0.05      0.10      0.92      0.03   0.01 │
LName   │   0.02      0.88      0.15      0.01   0.01 │
TotHrs  │   0.03      0.05      0.02      0.97   0.01 │
        └─────────────────────────────────────────────┘

1–2. Convert to Cost Matrix = (1 - similarity)

Lower cost = better match.


xxxxxxxxxx
                B columns
        ┌─────────────────────────────────────────────┐
A       │ id_emp  surname  given_nm   hours    dept   │
────────┼─────────────────────────────────────────────┤
EmpID   │ 0.05     0.90      0.95      0.98    0.99   │
FName   │ 0.95     0.90      0.08      0.97    0.99   │
LName   │ 0.98     0.12      0.85      0.99    0.99   │
TotHrs  │ 0.97     0.95      0.98      0.03    0.99   │
        └─────────────────────────────────────────────┘

1–3. Hungarian Algorithm — Visual Steps

Step 1 — Subtract row minima

We make each row have a zero at its best match.


xxxxxxxxxx
Row minima:  [0.05, 0.08, 0.12, 0.03]

After subtracting row minima:
                B columns
        ┌─────────────────────────────────────────────┐
EmpID   │ 0.00   0.85   0.90   0.93   0.94            │
FName   │ 0.87   0.82   0.00   0.89   0.91            │
LName   │ 0.86   0.00   0.73   0.87   0.87            │
TotHrs  │ 0.94   0.92   0.95   0.00   0.96            │
        └─────────────────────────────────────────────┘

Step 2 — Subtract column minima

Make each column also have a zero.


xxxxxxxxxx
Column minima: [0.00, 0.00, 0.00, 0.00, 0.87]

Subtract:
                B columns
        ┌─────────────────────────────────────────────┐
EmpID   │ 0.00   0.85   0.90   0.93   0.07            │
FName   │ 0.87   0.82   0.00   0.89   0.04            │
LName   │ 0.86   0.00   0.73   0.87   0.00            │
TotHrs  │ 0.94   0.92   0.95   0.00   0.09            │
        └─────────────────────────────────────────────┘

Now we have multiple zeros appearing in meaningful places.

Step 3 — Cover all zeros with the minimum number of lines

We draw horizontal/vertical lines to cover all 0 entries with as few lines as possible.


xxxxxxxxxx
                B columns
        ┌───────────────────────────────────────────────┐
EmpID   │  0     85     90     93    07                 │
FName   │  87    82     0      89    04   ← row line    │
LName   │  86     0     73     87     0   ← row line    │
TotHrs  │  94    92     95      0     09  ← row line    │
        │  ↑      ↑                     ↑               │
        │  | column lines cover zeros |                  │
        └───────────────────────────────────────────────┘
Minimum lines = 4 → equals number of rows → solution exists.

Since we can cover all zeros with m = 4 lines, the Hungarian conditions say:

A perfect assignment exists among the zeros.

No more matrix adjustments are needed.

1–4. Choose one zero per row and per column (optimal matching)

We now select non-conflicting zeros:


xxxxxxxxxx
Optimal zero choices:

EmpID   →  id_emp      (0)
FName   →  given_nm    (0)
LName   →  surname     (0)
TotHrs  →  hours       (0)

1–5. Final Semantic Alignment (maximizing total similarity)


xxxxxxxxxx
A column       →       B column         similarity
---------------------------------------------------
EmployeeID     →       id_employee           0.95
FirstName      →       given_name            0.92
LastName       →       surname               0.88
TotalHours     →       hours_total_month     0.97
---------------------------------------------------
Global semantic matching = OPTIMAL

✔ No two A-columns are matched to the same B-column ✔ Matches represent maximum possible global similarity ✔ All ambiguity is resolved using global optimization

Below is an intuitive, context-specific explanation of the Hungarian algorithm exactly for the problem of matching column names of table A to column names of table B using semantic similarity.

🎯 2–ALGORITHM | Goal in this problem

We have:

Table A with m column names
Table B with n column names where n ≥ m

We want to match each column of A to exactly one distinct column of B so that:

Matched pairs are as semantically similar as possible.
Total similarity is maximized globally (not column by column independently).

This is exactly a linear assignment problem, solved optimally by the Hungarian algorithm.

🧠 Step 1 — Build a similarity matrix (semantic scores)

Compute a matrix:

\begin{matrix} (1) & Sim (i, j) \end{matrix}

where:

$i$ = index of column in A
$j$ = index of column in B
$\mathrm{Sim}(i,j) \in [0,1]$

Example:

A \ B	id_employee	surname	given_name	hours_total_month
EmployeeID	0.95	0.10	0.05	0.02
FirstName	0.05	0.10	0.92	0.03
LastName	0.02	0.88	0.15	0.01
TotalHours	0.03	0.05	0.02	0.97

🔄 Step 2 — Convert similarity to cost

The Hungarian algorithm minimizes total cost.

But we want to maximize total similarity.

So define:

\begin{matrix} (2) & Cost (i, j) = 1 - Sim (i, j) \end{matrix}

Thus:

High similarity → low cost
Low similarity → high cost

This ensures the Hungarian algorithm will select high-semantic-similarity pairs.

🗺️ Step 3 — Interpret the problem as a bipartite graph

Imagine:

Left side nodes = columns of A
Right side nodes = columns of B
Every possible pair (Aᵢ → Bⱼ) has a cost

We want to draw m edges, each connecting:

one column in A,
to one distinct column in B,

such that:

\begin{matrix} (3) & Total cost is minimal \Leftrightarrow Total similarity is maximal . \end{matrix}

🏅 Step 4 — Why greedy matching fails

If you pick for each Aᵢ the Bⱼ with the highest similarity individually, you may get a collision:

Example:

A	Best match in B	Similarity
FirstName	given_name	0.92
LastName	given_name	0.88

Greedy strategy tries to match both A columns to the same B column → forbidden.

You then need a global strategy that resolves conflicts while preserving the optimal total similarity.

That is exactly what Hungarian solves optimally.

⚙️ Step 5 — Hungarian algorithm (intuitive explanation)

Think of it as:

A smart system that assigns each A-column to a unique B-column so that the total assignment cost is minimal.

Hungarian performs these conceptual operations:

1. Subtract row minima

Reduces each row so that the smallest cost becomes 0. This identifies potential good matches.

2. Subtract column minima

Ensures each column also has a 0 reachable, improving fairness.

3. Cover zeros with minimal number of lines

Zeros represent “promising candidate matches”. Covering them identifies whether an optimal assignment exists among them.

4. Adjust the matrix until an optimal set of zeros appears

If not enough zeros exist to assign m columns:

reduce uncovered elements,
increase covered ones,
generate new zeros.

5. Select exactly one zero per row to create the optimal assignment

This gives the globally best matching.

🎁 Final result from the algorithm

You receive pairs like:


xxxxxxxxxx
A: EmployeeID  → B: id_employee         (similarity = 0.95)
A: FirstName   → B: given_name          (similarity = 0.92)
A: LastName    → B: surname             (similarity = 0.88)
A: TotalHours  → B: hours_total_month   (similarity = 0.97)

Properties:

Every column in A is matched exactly once.
No two A columns share the same B column.
The total similarity is globally maximized, not heuristically chosen.

🧩 Why the Hungarian algorithm is perfect for the project HCM

✔ Ensures globally optimal matching

No heuristic. No local maximum. Guaranteed optimal.

✔ Resolves ambiguity

If multiple A columns are semantically close to the same B column, the algorithm distributes them optimally.

✔ Works with any similarity metric

TF-IDF, embeddings, Jaccard, Levenshtein… all fine.

✔ Scales well

Even 200×400 similarity matrices run in milliseconds.

✔ Clean fallback

If similarity < threshold → “no good match”.

3–CODE EXAMPLE

Here’s a concrete, reasonably robust pattern we can adopt in Python:

A & B are tables (e.g. Pandas DataFrames) or more generally two lists (I do not recommend sets) storing the names
We build a similarity matrix between column names using TF-IDF + cosine similarity (simple, language-agnostic, decent semantics)
We convert similarity → cost and run the Hungarian algorithm (linear_sum_assignment) to get the best one-to-one mapping
B can have more columns than A: we match every column of A to a distinct column of B

3–1. Core idea

For column names:

Normalize (lowercase, strip punctuation, split on _, etc.).
Turn all names into TF-IDF vectors.
Compute cosine similarity between each column of A and each column of B.
Build a cost matrix = 1 - similarity.
Use Hungarian algo to find minimal total cost (i.e. maximal total similarity).
Optionally, drop matches below a min_similarity threshold.

3–2. Full example code (ready to drop in a module)


xxxxxxxxxx
from __future__ import annotations

import re
from dataclasses import dataclass
from typing import List, Optional, Sequence, Dict, Tuple

import numpy as np
from scipy.optimize import linear_sum_assignment
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity


# ---------- Data structures ----------

@dataclass
class ColumnMatch:
    """Represents the match of a column in A to a column in B."""
    col_a: str
    col_b: Optional[str]    # None if no reasonable match found
    similarity: float
    index_a: int
    index_b: Optional[int]


# ---------- Text normalization utilities ----------

def normalize_name(name: str) -> str:
    """
    Normalize a column name for semantic comparison.

    Operations:
    - lowercasing
    - replace separators (_,-,.) by space
    - remove non-alphanumeric chars
    - collapse multiple spaces
    """
    s = name.lower()
    s = s.replace("_", " ").replace("-", " ").replace(".", " ")
    s = re.sub(r"[^a-z0-9\s]", " ", s)  # keep only letters, digits, spaces
    s = re.sub(r"\s+", " ", s).strip()
    return s


def apply_synonyms(name: str, synonyms: Dict[str, str]) -> str:
    """
    Very simple synonym expansion: replace full tokens using a dict.
    E.g. {"surname": "last name", "forename": "first name"}.
    """
    tokens = name.split()
    replaced = [synonyms.get(tok, tok) for tok in tokens]
    return " ".join(replaced)


# ---------- Similarity matrix construction ----------

def build_similarity_matrix(
    cols_a: Sequence[str],
    cols_b: Sequence[str],
    synonyms: Optional[Dict[str, str]] = None,
) -> np.ndarray:
    """
    Build an |A| x |B| similarity matrix between column names using
    TF-IDF + cosine similarity.

    Parameters
    ----------
    cols_a : list of str
        Column names of table A (to be matched).
    cols_b : list of str
        Column names of table B (reference / superset).
    synonyms : dict, optional
        Optional mapping token -> replacement token applied after normalization.

    Returns
    -------
    sim : np.ndarray of shape (len(cols_a), len(cols_b))
        Similarity scores in [0, 1].
    """
    # Normalize and optionally apply synonyms
    def preprocess_list(cols: Sequence[str]) -> List[str]:
        out = []
        for c in cols:
            s = normalize_name(c)
            if synonyms:
                s = apply_synonyms(s, synonyms)
            out.append(s)
        return out

    norm_a = preprocess_list(cols_a)
    norm_b = preprocess_list(cols_b)

    # Fit TF-IDF on all names together
    all_names = norm_a + norm_b
    vectorizer = TfidfVectorizer(ngram_range=(1, 3))  # 1-3 grams to catch short phrases
    tfidf = vectorizer.fit_transform(all_names)

    tfidf_a = tfidf[: len(norm_a), :]
    tfidf_b = tfidf[len(norm_a) :, :]

    # Cosine similarity A x B
    sim = cosine_similarity(tfidf_a, tfidf_b)
    return sim


# ---------- Hungarian-based matcher ----------

def best_column_matching(
    cols_a: Sequence[str],
    cols_b: Sequence[str],
    synonyms: Optional[Dict[str, str]] = None,
    min_similarity: float = 0.3,
) -> Tuple[List[ColumnMatch], np.ndarray]:
    """
    Compute best one-to-one matching between columns of A and B using
    Hungarian algorithm on a semantic similarity matrix.

    Parameters
    ----------
    cols_a : list of str
        Column names of table A (to be matched).
    cols_b : list of str
        Column names of table B (reference / superset).
    synonyms : dict, optional
        Optional mapping token -> replacement token (e.g. "surname" -> "last name").
    min_similarity : float
        Minimum similarity required to accept a match.
        If the best similarity for a column in A is below this threshold,
        the match is considered "None".

    Returns
    -------
    matches : list of ColumnMatch
        One entry per column in A.
    sim_matrix : np.ndarray
        Full similarity matrix (shape |A| x |B|).
    """
    if len(cols_a) == 0:
        return [], np.zeros((0, len(cols_b)))

    sim = build_similarity_matrix(cols_a, cols_b, synonyms=synonyms)
    # Hungarian solves a MIN cost assignment, so we transform:
    cost = 1.0 - sim

    # Hungarian algorithm
    row_ind, col_ind = linear_sum_assignment(cost)
    # Note: row_ind is [0..len(cols_a)-1], col_ind is chosen subset of B

    matches: List[ColumnMatch] = []
    for i_a, i_b in zip(row_ind, col_ind):
        s = float(sim[i_a, i_b])
        if s >= min_similarity:
            matches.append(
                ColumnMatch(
                    col_a=cols_a[i_a],
                    col_b=cols_b[i_b],
                    similarity=s,
                    index_a=i_a,
                    index_b=i_b,
                )
            )
        else:
            # no acceptable match
            matches.append(
                ColumnMatch(
                    col_a=cols_a[i_a],
                    col_b=None,
                    similarity=s,
                    index_a=i_a,
                    index_b=None,
                )
            )

    # Ensure matches are ordered by index_a (same order as A)
    matches.sort(key=lambda m: m.index_a)

    return matches, sim


# ---------- Convenience wrapper for Pandas ----------

def match_dataframe_columns(
    df_a,
    df_b,
    synonyms: Optional[Dict[str, str]] = None,
    min_similarity: float = 0.3,
) -> List[ColumnMatch]:
    """
    Convenience wrapper when A and B are pandas.DataFrame.

    Returns a list of ColumnMatch.
    """
    cols_a = list(df_a.columns)
    cols_b = list(df_b.columns)
    matches, _ = best_column_matching(
        cols_a, cols_b, synonyms=synonyms, min_similarity=min_similarity
    )
    return matches


# ---------- Example usage as a script ----------

if __name__ == "__main__":
    # EXAMPLE WITHOUT PANDAS (just lists of names)
    cols_a = ["EmployeeID", "FirstName", "LastName", "TotalHours"]
    cols_b = [
        "id_employee",
        "hours_total_month",
        "surname",
        "given_name",
        "department_code",
        "creation_date",
    ]

    synonyms = {
        # token-level synonym expansion (after normalization)
        "firstname": "first name",
        "lastname": "last name",
        "surname": "last name",
        "given": "first",    # crude but illustrates the idea
        "employeeid": "employee id",
    }

    matches, sim_matrix = best_column_matching(
        cols_a, cols_b, synonyms=synonyms, min_similarity=0.25
    )

    print("Similarity matrix (A x B):")
    print(sim_matrix)

    print("\nBest matches:")
    for m in matches:
        if m.col_b is None:
            print(
                f"A: {m.col_a:<15}  →  NO GOOD MATCH  (best similarity = {m.similarity:.3f})"
            )
        else:
            print(
                f"A: {m.col_a:<15}  →  B: {m.col_b:<20}  (similarity = {m.similarity:.3f})"
            )

3–3. How can we adapt/generalize this

Swap TF-IDF for embeddings for deeper semantics (e.g. sentence-transformers):
- Replace build_similarity_matrix’s TF-IDF part by a call to a pre-loaded embedding model.
- Similarity remains cosine between A-vectors and B-vectors.
Tune min_similarity:
- Higher (0.5–0.7) if the labels are relatively clean and English-like.
- Lower (0.2–0.3) if labels are noisy, multilingual, or very short.
Use ColumnMatch structure to:
- Automatically rename A’s columns to the matched ones from B.
- Flag columns with col_b is None for manual inspection.