ILCSS — Iterative Longest Common SubStrings Similarity

A string similarity measure specifically designed for approximate matching of multi-part personal names from heterogeneous sources.

Preprint: https://doi.org/10.5281/zenodo.19462182

Overview

ILCSS addresses a problem that defeats character-level measures such as Levenshtein distance and Jaro-Winkler: matching names whose components appear in different orders or with a different number of parts across administrative databases, criminal intelligence systems, or bibliographic records.

The algorithm iteratively extracts the longest common substrings shared by two strings, consuming matched blocks at each step and applying a positional penalty that attenuates with block length, thereby rewarding component-level agreement while retaining sensitivity to ordering conventions.

Example: abdullah abdel aziz vs aziz abdullah abdel → ILCSS = 0.93

Performance

Dataset	F1	Score gap (same vs different person)
Synthetic (295 pairs, 5 naming traditions)	0.998	0.729
DBLP-ACM (bibliographic records)	0.994	0.758
FEBRL (civil registry, name swaps)	0.989	0.816

Competing measures (Levenshtein, Jaro-Winkler, Ratcliff/Obershelp) achieve score gaps of 0.262–0.361 on the same synthetic benchmark.

Repository Structure

ILCSS_sim.pl          Perl reference implementation
ILCSS_sim.cr          Crystal implementation
t_LCS.cr              LCS with full matrix (Crystal)
t_LCS_opt.cr          Space-optimised LCS, returns all maximal blocks (Crystal)
ILCSS_paper.pdf       Published preprint
ILCSS_paper.tex       LaTeX source
eval/
  eval_ilcss.py       Synthetic evaluation (295 pairs, 5 naming traditions)
  eval_real.py        Real-world evaluation (DBLP-ACM, FEBRL)
  results.csv         Synthetic results
  results_dblp_acm.csv
  results_febrl.csv
  FEBRL_dataset1.csv  FEBRL benchmark dataset (BSD-3-Clause)
  DBLP-ACM/           Leipzig benchmark dataset (CC BY 4.0)
literature/
  references.bib      BibTeX references (29 entries)

Algorithm

function ilcss_sim(s1, s2):
    remove spaces from s1, s2
    L = max(len(s1), len(s2))
    score = 0
    loop:
        block = longest_common_substring(s1, s2)
        if len(block) < 3: break
        pos1 = s1.find(block)
        pos2 = s2.find(block)
        r_len     = len(block) / L
        r_dist    = |pos1 - pos2| / L
        r_penalty = r_dist * (1 - r_len)
        score    += len(block) * (1 - r_penalty)
        mask block in s1 and s2
    return score / L             # normalise to [0, 1]

Recommended matching threshold: 0.40

Running the Evaluation

Requires Python 3.8+ with jellyfish and difflib (standard library):

pip install jellyfish
python eval/eval_ilcss.py    # synthetic evaluation
python eval/eval_real.py     # DBLP-ACM + FEBRL evaluation

Citation

@misc{calao2026ilcss,
  author    = {Calao, Raoul Sotero},
  title     = {{ILCSS}: An Iterative Longest Common SubStrings Similarity
               Algorithm for Multi-Part Personal Name Matching},
  year      = {2026},
  doi       = {10.5281/zenodo.19462182},
  url       = {https://doi.org/10.5281/zenodo.19462182},
  note      = {Preprint}
}

License

Code: MIT License

Datasets:

DBLP-ACM: CC BY 4.0 — Köpcke et al., Leipzig University
FEBRL: BSD-3-Clause — Christen & Goiser

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ILCSS — Iterative Longest Common SubStrings Similarity

Overview

Performance

Repository Structure

Algorithm

Running the Evaluation

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
eval		eval
literature		literature
ILCSS_paper.pdf		ILCSS_paper.pdf
ILCSS_paper.tex		ILCSS_paper.tex
ILCSS_sim.cr		ILCSS_sim.cr
ILCSS_sim.pl		ILCSS_sim.pl
ILCSS_sim.py		ILCSS_sim.py
LICENSE		LICENSE
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

ILCSS — Iterative Longest Common SubStrings Similarity

Overview

Performance

Repository Structure

Algorithm

Running the Evaluation

Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages