NIST

Special Project

Compute Visual Similarity of Top-Level Domains

Compute the visual similarity between a possible new Generic Top-Level Domain (TLD) and other proposed TLDs, current TLDs, and reserved words.

This web site describes experimental software developed at the National Institute of Standards and Technology (NIST). No algorithms, code, or descriptions in whole or in part are recommended, used, or endorsed by the Internet Corporation for Assigned Names and Numbers (ICANN) or any other entity.


Compare Against All

Compare a string to other proposed TLDs, current TLDs, including country codes, and reserved words.

Proposed Top-Level Domain:

Note: All characters other than 0-9, a-z, A-Z, and hyphen (-) are removed.


Compare Two Strings

Compare two strings to each other.

String 1:
String 2:

Note: All characters other than 0-9, a-z, A-Z, and hyphen (-) are removed.


Background

Computers connect the Internet with numbers, IP addresses. However, people use strings ending in short segments like .edu, .uk, .tv, and .com, called Top-Level Domains (TLDs) to navigate the World Wide Web. Ensuring the ongoing security and stability of the Domain Name System is one activity of the Internet Corporation for Assigned Names and Numbers (ICANN). With the growth of the Web, there is a possibility of a lot of new TLDs. As one method to implement the recommendation that, "Strings must not be confusingly similar to an existing top-level domain ...", this web page invokes an algorithm developed at NIST "to provide an open, objective, and predictable mechanism for assessing the degree of visual confusion" between proposed or existing TLDs.

The algorithm takes into account varying degrees of similarity between some 60 pairs, like 0 and O (zero and oh), 1 and l (one and L), Z and 2, h and n, rn (R and N) and m, and w and vv (v v). Even insertions or deletions may cause confusion, for example .aaaah and .aaaaah look very much alike. The task is all the more challenging because domain names are not case sensitive.

Does such confusion happen in real life? Here's a piece of text from my screen.

I actually read the second line as "w a m s", not "w a r n s", even though "wams" is not a word!

Note that the algorithm is not meant to consider phonetic similarity. For example, "fish", "phish", and "fiche" sound alike, but are visually distinct and unlikely to be confused.

Web Interface

The first form has the algorithm compare a string against proposed generic TLDs, existing TLDs, including country codes, and reserved words. It reports exact matches, near matches, and best matches if there are no near matches.

The second form uses the algorithm to compare two strings against each other.

Only alphabetic characters (a-z and A-Z), numerals (0-9), and hyphens (-) are allowed in strings. See RFC 1035 2.3.1.

The Code and The Algorithm

The score is an enhanced Levenshtein distance that is adjusted for length and normalized. Other possibilities for distance measures are Jaro-Winkler, Damerau-Levenshtein, cosine distance, and many others.

The code is written in Python. The interface to the algorithm itself is a single function, howConfusableAre(). It takes two parameters: the two strings to be compared.

HowConfusableAre() calls levenshtein() to compute a form of edit difference, then normalizes the score and accounts for string lengths.

Levenshtein() takes two strings. It is an enhanced Levenshtein distance algorithm that accounts for substituting two characters by one (or vice versa), inserting an additional repeated character, and transposition, as well as the usual insertion, deletion, and substitution. The result is roughly the number of visual differences between the strings. I say "roughly" because substituting O (upper case letter "o") for 0 (zero) is a much smaller difference than substituting, say, w for t. Levenshtein() calls two routines to find similarity, and hence cost, for substituting or transposing characters or "digraph" (character pairs): characterSimilarity() and digraphSimilarity().

Both characterSimilarity() and digraphSimilarity() take two strings (single characters in the case of characterSimilarity()). Both work much the same way: look up the passed strings in a table. If they are in the table, return the value in the table. If not, use the default: 0 means completely different and 1 means identical.

The code, data, and automatic test material are in the file visualSimilarity.tar, which is a tar file.

Cautions

The algorithm, consisting of the distance measure, the scoring formula, and the character similarities are mostly just my estimations. Several people have looked at resulting scores. I adjusted the formula and a few weights to correspond with feedback. Still the algorithm has not been systematically or independently validated.


Web page
Created Tue Feb 5 16:49:10 2008
by Paul E. Black  (paul.black@nist.gov)
Updated Mon Jun 9 13:47:03 2014
by Paul Black  (paul.black@nist.gov)

Contact:  webmaster@nist.gov
PRIVACY/SECURITY ISSUES
FOIADisclaimerUSA.gov
Information Technology Laboratory
Software and Systems Division
NIST is an agency of the U.S. Commerce Department