[JustForFunPython] Levenshtein Distance to quantify difference between two words

A Ydobon
5 min readJan 27, 2020

--

‘Levenshtein Distance’ is also known as ‘Edit Distance’. Simply put, it counts how many character operations (Replace/ Delete/ Insert) do we need to do to change one word to the other . For further and more thorough explanation, we have Wikipedia!

In information theory, linguistics and computer science, the Levenshtein distance is a string metric for measuring the difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of single-character edits (insertions, deletions or substitutions) required to change one word into the other. It is named after the Soviet mathematician Vladimir Levenshtein, who considered this distance in 1965.

Google search result!

For example, the Levenshtein Distance between ‘hello’ and ‘Hello’ is 1 as we need 1 replace operation between ‘h’ to ‘H’. From ‘Google’ to ‘GOOG’, the distance is 5, as we have to replace 2 lower letter ‘o’s to capital ‘O’s, another replacement from lower g to capital G, two deletions on the last two letters ‘l’ and ‘e’.

Google → GOogle → GOOgle → GOOGle → GOOGl → GOOG

(5 steps different, 5 Levenshtein distance)

Ready? Let’s dive into coding this interesting distance calculator!

def L_distance(a,b):
if a == b: return 0

1–1. If we happen to measure the difference between two exact words, then the distance should be 0.

def L_distance(a,b):
if a == b: return 0
a_len = len(a)
b_len = len(b)
if a == "": return b_len
if b == "": return a_len

2–1. We will store each input word’s length in different variables.

2–2. If one of the given is an empty space, then the distance should be the other word’s length.

def L_distance(a, b):
if a == b: return 0
a_len = len(a)
b_len = len(b)
if a == "": return b_len
if b == "": return a_len
matrix = [[] for i in range(a_len+1)]
for i in range(a_len +1):
matrix[i] = [0 for j in range(b_len +1)]

3–1. We prepare a 2 dimensional matrix initialized with zeros.

def L_distance(a, b):
if a == b: return 0
a_len = len(a)
b_len = len(b)
if a == "": return b_len
if b == "": return a_len
matrix = [[] for i in range(a_len+1)]
for i in range(a_len +1):
matrix[i] = [0 for j in range(b_len +1)]
for i in range(a_len+1):
matrix[i][0] = i
for j in range(b_len+1):
matrix[0][j] = j

4–1. The distance calculator uses a 2D matrix. For the first component of each i-th row, we put the row index number ‘i’.

4–2. For the first component of each j-th column, we put the column index number ‘j’.

def L_distance(a, b):
if a == b: return 0
a_len = len(a)
b_len = len(b)
if a == "": return b_len
if b == "": return a_len
matrix = [[] for i in range(a_len+1)]
for i in range(a_len +1):
matrix[i] = [0 for j in range(b_len +1)]
for i in range(a_len+1):
matrix[i][0] = i
for j in range(b_len+1):
matrix[0][j] = j
for i in range(1, a_len+1):
a_ = a[i-1]
for j in range(1, b_len+1):
b_ = b[i-1]
cost = 0 if (a_ == b_) else 1
matrix[i][j] = min([
matrix[i-1][j] +1,
matrix[i][j-1] +1,
matrix[i-1][j-1] + cost
])
return matrix[a_len][b_len]

5–1. Looks a bit complicated, right? This part is a code implementation of this formula.

Let me explain in pictures. I will draw them by myself. (I am somewhat renowned for poor drawings in my lab.)

Let’s consider measuring distance between ‘Google’ and ‘GOOG’.

Pic — 1

Pic — 1. When comparing blank(“”) with the word “Google”, the number of operation we need is 6, which is the same as the Levenshtein distance.

Pic — 2

Pic — 2. When comparing the first letter of the two, they are exactly same. Therefore, we don’t need any operation, thereby the distance is 0.

Pic — 3

Pic — 3. Between ‘Go’ and ‘G’, 1 distance away.

Can you see the pattern in terms of filling this table?

Pic — 4

Pic — 4. Take a look in the green box of each picture. Out of three figures, we select one that is minimum, and add 1.

Pic — 5

Pic — 5. This, we just coded.

    for i in range(1, a_len+1):
a_ = a[i-1]
for j in range(1, b_len+1):
b_ = b[i-1]
cost = 0 if (a_ == b_) else 1
matrix[i][j] = min([
matrix[i-1][j] +1,
matrix[i][j-1] +1,
matrix[i-1][j-1] + cost
])
return matrix[a_len][b_len]

When we fill up the table,

Pic — 6

We can see that the Levenshtein distance between “GOOG” and “Google” is 5.

You know what, this can work in any other language, try to put some Korean words in them and check out the distance between “신촌” and “산촌”. The distance will certainly be 1, as the only first letters are different.

Easy-Peasy!

Happy learning, and see you around! ☕️ 🍰

--

--

A Ydobon
A Ydobon

No responses yet