[JustForFunPython] N-gram to quantify similarity between sentences

A Ydobon
2 min readJan 27, 2020

--

N-gram, I like this word, as the word itself looks cute for me. The thing is this method is very useful to check plagiarism, as it inspects similarity by comparing character-chunks each by each.

Yes I drew it by myself.

How this works? Well, we are about to build one right now!

def ngram(sentence, num):
tmp = []
sent_len = len(sentence) - num +1
for i in range(sent_len):
tmp.append(sentence[i:i+num])
return tmp

1–1. This ‘ngram’ function takes two inputs, sentence and the ’N’, for N-gram. If we say 2-grams, a sentence, ‘A cat sat on a mat.’ be fed into this function, the decomposed would be [‘A ‘, ‘ c’, ‘ca’, ‘at’, ‘t ‘, ‘ s’, ‘sa’, ‘at’, ‘t ‘, ‘ o’, ‘on’, ‘n ‘, ‘ a’, ‘a ‘, ‘ m’, ‘ma’, ‘at’, ‘t.’].

def diff_ngram(sent_a, sent_b, num):
a = ngram(sent_a, num)
b = ngram(sent_b, num)
common = []
cnt = 0
for i in a:
for j in b:
if i == j:
cnt += 1
common.append(i)
return cnt/len(a), common

2–1. We build another function which uses the previous function we just built, the ‘ngram()’.

2–2. By using ‘ngram()’ function, we decompose each input sentence into N-gram way, and store the decomposed ones in each variable ‘a’, ‘b’.

2–3. Now, check whether those two lists contain same components. If they have one, then counter, which named under ‘cnt’, will gain 1, and the component will be appended into the list variable ‘common’.

2–4. To quantify similarity, we divide ‘cnt’ by length of the list ‘a’. Also we return the common component list.

To test this, I made up three sentences.

c = ‘A cat sat on a mat.’

d = ‘A cat sining in the rain.’

e = ‘A dog sat on a mat.’

r2, word2 = diff_ngram(c, d ,2)
r3, word3 = diff_ngram(c, e, 3)
print("2-gram: ", r2, word2)
print("3-gram: ", r3, word3)

And the result is the following.

2-gram: 0.5555555555555556 [‘A ‘, ‘ c’, ‘ca’, ‘at’, ‘t ‘, ‘ s’, ‘at’, ‘t ‘, ‘n ‘, ‘at’]
3-gram: 0.7647058823529411 [‘at ‘, ‘ sa’, ‘sat’, ‘at ‘, ‘t o’, ‘ on’, ‘on ‘, ‘n a’, ‘ a ‘, ‘a m’, ‘ ma’, ‘mat’, ‘at.’]

Easy-Peasy!

Happy learning and see you around! 🏃 😎 🌻

--

--

A Ydobon
A Ydobon

No responses yet