[JustForFunPython] Classifying languages based on the frequency of each alphabet letter

A Ydobon
4 min readJan 25, 2020

I heard a strikingly interesting fact the other day, that linguistically it is a known fact that each different language is distinguishable depending on the number of occurrences of each alphabet letter.

Based on this, I can assume that by drawing a histogram for each language, we can distinguish French out of other languages such as English, or Indian.

So, today we will build a genius language classifier with help from Scikit Learn. If you are unfamiliar with this popular Python library, take a look via the link below, and try to install it in your machine.

My training dataset consists of 15 different text files. 5 each for French, English, and Indian. You can scrap any web text pages to make a dataset by yourself. Each of the files is saved like ‘en-1’, ‘en-2’, or ‘id-3’, ‘fr-7’. I am saying this because the upcoming frequency checking function will use the filename to label languages.

Inside my training folder

And also for the test dataset, 2 each for French, English, and Indian. Those are grouped under ‘train’ and ‘test’ folders each.

So, let’s dive in!

First of all, we will import the libraries that we will use.

from sklearn imoprt svm, metrics 
import glob, os.path, re, json

Second, we will make a function that counts the frequency of each alphabet character.

def check_freq(fname):
name = os.path.basename(fname)
lang = re.match(r'^[a-z]{2,}', name).group()
with open(fname, 'r', encoding = 'utf-8') as f:
text = f.read()
text = text.lower()

2–1. Files of different language texts will be grouped together, and by using regular expression function, I can group them based on the first 2 letters of the file. English is en, French in fr, and Indian in id.

2–2. After reading each text file, I will make those in lower case only by using the method “lower()”

And third, I will make a counter. The counter will be named as ‘cnt’ in the function below.

def check_freq(fname):
name = os.path.basename(fname)
lang = re.match(r'^[a-z]{2,}', name).group()
with open(fname, 'r', encoding = 'utf-8') as f:
text = f.read()
text = text.lower()
cnt = [0 for n in range(0, 26)]
code_a = ord('a')
code_b = ord('z')
for ch in text:
n = ord(ch)
if code_a <= n <= code_z:
cnt[n-code_a] += 1

3–1. Our counter ‘cnt’ is initialized with zeros first. The list consists of 26 zeros.

3–2. ord() function returns ASCII code number. Thereby, we store the smallest value in ‘code_a’ by using ord(‘a’), and the largest in the ‘code_b’ by ord(‘z’).

3–3. And for each different 26 alphabet letters, we add 1 up every time we spot each letter within the text.

Next, in the fourth step, we will normalize the counted frequency for each language, as the volume of each language data would differ.

def check_freq(fname):
name = os.path.basename(fname)
lang = re.match(r'^[a-z]{2,}', name).group()
with open(fname, 'r', encoding = 'utf-8') as f:
text = f.read()
text = text.lower()
cnt = [0 for n in range(0, 26)]
code_a = ord('a')
code_b = ord('z')
for ch in text:
n = ord(ch)
if code_a <= n <= code_z:
cnt[n-code_a] += 1
total = sum(cnt)
freq = list(map(lambda n: n/total, cnt))
return (freq, lang)

4–1. We sum up the list value of ‘cnt’ to divide each letter’s frequency with it.

4–2. The normalized frequency count value will be stored under ‘freq’, and the language label in ‘lang’. For example, the return value would be similar to this: ([0.164, 0.026, ….0.00460, 0.00046, 0.0148, 0.00058], ‘fr’)

The main part has been done, and now will make another function to load each text file.

def load_files(path):
freqs = []
labels = []
file_list = glob.glob(path)
for fname in file_list:
r = check_freq(fname)
freqs.append(r[0])
labels.append(r[1])
return {"freqs": freqs, "labels": labels}

5–1. “freq” will have all the normalized value lists, and “labels” will have each counterpart language labels.

5–2. Final return is in the form of dictionary, this is because we will store the result in JSON file format later.

Next, we will feed our prepared data to the load_files function we just built.

data = load_files("./train/*.txt")
test = load_files("./test/*.txt")
with open("freq.json", "w", encoding = 'utf-8') as f:
json.dump([data, test], f)

And then, we will train the model with Support Vector Machine Classifier method provided by Scikit-Learn.

clf = svm.SVC(gamma = 'auto') 
clf.fit(data['freq'], data['labels'])

6–1. We made an object by calling “svm.SVC()”, and then input our prepared dataset using the function “fit()”.

At this point, we should check how our machine performs on the unseen dataset, which is also known as ‘test’ dataset.

predict = clf.predict(test['freqs'])

7–1. When we train the dataset, we use the ‘fit’ function, and to test, we use the ‘predict()’ function in Scikit-Learn.

Now, done!

Let’s check the performance in concrete numeric figures.

accuracy_score = metrics.accuracy_score(test['labels'], predict) 
classification_report = metrics.classification_report(test['labels'], predict)

I printed the two results out.

Oh, by the way, if you are not familiar with interpreting classification report, don’t worry, I already posted sth to lessen your stress with the numbers!

So, how was it?

Easy-Peasy!

Happy learning and see you around! 🍰 🏃

--

--