[JustForFunPython] How to tune for the best parameters using Scikit Learn
Previously, we have built a classifier to distinguish different languages by counting the frequency of each Alphabet character. At that time, we did not tune parameters inside the model but used the default one. It showed okay performance, but in some tasks, tuning the parameters generate a humongous positive difference.
So, how do we tune? For example, in the Support Vector Machine provided by Scikit Learn, we can choose out of many different kernel functions: ‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’, ‘precomputed’ or a callable.
So, to test which kernel fits the best with our dataset, do we have to manually modify the value and run training again and again repeatedly? No, never worry about that. The library already is equipped with what we want. GridSearchCV.
GridSearchCV automatically inspects which parameter works best with our dataset. It is very easy and convenient.
First, set up some lists or dictionaries of selection.
Second, feed the selection of parameters into the model.
Third, wait until the training ends.
Take a look into the previous posting (link below) about coding the classifier in Python.
We will modify a bit, adding a few lines of code and wrap a function with another to use the provided GridSearch method aiming higher performance of our model.
First of all, we need to import GridSearchCV from the package. I added the following line in the very first part of my code.
from sklearn.model_selection import GridSearchCV
The next part we will modify is the following training code lines.
clf = svm.SVC(gamma = 'auto')
clf.fit(data['freqs'], data['labels'])
We will change the above into the new following code lines.
params = {"C": [1,10, 100], 'kernel':['linear', 'rbf', 'poly','sigmoid']}clf = GridSearchCV(svm.SVC(gamma = 'auto'), params, n_jobs = 1)
clf.fit(data['freqs'], data['labels'])
And the rest is the same.
1–1. params value is a collection of parameters that GridSearchCV will go over to find the best parameter combination of our model.
1–2. Do you see “n_jobs” in the GridSearchCV? This is for controlling parallel processing in Python. The default value if 1, and if you set the value to -1, it will automatically start to work parallel depending on the number of cores of your machine. But, unless your dataset is small, I do not recommend to change this value, as this may cause a memory error.
To check the best combination of parameters, use “.best_estimator()” method.
print(clf.best_estimator_)
According to mine, the best kernel function this time is ‘poly’. Considering that the default if ‘rbf’, I can be assured that GridSearchCV works properly.
(But, the performance did not advance further from the default one.)
Anyhow, we learned another new thing today!
Easy-Peasy huh?
GridSearch sounds intimidating if we have no prior knowledge about it, but as we are now equipped with the new skill, nothing to be scared!
So, happy learning! And see you around! 🏃 😎