Machine learning has given an idea of innovation and power to young minds. With the advancements in computer technology, it has become the stepping stone into the future. Machine learning provides various algorithms for solving different problems, one of which is classification. Classification recognizes, understands, and groups ideas and objects into categories or classes. Now, let’s see some types of classification algorithms in machine learning.
1. Logistic Regression
Logistic regression is a supervised machine learning technique used for classification problems. Here, we predict the categorical dependent variable using the given independent variables. The predicted outcome is binary: yes or no, 0 or 1, etc. The working of logistic regression is that the relationship between the dependent and independent variables is statistically analyzed with the sigmoid (aka logistic) function to carry out prediction, which is close to linear regression where a regression line is fitted to data. It is on top of the types of classification algorithms in machine learning because of its working principle, which is as simpler as linear regression. The sigmoid function is a mathematical function that is an S-shaped curve used to convert values into probabilities.
Mathematically, we define the probabilities of outcomes and events by measuring the impact of multiple variables in the given data. Logistic regression plays an important role in machine learning because of its ability to provide probabilities and classification of new data using historic continuous or discrete data. There are two assumptions for logistic regression:
- The nature of the dependent variable must be categorical
- No multi-collinearity in independent variables
The equation for logistic regression is,
where, log[y / 1-y] is the logarithm of the likelihood of the dependent variable
b0 is the y-intercept
x1, x2, x3… are the independent variables
b1, b2, b3… are the slope coefficients
Based on the number of outcome categories, logistic regression are of three types.
- Binomial – There are only two possible outcomes, 0 or 1, yes or no, etc.
- Multinomial – When there are three or more unordered possible outcomes. For example, categories of water bodies, ‘sea’, ‘lake’, or ‘river’.
- Ordinal – When there can be three or more ordered possible outcomes, such as ‘good’, ‘better’, or ‘best’.
2. K-nearest Neighbor
K-nearest neighbor or KNN is a non-parametric supervised learning classifier. It is one of the simplest classification techniques in machine learning used for both regression and classification problems. This algorithm works on the neighbors-based classification, a type of lazy learning as it does not learn from the training data immediately but stores it for the execution stage. KNN identifies an object or new data by finding the similarity between new data and the stored training set. It puts the new data into the category it finds the most similar to the available data. The classification computation is done by the majority of votes of the k-nearest neighbors of each data point. Mathematically, the Euclidean distance between the new data point and training data points is calculated. Then the new data point is assigned to the category or class which has the highest number of k-neighbors to the new data point.
What decides the ‘k’ in KNN?
‘K’ indicates the number of neighbors a data point has. It is considered a hyperparameter in KNN, which has to be decided beforehand to get the most suitable fit for the data set. When k is small, it gives the most adjustable fit to the data but will have low bias and high variance. Meanwhile, when k has a higher value, it is more flexible to outliners and has a lower variance but high bias. There is no right way to find the best value of ‘k,’ it depends on the dataset, but the most preferred value is 5.
3. Decision Tree
Decision tree is a supervised learning algorithm that uses a tree representation to solve the problem by producing a sequence of rules that can classify the data. This algorithm can be visualized among the other types of classification algorithms in machine learning, making it simple to understand. A decision tree requires little data preparation and can handle both numerical & categorical data. The algorithm is a flowchart tree-like structure in which each leaf node corresponds to a class label and represents attributes on the internal nodes. Decision trees can be defined as a graphical representation for getting all possible solutions to a problem based on the given conditions.
Here is the basic decision tree structure.
A decision tree works to predict the class of a given dataset. It starts from the root node and goes to leaf nodes. The algorithm works as a classification model by comparing the values of root attributes and record (dataset) attributes based on this, we follow the branch and jump on the next node. This process of comparison continues until you reach the leaf node. To find the best attribute in the dataset attribute selection measure (ASM) is used. ASM is a technique of selecting the best attribute in the given dataset performed either by information gain or Gini index.
The information gain is the change in entropy after segmentation of a dataset, or it is the measure of how much information an attribute provides about a class. The objective of the decision tree is to maximize the information gain, and the node with the highest information gain is split first.
The formula calculates information gain,
Information Gain= Entropy(S)- [(Weighted Avg) *Entropy(each feature)
And the formula for entropy is,
Entropy(s)= -P(yes)log2 P(yes)- P(no) log2 P(no)
where, S= total number of samples, P(yes)= probability of yes, and P(no)= probability of no
Gini Index is a metric measuring impurity or purity of an element. It calculates the amount of probability of a specific attribute classified incorrectly when selected at random. The decision tree prefers a low Gini index attribute to create binary splits. The formula for calculating the Gini index is,
Gini Index= 1- ∑jPj2
where Pj is the probability of an element being classified for ‘j’ distinct class.
4. Support Vector Machine
Support vector machine (SVM) is one of the most widely used supervised learning classification methods because of its significant accuracy with less computation power. The objective of SVM is to fit a hyperplane to the data points in an N-dimensional space that distinctly classifies data points and helps to categorize a new data point. Hyperplanes are decision boundaries that can segregate the N-dimension space into classes. The dimension of the hyperplane depends on the number of features present in the given dataset.
SVM chooses extreme points or support vectors that help create the hyperplane, thus, the algorithm’s name. Support vectors are defined as the data points closer to the hyperplane, which influence the position and orientation of the hyperplane. With these support vectors, SVM tries to maximize the margin of the classifier. The shortest distance between the observed data points and the threshold is called the margin, and the threshold is the largest distance between the two classes. See the diagram below for a better understanding of the hyperplane.
There can be two types of SVM:
- Linear SVM – When the dataset is linearly separable, that is, if the data is classified into two classes by a single straight line. Then, the classifier is called a linear SVM.
- Non-linear SVM – When the dataset is non-linearly separable, the data can not be classified using a straight line. Then, the classifier is called a non-linear SVM.
5. Naive Bayes
Naive Bayes is a probabilistic supervised learning algorithm based on the Bayes theorem used to solve types of classification problems. It makes predictions based on the probability of an object. Naive Bayes classifier is widely used in text classification, spam filtration, and sentiment analysis. It is one of the most simple but fast, accurate, and reliable algorithms in machine learning. Now, Bayes’ theorem or Bayes’ law is the basis of the algorithm, which is used to calculate the probability of a hypothesis with prior knowledge and works on conditional probability. Conditional probability is a measure of the probability of an event occurring, given that another event has occurred.
The formula for Bayes’ theorem is,
P(A|B) = Posterior probability, i.e the probability of hypothesis A on the observed event B.
P(B|A) = Likelihood probability, i.e the probability of the evidence given that the probability of a hypothesis is true.
P(A) = Prior Probability, the probability of hypothesis before observing the evidence.
P(B) = Marginal Probability, the probability of evidence.
The fundamental assumption of the Naive Bayes classification model is that each feature makes an independent and equal contribution to the outcome. To be noted, the assumption is not generally found in real-world situations. In fact, the independence assumption is never correct but often works well in practice.
There are three types of Naive Bayes classification models,
- Gaussian – If predictors take continuous values instead of discrete, the dataset’s features follow the normal distribution. Then, the Naive Bayes model is called a gaussian model.
- Multinomial – When the data is multinomially distributed, the classifiers use the frequency of words for the predicators to assign a category called the multinomial model.
- Bernoulli – Similar to the multinomial model, the predictor values are independent boolean variables.
Machine learning provides various classification techniques, and we have discussed the five most simple and basic types of classification algorithms. The above-stated algorithms are easy and straightforward for implementation yet give good accuracy. These algorithms use mathematically & statistically proven methods & laws to perform analytical tasks.