Quick Contact
Machine Learning Tutorial
 Introduction to Machine Learning
 Classification Algorithm
 Types of ML Classification Algorithms
 Types of Machine Learning
 Supervised learning
 Applications of unsupervised learning
 Unsupervised Learning
 Reinforcement learning
Machine Learning Interview Question & Answers
ML Interview questions Part II
Q16). Explain how a ROC curve works?
The ROC curve is a graphical representation of the comparison at different thresholds between true positive rates and falsepositive rates.
Q17). What is the DBSCAN clustering algorithm and implement it?
DBSCAN stands for densitybased spatial clustering based on density clustering of applications with noise. This algorithm refers to unsupervised learning which, based on the assumption that the cluster is continuous in the low/high point density data area, identifies unique groups or clusters in the dataset. Different kinds of clusters, i.e. shapes and sizes from a huge large of the dataset, can also be calculated which can contain noise and outliers.
Below is the code for the DBSCAN clustering algorithm. Now, let’s understand with the help of
an example, how to implement DBSCAN clustering by using jypter library of Python.
Step 1: Import the python modules
import numpy as np
from sklearn.cluster import DBSCAN
from sklearn import metrics
from sklearn.datasets import make_blobs
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
%matplotlib inline
Where “StandardScaler” function removes the mean and assigns each feature to unit variance.
Step 2: Input data
centers = [[1, 1], [1, 1], [1, 1]]
X, labels_true = make_blobs(n_samples=550, centers=centers, cluster_std=0.4, random_state=0)
X = StandardScaler().fit_transform(X)
The code is used to generate data using make_blobs.
Step 3: Calculate DBSCAN
db = DBSCAN(eps=0.2, min_samples=10).fit(X)
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_
n_clusters_ = len(set(labels)) – (1 if 1 in labels else 0)
n_noise_ = list(labels).count(1)
print(‘Estimated number of clusters: %d’ % n_clusters_)
print(‘Estimated number of noise points: %d’ % n_noise_)
print(“Homogeneity: %0.3f” % metrics.homogeneity_score(labels_true, labels))
print(“Completeness: %0.3f” % metrics.completeness_score(labels_true, labels))
print(“Vmeasure: %0.3f” % metrics.v_measure_score(labels_true, labels))
print(“Adjusted Rand Index: %0.3f” % metrics.adjusted_rand_score(labels_true, labels))
print(“Adjusted Mutual Information: %0.3f”
% metrics.adjusted_mutual_info_score(labels_true, labels))
print(“Silhouette Coefficient: %0.3f”
% metrics.silhouette_score(X, labels))
It is used to calculate the DBSCAN and clusters them in labels by ignoring noise if it is present in the output and plot final result.
Step 4: Visualise dataset
unique_labels = set(labels)
colors = [plt.cm.Spectral(each)
for each in np.linspace(0, 1, len(unique_labels))]
for k, col in zip(unique_labels, colors):
if k == 1:
# Black used for noise.
col = [0, 0, 0, 1]
class_member_mask = (labels == k)
xy = X[class_member_mask & core_samples_mask]
plt.plot(xy[:, 0], xy[:, 1], ‘o’, markerfacecolor=tuple(col),
markeredgecolor=’k’, markersize=10)
xy = X[class_member_mask & ~core_samples_mask]
plt.plot(xy[:, 0], xy[:, 1], ‘o’, markerfacecolor=tuple(col),
markeredgecolor=’k’, markersize=5)
plt.title(‘Estimated number of clusters: %d’ % n_clusters_)
plt.show()
For visualisation of data, black is removed and instead used for noise.
Output: DBSCAN clustering
Q18). Define precision and recall.
The true positive rate is also known as recall: the number of positives your model says is positive relative to the actual number of positives in the results. Precision is also known as the positive predictive value, and it is a measure of the sum of correct positives the model claims relative to the number of positives it claims. In the sense of a case where you predicted that there were 10 apples and 5 oranges in a case of 10 apples, it can be easier to think of recall and accuracy. You will have a great recall (there are 10 apples, and you expected there will be 10) but 66.7 per cent accuracy because only 10 (the apples) are right out of the 15 events you predicted.
Q19). Why do we need a training set, a validation set and a test set?
When constructing a model we split the data into three separate categories:

Training set:
We use the training set for model building and modifying the variables of the model. But, on top of the training set, we cannot rely on the correctness of the model built. On feeding new inputs, the model could offer incorrect outputs.

Validation set:
To look at the model’s response on top of the samples that do not exist in the training dataset, we use a validation set. We will then tune the hyperparameters based on the validation data’s approximate benchmark.
When we evaluate the response of the model using the validation set, we indirectly train the model with the validation set. This may result in the overfitting of unique data to the model. So, this model won’t be powerful enough to give the realworld data the desired answer.

Test set:
The test dataset is a subset of the real dataset that has not yet been used for model training. This dataset is unknown to the model. So, we can compute the response of the generated model on hidden data by using the test dataset. Based on the test dataset, we evaluate the model’s efficiency.
Q20). When an algorithm is considered independent in machine learning?
Machine learning is characterised as an independent machine learning algorithm in which mathematical foundations are independent of any specific classifier or learning algorithm.
Q21). What are the advantages of Naive Bayes?
The classifier in Naïve Bayes can converge faster than discriminative models such as logistic regression, so you need fewer data from the practising. The key benefit is that interactions between features can’t be taught.
Q22). Which one is your favourite algorithm and can you explain it?
This form of query measures the comprehension of how complex and technical nuances can be conveyed with poise and the ability to rapidly and accurately summarise. Make sure that you have a choice and make sure that you can clarify various algorithms so clearly and easily that the fundamentals can be grasped by a fiveyearold!
Q23). Differentiate between the flat kernel and Gaussian Kernel.
The difference between the flat kernel and Gaussian Kernel are as follows:

Flat kernel:
This kernel does not guarantee that the densest points are around the centre. A single kernel may be identified with the centre, which may cover two or more of the centre clusters at its edge.

Gaussian kernel:
This kernel ensures that the centre has the densest points. For this kernel, the standard deviation will function as a bandwidth parameter.
Q24). What is the Confusion Matrix?
The Confusion matrix is used to describe the success of a model and offers a summary of predictions on the problems of classification. It helps to understand the ambiguity between groups.
Q25). Explain false negative, false positive, true negative, and true positive with a simple example.
True Positive (TP):
When the condition is correctly predicted by the Machine Learning model, it is said to have a True Positive value.
True Negative (TN):
If the model of Machine Learning correctly predicts the negative condition or class, then a True Negative value is said to have.
False Positive (FP):
When a negative class or condition is incorrectly predicted by the Machine Learning model, then it is said to have a False Positive value.
False Negative (FN):
When a positive class or condition is incorrectly predicted by the Machine Learning model, then it is said to have a false negative value.
Q26). What do you mean by Association rule mining (ARM)?
It is also a kind of unsupervised method of learning, which is to verify the dependency of one data item on another data item. It also suits properly, so that it can be more profitable. It attempts to find some important relationship between the variables of the dataset or correlations. The algorithm uses various types of rules to decide the best link between variables present in the dataset. Marketbased analysis, for example, is one of the essential methods used in mining under the association law. It uses a broad dataset to view connections between data objects. It also helps distributors to consider the relationships between the goods so that clients can regularly purchase them together. For example, if a client wants to buy bread, he can purchase butter, eggs or milk after entering the store because these dairy products are aligned within the same shelf.
Q27). Demonstrate the KMeans clustering algorithm.
Now, let’s understand with the help of an example, how to implement Kmeans clustering by using jypter library of Python.
Step 1: Import the python modules
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
%matplotlib inline
Here, “Pandas” library is used to read and write the spreadsheets. The “Numpy” is used to calculate the efficiency of the data set. Where “Matplotlib” is used for data visualization.
Step 2: Input data and calculate
X= 2 * np.random.rand(100,2)
X1 = 1 + 2 * np.random.rand(50,2)
X[50:100, :] = X1
The above code is used to generate random data in the form of a twodimensional space.
Step 3: Visualise the dataset
plt.scatter(X[ : , 0], X[ :, 1], s = 50, c = ‘b’)
plt.show()
Now, to plot values in scatter way, scatter plots the values of existing dataset.
Step 4: Use of ScikitLearn
from sklearn.cluster import KMeans
Kmean = KMeans(n_clusters=2)
Kmean.fit(X)
Here, arbitrarily gives k (n_clusters) an arbitrary value of two.
Step 5: Find the centroid
Kmean.cluster_centers_
plt.scatter(X[ : , 0], X[ : , 1], s =50, c=’b’)
plt.scatter(0.94665068, 0.97138368, s=200, c=’g’, marker=’s’)
plt.scatter(2.01559419, 2.02597093, s=200, c=’r’, marker=’s’)
plt.show()
Here, the above code is used to find the centre of the clusters.
Step 6: Algorithm testing
Kmean.labels_
sample_test=np.array([3.0,3.0])
second_test=sample_test.reshape(1, 1)
Kmean.predict(second_test)
The code is used to getting the labels property of the Kmeans clustering example dataset; that is, how the data points are categorized into the two clusters.
Output: Kmeans
Q28). Explain the differences between random forest and gradient boosting algorithm.
Bagging techniques are used by Random Forest, while GBM uses boosting techniques.
Random forests primarily aim to decrease variance and GBM decreases a model’s bias and variance.
Q29). Name the nonlinear regression model four common algorithms.
The nonlinear regression model four common algorithms are:
 KNearest Neighbours
 Decision tree
 Naïve Bayes
 Random forest
Q30). Illustrate the random forest.
Now, let’s understand with the help of an example, how to implement Random Forest by using jypter library of Python.
Step 1: Import the python module and data
from sklearn import datasets
iris = datasets.load_iris()
print(iris.target_names)
print(iris.feature_names)
print(iris.data[0:5])
print(iris.target)
To build a model in a random forest, use ‘load_iris()’ function. It is an inbuilt function in sklearn. It consists of sepal (length and width) also petal (length and width) and other type of flowers too. The flower is divided into three classes such as setosa, versicolor, and Virginia. To print, the target and feature names; make sure that to you have a correct dataset. Further, the first five rows of the dataset will get printed and also the target variable for the whole dataset.
Step 2: Create a dataframe
import pandas as pd
data=pd.DataFrame({
‘sepal length’:iris.data[:,0],
‘sepal width’:iris.data[:,1],
‘petal length’:iris.data[:,2],
‘petal width’:iris.data[:,3],
‘species’:iris.target
})
data.head()
DataFrame is defined as a twodimensional labelled data structure which consists of columns and other potential types.
Step 3: Split the dataset
from sklearn.model_selection import train_test_split
X=data[[‘sepal length’, ‘sepal width’, ‘petal length’, ‘petal width’]]
y=data[‘species’]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
To split the columns into the dependent variable (Y) and independent variables (X) by using training and test set.
Step 4: Train the model
from sklearn.ensemble import RandomForestClassifier
clf=RandomForestClassifier(n_estimators=100)
clf.fit(X_train,y_train)
y_pred=clf.predict(X_test)
After splitting, train the model based on the training set and predict the performance on the test dataset.
Step 5: Check the accuracy
from sklearn import metrics
print(“Accuracy:”,metrics.accuracy_score(y_test, y_pred))
Output: 0.9777
Step 6: Predict type of flower
clf.predict([[3, 5, 4, 2]])
Step 7: Create a random forests model
from sklearn.ensemble import RandomForestClassifier
clf=RandomForestClassifier(n_estimators=100)
Step 8: To see variable score
import pandas as pd
feature_imp =
pd.Series(clf.feature_importances_,index=iris.feature_names).sort_values(ascending=False)
feature_imp
Step 9: Visualise the dataset
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
# Creating a bar plot
sns.barplot(x=feature_imp, y=feature_imp.index)
# Add labels to your graph
plt.xlabel(‘Feature Importance Score’)
plt.ylabel(‘Features’)
plt.title(“Visualizing Important Features”)
plt.legend()
plt.show()
For the visualization process, combine matplotlib and seaborn because Matplotlib is a superset of seaborn and seaborn is built on the top of matplotlib library. It also provides several customized themes and extra plot types.