#### ML Interview questions Part II

##### Q16). Explain how a ROC curve works?

The ROC curve is a graphical representation of the comparison at different thresholds between true positive rates and false-positive rates.

##### Q17). What is the DBSCAN clustering algorithm and implement it?

DBSCAN stands for density-based spatial clustering based on density clustering of applications with noise. This algorithm refers to unsupervised learning which, based on the assumption that the cluster is continuous in the low/high point density data area, identifies unique groups or clusters in the dataset. Different kinds of clusters, i.e. shapes and sizes from a huge large of the dataset, can also be calculated which can contain noise and outliers.

Below is the code for the DBSCAN clustering algorithm. Now, let’s understand with the help of

an example, how to implement DBSCAN clustering by using jypter library of Python.

##### Step 1: Import the python modules

import numpy as np

from sklearn.cluster import DBSCAN

from sklearn import metrics

from sklearn.datasets import make_blobs

from sklearn.preprocessing import StandardScaler

import matplotlib.pyplot as plt

%matplotlib inline

Where “StandardScaler” function removes the mean and assigns each feature to unit variance.

##### Step 2: Input data

centers = [[1, 1], [-1, -1], [1, -1]]

X, labels_true = make_blobs(n_samples=550, centers=centers, cluster_std=0.4, random_state=0)

X = StandardScaler().fit_transform(X)

The code is used to generate data using make_blobs.

##### Step 3: Calculate DBSCAN

db = DBSCAN(eps=0.2, min_samples=10).fit(X)

labels = db.labels_

n_clusters_ = len(set(labels)) – (1 if -1 in labels else 0)

n_noise_ = list(labels).count(-1)

print(‘Estimated number of clusters: %d’ % n_clusters_)

print(‘Estimated number of noise points: %d’ % n_noise_)

print(“Homogeneity: %0.3f” % metrics.homogeneity_score(labels_true, labels))

print(“Completeness: %0.3f” % metrics.completeness_score(labels_true, labels))

print(“V-measure: %0.3f” % metrics.v_measure_score(labels_true, labels))

print(“Silhouette Coefficient: %0.3f”

% metrics.silhouette_score(X, labels))

It is used to calculate the DBSCAN and clusters them in labels by ignoring noise if it is present in the output and plot final result.

##### Step 4: Visualise dataset

unique_labels = set(labels)

colors = [plt.cm.Spectral(each)

for each in np.linspace(0, 1, len(unique_labels))]

for k, col in zip(unique_labels, colors):

if k == -1:

# Black used for noise.

col = [0, 0, 0, 1]

plt.plot(xy[:, 0], xy[:, 1], ‘o’, markerfacecolor=tuple(col),

markeredgecolor=’k’, markersize=10)

plt.plot(xy[:, 0], xy[:, 1], ‘o’, markerfacecolor=tuple(col),

markeredgecolor=’k’, markersize=5)

plt.title(‘Estimated number of clusters: %d’ % n_clusters_)

plt.show()

For visualisation of data, black is removed and instead used for noise.

##### Q18). Define precision and recall.

The true positive rate is also known as recall: the number of positives your model says is positive relative to the actual number of positives in the results. Precision is also known as the positive predictive value, and it is a measure of the sum of correct positives the model claims relative to the number of positives it claims. In the sense of a case where you predicted that there were 10 apples and 5 oranges in a case of 10 apples, it can be easier to think of recall and accuracy. You will have a great recall (there are 10 apples, and you expected there will be 10) but 66.7 per cent accuracy because only 10 (the apples) are right out of the 15 events you predicted.

##### Q19). Why do we need a training set, a validation set and a test set?

When constructing a model we split the data into three separate categories:

• ##### Training set:

We use the training set for model building and modifying the variables of the model. But, on top of the training set, we cannot rely on the correctness of the model built. On feeding new inputs, the model could offer incorrect outputs.

• ##### Validation set:

To look at the model’s response on top of the samples that do not exist in the training dataset, we use a validation set. We will then tune the hyperparameters based on the validation data’s approximate benchmark.

When we evaluate the response of the model using the validation set, we indirectly train the model with the validation set. This may result in the overfitting of unique data to the model. So, this model won’t be powerful enough to give the real-world data the desired answer.

• ##### Test set:

The test dataset is a subset of the real dataset that has not yet been used for model training. This dataset is unknown to the model. So, we can compute the response of the generated model on hidden data by using the test dataset. Based on the test dataset, we evaluate the model’s efficiency.

##### Q20). When an algorithm is considered independent in machine learning?

Machine learning is characterised as an independent machine learning algorithm in which mathematical foundations are independent of any specific classifier or learning algorithm.

##### Q21). What are the advantages of Naive Bayes?

The classifier in Naïve Bayes can converge faster than discriminative models such as logistic regression, so you need fewer data from the practising. The key benefit is that interactions between features can’t be taught.

##### Q22). Which one is your favourite algorithm and can you explain it?

This form of query measures the comprehension of how complex and technical nuances can be conveyed with poise and the ability to rapidly and accurately summarise. Make sure that you have a choice and make sure that you can clarify various algorithms so clearly and easily that the fundamentals can be grasped by a five-year-old!

##### Q23). Differentiate between the flat kernel and Gaussian Kernel.

The difference between the flat kernel and Gaussian Kernel are as follows:

• ##### Flat kernel:

This kernel does not guarantee that the densest points are around the centre. A single kernel may be identified with the centre, which may cover two or more of the centre clusters at its edge.

• ##### Gaussian kernel:

This kernel ensures that the centre has the densest points. For this kernel, the standard deviation will function as a bandwidth parameter.

##### Q24). What is the Confusion Matrix?

The Confusion matrix is used to describe the success of a model and offers a summary of predictions on the problems of classification. It helps to understand the ambiguity between groups.

##### True Positive (TP):

When the condition is correctly predicted by the Machine Learning model, it is said to have a True Positive value.

##### True Negative (TN):

If the model of Machine Learning correctly predicts the negative condition or class, then a True Negative value is said to have.

##### False Positive (FP):

When a negative class or condition is incorrectly predicted by the Machine Learning model, then it is said to have a False Positive value.

##### False Negative (FN):

When a positive class or condition is incorrectly predicted by the Machine Learning model, then it is said to have a false negative value.

##### Q26). What do you mean by Association rule mining (ARM)?

It is also a kind of unsupervised method of learning, which is to verify the dependency of one data item on another data item. It also suits properly, so that it can be more profitable. It attempts to find some important relationship between the variables of the dataset or correlations. The algorithm uses various types of rules to decide the best link between variables present in the dataset. Market-based analysis, for example, is one of the essential methods used in mining under the association law. It uses a broad dataset to view connections between data objects. It also helps distributors to consider the relationships between the goods so that clients can regularly purchase them together. For example, if a client wants to buy bread, he can purchase butter, eggs or milk after entering the store because these dairy products are aligned within the same shelf.

##### Q27). Demonstrate the K-Means clustering algorithm.

Now, let’s understand with the help of an example, how to implement K-means clustering by using jypter library of Python.

##### Step 1: Import the python modules

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

from sklearn.cluster import KMeans

%matplotlib inline

Here, “Pandas” library is used to read and write the spreadsheets. The “Numpy” is used to calculate the efficiency of the data set. Where “Matplotlib” is used for data visualization.

##### Step 2: Input data and calculate

X= -2 * np.random.rand(100,2)

X1 = 1 + 2 * np.random.rand(50,2)

X[50:100, :] = X1

The above code is used to generate random data in the form of a two-dimensional space.

##### Step 3: Visualise the dataset

plt.scatter(X[ : , 0], X[ :, 1], s = 50, c = ‘b’)

plt.show()

Now, to plot values in scatter way, scatter plots the values of existing dataset.

##### Step 4: Use of Scikit-Learn

from sklearn.cluster import KMeans

Kmean = KMeans(n_clusters=2)

Kmean.fit(X)

Here, arbitrarily gives k (n_clusters) an arbitrary value of two.

##### Step 5: Find the centroid

Kmean.cluster_centers_

plt.scatter(X[ : , 0], X[ : , 1], s =50, c=’b’)

plt.scatter(-0.94665068, -0.97138368, s=200, c=’g’, marker=’s’)

plt.scatter(2.01559419, 2.02597093, s=200, c=’r’, marker=’s’)

plt.show()

Here, the above code is used to find the centre of the clusters.

##### Step 6: Algorithm testing

Kmean.labels_

sample_test=np.array([-3.0,-3.0])

second_test=sample_test.reshape(1, -1)

Kmean.predict(second_test)

The code is used to getting the labels property of the K-means clustering example dataset; that is, how the data points are categorized into the two clusters.

##### Q28). Explain the differences between random forest and gradient boosting algorithm.

Bagging techniques are used by Random Forest, while GBM uses boosting techniques.

Random forests primarily aim to decrease variance and GBM decreases a model’s bias and variance.

##### Q29). Name the non-linear regression model four common algorithms.

The non-linear regression model four common algorithms are:

• K-Nearest Neighbours
• Decision tree
• Naïve Bayes
• Random forest
##### Q30). Illustrate the random forest.

Now, let’s understand with the help of an example, how to implement Random Forest by using jypter library of Python.

##### Step 1: Import the python module and data

from sklearn import datasets

print(iris.target_names)

print(iris.feature_names)

print(iris.data[0:5])

print(iris.target)

To build a model in a random forest, use ‘load_iris()’ function. It is an in-built function in sklearn. It consists of sepal (length and width) also petal (length and width) and other type of flowers too. The flower is divided into three classes such as setosa, versicolor, and Virginia. To print, the target and feature names; make sure that to you have a correct dataset. Further, the first five rows of the dataset will get printed and also the target variable for the whole dataset.

##### Step 2: Create a dataframe

import pandas as pd

data=pd.DataFrame({

‘sepal length’:iris.data[:,0],

‘sepal width’:iris.data[:,1],

‘petal length’:iris.data[:,2],

‘petal width’:iris.data[:,3],

‘species’:iris.target

})

DataFrame is defined as a two-dimensional labelled data structure which consists of columns and other potential types.

##### Step 3: Split the dataset

from sklearn.model_selection import train_test_split

X=data[[‘sepal length’, ‘sepal width’, ‘petal length’, ‘petal width’]]

y=data[‘species’]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

To split the columns into the dependent variable (Y) and independent variables (X) by using training and test set.

##### Step 4: Train the model

from sklearn.ensemble import RandomForestClassifier

clf=RandomForestClassifier(n_estimators=100)

clf.fit(X_train,y_train)

y_pred=clf.predict(X_test)

After splitting, train the model based on the training set and predict the performance on the test dataset.

##### Step 5: Check the accuracy

from sklearn import metrics

print(“Accuracy:”,metrics.accuracy_score(y_test, y_pred))

Output: 0.9777

##### Step 6: Predict type of flower

clf.predict([[3, 5, 4, 2]])

##### Step 7: Create a random forests model

from sklearn.ensemble import RandomForestClassifier

clf=RandomForestClassifier(n_estimators=100)

##### Step 8: To see variable score

import pandas as pd

feature_imp =

pd.Series(clf.feature_importances_,index=iris.feature_names).sort_values(ascending=False)

feature_imp

##### Step 9: Visualise the dataset

import matplotlib.pyplot as plt

import seaborn as sns

%matplotlib inline

# Creating a bar plot

sns.barplot(x=feature_imp, y=feature_imp.index)

plt.xlabel(‘Feature Importance Score’)

plt.ylabel(‘Features’)

plt.title(“Visualizing Important Features”)

plt.legend()

plt.show()

For the visualization process, combine matplotlib and seaborn because Matplotlib is a superset of seaborn and seaborn is built on the top of matplotlib library. It also provides several customized themes and extra plot types.