Quick Contact


    Introduction to Unsupervised Learning

    It is also a technique for machine learning in which the model does not need to be trained by users. Its aim is to deals with the unlabelled data. In order to discover patterns and data that were not previously identified, it allows the model to work on it itself. The algorithm let users to perform more complex tasks. Thus, it is more unpredictable algorithm as compared with other natural learning concepts. For example, clustering, neural networks, etc.The figure shows the working of the unsupervised learning:

    Unsupervised Learning

    The algorithm first examine the input raw data, present in the dataset and recognises various
    patterns. Then, identified patterns are used to extract useful information from the given unlabelled dataset. Finally, the model able to make sense of the knowledge by itself.

    Types of unsupervised learning

    In unsupervised learning, its focus on identifying items rather than predicting the output. Its consists of mainly two types of methods: Clustering and Association. Further, these types are described in detail.

    Unsupervised Learning

    Let’s understand clustering type and its algorithm in detail.

    • Clustering algorithms:

      In this method, the algorithm divides the datasets into different types of groups. For example, data points which comes under in a specific group are more similar to another ones. To build clusters, different types of algorithms are used like K-Means clustering algorithm, meanshift clustering algorithm, DBSCAN clustering algorithm, hierarchical clustering algorithm. Here, some of the tasks which can performed by clustering analysis are as follows:

      • Classify an objects based on their features
      • In a library, books can be kept, according to the books of a genre
      • Identify user group; with the help of same behaviour
    • K-Means clustering algorithm:

      It is an unsupervised learning algorithm and in which no labelled data are present for clustering. Its aim, to reduce the distances between the sum of data point and their respective clusters. It also divides objects into clusters. Those objects have same similarities are kept together and dissimilar objects are kept separately into another cluster. It is also a centroid-based algorithm, where every clusters are linked with a centroid.

      However, an algorithm consider unlabelled dataset as an input and distribute the dataset into k- number of clusters. Therefore, repeat the procedure until the best find clusters are not tified. Where, in this algorithm, the value of k should be predefined and consist of two main tasks are as follows:

      • Calculate the best value for K-centre points.
      • Assigns every data points to its nearest k-centre. A cluster is generated by those data points those are close to the specific k-centre.
    • Hence, every outcome has cluster data points with some similarities and dissimilarities are away from other clusters. The figure show the working of the K-means clustering algorithm:

      Unsupervised Learning

      Now, let’s understand with the help of an example, how to implement K-means clustering by using jypter library of Python.

      Step 1: Import the python modules

      import pandas as pd

      import numpy as np

      import matplotlib.pyplot as plt

      from sklearn.cluster import KMeans

      %matplotlib inline

      Here, “Pandas” library is used to read and write the spreadsheets. The “Numpy” is used to calculate the efficiency of the data set. Where, “Matplotlib” is used for data visualization.

      Step 2: Input data and calculate

      X= -2 * np.random.rand(100,2)

      X1 = 1 + 2 * np.random.rand(50,2)

      X[50:100, :] = X1

      The above code is used to generate random data in the form of a two-dimensional space.

      Step 3: Visualise the dataset

      plt.scatter(X[ : , 0], X[ :, 1], s = 50, c = ‘b’)

      plt.show()

      Now, to plot values in scatter way, scatter plots the values of existing dataset.

      Step 4: Use of Scikit-Learn

      from sklearn.cluster import KMeans

      Kmean = KMeans(n_clusters=2)

      Kmean.fit(X)

      Here, arbitrarily gives k (n_clusters) an arbitrary value of two.

      Step 5: Find centroid

      Kmean.cluster_centers_

      plt.scatter(X[ : , 0], X[ : , 1], s =50, c=’b’)

      plt.scatter(-0.94665068, -0.97138368, s=200, c=’g’, marker=’s’)

      plt.scatter(2.01559419, 2.02597093, s=200, c=’r’, marker=’s’)

      plt.show()

      Here, the above code is used to find the centre of the clusters.

      Step 6: Algorithm testing

      Kmean.labels_

      sample_test=np.array([-3.0,-3.0])

      second_test=sample_test.reshape(1, -1)

      Kmean.predict(second_test)

      The code is used to getting the labels property of the K-means clustering example dataset; that is, how the data points are categorized into the two clusters.

      Unsupervised Learning
      Output: K-means
    • Mean shift clustering algorithm:

      : It is a common type of unsupervised learning. The working of an algorithm is based on the method, called KDE (Kernel Density Estimation) and also known as mode seeking algorithm. To develop a model for the machine learning at the primary stage, it is linked with the maximum density points or mode value. Where, the Kernel is connected with statistical computation which is related to the weightage of the data points. The computer vision and image segmentation is commonly used for this algorithm. To fulfil the kernel function, the following conditions are required:

      • Ensure the estimation of kernel density to be normalized.
      • Associate KDE with the space of symmetry.

      Further, it consists of two main popular kernel functions like flat kernel and Gaussian Kernel.

      • Flat Kernel:

        This kernel does not assure to have densest points around the center. The center can be associated with a single kernel which might covers the two or more than center clusters at its edge.

      • Gaussian Kernel:

        This kernel assure to have densest points around the center. The standard deviation would act like bandwidth parameter for this kernel.

      Now, let’s understand with the help of an example, how to implement Mean shift clustering by using jypter library of Python.

      Step 1: Import the python modules

      import numpy as np

      from sklearn.cluster import MeanShift

      import matplotlib.pyplot as plt

      from matplotlib import style

      %matplotlib inline

      style.use(“ggplot”)

      from sklearn.datasets.samples_generator import make_blobs

      Here, “Numpy” is used to calculate the efficiency of the data set. Where, “Matplotlib” is used for data visualization. The (“ggplot”) define the grammar of graphics in Sk-learn.The function, called “make_blobs”; part of sklearn.datasets.samples_generator. In which all the methods generates the data samples or datasets.

      Step 2: Input data

      centers = [[1,1,1],[1,2,2],[3,8,8]]

      X, _ = make_blobs(n_samples = 500, centers = centers, cluster_std = 0.5)

      plt.scatter(X[:,0],X[:,1])

      plt.show()

      By using “make_blobs”, enter the values and calculate. The below figure contain 2D dataset for four different blobs.

      Unsupervised Learning
      Output: 2D dataset
      Step 3: Calculate and Visualise dataset

      ms = MeanShift()

      ms.fit(X)

      labels = ms.labels_

      cluster_centers = ms.cluster_centers_

      print(cluster_centers)

      n_clusters_ = len(np.unique(labels))

      print(“Estimated clusters:”, n_clusters_)

      colors = 10*[‘r.’,’g.’,’b.’,’c.’,’k.’,’y.’,’m.’]

      for i in range(len(X)):

      plt.plot(X[i][0], X[i][1], colors[labels[i]], markersize = 3)

      plt.scatter(cluster_centers[:,0],cluster_centers[:,1],

      marker = “.”,color = ‘b’, s = 10, linewidths = 5, zorder = 10)

      plt.show()

      To calculate the generate 2D dataset for four different blobs, mean shift is applied.

      Unsupervised Learning
      Output: Mean shift clustering
    • DBSCAN clustering algorithm:

      DBSCAN stands for Density-Based Spatial Clustering of Applications with noise based on density clustering. This algorithm refers to unsupervised learning which identify particular groups or clusters in the dataset, based on the assumption which cluster is continuous in the data region with low/high point density. It can also determine different types of clusters, i.e., shapes and sizes from a huge amount of dataset which can be contain noise and outliers. Generally, it uses two basic parameters:

      • minPts:

        To be consider as a dense, it should clustered minimum number of points together.

      • eps (ε):

        This parameter, distance measure, use to track points in its neighbourhood.

      Further, to understand these parameters, use the method of density reachability and density connectivity and figure consist of main three part of an algorithm.

      • Density reachability:

        It develops a point that is reachable from another point, only if it lies within a certain distance from it.

      • Density connectivity:

        It is based a transitivity chaining method which determine, whether points are lies in a specific cluster or not.

      Unsupervised Learning

      Now, let’s understand with the help of an example, how to implement DBSCAN clustering using by jypter library of Python.

      Step 1: Import the python modules

      import numpy as np

      from sklearn.cluster import DBSCAN

      from sklearn import metrics

      from sklearn.datasets import make_blobs

      from sklearn.preprocessing import StandardScaler

      import matplotlib.pyplot as plt

      %matplotlib inline

      Where, “StandardScaler” function removes the mean and assign each feature to unit variance.

      Step 2: Input data

      centers = [[1, 1], [-1, -1], [1, -1]]

      X, labels_true = make_blobs(n_samples=550, centers=centers, cluster_std=0.4,

      random_state=0)

      X = StandardScaler().fit_transform(X)

      The code is used to generate data using make_blobs.

      Step 3: Calculate DBSCAN

      db = DBSCAN(eps=0.2, min_samples=10).fit(X)

      core_samples_mask = np.zeros_like(db.labels_, dtype=bool)

      core_samples_mask[db.core_sample_indices_] = True

      labels = db.labels_

      n_clusters_ = len(set(labels)) – (1 if -1 in labels else 0)

      n_noise_ = list(labels).count(-1)

      print(‘Estimated number of clusters: %d’ % n_clusters_)

      print(‘Estimated number of noise points: %d’ % n_noise_)

      print(“Homogeneity: %0.3f” % metrics.homogeneity_score(labels_true, labels))

      print(“Completeness: %0.3f” % metrics.completeness_score(labels_true, labels))

      print(“V-measure: %0.3f” % metrics.v_measure_score(labels_true, labels))

      print(“Adjusted Rand Index: %0.3f” % metrics.adjusted_rand_score(labels_true, labels))

      print(“Adjusted Mutual Information: %0.3f” % metrics.adjusted_mutual_info_score(labels_true, labels))

      print(“Silhouette Coefficient: %0.3f” % metrics.silhouette_score(X, labels))

      It is used to calculate the DBSCAN and clusters them in labels by ignoring noise if it is present in the output and plot final result.

      Step 4: Visualise dataset

      unique_labels = set(labels)

      colors = [plt.cm.Spectral(each)

      for each in np.linspace(0, 1, len(unique_labels))]

      for k, col in zip(unique_labels, colors):

      if k == -1:

      # Black used for noise.

      col = [0, 0, 0, 1]

      class_member_mask = (labels == k)

      xy = X[class_member_mask & core_samples_mask]

      plt.plot(xy[:, 0], xy[:, 1], ‘o’, markerfacecolor=tuple(col),

      markeredgecolor=’k’, markersize=10)

      xy = X[class_member_mask & ~core_samples_mask]

      plt.plot(xy[:, 0], xy[:, 1], ‘o’, markerfacecolor=tuple(col),

      markeredgecolor=’k’, markersize=5)

      plt.title(‘Estimated number of clusters: %d’ % n_clusters_)

      plt.show()

      For visualisation of data , black are removed and instead used for noise.

      Unsupervised Learning
      Output: DBSCAN clustering
    • Hierarchical clustering algorithm:

      It is a common type of unsupervised learning algorithm. It is used to cluster unlabelled data points in the dataset. Like K-means clustering, also cluster the data points together with same characteristics. On the other hand, in some cases, the outcome of hierarchical and K-Means clustering may be equal.The algorithm is classified into two main types are as follows:

      • Agglomerative hierarchical clustering:

        It is a “bottom-up” technique in which every assumption starts from itself cluster and combine pairs of clusters as a one moves towards hierarchy.

      • Divisive hierarchical clustering:

        It is a “top-down” technique in which where all the observations are assigned to a single cluster and then divide the cluster into two least similar clusters and it continues the process until each cluster is allocated for each observation. However, the figure shows about the difference between these two types:

      Unsupervised Learning

      Now, let’s understand with the help of an example, how to implement hierarchical clustering by using jypter library of Python.

      Step 1: Import the python modules

      import matplotlib.pyplot as plt

      import pandas as pd

      %matplotlib inline

      import numpy as np

      from sklearn.cluster import AgglomerativeClustering

      The method of hierarchical clustering is similar just like to other unsupervised machine learning algorithm. Then, import the needful libraries.

      Step 2: Input data

      X = np.array([[2,3], [11,14],[13,15],[20,10],[18,25],[60,68],[73,80],[65,88],[45,50],[70,95],])

      Input data to generate the output.

      Step 3: Calculate the dataset

      cluster = AgglomerativeClustering(n_clusters=2, affinity=’euclidean’, linkage=’ward’)

      cluster.fit_predict(X)

      print(cluster.labels_)

      Import the function, called fit_predict, which predicts the clusters. It predict those clusters which belong to the data point. Further, “AgglomerativeClustering” class from the “sklearn.cluster” calculates the number of parameters. Then, these parameters are set to use n_clusters parameter, while the affinity is set to the “euclidean”. Thus, the link parameter is set as a “ward”; that reduces the variant between the clusters.

      Step 4: Visualise the dataset

      plt.scatter(X[:,0],X[:,1], c=cluster.labels_, cmap=’rainbow’)

      Finally, plot the clusters data points, to get the output.

      Unsupervised Learning
      Output: Hierarchical clustering

      By using certain rules, association helps to develop relationships between different data points
      present in the huge datasets. Like, online shopping websites are generally use it to make recommendations to clients based on their online previously history. It uses association rule
      mining technique to form association.

    • Association rule mining (ARM):

      It is also a type of unsupervised learning method, which is to check one data item’s dependence on another data item. It also match accordingly so that, can be more profitable. It attempts to find any interesting relationship or correlations between the dataset variables. The algorithm uses different types of rules to identify the better relationship between variables which are present in the dataset.

      For example, Market based analysis, it is one of the important approach used in association rule mining. To show associations between data items it uses a large dataset. It also allow retailers to recognise the relationships between the items so that, customers can buy together them frequently. Such as, if a customer wants to buy bread, after entering the shop he may can buy butter, eggs or milk because these dairy products are aligned within the same shelf.

      Unsupervised Learning

      Now, let’s understand in detail are these how rules are associated with association algorithm of machine learning.

      • Support:

        It calculates how frequently an item has occurred in the dataset. The support of X with respect to T. Where, transaction (T) is defined as the proportion present in the dataset and also that consist of the itemset (X). The equation of the support is:

        Support(X) = Frequency itemset X/Transaction T

      • Confidence:

        In this, true is an indicator of how much it has been found that the rule is valid or in the dataset, how often X and Y items occur together, when X event is already given. It define as the ratio between the transaction which contains X and Y which refers to the number of records which contain X. The equation of the confidence is:

        Confidence = Frequency of X and Y/Frequency of X
      • Lift:

        It can define the strength for any rule or the ratio between the support observed and that predicted if X and Y were independent. The equation of the lift is:

        Lift = Support X and Y/Support X ×Support Y

        Further, association rule learning are classified into three algorithms are as follows:

    • Apriori Algorithm:

      This algorithm is similar to the association rule mining technique. It is also used for mining similar item sets. For example, in a shop, customer can buy things related to the similar products like bread, butter or milk and etc.

    • Eclat Algorithm:

      To achieve mining of itemset, the eclat algorithm is applied. The mining of Itemset allows to get periodic patterns in the dataset. For example, if a constomer went in a shop to buy butter, he can also buy eggs. The aim of this algorithm, to use set relationships to calculate the support of a consumer itemset. To count the columns, it also works based on a depth-first search. Thus, this algorithm works faster than the Apriori algorithm.

    • F-P Growth Algorithm:

      This algorithm is uses databases and not streams. Although, an Apriori algorithm requires n+1 scans until a database is not fully used; where n refers to the longest model length. By using the FP-Growth concept, for the complete database, the number of scans can be reduces to two.

    Copyright 1999- Ducat Creative, All rights reserved.