Data Science Tutorial
- What is Data Science?
- Life Cycle of Data Analytics
- What is Machine Learning?
- Python Tools in Data Science
- Working with DataBase
- Data Science using R
- Hierarchical Indexing
- Data Science Using Scikit
- Clustering in Data Science
- Working with Network Data
- What is Plotting
- String Manipulation
- What is Text Analysis?
Clustering in Data Science
Clustering is the use of unsupervised techniques for grouping equivalent objects. In machine learning, unsupervised defines the problem of finding a hidden framework within unlabeled data. Clustering techniques are unsupervised in the perception that the data scientist does not decide, in advance, the labels to apply to the clusters.
Clustering is a method used for exploratory analysis of the data. In clustering, there are no predictions made. Instead, clustering methods find the similarities between objects according to the object attributes and similar group objects into clusters. Clustering techniques are used in marketing, economics, and various branches of science. A popular clustering method is k-means.
Given a set of objects each with n measurable attributes, k-means  is an analytical technique that, for a selected value of k, identifies k clusters of objects depends on the closeness of the object to the center of the k groups. The center is decided as the arithmetic average (mean) of each cluster’s n-dimensional vector of attributes. This method represents the algorithm to determine the k means as well as how best to apply this technique to several use cases.
Clustering is used as a lead-in to classification. Once the clusters are recognized, labels can be applied to each cluster to classify each group based on their characteristics. Clustering is primarily an exploratory technique to find the hidden framework of the data, possibly as a prelude to more focused analysis or decision phase. Some specific applications of k-means are picture processing, medical, and user segmentation.
Video is one example of the increasing volumes of unstructured data being collected. In each frame of a video, k-means analysis can be used to recognize objects in the video. For each frame, the function is to determine which pixels are most equivalent to each other. The attributes of each pixel can involve brightness, color, and area, the x and y coordinates in the frame. With security video pictures, for example, successive frames are determined to recognize any changes to the clusters. These newly identified clusters can denoted unauthorized access to a facility.
Patient attributes including age, height, weight, systolic and diastolic blood pressures, cholesterol level, and other attributes can recognize naturally appearing clusters. These clusters can be used to target individuals for specific preventive measures or clinical trial participation. Clustering is useful in biology for the classification of plants and animals as well as in the field of human genetics.
Marketing and sales groups use k-means to identify better customers who have similar behaviours and spending patterns. For example, a wireless provider can look at the following user attributes: monthly bill, several text messages, data volume consumed, minutes used during several daily periods, and years as a customer. The wireless company can then look at the naturally appearing clusters and assume tactics to increase sales or decrease the customer churn rate, the proportion of customers who end their relationship with a particular company.
The k-means algorithm to discover k clusters can be represented in the following four steps.
- Select the value of k and the k initial guesses for the centroids.
- Evaluate the distance from each data point (x, y) to each centroid. Assign each point to the closest centroid. This association represents the first k clusters. In two dimensions, the distance, d, between any two points, (x1, y1) and (x2, y2) in the Cartesian plane is generally expressed by using the Euclidean distance measure provided in equation
- Compute the centroid, the center of mass, of each newly defined cluster from Step 2.
- Repeat Steps 2 and 3 until the algorithm converges to an answer.
d=√ (x1 – x2)2 + (y1 – y2)2
- Assign each point to the closest centroid computed in Step 3.
- Evaluate the centroid of newly defined clusters.
- Repeat until the algorithm reaches the final solution.
Convergence is reached when the computed centroids do not modify, or the centroids and the assigned points oscillate back and forth from one iteration to the next. The latter case can appear when there are one or more points that are equal distances from the computed centroid.