Data Science Tutorial
- What is Data Science?
- Life Cycle of Data Analytics
- What is Machine Learning?
- Python Tools in Data Science
- Working with DataBase
- Data Science using R
- Hierarchical Indexing
- Data Science Using Scikit
- Clustering in Data Science
- Working with Network Data
- What is Plotting
- String Manipulation
- What is Text Analysis?
What is Machine Learning?
Machine learning is the procedure by which a system can work more efficiently as it collects and learns from the record it is given.
For example, as a user writes more text messages on the phone, the phone learns more about the messages’ common vocabulary. It can predict (autocomplete) their words faster and more accurately. In the broader field of science, machine learning is a subfield of artificial intelligence and is closely related to applied mathematics and statistics. Machine learning has many applications in everyday life.
Applications of Machine Learning in Data Science
Regression and classification are of primary importance to a data scientist. To achieve these goals, one of the main tools a data scientist uses is machine learning. The uses for regression and automatic classification are wide-ranging, such as the following:
It can be finding oil fields, gold mines, or archaeological sites based on existing sites (classification and regression).
It can find place names or persons in the text (classification).
It can be Identifying people based on pictures or voice recordings (classification).
It can be recognizing birds based on their whistle (classification).
It can be identifying profitable customers (regression and classification).
It can proactively identify car parts that are likely to fail (regression).
It can be identifying tumors and diseases (classification).
It can be predicting the amount of money a person will spend on product X (regression). It can be predicting the number of eruptions of a volcano in a period (regression).
It can predict your company’s yearly revenue (regression).
It can predict which team will win the Champions League in soccer (classification).
The modeling phase consists of four steps:
Engineering features and selecting a model
With engineering features, you must come up with and create possible predictors for the model. This is one of the most essential steps in the process because a model recombines these features to achieve its predictions.
Training your model
With the right predictors in place and a modeling technique in mind, you can progress to model training. In this phase, you present to your model data from which it can learn. The most common modeling techniques have industry-ready implementations in almost every programming language, including Python.
These enable you to train your models by executing a few lines of code. For more state-of-the-art data science techniques, you’ll probably end up doing heavy mathematical calculations and implementing them with modern computer science techniques. Once a model is trained, it’s time to test whether it can be extrapolated to reality: model validation.
Validating a model
Data science has many modeling techniques, and the question is which one is the right one to use. A good model has two properties: it has good predictive power, and it generalizes well to data it hasn’t seen. To achieve this, you define an error measure (how wrong the model is) and a validation strategy
Two standard error measures in machine learning are the classification error rate for classification problems and the mean squared error for regression problems. The classification error rate is the percentage of observations in the test data set that your model mislabeled; lower is better.
The mean squared error measures how big the average error of your prediction is. Squaring the average error has two consequences: you can’t cancel out a wrong prediction in one direction with an incorrect prediction in the other direction.
For example, overestimating future turnover for next month by 5,000 doesn’t cancel out underestimating it by 5,000 for the following month. As a second consequence of squaring, more significant errors get even more weight than they otherwise would. Small errors remain small or can even shrink (if <1), whereas many errors are enlarged and will draw your attention.
Predicting new observations
The process of applying your model to new data is called model scoring. Model scoring is something you implicitly did during validation; only now you don’t know the correct outcome. By now, you should trust your model enough to use it for real. Model scoring involves two steps. First, you prepare a data set that has features precisely as defined by your model. This boils down to repeating the data preparation you did in step one of the modeling process but for a new data set. Then you apply the model on this new data set, and this results in a prediction.