Data Science Tutorial
- What is Data Science?
- Life Cycle of Data Analytics
- What is Machine Learning?
- Python Tools in Data Science
- Working with DataBase
- Data Science using R
- Hierarchical Indexing
- Data Science Using Scikit
- Clustering in Data Science
- Working with Network Data
- What is Plotting
- String Manipulation
- What is Text Analysis?
Data Science Using Scikit-Learn
Several Python libraries offer solid execution of a range of machine learning algorithms. One of the best called is Scikit-Learn, a package that supports accurate versions of a large number of standard algorithms. A clean, uniform features and Scikit-Learn, and streamlined API, as well as by beneficial and complete online documentation.
Data Representation in Scikit-Learn
Machine learning is about generating models from data: for that reason, we will start by discussing how data can be represented to be learned by the computer. The best method to thought about data within Scikit-Learn is in terms of tables of data.
Data as table
A virtual table is a two-dimensional grid of data, in which the rows describe single elements of the dataset, and the columns describe quantities associated with each of these elements. For example, consider the Iris dataset, popularly analyzed by Ronald Fisher in 1936. We can download this dataset in the form of a Pandas DataFrame using the Seaborn library:
In: import seaborn as sns
iris = sns.load_dataset('iris')
Out: sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
Therefore, each row of the data defines a single observed flower, and the multiple rows are the total number of flowers in the dataset. In general we will define the rows of the matrix as samples and the number of rows as n_samples.
Each column of the data refers to a particular quantitative piece of information that describes each sample. In general, we will refer to the columns of the matrix as features, and the number of columns as n_features.
This table layout makes clear that the information can be thought of as a two-dimensional numerical array or matrix, which we will call the features matrix. By convention, this features matrix is often stored in a variable named X.
The features matrix is assumed to be two-dimensional, with shape [n_samples, n_features], and is included in a NumPy array or a Pandas DataFrame. However, some ScikitLearn models also accept SciPy sparse matrices. The samples (i.e., rows) always defines the individual objects defined by the dataset.
For example, the sample can be a flower, a person, a document, an image, a sound file, a video, an astronomical object, or anything else we can define with a set of quantitative measurements. The features (i.e., columns) always describes the distinct observations that quantitatively represent each sample. Features are generally real-valued but can be Boolean or discrete-valued in some methods.
In addition to the feature matrix X, we also generally work with a label or target array, which by convention we will usually call y. The target array is usually one dimensional, with length n_samples, and is generally contained in a NumPy array or Pan‐ das Series.
The target array can have continuous analytical values or discrete classes/labels. While some Scikit-Learn estimators do handle multiple target values in the form of a two-dimensional [n_samples, n_targets] target array, we will generally be working with the typical case of a one-dimensional target array.
For example, in the primary data, we can wish to generate a model that can predict the species of the flower depends on the other measurements; in this case,the species column can be considered the feature.
In: X_iris = iris.drop('species', axis=1)
Out: (150, 4)
In: y_iris = iris['species']
Basics of the API
Most generally, the steps in using the Scikit-Learn estimator API are as follows:
- Select a class of model by importing the appropriate estimator class from ScikitLearn.
- Select model hyperparameters by instantiating this class with desired values.
- Sequence the data into a features matrix and target vector following the discussion from before.
- Fit the model to our data by calling the fit() method of the model instance.
- Apply the model to new data:
For supervised learning, we predict labels for new data using the predict() method.
For unsupervised learning, we often transform or infer properties of the data using the transform() or predict() method.