Data Science Tutorial
- What is Data Science?
- Life Cycle of Data Analytics
- What is Machine Learning?
- Python Tools in Data Science
- Working with DataBase
- Data Science using R
- Hierarchical Indexing
- Data Science Using Scikit
- Clustering in Data Science
- Working with Network Data
- What is Plotting
- String Manipulation
- What is Text Analysis?
Life Cycle of Data Analytics
The Data Analytics Lifecycle defines analytics process best practices spanning discovery to project completion. The lifecycle draws from established methods in the realm of data analytics and decision science. This synthesis was developed after gathering input from data scientists and consulting based approaches that provided feedback on pieces of the process. Several of the methods that were consulted include these:
Scientific method: It provides a solid framework for thinking about and deconstructing problems into their principal parts. One of the most valuable ideas of the scientific method relates to forming hypotheses and finding ways to test ideas.
CRISP-DM: It provides useful input on ways to frame analytics problems and is a popular approach for data mining.
Tom Davenport’s DELTA framework: The DELTA framework offers an approach for data analytics projects, including the context of the organization’s skills, datasets, and leadership engagement.
Doug Hubbard’s Applied Information Economics (AIE) approach: AIE provides a framework for measuring intangibles and includes guidance on developing decision models, calibrating expert estimates, and deriving the expected value of information.
“MAD Skills” by Cohenetal: It offers input for several of the techniques mentioned in Phases 2–4 that focuses on model planning, execution, and critical findings.
The main phases of the Data Analytics Life Cycle are as follows:
In step 1, the team learns the business area, containing relevant history such as whether the organization or business unit has attempted equivalent projects in the past from which they can understand. The team assesses the resources feasible to support the project in terms of people, technology, time, and data. Essential activities in this phase include framing the business problem as an analytics challenge that can be addressed in subsequent steps and formulating initial hypotheses (IHs) to test and begin learning the data.
In Phase 2, it needed the presence of an analytic sandbox, in which the team can work with data and implement analytics for the duration of the project. The team require to execute extract, load, and transform (ELT) or extract, transform and load (ETL) to get the record into the sandbox. The ELT and ETL are abbreviated as ETLT.
Data should be changed in the ETLT process so the team can work with it and analyze it.
Data preparation tends to be the most labour-intensive step in the analytics lifecycle. It is common for teams to spend at least 50% of a data science project’s time in this critical phase. If the team cannot obtain enough data of sufficient quality, it may be unable to perform the subsequent steps in the lifecycle process.
In step 3 is model planning, where the team decides the methods, techniques, and workflow it intends to follow for the subsequent model building stage. The team explores the information to learn about the relationships between variables and subsequently selects vital variables and the most suitable models.
Some of the activities to consider in this phase include the following:
It can assess the structure of the datasets. The design of the datasets is one factor that dictates the tools and analytical techniques for the next phase. Depending on whether the team plans to analyze textual data or transactional data, for example, different devices and approaches are required.
It can ensure that the analytical techniques enable the team to meet the business objectives and accept or reject the working hypotheses.
In step 4, the team develops datasets for testing, training, and production purposes. Also, in this stage, the team builds and implements models based on the work completed in the model planning stage. The team also considers whether its current device will suffice for running the models, or if it will require a more robust environment for executing models and workflows (for example, fast hardware and parallel processing, if applicable).
During this phase, users run models from analytical software packages, such as R or SAS, on file extracts and small datasets for testing purposes. On a small scale, assess the validity of the model and its results. For instance, determine if the model accounts for most of the data and has robust predictive power. At this point, refine the models to optimize the results, such as by modifying variable inputs or reducing correlated variables where appropriate.
In step 5, the team, in collaboration with major stakeholders, decides if the outcomes of the project are a success or a failure depends on the criteria developed in step 1. The team should identify key findings, quantify the business value, and create a narrative to summarize and convey results to stakeholders.
In step 6, the team delivers final reports, briefings, code, and technical documents. Also, the team may run a pilot project to implement the models in a production environment.