    What is Data Science?

    Data science is the regulation of the extraction of knowledge from information. It depends on computer science (for data structures, algorithms, visualization, big data provides, and general programming), statistics (for regressions and inference), and domain knowledge (for asking questions and interpreting outcomes).

    Facets of Data

    There are various facets of data are as follows:

    1. Structured data

      Structured data is data that depends on a data model and resides in a fixed field within a record. As such, it’s often easy to store structured data in tables within databases or Excel files, as shown in the figure. SQL, or Structured Query Language, is the preferred way to manage and query data that resides in databases.

    2. Unstructured data

      Unstructured data is data that isn’t easy to fit into a data model because the content is context-specific or varying. One example of unstructured data is your regular email, as shown in the figure. Although email contains structured elements such as the sender, title, and body text, it’s a challenge to find the number of people who have written an email complaint about a specific employee because so many ways exist to refer to a person, for example. The thousands of different languages and dialects out there further complicate this.

    3. Natural language

      Natural language is a specific type of unstructured data; it’s challenging to process because it requires knowledge of specific data science techniques and linguistics. The natural language processing community has had success in entity recognition, topic recognition, summarization, text completion, and sentiment analysis, but models trained in one domain don’t generalize well to other fields.

    4. Machine-generated data

      Machine-generated data is information that’s automatically created by a computer, process, application, or another machine without human intervention. Machine-generated data is becoming a significant data resource and will continue to do so.

      Wikibon has forecast that the market value of the industrial Internet (a term coined by Frost & Sullivan to refer to the integration of complex physical machinery with networked sensors and software) will be approximately $540 billion in 2020. IDC (International Data Corporation) has estimated there will be 26 times more connected things than people in 2020. This system is known as the internet of things

    5. Graph-based or network data

      “Graph data” can be a confusing term because it can show any data in a graph. “Graph” in this case points to mathematical graph theory. In graph theory, a graph is a mathematical structure to model pair-wise relationships between objects. Graph or network data is, in short, data that focuses on the relationship or adjacency of objects.

      The graph structures use nodes, edges, and properties to represent and store graphical data. Graph-based data is a natural way to describe social networks, and its design allows you to calculate specific metrics such as the influence of a person and the shortest path between two people.

    6. Audio, print, and video

      Audio, print, and video are data types that pose specific challenges to a data scientist. Tasks that are trivial for humans, including recognizing objects in pictures, turn out to be challenging for computers. MLBAM (Major League Baseball Advanced Media) announced in 2014 that they would increase video capture to approximately 7 TB per game for live, in-game analytics. High-speed cameras at stadiums will capture ball and athlete movements to calculate in real-time, for example, the path taken by a defender relative to two baselines.

    7. Streaming data

      While streaming data can handle almost any of the previous forms, it has an extra property. The data flows into the system when an event happens instead of being loaded into a data store in a batch.

