Quick Contact

    Interview Questions

    Q1). What do you mean by Big Data?

    Big Data is a concept that refers to complicated and broad datasets. Big data cannot be managed by a relational database, and that’s why special instruments and techniques are used to perform operations on a large data set. Big data allows businesses to better understand their business and helps them to derive meaningful information on a regular basis from the unstructured and raw data collected.

    Q2).What are the different types of Big Data?

    There are three types of Big Data are as follows:

    • Structured Data:

      It implies that in a fixed format, the data can be processed, stored, and retrieved. It is a highly structured data for e.g. phone numbers, social security numbers, ZIP codes, employee data, and wages, etc. that can be quickly analysed and processed.

    • Unstructured Data:

      This applies to knowledge that does not have a particular structure or shape. Formats such as audio, video, social media messages, digital surveillance data, satellite data and so on are the most common forms of unstructured data.

    • Semi-structured Data:

      This applies to both structured and unstructured information formats and is non-specified but important.

    Q3). What are the five V’s of Big Data?

    The five V’s of Big Data are as follows:

    • Volume:

      Representation of volume is the volume of data growing at a high rate, i.e. the volume of data in Petabytes.

    • Velocity:

      Velocity is the pace at which knowledge expands. Social networking plays a major role in the pace of data development.

    • Variety:

      Variety defines the different types of data, i.e. various data formats such as text, audios, videos, etc.

    • Veracity:

      Veracity refers to the confusion of available information. Veracity exists because of the large volume of knowledge that gives incompleteness and inconsistency.

    • Value:

      Value refers to the transformation of knowledge into value. Businesses can create revenue by converting accessed big data into values.

    Q4). How Big Data and Hadoop are related to each other?

    The terms Big Data and Hadoop are almost interchangeable. Hadoop, a platform specialising in big data operations, also became popular with the growth of big data. Professionals may use the platform to evaluate big data and assist organisations in making decisions.

    Q5). How to process Big Data?

    MapReduce is one of the more common ones. This consists primarily of two phases called the Map and Reduce phases. There is an intermediate step called Shuffle in between the Map and Reduce phase. The task given is split into two tasks:

    • The input is divided into fixed-size splits. Then every input split is given to each mapper. In parallel, the mappers run. So it greatly decreases the execution time and we get the production very quickly.
    • A key-value pair is the input to the mapper. A further key-value pair is the performance of mappers. Then, this intermediate outcome is shuffled and issued to reducers. Your ideal output is the output of reducers.
    Q6). Name the tools which are used to extract big data?

    There are various methods available for the extraction of big data. For instance, Flume, Kafka, Nifi, Sqoop, Chukwa, Talend, Morphlines, Scriptella, etc.

    Q7). Explain how missing values are handled in Big Data?

    Missing values apply to the values for a specific column that are not present. It may lead to inaccurate data and incorrect results if we do not take care of the missing values. So, we are expected to properly handle the missing values before processing the big data so that we get the right sample. There are different ways of treating missing values. We may either drop the data or want to replace it with an imputation of the data. If the number of missing values is minimal, then it will be abandoned in general practise. The data imputation is performed if the number of instances is greater. To estimate missing values, there are some techniques in statistics are as follows:

    • Regression
    • Maximum Likelihood Estimation
    • Listwise/pairwise Deletion
    • Multiple data imputation etc.
    Q8). Define the term”fsck”?

    Fsck stands for file system check. It is a command that HDFS uses. This command is used to search for anomalies and to check whether there is a file problem. For example, if a file has any missing blocks, HDFS will be notified through this order.

    Q9). Why Hadoop is used for data analytics?

    Since data analytics has become one of the main business parameters, businesses deal with a tremendous amount of structured, unstructured and semi-structured data. It is very difficult to analyse unstructured data where Hadoop plays a major role with its capabilities of storage, processing and data collection. Moreover, Hadoop is open source and runs on hardware for commodities. Therefore, it is a cost-benefit option for corporations.

    Q10). Name the command which is used to format the NameNode?

    $ hdfs namenode –format

    Copyright 1999- Ducat Creative, All rights reserved.