Top 10 Data Science Interview Questions

Data science is an interdisciplinary field about scientific methods, processes, and systems to extract knowledge from data in various forms and take decisions based on this knowledge. There is no surprise that in the new era of Big Data and Machine Learning, Data Science is in great demand and Data Scientists are among the highest-paid IT professionals. This blog includes a few of the most frequently asked questions in Data Science interviews.

Here is a list of the Top 10 Popular Data Science Interview questions:


1. List the differences between Supervised and Unsupervised Learning.

Supervised Learning Unsupervised Learning
Uses known and labeled data as input Uses unlabeled data as input
Supervised learning has a feedback mechanism Unsupervised learning has no feedback mechanism
Used for prediction Used for analysis
Enables Classification and Regression Enables Classification, Density Estimation, & Dimension Reduction
The most commonly used supervised learning algorithms are decision trees, logistic regression, and support vector machine The most commonly used unsupervised learning algorithms are k-means clustering, hierarchical clustering, and apriori algorithm

2. What do you understand by linear regression?

Linear regression helps in understanding the linear relationship between the dependent and the independent variables. It is a supervised learning algorithm, which helps in finding the linear relationship between two variables. One is the predictor or the independent variable and the other is the response or the dependent variable. In Linear Regression, we try to understand how the dependent variable changes w.r.t the independent variable. If there is only a single independent variable, then it is called simple linear regression, whereas it has more than one independent variable then it is known as multiple linear regression.

3. How is logistic regression done?

Logistic regression determines the relationship between the dependent variable and one or more independent variables by approximating possibility using its underlying logistic function.

The formula and graph for the sigmoid function are as shown:

Logistic regression

4.What is a Confusion Matrix?

The confusion matrix is a 2X2 table that contains 4 outputs provided by the binary classifier. You can derive various measures, such as error-rate, accuracy, specificity, sensitivity, precision, and recall from it.

True Positive: This denotes all of those records where the actual values are true and the predicted values are also true. So, these denote all of the true positives. False Negative: This denotes all of those records where the actual values are true, but the predicted values are false. False Positive: In this, the actual values are false, but the predicted values are true. True Negative: Here, the actual values are false and the predicted values are also false. So, if you want to get the correct values, then correct values would fundamentally signify all of the true positives and the true negatives. This is how the confusion matrix operates.


5. How do you build a random forest model?

A Random Forest is built up of a number of decision trees. If you split the data into different packages and make a decision tree in each of the different groups of data, the random forest brings all those trees together.

Steps to build a random forest model:

  1. Randomly select ‘k’ features from a total of ‘m’ features where k << m.
  2. Among the ‘k’ features, calculate the node D using the best split point.
  3. Split the node into daughter nodes using the best split.
  4. Repeat steps two and three until leaf nodes are finalized.
  5. Build forest by repeating steps one to four for ‘n’ times to create ‘n’ number of trees.


  1. 6. How is Data modeling different from Database design?

Data Modelling: It can be considered as the first step towards the design of a database. Data modelling creates a conceptual model based on the relationship between various data models. The process involves moving from the conceptual stage to the logical model to the physical schema. It involves the systematic method of applying data modeling techniques. Database Design: It is the process of designing the database. The database design creates an output which is a detailed data model of the database. Strictly speaking, database design includes the detailed logical model of a database but it can also include physical design choices and storage parameters.


7. How can you avoid the overfitting of your model?

Overfitting refers to a model that is only set for a very small amount of data and ignores the bigger picture. There are three main methods to avoid overfitting:

  1. Keep the model simple—take fewer variables into account, thereby removing some of the noise in the training data.
  2. Use cross-validation techniques, such as k folds cross-validation.
  3. Use regularization techniques, such as LASSO, that penalize certain model parameters if they’re likely to cause overfitting.


8. What do you understand by the term Normal Distribution?

Data is usually distributed in different ways with a bias to the left or to the right or it can all be jumbled up. However, there are chances that data is distributed around a central value without any bias to the left or right and reaches normal distribution in the form of a bell-shaped curve.

Properties of Normal Distribution are as follows;

  1. Unimodal -one mode
  2. Symmetrical -left and right halves are mirror images
  3. Bell-shaped -maximum height (mode) at the mean
  4. Mean, Mode, and Median are all located in the center
  5. Asymptotic


9. What is correlation and covariance in statistics?

Covariance and Correlation are two mathematical concepts; these two approaches are widely used in statistics. Both Correlation and Covariance establish the relationship and also measure the dependency between two random variables. Though the work is similar between these two in mathematical terms, they are different from each other.

Correlation: Correlation is considered or described as the best technique for measuring and also for estimating the quantitative relationship between two variables. Correlation measures how strongly two variables are related.

Covariance: In covariance two items vary together and it’s a measure that indicates the extent to which two random variables change in cycle. It is a statistical term; it explains the systematic relation between a pair of random variables, wherein changes in one variable reciprocal by a corresponding change in another variable.


  1. What is the p-value?

When you perform a hypothesis test in statistics, a p-value can help you determine the strength of your results. a p-value is a number between 0 and 1. Based on the value it will denote the strength of the results. The claim which is on trial is called the Null Hypothesis.

A low p-value (≤ 0.05) indicates strength against the null hypothesis which means we can reject the null Hypothesis. A high p-value (≥ 0.05) indicates strength for the null hypothesis which means we can accept the null Hypothesis p-value of 0.05 indicates the Hypothesis could go either way.


For data scientists, the work isn’t easy, but it’s rewarding and there are plenty of available positions out there. These data science interview questions can help you get one step closer to your dream job. So, prepare yourself for the rigors of interviewing and stay sharp with the nuts and bolts of data science.

Leave a comment

Your email address will not be published. Required fields are marked *