 # 13 Tricky Theoretical Data Science Interview Questions & Their Answers

It’s the 21st century: Data is the new gold and data science jobs are in currency. Interest in data science has seen an uptick of 326% in the last 5 years and a massive surge of 9700% in the last 10 years, data from Exploding Topics reveals. In 2019, the Data Science and Big Data industry in India was already worth 3.03 billion dollars, as per a study by Analytics India Magazine and Praxis Business School, and this figure is projected to double by 2025.

Data scientists convert heaps, rather mounds, of 1s and 0s into meaningful insights and information, helping companies solve complex problems and model scientific solutions. With every company slowly becoming a ‘tech’ company, the need for data scientists is becoming increasingly universal.

However, as an aspiring data scientist, remember that the field requires you to have excellent math and problem-solving skills, alongside the ability to code well. To get a coveted job at a company that prides itself to employing the cream of the crop, a strong theoretical foundation can prove to be the deciding factor. To help you prepare for your next data science interview, here are the answers to 13 tough theoretical questions.

### 1. List out a few assumptions you’d make when implementing linear regression.

• Linear relationship: Between independent and dependent variables
• No autocorrelation: The residual errors are independent of each other
• Residual normality: Residuals (errors) follow a normal distribution, though this is not required as the sample size increases
• Homoscedasticity: Constant/similar variance of error terms across different values of variables
• Low (no) multicollinearity: Independent variables don’t follow a perfectly linear relationship
• Model correctly specified: Important independent variables are not missing
• Additivity: The independent variables times their coefficients and added provide an accurate prediction

### 2. Distinguish between Gradient Descent (GD) and Stochastic Gradient Descent (SGD).

In GD, the entire training set is considered for a parameter update in an iteration, but in SGD a ‘random’ (single) instance of training data is considered. GD converges slowly and requires more computation per step, but provides an optimal solution. SGD is faster and has less computational weight, but provides a good, and not optimal, solution.

### 3. What is regularisation and why is it useful? Explain L1 and L2 regularisation.

Regularisation in ML models helps prevent overfitting (model corresponds too exactly to training set and hence fails to perform on new data). Here, the model is made more general by the addition of a tuning parameter, a penalty term. L1 (lasso) and L2 (ridge) regularisation are two common forms of regularisation.

• In case of L1, i.e., Lasso or Least Absolute Shrinkage and Selection Operator regularisation, Lambda times the sum of the absolute magnitude of the coefficients is added to the loss function.
• In case of L2 or Ridge regularisation, Lambda times the sum of the squared magnitude of the coefficients is added to the loss function.

*Lambda denotes the amount of regularisation, with too large a value resulting in underfitting.

### 4. What is Random Forest and how does it work? Can you give a real-life application of it?

Random Forest is an ML method in which an ensemble of Decision Trees, not just a single tree, is built to perform classification and regression tasks. Random Forests are trained with the bagging algorithm, where the combination of weaker models boosts the accuracy of the overall model. Random forest adds feature randomness when growing trees and the idea is to create a forest of decorrelated trees to get a stable, more precise result. For classification problems, the majority vote is considered, and for regression problems the average value is taken. The Random Forest model can be used by financial institutions to assess credit risk, that is, to reject risky customers.

### 5. What happens if you set all the weights of a neural network to zero?

The model won’t converge as all the neurons learn the same thing, since they evolve symmetrically. The output signal of the connections is zero and initialising with random weights, of appropriate values, is crucial to breaking the symmetry, which makes for better, faster training.

### 6. What is TF-IDF and how do you calculate it?

TF-IDF stands for term frequency–inverse document frequency and is a statistic that tells you how important or relevant a word/ term is in a particular document within a corpus. TF-IDF is used in summarisation and information retrieval. A high TF-IDF score means that the word is rarer and hence, potentially more relevant. TF highlights how frequently a term occurs, within the context of the length of the document, while IDF puts the spotlight on the rare terms.

TF(w) = No. of times the term w occurs in the text/ total number of terms in the text

IDF(w) = loge(Total number of documents / Number of documents containing term w)

TF-IDF(w) = TF(w)*IDF(w)

### 7. Explain the precision-recall (PR) curve and comment on the area under the curve as a metric

The PR curve is a graph in which you plot the Precision values (TP / (TP + FP)) on the Y axis and the Recall values (TP / (TP + FN)) on the X axis, for different thresholds. The better the algorithm, the greater the area under the PR curve (AUC). High area implies high precision, i.e., low false positives, and high recall, i.e., low false negatives.

### 8. What is dropout in neural networks?

Dropout is a technique used to prevent overfitting and, here, one temporarily ‘drops out’ units from the neural network during training to obtain different models and a more accurate average. Dropout reduces co-dependency amongst neurons. Neurons are shut down at random by zeroing and the remainder have their values proportionately multiplied.

### 9. What is the cold start problem and how would you solve it?

The cold start problem in recommender systems refers to a scenario where the system has insufficient data about new users or new items to generate optimal results. Three methods that can be used to tackle this problem:

• Representative-based: Identifies the user with a set of representative items
• Content-based: Utilises information about users from side sources like social media
• Popularity-based: Display items that have high clicks/ sales for a given region/time

### 10. Is it better to spend 5 days developing a 90% accurate solution or 10 days for 100% accuracy?

Often, it is better to build a model quickly and then prime it based on patterns in new data instances. However, the context is also important; for instance, in case of banking and fraud detection, more accuracy is desirable. Based on the learning task, it is possible to decide upon the complexity of the model.

### 11. With respect to ensembles, do you think 50 small decision trees are better than a large one?

It depends on the practical circumstances but the goal in choosing them would be to:

• Prevent overfitting
• Avoid model biases
• Get a more robust model

It may be advisable to try both and weigh the pros and cons.

### 12. What is the curse of dimensionality and how to overcome it?

The curse of dimensionality (COD) is a term used to describe the phenomena that occur when performing tasks on data in high-dimensional spaces that do not arise in low-dimensional spaces. High noise, data sparsity, and lack of association are examples of issues that occur.

To overcome COD one may use a dimensionality reduction technique such as:

• Principal component analysis (PCA)
• Linear discriminant analysis (LDA)
• t-distributed stochastic neighbour embedding (t-SNE)
• Autoencoder
• Missing Values Ratio

### Bonus Question: What is RankNet, LambdaRank, and LambdaMART? Distinguish between them.

RankNet, LambdaRank, and LambdaMART are 3 LTR algorithms developed at Microsoft Research by Chris Burges and his colleagues.

• RankNet uses SGD to optimise the cost function, seeking to eliminate the incorrect ordering of results in a ranked list.
• LambdaRank trains the RankNet model without the costs, but with the gradients only. Scaling the gradients by the absolute change in Normalized Discounted Cumulative Gain NDCG gives good results.
• LambdaMART brings together LambdaRank and MART (Multiple Additive Regression Trees). “LambdaMART is the boosted tree version of LambdaRank”, according to Chris Burges’ paper.

As you cover the various theoretical bases of data science, from regression and classification to recommender systems and rank and search, remember to check two more boxes:

1. Work on improving your technical (coding) skills 