Statistical Techniques Data Scientists Need to Master

Naveen Benny

3 years ago

In data science there is a confluence of statistics and coding. Mastery with statistical techniques can help you properly collect, analyze, and derive results from data. Using statistical models is all about unearthing the gold in the data and one of the first techniques you will encounter as a data scientist is linear regression. In an earlier blog you learned what simple and multiple linear regression are as well as how to execute linear regression in Python.

Linear regression is a widely-used statistical technique. However, it isn’t the only one. In real-life situations the statistical techniques you employ will depend on various factors such as the nature of the data, the computational capabilities, etc. So, to help you equip yourself with the right statistical tools, here is a brief introduction to 3 statistical techniques a data scientist must master.

Jump to

Classification

Think of spam filters: any incoming email is treated as spam or not spam. This is an example of classification, and in particular, binary classification. When you perform classification on structured or unstructured data, the goal is to map the data to certain classes/ categories. So, if your hospital needs to decide, based on vitals, age, comorbidities, and so on, if a person must be admitted to the ICU, a classification algorithm can help predict if the patient belongs to the high-risk or low-risk class.

Decision tree: If you’ve ever worked your way through a yes-no flowchart, then you’ve classified your data as per rules you set. Should you go on vacation? Yes, if the budget is under Rs.75,000; no, if otherwise. Further, yes if the trip is at least 3 nights; no, if otherwise. This is how a decision tree is formed. It is visually appealing, but overfitting and instability are issues.
Logistic regression: This predictive analysis technique is a favorite for binary classification problems – is the tumor cancerous (1) or not (0) – and is based on the logarithmic function. It works very well on linearly-separable data, but this isn’t necessarily how things pan out in the real world.
K-nearest neighbor: A lazy learner, KNN is a supervised ML algorithm which relies on the age-old adage, birds of a feather flock together. When the algorithm wants to find out the class of a data point, it looks to the class of ‘k’ nearest neighbor and then makes a confident prediction. Recommender systems are an application of KNN.

Here are some other classification techniques to explore: Naïve Bayes, Random forest, Stochastic Gradient Descent, Artificial Neural Networks, discriminant analysis, and support vector machine.

Dimensionality reduction

An antidote to the “curse of dimensionality” present in high-dimension feature sets, dimensionality reduction techniques aim to provide you with a low-dimensional space that contains the “essence” of the data. Suppose you have run a shopping center and have a goldmine of data on consumer spending on different variables – clothing, food, beverages, perfumes, furniture, household appliances, and so on. If you want to forecast consumer spending on a new product, say, Greek Yoghurt, data pertaining to food and beverages can help you, while the rest may be extraneous.

Some real-life applications of dimensionality reduction include:

Face recognition
Protein classification
Text mining

Feature selection: A subset of the original features is selected. For instance, in the example above, we can simply just ignore data pertaining to “furniture”. However, in real-life it can get tough to arrive at black-and-white conclusions and simply ignore some features.
Wrapper strategy: Iteratively considers subsets of the features, gauging the model performance of a specific ML algorithm against a metric like AUC or RMSE. In this “greedy search” approach, the model is retrained, and the goal is to find the ideal combination of features, that provide the best model performance. Forward selection and backward selection are two wrapper methods.
Filter strategy: Features are filtered based on a threshold set for a metric. The goal is to arrive at an input subset that will be most useful and, here, there is no specific ML algorithm. Chi-square test, correlation, and ANOVA are examples of filter methods. If your dataset has several columns that are highly correlated, you can filter out data, without affecting your predictions.
Embedded strategy: Features are selected during model training; that is, feature selection is embedded into the model building phase. Examples are LASSO, Ridge Regression, and Elastic Net. Embedded methods are computationally faster than wrapper methods but slower than filter methods.

Feature extraction: Instead of simply discarding entire features from the original pool, here the approach is to retain useful information from the initial features and create new ones that describe the dataset but with less dimensionality. Image processing is an important application of feature extraction. Because these new features are derived from the old ones, you can now discard some new variables and still retain information from the old.
Principal Component Analysis (PCA): An unsupervised learning algorithm, PCA takes high-dimensional data and summarizes it, with linear mapping, into a lower-dimensional form, without losing too much original information. The new features are uncorrelated and PCA aims to maximize the variance in the low-dimensional set. The first principal component captures most of the variation, the second principal component captures the second-most and so on.
Linear Discriminant Analysis (LDA): Assuming linear separability of the classes, LDA draws out a projection axis in such a manner that minimizes intra-class variation and maximizes inter-class variation. Applications of LDA range from marketing and bankruptcy prediction to patient prognosis.

Resampling

The dictionary definition of the word “resample” is to take a sample of or from (something) again. In statistics, resampling involves repeatedly drawing samples from a population to estimate the precision of a population parameter. When the distribution of the data is not known, resampling can be used.

Bootstrap: You draw a high number of smaller, equal-sized samples, with replacement, from the dataset. Bootstrapping is simple and intuitive, and applications include estimating the accuracy of statistics like confidence intervals, variance, bias, and so on, as well as hypothesis testing.
K-fold cross-validation: You first split the data into k smaller folds and use the “Kth” fold for testing, while keeping the remainder for training. The processes works iteratively until each fold gets a chance to become a test set. Average the scores recorded to evaluate the model’s performance

Two other prominent resampling methods you can explore are Jackknife and Permutation.

These 3 statistical techniques are commonly employed by data scientists and the data science community is chockfull of more techniques and in-depth research papers you can dive into to get a firm foundation. However, nothing beats hands-on work, and the best way to get exposure to plenty of challenging real-world data science problems is to sign up with Talent500. Once you take our assessment, our smart algorithms will match your strengths will job openings at Fortune 500 companies. Compared to traditional modes of recruitment, the statistics favor our dynamic skill assessment method, which results in 5x faster hiring!

So, as you delve deeper into the math behind data science, make sure to #BeLimitless in your career by being in a population sample top-tier companies pick from.