Data Science from a Statistician’s Perspective

3 min readOct 1, 2020

Statistics is defined to be the science of data. It deals with the collection, organization, analysis, interpretation, and presentation of data. However, there is another term that evolves which can have the same definition as statistics: Data Science. There is still no consensus as to how this term is defined. However, it is established that Data Science is a multi-disciplinary field that combines the knowledge of computer science, information technology, business, economics, statistics etc. Personally, I’d like to think that data science lets us know how to do things and statistics reminds us why.

In this article, we will not discuss in detail, the technicalities of differentiation Statistics from Data Science (because label is not important, only in relationships). We will discuss how the former is used in the latter.

Lies, Damned Lies and Statistics

“There are three kinds of lies: lies, damned lies and statistics.” — Mark Twain. This emphasizes the power of numeric evidence in changing people’s views and shifting decision making processes. We are in this time that decisions in politics, business and academe are backed up by data. It is then important to understand how data can be misrepresented and can be manipulated for personal interests.

As previously defined, statistics is a culmination of a long process of collection, organization, analysis and interpretation of data. Knowledge of statistics helps us understand how to avoid common mistakes in Data Science from data collection up to data interpretation. Here are some statistical concepts that should be understood by Data Scientists:

Levels of Measurement

Measures of Central Tendency, Variability and Variable Relationships

Sensitivity, Specificity and Accuracy

Hypothesis Testing

Probability Distributions

Regression

Garbage in, Garbage out

The key similarity of statistics and data science is data. Insights from models and tests cannot be ascertained with errors in data. Exploring and cleaning the data is a must before applying statistical tests or building models.

Exploratory Data Analysis (EDA) is the mode of analysis concerned with discovery, exploration, and empirically detecting phenomena in data. It is one of the approaches of data analysis. It differs from Classical and Bayesian Approach.

EDA reveals patterns and properties of the data and helps us in deciding what techniques or models are more appropriate to use. EDA allows the data to reveal itself and the tests and models that best fit the data. It tells us if there are outliers and anomalies that will influence our analysis, what variables are important to look at and what are relationships of these variables to one another. Most of the time, EDA requires little to no assumptions about the data as opposed to Classical and Bayesian data analysis.

Statistical models and tests have their own assumptions. We have normality, homoscedasticity, linearity etc. It is important to explore the data first in order for us to validate if these assumptions are satisfied before we implement tests or build models using our data. These assumptions act like requirements before you use a model or a test. After conducting EDA, we can now know how to clean the data. It reveals to us errors, inaccuracies, incompleteness and inconsistencies with our data. Failure to check these assumptions and failure to properly clean the data will result to false conclusions.

False conclusions in statistics or data science can be prevented if we understand how our data behaves and if we understand the design of our models and statistical tests. It is easier now to conduct statistical tests and build models with the advancements in statistical programming languages which allow us to use built-in functions in implementing these tests or in building models. However, it is important to truly understand how these functions work and how our data behaves. If we do not understand these, we might be processing garbage data or we might be using inappropriate methodologies and therefore produce garbage conclusions.

Data Science from a Statistician’s Perspective

Lies, Damned Lies and Statistics

Garbage in, Garbage out

Written by Keith Monreal

No responses yet