The many kinds of data professionals

Published on Apr 02, 2021

In 2012, the Harvard Business Review declared that Data Scientist is the sexiest job of the 21st century. This marked the beginning of the rise of the data professional in knowledge work. Modern technology generates vast amounts of data and the data professionals are the people tasked with making sense it. Data professionals work under many titles such as Data Scientist, Machine Learning Engineer, Statistician, Data Engineer, Business Analyst, etc. During a typical workday, each of these professionals get to wear many different hats and employ many different skills in their goal of extracting information from data.

In this article, we given an overview of the different types of data professionals focusing on their educational background, daily workload, skillset and software tools they use more frequently. We aim to demystify the kinds of work one is expected to do based on her job title. It is our hope that if working with data is your passion our article will be a helpful guide for choosing what to study in school and what tools to master.

Before we begin, it is important to note that we only focus on what we believe are the professions appearing more commonly in job advertisements. More specifically, these are Researcher (Computer Science, Machine Learning and Statistics), Data Scientist, Statistician, Data Engineer, Machine Learning Engineer, and Research Engineer.

There exist complex relationships between all the above professions including, in some cases, considerable overlap in required skills and tools used such as for example between a Data Scientists and a Statistician. However, we believe that each professional has a set of skills they excel at and that set of skills is the differentiating factor.

Moving on, we elaborate on what each data professional does on a given workday and the kinds of tools she uses.

Researcher

Researcher is probably the easiest of the professions to define. This category includes academics working at universities and researchers in industry labs, e.g., Google Brain, Facebook AI, and Microsoft Research. Researchers are certified experts. They focus on keeping up with the state of the art, engaging in peer review, and publishing new methods, algorithms, and theories. Researchers are easy to evaluate using objective criteria such as number of publications, citations, and h-index.

Once upon a time most researchers made their home at universities and many still do but in the Age of Data (the period starting from about 2010 onwards) a large percentage now work in industry. If you wish to join this group, then you will need to first earn a PhD, preferably from a top-tier university.

Statistician

Statisticians are the original Data Scientists. They are experts at building models, calculating confidence intervals, and interpreting p-values. Statistician worship Galton, Pearson, and Fisher. They love debating the pros and cons of the frequentist and Bayesian views of statistics.

Statisticians should have been the winning professionals of the Age of Data but instead they were replaced by Data Scientists (for more on them see the next section) who focused on data rather than models and were better skilled at handling big data favouring predictive accuracy over model interpretability. In 2001, Leo Breiman was right to point out the two cultures of statistics one that focused on data models and one that focused on algorithms. In a world that was soon after buried in data, statisticians who kept a steady course on data models and ignored algorithms to some degree lost out on the gold rush that ensued.

Today statisticians are still valuable and certainly not confined to academia. They work in industries such as health care where experimental design is of the at most importance. They are experts at working with small data where algorithmic approaches such as training massive neural networks is not a viable option. They are at home using R, a programming language designed by statisticians and for statisticians, and don't find much need for tools such as Power BI and Tableau. Software engineering and deploying models on the cloud is certainly not in their skillset.

Data Scientist

Data Scientist was once a catch-all job title for all data professionals. It came into existence to differentiate the new breed of knowledge workers from old-school Statisticians and Business Analysts. Data Scientists employ statistics and algorithms to analyse data and build predictive models. They differentiated themselves from Business Analysts by being domain generalists and using lower-level data analysis tools such as R and Python instead of spreadsheets such as Excel.

Over the years, the Data Scientist evolved to use an ever increasing set of data analysis tools, e.g., Spark, Snowflake, but her main focus has shifted to two major tasks: (a) Exploratory data analysis with prototype model building, and (b) story telling.

A Data Scientist uses tools such as SQL to interrogate databases for data and R/Python for basic data cleaning, pre-processing and model building. This requires knowledge of relational databases and R/Python libraries such as scikit-learn, numpy, pandas, tidyverse, dpylr, dask, etc. They don't specialise in software engineering and they perform data analysis using Jupyter Notebooks or small scripts in R Studio. For story telling, Power BI and Tableau are certainly within their technical expertise but so is PowerPoint.

Data Scientists do not work on productising models and are often not very competent software engineers. Agile development practices are incompatible with their workflow as data analysis tasks do not fit neatly within sprint intervals. They are evaluated on their ability to discover interesting insights in raw data and tell good stories, especially ones that improve their employers' bottom lines.

Data Engineer

In the early days of the Age of Data, the Data Scientist was the swiss army knife of data professionals. Neither employers or employees had a good understanding of the skills required to make sense of data. It was a time of much promise and also exploration. As new tools for storing and accessing vast amounts of (often) real-time data became available the amount of time and effort required to pull and prepare data for modelling dominated a Data Scientist's workday. Employers recognized the need for experts at data wrangling and so the Data Engineer role emerged.

Data Engineers work on the pipelines that allow data to move in and out of data lakes. They also work on getting data cleaned and ready for model building. Data Engineers are IT professionals that have excellent software engineering and any necessary DevOps skills.

Machine Learning Engineer

As it happens, Data Scientists often lack good software engineering and general computer science and IT skills. Early on, Data Scientists were either re-branded Statisticians or other STEM-trained professionals, e.g., physicists, who realised that their strong math skills easily transferred to the new and lucrative profession of data analysis. Later universities developed Data Science degrees that covered several subjects including statistics, machine learning, and software engineering. However, these degree programs were short in duration, commonly just 1 year long, which meant students were getting trained on a broad set of subjects but with little depth on any of them. Software engineering was one of the subjects glossed over.

In due time, as we explained above, Data Scientists focused on modelling and story telling which meant that there was a need for professionals to fill the gaps of productization and maintenance of machine learning models. The Machine Learning Engineer role exists due to this need.

Machine Learning Engineers have strong software engineering skills. They understand machine learning basics and know how to implement algorithms using low-level frameworks such as TensorFlow and PyTorch. On some occasions they are also asked to fill the DevOps role; most recently, the industry has begun calling this role MLOps and do some degree decoupling it from a Machine Learning Engineer's workload.

Machine Learning Engineers know how to produce reliable code that is easy to maintain and well tested. Agile is how they work, code reviews is their peer-review system, and CI/CD is their bread and butter. They speak the language of Docker and Kubernetes and know how to deploy, update, and scale machine learning models on the cloud, e.g., AWS, Azure, and the Google Cloud Platform.

This is the perfect role for a skilled software engineer who wishes to be part of the data goldrush. A Bachelor's degree in Computer Science (although self-taught is often acceptable as well) is the minimum entry requirement. A Master's degree in machine learning is often useful to have especially if earned from a reputable university.

Machine Learning Research Engineer

The Machine Learning Research Engineer (or just Research Engineer for short) is the new generalist role of the modern data professions. It is the only role that crosses the boundaries of all the other roles.

A Research Engineer keeps up with the state of the art by reading the most recently published peer-reviewed works. She actively engages in research activities, pure and applied, developing new algorithms and submitting these works for peer-review. She has above average software engineering skills that she uses to implement algorithms based on their description in research papers. A deep knowledge of machine learning allows her to apply a large set of methods to solve real business problems. And even though she is rarely tasked with model deployment and maintenance, she knows how to employ DevOps tools such as Docker to build demos and share solutions with colleagues and the wider research community.

Excellent Research Engineers know how to explain complex concepts in plain language and tell stories from data not unlike Data Scientists. They are the glue that binds together a strong data team because they can understand and speak the language of all other roles. Research Engineers cannot exist in isolation but are an integral part of a well functioning team. They often hold advanced university degrees, i.e., PhD, and have studied in combined degree programs. It is not rare to find that a Research Engineer has degrees in Computer Science and Statistics or Physics.

We have described the different roles of the data profession. You should now be able to choose a career path that best suits your interests and also select what subject to study in detail. And when the time comes to find a new job, you should now be able to decipher the job ads targeting data professionals.