Organizations spending on huge Data Projects must have a clear perception of the associated roles like Data Engineer, Data Scientists & Data Analysts, whilst understanding how they differ from each other. Even though these roles have some overlapping tasks, having one work as a long term proxy for the other could seriously impact the efficiency of the Data project. Therefore, it helps to have a better picture as for what each role is designed for and where do they perform better.
One of the simplest comparison used everywhere is the Venn diagram below, which gives an idea about each of their areas of expertise and how they overlap. Data Engineer comes from a technical background which is more like a software engineering plus business know how. Data Analyst on the other hand think from the Business communication side, with knowledge of reporting tools but less programming skills. Data Scientist on the other hand combine Math & Statistical knowledge with a profound business understanding and complement them with programming & big data technologies.
Although we can compare all the three roles, we will exclude Data Analyst, primarily for the sake of a short-read & secondarily because Data Analyst appear at the end of a data architecture and do not have confusing overlap like the other two i.e Data Engineer & Data Scientist. Also the goal of this blog is not an actual vs to see who could land a better punch, or compare them to weight their importance and see who is a better sibling? The objective is rather to understand their core competence and to explain that why having them perform each other’s task could be a bad idea.
Misconception about the two roles
One of the common problems in interpreting the roles is the Venn diagram used before. Although it is correct in terms of area overlap, it does not necessarily mean that a Data Scientist can build pipelines like a Data Engineer and a Data Engineer is able to make statistically backed decisions like a Data Scientist. It just means the other person does it better. So what is the problem? Why not let the elder sibling do what he likes and the younger one focus on his own skills. Unfortunately it does not necessarily work like that in an organization. Due to the frequent overlap and knowledge transfers, and the fact that they can(but should not often) perform each other’s task there are cases that doing so permanently damages the overall efficiency.
In order to understand that lets dive deeper into each of the roles and respective skills.
Termed as the Sexiest job of the 21st century Data Scientist role is for a person who possess statistical, programming and math knowledge. Their task is to create advanced analytics & machine learning models. They must have a good business understanding and possess skills to communicate complex observations to business owners. In short, someone who has strengthened their profound statistical knowledge with programming skills. However they are not good programmers (comparable to software or data engineers) and are just using programming as a tool to understand the massive amount of data.
The most valuable asset for a Data Scientist is clean, easily accessible raw or curated data. Data Engineering is the one who ensures that the data used by the Data Scientist is correct and readily accessible. The roles demands developing, constructing, testing & maintenance of the big data architecture. They have programming background and system creation skills and can provide solutions to the big data problems.
As mentioned before, they both can perform each other’s tasks as there is always some overlap when the project demands it. A Data Engineer can perform analysis, however it is not his core expertise. He will not be as quick and accurate as a Data Scientist. Data Engineer are smart people and can acquire statistical knowledge and progress upwards to building machine learning models, but that is a gradual learning process. Until then a Data scientist building ML models would be a faster and viable option.
Data scientist are also all brains and can build pipelines, but the main issue here is that they have learnt programming and pipeline creation out of necessity to complement big data analysis. They are used to limited methods and hence might not choose the right tool for their use case. Of course they can gradually perfect those skills but until that day arrives the risk & ROI might not justify the time & resources invested.
Suppose you used a wrong tool and resource to create many pipeline runs, let us consider Apache spark being used for a ‘not that big data’ which could have been easily managed by Snowflake. The result is a 15minute run instead of 1minute. Now imagine the wait time multiplied with 10 other pipelines together. Well, when my test pipeline runs for more than 10 minutes, I usually go for a coffee. Now doing this the wrong way for additional 10 pipelines, that is going to be expensive… in terms of coffees purchased, and of course, also in terms of the computational resources utilized.
Both of the roles are equally important, where a Data Scientist helps improve business decisions and a Data Engineer ensures that those decisions are based upon most recent high quality data. The two roles have their own competence and by understanding their differences, companies can get the most out data projects. What is important is that the ratio of data engineer to data scientist must always be greater than one and is to be decided based on the size of the complexity. Personally, I would be happy if as a part of ‘lessons learnt’, at least one manager would stop asking his Data Engineer to build a AI algorithm, or stop asking the data scientist why the data quality is bad!