Most companies in the last decade took serious efforts towards digital projects. Companies started considering data as an asset and performed analysis of the data that was being generated. In the beginning it seemed promising to make more accurate business decision using additional or the most recent data. Since the data collected or managed was not in huge sizes it never raised serious concerns, so everything went well…for a while. Eventually it was noted that many such data projects failed, mostly due to similar reasons. In 2017, Gartner reported that 60% of bigdata projects failed to move past the preliminary stages. The CTO of IBM said that 87% of the data projects never make it to production. What is the cause of such massive failure, and what is the reason that led to this blog in the first place?
While the norm was to have Data Scientists to make sense out of the data, it was less thought of to have someone who would manage the data, its infrastructure and ensure its quality and availability to the relevant teams. According to Gartner, the failed attempts were due to a lack of reliable data infrastructures, the inability to manage unimaginable volumes of data and high complexity became the prime cause.
The thing is, one may have the right experts to model, visualize the data and make reports. But if the data is not managed or curated in a systematic manner, the ever growing data volumes will eventually throw the process into madness. If the Analysis is not correct it does not always mean that the Data Scientists or the BI Analysts are doing something wrong, rather it could be the result of incorrectly processed data.

WHAT IS MISSING THEN?

Consider a MotoGP analogy with Valentino Rossi as the rider. The rider and his skills are important to win the race, but he would also be dependent on a team of experts and most importantly…a superbike. Let us compare the rider of a race to the BI Analyst & Data science guys of a company. They could be the best in their field, one trying to win the race while others trying to achieve business goal, each having exceptional skills indeed. But what good can a world champion can do in a race without his motorbike? Businesses were expecting the rider to win the race on a bicycle instead of a motorbike. Less importance was given to the motorbike itself, ie. The Data Engineers and the need of Data Engineering.

To highlight the importance of Data Engineering let us refer to the pyramid below. The Data Science Hierarchy of needs was published by Monica Rogati in 2017, indicating the amount of work needed and the complexity of specific steps as you move up. The general idea is that, to achieve ML or AI capabilities the basics need to be strong enough and must have a dedicated team to ensure that the process of collecting, storing & exploring the data have been performed in the best possible way. The analytical capabilities are based entirely on the performance & quality of this data & its architecture. Therefore just having the best driver for AI or ML would not work out if you are still making them operate a Horse Carriage.

What is Data Engineering?

Data Engineering, a lesser known sibling of Data Science, primarily consisted of ETL, Business Intelligence & Data Warehousing skills. A decade ago, the size of the data was not big enough and therefore, Data Engineering would usually remain at PoC levels and did not make into production. Today, businesses are creating more and more data at an ever increasing rate. This data needs to be collected, cleaned and updated regularly or in Realtime. In parallel there have been advancement in databases technologies & storage solutions making it faster and cheaper to save & use data. Companies are therefore eager to invest in getting the most value from their data and are giving a serious focus on Data Management & processes. However, managing the entire Data landscape and pipeline processes is a full time responsibility and requires a dedicated Data Engineering team.

What does a Data Engineer do?

A data engineer could be better introduced as a software engineer plus business intelligence engineer with big data capabilities. In an nutshell, a data engineer ensures that the raw data is more useful to the organization. The exact roles and responsibilities of a Data Engineer varies across companies based on their size. From a small scale to a big organization a Data Engineer could perform under Generalist, Pipeline-centric or Data-centric roles. As a Data Engineer you may be involved in projects such as the following:
• Architecture design
• Building ETL pipelines
• Building Data Warehouse/Data Lake
• Machine learning algorithm deployment
• Manage data and meta-data
• Track pipeline stability

Should it even matter to your organization?

Yes, and the implications are directly apparent in your financial performance. There are also technical advantages to have a Data Engineering team but for this particular case it would be helpful to understand the importance of Data Engineering in terms of monetary & time savings through the ‘1-10-100’ Rule.
This 1-10-100 rule is basically a Quality Management rule, nonetheless equally relevant to Data Science projects. As we have seen that many projects did fail due to lack of a strong base, the cost of such failure is the invested money as well as the time spent on developing something which did not give any return. As mentioned before in the Data science hierarchy of needs, a strong base would help you mitigate correction costs or costs due to failure. And data Engineering ensures that this base is strong enough and resilient.

Conclusion

Businesses will always need analysis based on the most recent data stored in their data warehouses, where the data is structured, curated and made easily accessible. The way these Data warehouses are built will change with development in database technologies and more incoming data. And therefore you need Data Engineering and Data Engineers to handle these changes.

Peter Schmäling

Tushar Poojary

Tushar Poojary is a Junior Solution Architect at HUBSTER.S