My previous blog in this section focused on the necessity of a Data Engineer in a team and how they differ from other data science roles. This last section of the blog would emphasize more over the Roles and Responsibilities of a Data Engineer and their desired skill set.
With the explosion of data & database technologies, we have seen many tools that are available and have similar capabilities. The range of these technologies is also huge and therefore have led to too many tools and technologies for a Data Engineer. It is not possible for one to know all of them, rather a single one is sufficient if there is no specific business requirements. And even if the business requirement changes, fortunately the fundamentals are all similar and won’t make it impossible to shift to another technology. But again, expecting a Data Engineer to know all, is like having a Unicorn…’sounds good to have, but doesn’t exist’.
An everyday Data Engineer’s roles, responsibilities & skillset needs vary based on the size of the company and the complexity of a project. It is therefore, important to know how to categorize them and where do they fit in the organization.
Depending on the company size a Data Engineer could take one of the three roles, which changes from a general to a narrowed down scope based upon the Team size and Database complexity.
Generalist: Typical for small teams, where a Data Engineer has to wear many hats. They must look after the complete process right from inception of Data, Managing the pipelines and maintenance to the Analysis part. One has to be a good communicator with sound Business acumen.
Pipeline-Centric: Where the Data engineers help build pipelines as per the Use case, which would later be used by the Data Scientist or the Data Analysts.
Database-centric: When the organizational data is large, in addition to the pipelines there is also a requirement of maintaining the analytical databases. Since the complexity of the databases is huge, that itself becomes a full time job.
A Data Engineer develops, builds, tests & maintains the complete architecture of a Data processing system. As a Data Engineer you will be responsible for the following things.
In its core, Data Engineering comprehends architecture design, deployment and maintenance of a Data platform. It must have a careful consideration of the changing business requirements so that the change in the system is more resilient.
Building & maintain ETL (Extract-Transform-Load) pipelines:
Fundamental of every data architecture, this process involves extracting data from various sources, transforming it and loading it into data warehouse, which is utilized by the end users for analysis purpose.
Building & maintaining Data Warehouse/Lake:
In big organizations Building and Maintaining a Data Warehouse is a full time role. The existence of many databases makes it necessary to have responsible people, from governance’s point of view. They take care of the schema & organizing the metadata and define the ETL process.
The data can be stored in a warehouse either in a structured or unstructured way. The data contains meta-data (data about data) which is helpful in documentation and for a quick access to different information about a database. A data engineer is responsible of managing the data stored and structuring it via DBMS systems while ensuring proper Governance.
Optimization & Scalability
It is a usual situation with big data architecture systems that the pipeline run takes hours to run and might not be configured correctly, this could greatly affect the availability of data while having a significant price impact. It is expected from a Data Engineer to optimize the available system while ensuring availability and scalability.
A Data Engineering skill requirement is holistic and includes many tools and technologies being used to in combination. If you search for a Data Engineer with complete knowledge and skill set of all the available data Engineering tools & technologies, you might rather have a better chance in finding a Unicorn. Yet keep a realistic expectations from a normal DE in flesh and bone, the skills can be clustered into these 6 categories.
A Data Engineer’s daily task would be to maintain the databases and hence must possess a good knowledge of DBMS & Database Systems and their scripting language like SQL or NoSQL.
At least one of the Programming languages like python, scala or java is a must have for a Data Engineer. It helps perform statistical analysis and modelling. The language requirement depends on the tools that would be used, like MapReduce, AWS, Azure, Apache Spark or Hadoop but being proficient in at least one is a must.
Realtime Streaming Data is another necessity in many organizations where the most recent data brings in significant business values. Example of a Realtime Streaming data use case, is the car share price surge that is based on demand or weather conditions, or during your flight arrival or departure time if you plan your trip to or from the airport.
Data Warehousing will enable you to store huge amounts of data for analytics and these data comes from various sources and is therefore one of the fundamentals. As a Data Engineer you must be proficient in at least one of the data warehousing tools like Snowflake, Oracle, Azure or AWS. In addition to these, knowledge of Operating systems is also important if the Operations is based on any one of the Operating systems.
While looking out for the ideal Data Engineer for your team, the key word match is not enough. An ideal candidate is the one who might have only one skill for each categories like programming, database & data warehouse knowledge, but has a holistic skill-set balance with a good business understanding and can steer the project in the right direction.