How Does Maxim Bushmin, The Influential Face Of Data Engineering, See The Future Of This Field?
The Field Of Data Engineering Is Developing Rapidly, And This Has Made The Job Market Of These IT Specialists Very Hot.
It’s not wrong to know that the job title of data engineer almost didn’t exist until ten years ago. Still, the need of organizations for a notable trend in software engineering caused this job title to be created and to move quickly on the path of progress.
The responsibilities of a data engineer are not fixed and depend on the company where the data specialist works. However, data engineers must learn common skills to perform their daily tasks. Regardless of the traditional skills that a data engineer should have, in the future, data engineers are expected to have two skills, the ability to work with cloud technologies and SaaS products, spending less time on coding and more time on monitoring. Now let’s explore these skills in more detail.
In the world of data engineering, Maxime Beauchemin is a well-known figure. One of the first data engineers at Facebook and Airbnb, he wrote the viral Apache Airflow tool, open-source it, and shortly after that, developed Apache Superset. This device significantly changed the collection and analysis ecosystem by carefully exploring the data. Currently, Bushman is the CEO and co-founder of Preset, A startup whose field of work is the visualization of data that machine learning algorithms can use.
Bushman has been one of the most influential figures in the world of data engineering in the last decade. In a 2017 personal blog post titled The Rise of the Data Engineer, he showed companies why data engineering is one of the most important jobs in the IT world. Bushman believes that for accurate data scaling and precise analytics, data teams need an expert data engineer to manage ETL, build data pipelines, and scale data infrastructures.
A data engineer is a member of a data team that focuses primarily on building and optimizing platforms for ingesting, storing, analyzing, visualizing, and effectively using data.
The question that almost occupies the minds of many professionals working in this field is where the field of data engineering will be in the next five years and what the experts in this field will do. How will the process of decentralization be done, what role will the cloud play in this field, and other questions that may come to the minds of many experts? In this article, we will examine some of Mr. Bushmin’s points of view and predictions.
The cloud will play an essential role in changing the tasks of data engineers.
Bushman points out that not too long ago, data engineers had to spend a lot of time doing things related to Hive, a scalable and fast data warehouse. Also, they were responsible for managing various elements of the data transmission line. In other words, data engineering was a tedious and time-consuming process that was not particularly attractive. He says, “You had to spend a lot of time performing the initial tasks of a project, which caused job burnout. So that sometimes you had to work for 10 to 12 hours to complete a basic task.” In 2021, data engineers can do big things very quickly thanks to the computing power of BigQuery, Snowflake, Firebolt, Databricks, and other cloud storage technologies. This development is greatly simplified thanks to cloud-based, SaaS, and NoSQL database technologies, but that’s not the whole story.
“The cloud has indeed simplified things significantly, but you have to be careful with your computing costs, and at the end of the month, you may find that your wallet balance is quickly depleted because you have no limit on how much resources you can use,” Bushman says. You don’t have it, and you may spend more than you need due to using storage space or processing power for no reason.”
Since data engineers are no longer responsible for managing processing power and storage space, their duties will shift from infrastructure development to approaches based on data stack development or specialized roles.
We can see this change in the emergence of a concept called “data reliability engineering.” In this case, the data engineer is responsible for managing, not building, the data infrastructure and monitoring the performance of cloud-based systems.
Consensus on data governance will become more challenging to achieve
Until just a few years ago, the structure of data teams was highly centralized, with data engineers and tech-savvy analysts taking on roles similar to data librarians in the company. Data governance did not mean much, and engineers collected data from various sources without problems.
Bushman says: “Today, we are facing a concept called distributed governance, which is of interest to companies. Each team has its domain of analysis, team structures tend to be decentralized, and team members, like data scientists, demand only good data. The reality is that data warehouses mirror the organization in many ways. We accept that consensus is essential in doing things, but this will not necessarily make the process of doing everything easier. “If people don’t agree on what to call a data warehouse or the definition of metrics, that lack of consensus can cause problems.”
Bushman points out that achieving consensus will not be easy, especially if the data is to be obtained in different ways from organizational sources.
This leads to problems of redundancy and inconsistency unless teams agree on what data is private or what different parts of the organization share with them.
Currently, data-driven teams are responsible for all of the company’s data. More precisely, they own the data they collect and use. As data is shared by different groups and exposed on a larger scale, it must be prepared more carefully, and application programming interfaces (APIs) developed more obsessively.
Change management is still a problem, but the right tools can help
In 2017, when Bushman wrote his first data engineering paper, he pointed out: “When the nature of data changes, it will have a dramatic impact on company performance. The lack of forward-looking management will cause technical and cultural gaps in such a situation.
We will see failures in downstream layers such as dashboards, reports, and other data-driven products when source code or datasets are changed or updated. Any analysis will be invalid in practice if the problems in the lower layers are not solved. This data corruption will be costly for organizations, and much time should be spent solving this problem.
Often, breakdowns come without any apparent signs. In such situations, data engineering teams try to understand what went wrong, who is affected, and how they can fix it. Today, data engineering teams are increasingly relying on DoApps and software engineering best practices to build a more robust toolset and organizational culture that emphasizes the two critical metrics of effective communication and data reliability.
“Data visibility helps data engineering teams identify and fix problems and even gain insight into how failures affect people,” Bushman says. However, change management is as much cultural as it is technical. “Managing change means that team members must closely monitor processes, the central data platform, and workflows.”
If there’s no distinction between private and public data, it’s hard to know who’s using what data and, if the data goes wrong, what’s causing it. Analyzing the nature of data and paying attention to the principle of data governance is one of the success factors of data-oriented projects in the future.
While at Airbnb, Bushman set out to design the Dataportal to systematize access to data and empower all Airbnb employees to explore, understand, and trust data. While such tools declare which employees or parts of an organization are affected by changes in data, they do not do much specific work on applying management to the data.
Data must be immutable. Otherwise, things get out of hand
Designing tools that are supposed to perform operations on data are borrowed from software engineering patterns, which is considered the strength of these tools. However, some criteria affect working with ETL pipelines.
“If I want to change the name of a column in the database, it’s relatively difficult to do because we have to re-run our ETL and edit the SQL dialogs,” Bushman says. When data transmission lines and data structures change, they affect system performance. In general, changes are difficult to implement and sometimes cause unexpected crashes. For example, suppose you have an incremental process that periodically loads data into a huge table and wants to drop some of that data. You’ll need to stop the data pipeline, configure the infrastructure twice, and add new columns. Are created, deploy the new business logic and discard the old one.
Data engineering tools don’t help much in this regard, especially if the volume of data and workflows become large. The most effective solution in this field is to preserve assets and prevent changes in data. Also, if changes are unavoidable, everything should be documented.
Data engineers will make extensive use of cloud technologies and SaaS products.
Ten years ago, companies depended on on-premises infrastructure to store their data. During this period, data engineers spent a lot of time setting up and configuring their machines. This is why the primary technologies for working with big data emerged as tools specific to organizational environments.
Next, cloud service providers entered the field with the promise of services that would simplify the data management process. So that data engineers could devote more time to solving business problems.
Cloud service providers and technology companies like Snowflake and Databricks have simplified the process of working with big data. Today, the available technologies professionals exercise closer control over the quality, governance, and manner of data acquisition and optimize the cross-product integration process.
Gone are the days when data engineers used only one Apache Foundation tool to complete their work. Today, they have access to countless tools to perform assigned tasks and are always looking to choose the best tool. For this reason, they must have good knowledge in the field of data engineering ecosystems and know how to identify critical criteria to select the best device based on these criteria.
Choosing the right tool to do the job is not easy. Unfortunately, integrating the devices to build a stable data platform is another challenge that data engineers face. Some data engineers use infrastructure as code to gather information and automate infrastructure deployment. This trend seems to become one of the mandatory skills a data engineer needs shortly.
Data engineers will spend less time coding and more time monitoring
Shortly, data engineers won’t have to use specialized ETL tools like Scala and Spark to design and develop complex data pipelines.
For data mining, they will have access to technologies such as Airbyte to program the processes of extracting information from various data sources. Also, downloading or uploading data has become more accessible than before. For example, the Snowflake infrastructure simplifies the process of loading a file from blob storage resources that are in a table. So data professionals can use a single-line SQL command to do this.
In the transformation phase, but provides a new paradigm for data engineers to store their data in the data warehouse and use SQL as the primary language of data transformation. More precisely, the data transformation process will shift from ETL to ELT.
In such a situation, the deployment of a workflow becomes simpler than at present, and we can use the modern data stack in this context. Data stack refers to a set of technologies that aim to reduce the complexity of data workflows and increase the speed of tasks. Modern data stacks allow data analysts to perform their functions independently and no longer need the help of data engineers to collect and transform raw data. Does this mean that data engineers will no longer have a place on data teams? The answer is negative. The role of the data engineer will lean towards a more operations-oriented part. The next generation of data engineers will focus on improving data reliability.
In the future, a data engineer is expected to have the following responsibilities:
- Monitor the execution of data workflows and configure alerts in case of unforeseen events
- Preparing the infrastructure that will be used to use the data
- Building data transmission lines based on the CI/CD pattern to verify the correctness of the codes and automatic deployment
- Ensuring data quality at all times
Similar to what we saw in software development a few years ago, we may see a similar trend in the data world with the rise of software reliability engineers (SRE). To be more specific, we will see a new job title called “Data Reliability Engineer.” They ensure that data is available and reliable.
In such situations, data engineers are primarily responsible for defining Service Level Indicators (SLI) and Service Level Objectives (SLO). In the future, data engineers will play an essential role in incident response. The perspective of continuous developments in this field shows that the job title of data engineer will undergo fundamental changes in the future.
The next generation of data engineers will not work on a specific data product and will help data-driven teams produce more productive outcomes. For this purpose, they will be responsible for providing the right set of tools. This is what we will know as the “Data Mesh Paradigm.”
So, in the future, when you need to build dashboards for financial reports, you won’t need a team consisting of product owners, data analysts, and data engineers. The data analyst will be independent and use the tools that the team has prepared for him to quickly extract the necessary data and then calculate the critical indicators based on this raw data.