"A data pipeline is an artifact of a data engineering process. It transforms raw data into data ready for analytics. These in turn help solve problems, aid support decisions, and make our lives more convenient. In some ways, it can be thought of as the stitch between the OLTP and OLAP systems. Data pipelines are sometimes referred to as ETL, which stands for extract, transform, load, and it has a variation called extract, load, transform (ELT). The main difference between the two is whether the incoming data is first saved to disk and then transformed (data wrangling) or vice versa. The processing is loosely referred to as ETL. Although, it is fair to say ELT is relevant in the context of Data Lakes and unstructured data, whereas ETL is used for Data Warehouses." (Anindita Mahapatra, "Simplifying Data Engineering and Analytics with Delta", 2022)
"A data silo is an isolated source of data that is only accessible to a single line of business (LOB) or department. It leads to inefficiencies, wasted resources, and obstacles in the form of incomplete data profiles and the inability to construct deep insights. [...] On the other hand, a data swamp is a large body of data that is ungoverned and unreliable. It is hard to find data and even harder to use it, which is why it's often used out of context. This is the opposite of data silos in the sense that the data is there and has been brought together, but because it has been done without adequate process and policy, it is as good as not being there. That would be a wasted investment." (Anindita Mahapatra, "Simplifying Data Engineering and Analytics with Delta", 2022)
"A model that has made it into production is a wonderful achievement! However, the journey does not stop there. There is a whole separate pipeline around model management. Over time, the model becomes stale and needs to be retrained. Yet another separate pipeline to monitor drift is needed. Model drift is often on account of data drift and is a signal to trigger a retraining process. This is where the champion model in production is compared against a new challenger version to see whether it is time to be replaced or not. Over time, it is important to be able to query what version exists in production, so that there is no confusion about which is the active one, which is the challenger, and which one needs to be promoted or rolled back. Many people have no idea what version is in production! This is where a central model registry that serves as the single source of truth for the models and their stages and versions is imperative." (Anindita Mahapatra, "Simplifying Data Engineering and Analytics with Delta", 2022)
"Data-driven organizations exhibit a culture of analytics. This cannot be confined to just a few premiere groups but rather to the entire organization. There are both cultural and technical challenges to overcome and this is where people, processes, and tools need to come together to bring around sustainable changes. Every business needs a strategy for business transformation." (Anindita Mahapatra, "Simplifying Data Engineering and Analytics with Delta", 2022)
"Data engineering is the process of converting raw data into analytics-ready data that is more accessible, usable, and consumable than its raw format. Modern companies are increasingly becoming data-driven, which means they use data to make business decisions to give them better insights into their customers and business operations. They can use these to improve profitability, reduce costs, and give them a competitive edge in the market. Behind the scenes, a series of tasks and processes are performed by a host of data personas who build reliable pipelines to source, transform, and analyze data so that it is a repeatable and mostly automated process." (Anindita Mahapatra, "Simplifying Data Engineering and Analytics with Delta", 2022)
"Data governance refers to aligning all aspects of data strategy, business strategy, and compliance requirements. A three-pronged approach of people, policy, and process will provide oversight for all data operations from the time data touches a system to the point it leaves. Roles and responsibilities dictate who has access to what data, something that needs to be enforced and monitored. Data lineage is tracked to provide accountability for how data has been transformed at various steps. Delta's history functionality provides a good audit trail. A central catalog builds on top of it and provides a central place for defining the rules, enforcing them, and monitoring compliance via audit logs. Some of these catalogs have to be built and stitched together unless a managed platform that has taken care of these aspects is leveraged." (Anindita Mahapatra, "Simplifying Data Engineering and Analytics with Delta", 2022)
"Data lakes have been in existence for a while now, so their need is no longer questioned. What is more relevant is the specifics of the solution's implementation. Consolidating all the siloed data by itself does not constitute a data lake. However, it is a starting point. Layering in governance makes the data consumable and is a step toward a curated data lake. Big data systems provide scale out of the box but force us to make some accommodations for data quality. Age-old aspects of transactional integrity were compromised on a distributed system because it was very hard to maintain ACID compliance. Due to this, BASE properties were favored. All of this was moving the needle in the wrong direction and from pristine data lakes we were moving toward data swamps, where the data could not be trusted and hence insights that were generated on the data could not be trusted either." (Anindita Mahapatra, "Simplifying Data Engineering and Analytics with Delta", 2022)
"Lakehouse is a new architecture and data storage paradigm that combines the characteristics of both data warehouses and data lakes to create a unified basis for all types of use cases to be built on top of it. There is no need to move data around. Data is curated and remains in an open format and serves as the single source of truth (SSOT) for all the consumption layers. A modern data platform has needs that span traditional data warehouses, data lakes, machine learning systems, and streaming systems and there is some overlap among these systems. A Lakehouse offers features that span all four systems [...]" (Anindita Mahapatra, "Simplifying Data Engineering and Analytics with Delta", 2022)
"Many argue that model drift is best monitored by monitoring the data drift in incoming data and the drift in the generated features. As and when the ground truth is available, it is joined by some primary key criteria with the inference data in a Delta table. Again, the update and merge operation support in Delta makes this a breeze. Now the actual and predicted values of the inference data are computed to see how well the model is doing in terms of the quality of insight generation. The feature engineering pipeline is completely in-house and is easier to monitor for drift. The model interpretability may indicate that some columns contributing to the predictive power are incorrect, and it may be necessary to add or remove features. In such cases, a threshold of tolerance is violated, which signals a need for model retraining." (Anindita Mahapatra, "Simplifying Data Engineering and Analytics with Delta", 2022)
"Metadata is critical in driving business value. It does this by facilitating innovation and collaboration among data teams, which indirectly helps mitigate risks such as misinterpretation and misrepresentation of data. Not only does it help ML practitioners discover the right datasets to use for their modeling exercises, but it also enables citizen data scientists to access the most valuable datasets, thereby ensuring the generation of timely and accurate insights." (Anindita Mahapatra, "Simplifying Data Engineering and Analytics with Delta", 2022)
"Simply put, 'lakehouse' refers to an open data architecture that combines the best of data lakes and data warehouses on a single platform. At this point, it would be fair to say that a lakehouse is closer to a data lake than a data warehouse. In fact, it is an extension of your data lake to support all use cases, from BI to AI. All data science and ML personas who were shunted into downstream applications because the tools of their trade were so vastly different and can now share the same stage and have access to the same data as other data personas. This eliminates the need to stitch fragile systems together and leads to better data quality and end-to-end latencies since there is no need to copy data across disparate architectures." (Anindita Mahapatra, "Simplifying Data Engineering and Analytics with Delta", 2022)
"Since data engineering is such a crucial field, you may be wondering who the main players are and what skill sets they possess. Building a data product involves several folks, all of whom need to come together with seamless handoffs to ensure a successful end product or service is created. It would be a mistake to create silos and increase both the number and complexity of integration points as each additional integration is a potential failure point." (Anindita Mahapatra, "Simplifying Data Engineering and Analytics with Delta", 2022)
"The main challenges include relentlessly chasing data issues that include schema and quality changes (data drift). Sometimes, fixing these issues can cause outages and delays to existing jobs. This is tied tightly to the underlying infrastructure, process, and technology and can be vulnerable to any changes there. For example, a temporary glitch in the cloud ecosystem will result in a failure of the data pipeline." (Anindita Mahapatra, "Simplifying Data Engineering and Analytics with Delta", 2022)
"Traditional data lakes provide the necessary scalability, but not the real-time concurrency and latency needed for BI use cases. Delta comes to the rescue once again by providing performance at scale with a host of optimization techniques, such as caching, data compaction, and indexing. Previously, a subset of the curated data would be pushed to a warehouse to satisfy the latency and concurrency requirements of known queries. What this meant was that if a consumer needed a different access pattern or a slightly older dataset that was not available, they would have to request that their IT or data team get involved. This took data democratization a step backward. Ideally, we should allow people to access any data that they have privileges to. Delta Lake goes a step forward and allows BI tools to access data directly from the lake instead of accessing a sliver of the data in their expensive warehouses." (Anindita Mahapatra, "Simplifying Data Engineering and Analytics with Delta", 2022)
"Understanding modern data architectures and sound data engineering principles and practices are crucial to ensure that your AI and BI strategies are reliable and defensible. Generated insights are going to be as good as the quality of the underlying data, so the upfront effort put into understanding the data, modeling it, and transforming it per the business needs goes a long way to foster innovation, productivity, and agility in your data teams." (Anindita Mahapatra, "Simplifying Data Engineering and Analytics with Delta", 2022)
"We are at the interesting conjunction of big data, the cloud, and artificial intelligence (AI), all of which are fueling tremendous innovation in every conceivable industry vertical and generating data exponentially. Data engineering is increasingly important as data drives business use cases in every industry vertical. You may argue that data scientists and machine learning practitioners are the unicorns of the industry, and they can work their magic for business. That is certainly a stretch of the imagination. Simple algorithms and a lot of good reliable data produce better insights than complicated algorithms with inadequate data." (Anindita Mahapatra, "Simplifying Data Engineering and Analytics with Delta", 2022)

No comments:
Post a Comment