Showing posts sorted by date for query Data science. Sort by relevance Show all posts
Showing posts sorted by date for query Data science. Sort by relevance Show all posts

17 September 2024

Software Engineering: Mea Culpa (Part V: All-Knowing Developers are Back in Demand?)

Software Engineering Series

I’ve been reading many job descriptions lately related to my experience and curiously or not I observed that many organizations look for developers with Microsoft Dynamics experience in the CRM, respectively Finance and Operations (F&O) and Business Central (BC) areas. It’s a good sign that the adoption of Microsoft solutions for CRM and ERP increases, especially when one considers the progress made in the BI and AI areas with the introduction of Microsoft Fabric, which gives Microsoft a considerable boost. Conversely, it seems that the "developers are good for everything" syntagma is back, at least from what one reads in job descriptions. 

Of course, it’s useful to have an inhouse developer who can address all the aspects of an implementation, though that’s a lot to ask considering the different non-programming areas that need to be addressed. It’s true that a developer with experience can handle Requirements, Data and Process Management, respectively Data Migrations and Business Intelligence topics, though if one considers that each of the topics can easily become a full-time job before, during and post-project implementations. I’ve been there and I (hopefully) know that the jobs imply. Even if an experienced programmer can easily handle the different aspects, there will be also times when all the topics combined will be too much for a person!

It's not a novelty that job descriptions are treated like Christmas lists, but it’s difficult to differentiate between essential and nonessential skillset. I read many jobs descriptions lately in which among a huge list of demands, one of the requirements is to program in the F&O framework, sign that D365 programmers are in high demand. I worked for many years as programmer and Software Engineer, respectively in the BI area, where SQL and non-SQL code is needed. Even if I can understand the code in F&O, does it make sense to learn now to program in X++ and the whole framework? 

It's never too late to learn new tricks, respectively another programming language and/or framework. It even helps to provide better solutions in other areas, though frankly I would invest my time in other areas, and AI-related topics like AI prompting or Data Science seem to be more interesting in the long term, especially when they are already in demand!

There seems to be a tendency for Data Science professionals to do everything, building their own solutions, ignoring the experience accumulated respectively the data models built in BI and Data Analytics areas, as if the topics and data models are unrelated! It’s also true that AI-modeling comes with its own requirements in what concerns data modeling (e.g. translating non-numeric to numeric values), though I believe that common ground can be found!

Similarly, the notebook-based programming seems to replicate logic in each solution, which occasionally makes sense, though personally I wouldn’t recommend it as practice! The other day, I was looking at code developed in Python to mimic the joining of tables, when a view with the same could be easier (re)used, maintained, read and probably more efficient, even if different engines will be used. It will be interesting to see how the mix of spaghetti solutions will evolve over time. There are developers already complaining of the number of objects used in the process by building logic for each layer from the medallion architecture! Even if it makes sense from architectural considerations, it will become a nightmare in time.

One can wonder also about nomenclature used – Data Engineer or Prompt Engineering for the simple manipulation of data between structures in data transformations, respectively for structuring the prompts for AI. I believe that engineering involves more than this, no matter the context! 

Previous Post <<||>> Next Post

09 April 2024

Business Intelligence: Why Data Projects Fail to Deliver Real-Life Impact (Part IV: Making It in the Statistics)

Business Intelligence
Business Intelligence Series

Various sources (e.g., [1], [2], [3]) advance the failure rates for data projects somewhere between 70% and 85%, rates which are a bit higher than the failure of standard projects estimated at 60-75% but not by much. This means that only 2-3 out of 10 projects will succeed and that’s another reason to plan for failure, respectively embrace the failure

Unfortunately, the statistics advanced on project failure have no solid fundament and should be regarded with circumspection as long the methodology and information about the population used for the estimates aren’t shared, though they do reflect an important point – many data projects do fail! It would be foolish to think that your project will not fail just because you’re a big company, and you have the best resources, and you have a proven rate of success, and you took all the precautions for the project not to fail.

Usually at the end of a project the team meets together to document the lessons learned in the hope that the next projects will benefit from them. The team did learn something, though as the practice shows even if the team managed to avoid some issues, other issues will impact the next similar project, leading to similar variances. One can summarize this as "on the average the impact of new issues and avoided known issues tends to zero out" or "on average, the plusses and minuses balance each other across projects". It’s probably a question of focus – if organizations focus too much on certain aspects, other aspects are ignored and/or unseen. 

So, your first data project will more likely fail. The question is: what do you do about it? It’s important to be aware of why projects and data projects fail, though starting to consider and monitor each possible issue can prove to be ineffective. One can, however, create a risk register from the list and estimate the rates for each of the potential failures, respectively focus on only the top 3-5 which have the highest risk. Of course, one should reevaluate the estimates on a regular basis though that’s Risk Management 101. 

Besides this, one should focus on how the team can make the project succeed. When adopting a technology, methodology or set of processes, it’s recommended to start with a proof-of-concept (PoC). To make the PoC a helpful experience it’s probably important to start with a topic that’s not too big to handle, but that also involves some complexity that would allow the organization to evaluate the targeted set of tools and technologies. It can also be a topic for which other organizations have made important progress, respectively succeed. The temptation is big to approach the most stringent issues in the organization, respectively to build something big that can have an enormous impact for the organization. Jumping too soon into such topics can just increase the chances of failure. 

One can also formulate the goals, objectives and further requirements in a form that allows the organization to build upon them even if the project fails. A PoC is about learning, building a foundation, doing the groundwork, exploring, mapping the unknown, and identifying what's still missing to make progress, respectively closing the full circle. A PoC is less about overachievement and a big impact, which can happen, though is a consequence of the good work done in the PoC. 

The bottom line, no matter whether you succeed or fail, once you start a project, you’ll still make it in the statistics! More important is what you’ve learnt after the first data project, respectively how you can use the respective knowledge in further projects to make a difference!

Previous Post <<||>> Next Post

References:
[1] Harvard Business Review (2023) Keep Your AI Projects on Track, by Iavor Bojinov (link)
[2] Cognilytica (2023) The Shocking Truth: 70-80% of AI Projects Fail! (link)
[3] VentureBeat (2019) Why do 87% of data science projects never make it into production? (link)

08 April 2024

Business Intelligence: Why Data Projects Fail to Deliver Real-Life Impact (Part III: Failure through the Looking Glass)

Business Intelligence
Business Intelligence Series

There’s a huge volume of material available on project failure – resources that document why individual projects failed, while in general projects fail, why project members, managers and/or executives think projects fail, and there seems to be no other pleasant activity at the end of a project than to theorize why a project failed, the topic culminating occasionally with the blaming game. Success may generate applause, though is failure that attracts and stirs the most waves (irony, disapproval, and other similar behavior) and everybody seems to be an expert after the consumed endeavor. 

The mere definition of a project failure – not fulfilling project’s objectives within the set budget and timeframe - is a misnomer because budgets and timelines are estimated based on the information available at the beginning of the project, the amount of uncertainty for many projects being considerable, and data projects are no exceptions from it. The higher the uncertainty the less probable are the two estimates. Even simple projects can reveal uncertainty especially when the broader context of the projects is considered. 

Even if it’s not a common practice, one way to cope with uncertainty is to add a tolerance for the estimates, though even this practice probably will not always accommodate the full extent of the unknown as the tolerances are usually small. The general expectation is to have an accurate and precise landing, which for big or exploratory projects is seldom possible. 

Moreover, the assumptions under which the estimates hold are easily invalidated in praxis – resources’ availability, first time right, executive’s support to set priorities, requirements’ quality, technologies’ maturity, etc. If one looks beyond the reasons why projects fail in general, quite often the issues are more organizational than technological, the lack of knowledge and experience being one of the factors. 

Conversely, many projects will not get approved if the estimates don’t look positive, and therefore people are pressured in one way or another to make the numbers fit the expectations. Some projects, given their importance, need to be done even if the numbers don’t look good or can’t be quantified correctly. Other projects represent people’s subsistence on the job, respectively people self-occupation to create motion, though they can occasionally have also a positive impact for the organizations. These kinds of aspects almost never make it in statistics or surveys. Neither do the big issues people are afraid to talk about. Where to consider that in the light of politics and office’s grapevine the facts get distorted.

Data projects reflect all the symptoms of failure projects have in general, though when words like AI, Statistics or Machine Learning are used, the chances for failure are even higher given that the respective fields require a higher level of expertise, the appropriate use of technologies and adherence to the scientific process for the results to be valid. If projects can benefit from general receipts, respectively established procedures and methods, their range of applicability decreases when the mentioned areas are involved. 

Many data projects have an exploratory nature – seeing what’s possible - and therefore a considerable percentage will not reach production. Moreover, even those that reach that far might arrive to be stopped or discarded sooner or later if they don’t deliver the expected value, and probably many of the models created in the process are biased, irrelevant, or incorrectly apply the theory. Where to add that the mere use of tools and algorithms is not Data Science or Data Analysis. 

The challenge for many data projects is to identify which Project Management (PM) best practices to consider. Following all or no practices at all just increases the risks of failure!

Previous Post <<||>> Next Post

02 March 2024

Business Intelligence: Microsoft Releases for the BI Technology Stack (Timeline)

Business Intelligence
Business Intelligence Series

I started some years back to put together a timeline for the most important events happening in the BI technology stack (work in progress):

2023: Microsoft announces Microsoft Fabric (>>)

  • Synapse Data Warehouse is the next generation of data warehousing in Microsoft Fabric with native support for the delta lake.
  • Data Engineering & Data Science workloads with support for lakehouses, notebooks, Spark Job definitions, models and experiments.
  • Real-Time Analytics is a robust platform tailored to deliver real-time data insights and observability analytics capabilities for a wide range of data types.
  • OneLake provides a single unified storage location for all your data analytics needs.

2022: Microsoft releases SQL Server 2022 (>>)

  • Synapse Link for SQL Server 2022 allows to seamlessly replicate operational data in near real-time to be able to have more powerful analytics.
  • Purview is a unified data governance and management service.

2019: Microsoft launches Azure Synapse Analytics service (formerly SQL Data Warehouse), a limitless analytics service, that brings together enterprise data warehousing and Big Data analytics. (>>)

2019: Microsoft releases SQL Server 2019 (>>)

  •  Big Data Clusters add-in for SQL Server allows to deploy scalable clusters of SQL Server, Spark, and HDFS containers running on Kubernetes (feature to be retired)

2018: Microsoft extends PowerQuery with ETL capabilities. (>>)

2018: Microsoft releases Azure Data Studio, a data management tool that enables to work with SQL Server, Azure SQL DB and SQL DW from Windows, macOS and Linux. (>>)

2017: Microsoft releases Power BI Report Server, an on-premises server that enables Power BI Pro users to publish Power BI reports and distribute them broadly across the enterprise, without requiring report consumers to be licensed individually per use (>>)

2017: Microsoft released SQL Server Data Tools (SSDT), which uses PowerQuery to import and prepare data in SSAS/AAS tabular models.

2017: Microsoft releases SQL Server 2017. (>>)

  • SSRS is no longer available to install through SQL Server setup.
  • Python support added, R Services renamed to Machine Learning Services. (>>)

2016: Microsoft releases SQL Server 2016 (What's new, >>)

  • Query Store allows to monitor and troubleshoot performance issues.
  • SQL Server R Services integrate the R programming language into SQL Server.
  • Direct Query for SSAS.
  • PolyBase for querying the data stored in HDFS. (>>)
  • Support for Support for HDFS in SSIS.
  • Azure SQL Data Warehouse is GA. (>>)
  • modern reports with SSRS. (>>)
  • Real-Time Operational Analytics. (>>)
2016: SQL Server 2014 Developer Edition becomes free. (>>)

2015: Microsoft announces elastic databases SQL Data Warehouse & Azure Data Lake. (>>)

  • Elastic databases allows to build SaaS applications to manage large numbers of databases that have unpredictable resource demand.
  •  Azure SQL Data Warehouse is an elastic data warehouse in the cloud that can dynamically grow, shrink and pause compute in seconds independent of storage.
  • Azure Data Lake is a hyper-scale data store for big data analytic workloads.

2015: Microsoft releases Power BI to the general public.

  • Power BI Designer renamed to Power BI Desktop.
2015: Microsoft releases several Azure services:
  • launches the SQL Server Cloud database.
  • Azure Data Factory (ADF), a fully managed service that does information production by orchestrating data with processing services as managed data pipelines. (>>)
  • Azure Stream Analytics, a fully managed stream processing engine that is designed to analyze and process large volumes of streaming data with sub-millisecond latencies. (>>)

2014: Microsoft released Power BI Designer unifying Power Query, Power Pivot & Power View.

2013: Microsoft announces Power BI for Office 365. (>>)

2012: Microsoft releases with SQL Server 2012. (>>)

  • BI Semantic Model for SSAS provides a single, scalable model for BI applications.
  • Parallel Data Warehouse with PolyBase capabilities. 
  • in-memory capabilities. (>>)
  • Windows Azure SQL Reporting service available (>>)
  • SQL Server Data Tools unifies SQL Server and cloud SQL Azure development for both professional database and application developers.

2010: Microsoft released 

  • Power Pivot as part of SQL Server R2.
  • Azure SQL Database.

2010: Microsoft releases SQL Server 2008 R2.

  • Master Data Services.
  • Power Pivot & Self-service BI capabilities in SSAS.

2008: Microsoft releases SQL Server 2008 (>>)

  • Table compression.
  • Change Data Capture (CDC).

2005: Microsoft releases SQL Server 2005

  • a greatly enhanced version of Analysis Services.
  • SQL Server Integration Services to replace DTS.

2004: Microsoft released SQL Server Reporting Services (SSRS) as add-on to SQL Server 2000.

2000: Microsoft released SQL Server Analysis Services (SSAS) with SQL Server 2000.

1998: Microsoft released SQL Server 7.

  • OLAP services & first MDX specifications.
  • Data Transformation Services (DTS) for ETL workloads.

17 February 2024

Business Intelligence: A Software Engineer's Perspective I (Houston, we have a Problem!)

Business Intelligence Series
Business Intelligence Series

One of the critics addressed to the BI/Data Analytics, Data Engineering and even Data Science fields is their resistance to applying Software Engineering (SE) methods in practice. SE can be regarded as the application of sound methods, methodologies, techniques, principles, and practices to obtain high quality economic software in a reproducible manner. At minimum, should be applied SE techniques and practices proven to work, for example the use of best practices, reference technologies, standardized processes for requirements gathering and management, etc. This doesn't mean that one should apply the full extent of SE but consider a minimum that makes sense to adopt.

Unfortunately, the creation of data artifacts (queries, reports, data models, data pipelines, data visualizations, etc.) as process seem to be done after the principle of least action, though least action means here the minimum interaction to push pieces on a board rather than getting the things done. At high level, the process is as follows: get the requirements, build something, present results, get more requirements, do changes, present the results, and the process is repeated ad infinitum.

Given that data artifact's creation finds itself at the intersection of two or more knowledge areas in which knowledge is exchanged in several iterations between the parties involved until a common ground is achieved, this process is totally inefficient from multiple perspectives. First of all, it takes considerably more time than planned to reach a solution, resources being wasted in the process, multiple forms of waste being involved. Secondly, the exchange and retention of knowledge resulting from the process is minimal, mainly on a need by basis. This might look as an efficient approach on the short term, but is inefficient overall.

BI reflects the general issues from SE - most of the issues can be traced back to requirements - if the requirements are incorrect and there's no magic involved in between, then one can't expect for the solution to be correct. The bigger the difference between the initial and final requirements elicited in the process, the more resources are wasted. The more time passes between the start of the development phase and the time a solution is presented to the customer, the longer it takes to build the final solution. Same impact have the time it takes to establish a common ground and other critical factors for success involved in the process.

One can address these issues through better requirements elicitation, rapid prototyping, the use of agile methodologies and similar approaches, though the general feeling is that even if they bring improvements, they don't address the root causes - lack of data literacy skills, lack of knowledge about the business, lack of maturity in planning and executing tasks, the inexistence of well-designed processes and procedures, respectively the lack of an engineering mindset.

These inefficiencies have low impact when building a report occasionally, though they accumulate and tend to create systemic issues in what concerns the overall BI effort. They are addressed locally by experts and in general through a strategic approach like the elaboration of a BI strategy, though organizations seldom pay attention to them. Some organizations consider that they are automatically addressed as part of the data culture, though data culture focuses in general on data literacy and not on the whole set of assumptions mentioned above.

An experienced data professional sees more likely the inefficiencies, tries to address them locally in his interactions with the various stakeholders, he/she can build a business case for addressing them, though it depends on organizations to recognize that they have a problem, respective address the inefficiencies in a strategic and systemic manner!

Previous Post <<||>> Next Post

Business Intelligence: Microsoft Fabric's Notebooks

Business Intelligence Series
Business Intelligence Series 

When several technologies make their entrance in a data-related field like Data Warehousing, Data Analitics or Data Science, one is forced to understand how the respective technologies can be used or misused, respectively what's their place in the bigger picture. Microsoft Fabric introduces several important technologies that will change the way data are stored, processed and consumed. 

The first important technology is the notebook - a web document-like cell-based container for writing and executing code in a collaborative manner. The concept is not new, Jupyter notebooks have been around for almost a decade. In Microsof Fabric, notebooks support multiple languages, from which a default one applies to the whole notebook, while on cell level any of the supported languages can be used. 

One can execute a single cell, multiple cells or the entire notebook in a sequential manner, mix languages for the various operations - load, transform, save, and visualize data when needed. Notebooks can be parametrized and run via the homonymous activity in Data Factory pipelines, automating thus data processing. Probably more functionality is to come. 

Data engineers seems to have great flexibility, though usually flexibility implies constraints and/or mischiefs in other areas. I see for example in presentations the overuse of temporary data objects (mainly views) in Spark SQL as part of complex logic. That's acceptable during prototyping, though such code becomes a danger as soon the logic is deployed into production. Data objects should be created outside of the logic that uses them and should be treated as artifacts, with version control and proper documentation. It's maybe true that temporary objects reduce the volume of objects in the metastore, though is this the way to go?

Temporary objects tend to lead to wheel's reinvention or they get duplicated across multiple notebooks, which can easily create a maintenance nightmare. One needs to consider that the business logic changes a lot, the requirements and the data sources change, and on the long term, the cost of maintaining the code can easily overweight the benefits. 

Notebooks remind me of the beginnings of web programming when HTML was mixed up with client scripting languages like VB Script or Javascript, CSS, respectively server-side scripting languages. It was kind of a spaghetti code, modified repeatedly by multiple programmers, unendingly duplicated, and through a miracle it worked, until it stopped working unexpectedly in strangest situations. The strangest part was when after removing  commented code from a section made the code run again. 

The debugging of another person's code was a nightmare. Code developed by two people for similar purposes was looking unrecognizable different in terms of structure, programming techniques and layout. The technical debt was high, increasing in exponential manner. One was aware that the code needed refactoring, though there were more important things to do or no time allocated for it.

In the meantime the maturity of programming languages, frameworks, methodologies, best practices, and hopefully of programmers improved the overall quality of software (at least on average). Thinking of software from an Engineer's perspective improved the efficiency and effectiveness of a programmer's endeavor. The average programmer is able to write quality code, though there's a considerable minimum of "engineering" knowledge involved beside the mere knowledge of languages and tools. 

Notebooks are good up to a point, beyond which one needs to take a step back, restructure, move the code where it belongs, take a few more steps back and review the good practices and their application, disseminate the knowledge inside the team and use it in the next iterations, respectively refractor the code when needed! Hopefully, people learned from the mistakes of the past. 

Resources:
[1] Microsoft Learn (2023) How to use Microsoft Fabric notebooks (link

16 February 2024

Business Intelligence: What is a BI Strategy?

Business Intelligence Series
Business Intelligence Series

"A BI strategy is a plan to implement, use, and manage data and analytics to better enable your users to meet their business objectives. An effective BI strategy ensures that data and analytics support your business strategy." [1]

The definition is from Microsoft's guide on Power BI implementation planning, a long-awaited resource for those deploying Power BI in their organization. 

I read the definition repeatedly and, even if it looks logically correct, the general feeling is that it falls short, and I'm trying to understand why. A strategy is a plan indeed, even if various theorists use modifiers like unified, comprehensive, integrative, forward-looking, etc. Probably, because it talks about a BI strategy, the definition implies using a strategic plan. Conversely, using "strategic plan" in the definition seems to make the definition redundant, though it would pull then with it all what a strategy is about. 

A business strategy is about enabling users to meet organization's business objectives, otherwise it would fail by design. Implicitly, an organization's objectives become its employees' objectives. The definition kind of states the obvious. Conversely, it talks only about the users, and not all employees are users. Thus, it refers only to a subset. Shouldn't a BI strategy support everybody? 

Usually, data analytics refers to the procedures and techniques used for exploration and analysis. Isn't supposed to consider also the visualization of data? Did it forgot something else? Ideally, a definition shouldn't define what its terms are about individually, but what they are when used together.

BI as a set of technologies, architectures, methodologies, processes and practices is by definition an enabler if we take these components individually or as a whole. I would play devil's advocate and ask "better than what?". Many of the information systems used in organizations come with a set of reports or functionalities that enable users in their jobs without investing a cent in a BI infrastructure. 

One or two decades ago one of the big words used in sales pitches for BI tools was "competitive advantage". I was asking myself when and where did the word disappeared? Is BI technologies' success so common that the word makes no sense anymore? Did the sellers become more ethical? Or did we recognize that the challenges behind a technology are more of an organizational nature? 

When looking at a business strategy, the hierarchy of business objectives forms its backbone, though there are other important elements that form its foundation: mission, vision, purpose, values or principles. A BI strategy needs to be aligned with the business strategy and the other strategies (e.g. quality, IT, communication, etc.). Being able to trace this kind of relationships between strategies is quintessential. 

We talk about BI, Data Analytics, Data Management and newly Data Science. The relationship between them becomes more complex. Therefore, what differentiates a BI strategy from the other strategies? The above definition could apply to the other fields as well. Moreover, does it makes sense to include them in one form or another?

Independently how the joint field is called, BI and Data Analytics should be about gaining a deeper understanding about the business and disseminating that knowledge within the organization, respectively about exploring courses of action, building the infrastructure, the skillset, the culture and the mindset to approach more complex challenges and not only to enable business goals!

There are no perfect definitions, especially when the concepts used have drifting definitions as well, being caught into a net that makes it challenging to grasp the essence of things. In the end, a definition is good enough if the data professionals can work with it. 

Resources:

[1] Microsoft Learn (2004) Power BI implementation planning: BI strategy (link).

14 February 2024

Business Intelligence: A One Man Show VI (The Lakehouse Perspective)

Business Intelligence Suite
Business Intelligence Suite

Continuing the ideas on Christopher Laubenthal's article "Why one person can't do everything in the data space" [1] and why his analogy between a college's functional structure and the core data roles is poorly chosen. In the last post I mentioned as a first argument that the two constructions have different foundations.

Secondly, it's a matter of construction, namely the steps used to arrive from one state to another. Indeed, there's somebody who builds the data warehouse (DWH), somebody who builds the ETL/ELT pipelines for moving the data from the sources to the DWH, somebody who builds the sematic data model that includes business related logic, respectively people who tap into the data for reporting, data visualizations, data science projects, and whatever is still needed in the organization. On top of this, there should be somebody who manages the DWH. I haven't associated any role to them because one of the core roles can be responsible for more than one step. 

In the case of a lakehouse, it is the data engineer who moves the data from the various data sources to the data lake if that doesn't happen already by design or configuration. As per my understanding the data engineers are the ones who design and build the new lakehouse, move transform and manage the data as required. The Data Analysts, Data Scientist and maybe some Information Designers can tap then into the data. However, the DWH and the lakehouse(s) are technologies that facilitate their work. They can still do their work also if the same data are available by other means.

In what concerns the dorm analogy, the verbs were chosen to match the way data warehouses (DWH) or lakehouses are built, though the congruence of the steps is questionable. One could have compared the number of students with the numbers of data entities, but not with the data themselves. Usually, students move by themselves and occupy the places. The story tellers, the assistants and researchers are independent on whether the students are hosted in the dorm or not. Therefore, the analogy seems to be a bit forced. 

Frankly, I covered all the steps except the ones related to Data Science by myself for both described scenarios. It helped that I knew the data from the data sources and the transformations rules I had to apply, respectively the techniques needed for moving and transforming the data, and the volume of data entities was manageable somehow. Conversely, 1-2 more resources in the area of data analysis and visualizations could have helped to bring more value to the business. 

This opens the challenge of scale and it has do to with systems engineering and how the number of components and the interactions between them increase systems' complexity and the demand for managing the respective components. In the simplest linear models, for each multiplier of a certain number of components of the same type from the organization, the number of resources managing the respective layer matches to some degree the multiplier. E.g. if a data engineer can handle x data entities in a unit of time, then for hand n*x components are more likely at least n data engineers required. However, the output of n components is only a fraction of the n*x given the dependencies existing between components and other constraints.

An optimization problem resumes in finding out what data roles to chose to cover an organization's needs. A one man show can be the best solution for small organizations, though unless there's a good division of labor, bringing a second person will make the throughput slower until will become faster.

Previous Post <<|||>> Next Post

Resources:
[1] Christopher Laubenthal (2024) "Why One Person Can’t Do Everything In Data" (link)

13 February 2024

Business Intelligence: A One Man Show V (Focus on the Foundation)

Business Intelligence Suite
Business Intelligence Suite

I tend to agree that one person can't do anymore "everything in the data space", as Christopher Laubenthal put it his article on the topic [1]. He seems to catch the essence of some of the core data roles found in organizations. Summarizing these roles, data architecture is about designing and building a data infrastructure, data engineering is about moving data, database administration is mainly about managing databases, data analysis is about assisting the business with data and reports, information design is about telling stories, while data science can be about studying the impact of various components on the data. 

However, I find his analogy between a college's functional structure and the core data roles as poorly chosen from multiple perspectives, even if both are about building an infrastructure of some type. 

Firstly, the two constructions have different foundations. Data exists in a an organization also without data architects, data engineers or data administrators (DBAs)! It's enough to buy one or more information systems functioning as islands and reporting needs will arise. The need for a data architect might come when the systems need to be integrated or maybe when a data warehouse needs to be build, though many organizations are still in business without such constructs. While for the others, the more complex the integrations, the bigger the need for a Data Architect. Conversely, some systems can be integrated by design and such capabilities might drive their selection.

Data engineering is needed mainly in the context of the cloud, respectively of data lake-based architectures, where data needs to be moved, processed and prepared for consumption. Conversely, architectures like Microsoft Fabric minimize data movement, the focus being on data processing, the successive transformations it needs to suffer in moving from bronze to the gold layer, respectively in creating an organizational semantical data model. The complexity of the data processing is dependent on data' structuredness, quality and other data characteristics. 

As I mentioned before, modern databases, including the ones in the cloud, reduce the need for DBAs to a considerable degree. Unless the volume of work is big enough to consider a DBA role as an in-house resource, organizations will more likely consider involving a service provider and a contingent to cover the needs. 

Having in-house one or more people acting under the Data Analyst role, people who know and understand the business, respectively the data tools used in the process, can go a long way. Moreover, it's helpful to have an evangelist-like resource in house, a person who is able to raise awareness and knowhow, help diffuse knowledge about tools, techniques, data, results, best practices, respectively act as a mentor for the Data Analyst citizens. From my point of view, these are the people who form the data-related backbone (foundation) of an organization and this is the minimum of what an organization should have!

Once this established, one can build data warehouses, data integrations and other support architectures, respectively think about BI and Data strategy, Data Governance, etc. Of course, having a Chief Data Officer and a Data Strategy in place can bring more structure in handling the topics at the various levels - strategical, tactical, respectively operational. In constructions one starts with a blueprint and a data strategy can have the same effect, if one knows how to write it and implement it accordingly. However, the strategy is just a tool, while the data-knowledgeable workers are the foundation on which organizations should build upon!

"Build it and they will come" philosophy can work as well, though without knowledgeable and inquisitive people the philosophy has high chances to fail.

Previous Post <<||>> Next Post

Resources:
[1] Christopher Laubenthal (2024) "Why One Person Can’t Do Everything In Data" (link)

Business Intelligence: A One Man Show III (The Microsoft Fabric)

Business Intelligence Series
Business Intelligence Series

Announced at the end of the last year, Microsoft Fabric (MF) become a reality for the data professional, even if there are still many gaps in the overall architecture and some things don't work as they should. The Delta Lake and the various data consumption experiences seem to bring more flexibility but also raise questions on how one can use them adequately in building solutions for Data Analytics and/or Data Science. 

Currently, as it happens with new technologies, data professionals seem to try to explore the functionality, see what's possible, what's missing, and that's a considerable effort as everybody is more or less on his own. The material released by Microsoft and other professionals should facilitate in theory this effort, though the considerable number of features and the effort needed to review them do the opposite. Some professionals do this as part of their jobs, and exploring the feature seems to be a full job in each area, while others, like myself, do it in their own time. 

There are organizations that demand from their employees to regularly actualize their knowledge in their field of activity, respectively explore how new technologies can be integrated in organization's architecture. Having a few hours or even a day a weak for this can go a long way! Occasionally, I could take 1-2 hours a week during the program and take maybe a few many more hours from my own time. Unfortunately, most of the significant progress I made in a certain area (SQL Server, Dynamics 365, Software Engineering, Power BI, and now MF) it was done in my own time, which became in time more and more challenging to do given the pace with which new features and technologies develop.

By comparison, it was relatively easy to locally install SQL Server in its various CTP or community versions, deploy one of the readily-available databases, and start learning. I'm still doing it, playing with a SQL Server 2022 instance whenever I find the time. Similarly, I can use Power BI and a few other tools, depending again on the time available to make progress. However, with MF things start slowly to get blurry. The 60 days of trial won't cut it anymore as there are so many things to learn - Spark SQL, PySpark, Delta Lake, KQL, Dataflows, etc. Probably, there will be ways for learning any of this standalone, though not together in an integrated manner. 

The complexity of the tools demands more time, a proper infrastructure and a good project to accommodate them. This doesn't mean that the complexity of the solutions need to increase as well! Azure Synapse allowed me to reuse many of the techniques I used in the past to build a modern Data Analytics solution, while in other areas I had to accommodate the new. The solution wasn't perfect (only time will tell), though it provided the minimum of what was needed. I expect the same to happen in Microsoft Fabric, even if the number of choices is bigger. 

There's a considerable difference between building a minimal viable solution and exploring, respectively harnessing MF's capabilities. The challenge for many organizations is to determine what that minimum is about, how to build that knowledge into the team, especially when starting from zero. 

Conversely, this doesn't mean that the skillset and effort can't be covered by one person. It might be more challenging though achievable if the foundation is there, respectively if certain conditions are met. This depends also on organization's expectations, infrastructure and other characteristics. A whole team is more likely to succeed than one person, but not certainty! 

Previous Post <<||>> Next Post

27 January 2024

Data Science: Back to the Future I (About Beginnings)

Data Science
Data Science Series

I've attended again, after several years, a webcast on performance improvement in SQL Server with Claudio Silva, “Writing T-SQL code for the engine, not for you”. The session was great and I really enjoyed it! I recommend it to any data(base) professional, even if some of the scenarios presented should be known already.

It's strange to see the same topics from 20-25 years ago reappearing over and over again despite the advancements made in the area of database engines. Each version of SQL Server brought something new in what concerns the performance, though without some good experience and understanding of the basic optimization and troubleshooting techniques there's little overall improvement for the average data professional in terms of writing and tuning queries!

Especially with the boom of Data Science topics, the volume of material on SQL increased considerably and many discover how easy is to write queries, even if the start might be challenging for some. Writing a query is easy indeed, though writing a performant query requires besides the language itself also some knowledge about the database engine and the various techniques used for troubleshooting and optimization. It's not about knowing in advance what the engine will do - the engine will often surprise you - but about knowing what techniques work, in what cases, which are their advantages and disadvantages, respectively on how they might impact the processing.

Making a parable with writing literature, it's not enough to speak a language; one needs more for becoming a writer, and there are so many levels of mastery! However, in database world even if creativity is welcomed, its role is considerable diminished by the constraints existing in the database engine, the problems to be solved, the time and the resources available. More important, one needs to understand some of the rules and know how to use the building blocks to solve problems and build reliable solutions.

The learning process for newbies focuses mainly on the language itself, while the exposure to complexity is kept to a minimum. For some learners the problems start when writing queries based on multiple tables -  what joins to use, in what order, how to structure the queries, what database objects to use for encapsulating the code, etc. Even if there are some guidelines and best practices, the learner must walk the path and experiment alone or in an organized setup.

In university courses the focus is on operators algebras, algorithms, on general database technologies and architectures without much hand on experience. All is too theoretical and abstract, which is acceptable for research purposes,  but not for the contact with the real world out there! Probably some labs offer exposure to real life scenarios, though what to cover first in the few hours scheduled for them?

This was the state of art when I started to learn SQL a quarter century ago, and besides the current tendency of cutting corners, the increased confidence from doing some tests, and the eagerness of shouting one’s shaking knowledge and more or less orthodox ideas on the various social networks, nothing seems to have changed! Something did change – the increased complexity of the problems to solve, and, considering the recent technological advances, one can afford now an AI learn buddy to write some code for us based on the information provided in the prompt.

This opens opportunities for learning and growth. AI can be used in the learning process by providing additional curricula for learners to dive deeper in some topics. Moreover, it can help us in time to address the challenges of the ever-increase complexity of the problems.

02 January 2024

Systems Engineering: Never-Ending Stories in Praxis (Quote of the Day)

Systems Engineering
Systems Engineering Cycle

"[…] the longer one works on […] a project without actually concluding it, the more remote the expected completion date becomes. Is this really such a perplexing paradox? No, on the contrary: human experience, all-too-familiar human experience, suggests that in fact many tasks suffer from similar runaway completion times. In short, such jobs either get done soon or they never get done. It is surprising, though, that this common conundrum can be modeled so simply by a self-similar power law." (Manfred Schroeder, "Fractals, Chaos, Power Laws Minutes from an Infinite Paradise", 1990)

I found the above quote while browsing through Manfred Schroeder's book on fractals, chaos and power laws, book that also explores similar topics like percolation, recursion, randomness, self-similarity, determinism, etc. Unfortunately, when one goes beyond the introductory notes of each chapter, the subjects require more advanced knowledge of Mathematics, respectively further analysis and exploration of the models behind. Despite this, the book is still an interesting read with ideas to ponder upon.

I found myself a few times in the situation described above - working on a task that didn't seem to end, despite investing more effort, respectively approaching the solution from different angles. The reasons residing behind such situations were multiple, found typically beyond my direct area of influence and/or decision. In a systemic setup, there are parts of a system that find themselves in opposition, different forces pulling in distinct directions. It can be the case of interests, goals, expectations or solutions which compete or make subject to politics. 

For example, in Data Analytics or Data Science there are high chances that no progress can be made beyond a certain point without addressing first the quality of data or design/architectural issues. The integrations between applications, data migrations and other solutions which heavily rely on data are sensitive to data quality and architecture's reliability. As long the source of variability (data, data generators) is not stabilized, providing a stable solution has low chances of success, no matter how much effort is invested, respectively how performant the tools are. 

Some of the issues can be solved by allocating resources to handle their implications. Unfortunately, some organizations attempt to solve such issues by allocating the resources in the wrong areas or by addressing the symptoms instead of taking a step back and looking systemically at the problem, analyzing and modeling it accordingly. Moreover, there are organizations which refuse to recognize they have a problem at all! In the blame game, it's much easier to shift the responsibility on somebody else's shoulders. 

Defining the right problem to solve might prove more challenging than expected and usually this requires several iterations in which the knowledge obtained in the process is incorporated gradually. Other times, one attempts to solve the correct problem by using the wrong methodology, architecture and/or skillset. The difference between right and wrong depends on the context, and even between similar problems and factors the context can make a considerable difference.

The above quote can be corroborated with situations in which perfection is demanded. In IT and management setups, excellence is often confounded with perfection, the latter being impossible to achieve, though many managers take it as the norm. There's a critical point above which the effort invested outweighs solution's plausibility by an exponential factor.  

Another source for unending effort is when requirements change frequently in a swift manner - e.g. the rate with which changes occur outweighs the progress made for finding a solution. Unless the requirements are stabilized, the effort spirals towards the outside (in an exponential manner). 

Finally, there are cases with extreme character, in which for example the complexity of the task outweighs the skillset and/or the number of resources available. Moreover, there are problems which accept plausible solutions, though there are also problems (especially systemic ones) which don't have stable or plausible solutions. 

Behind most of such cases lie factors that tend to have chaotic behavior that occurs especially when the environments are far from favorable. The models used to depict such relations are nonlinear, sometimes expressed as power laws - one quantity varying as a power of another, with the variation increasing with each generation. 

Previous Post <<||>> Next Post

Resources:
[1] Manfred Schroeder, "Fractals, Chaos, Power Laws Minutes from an Infinite Paradise", 1990 (quotes)

14 October 2023

Graphical Representation: On Insights II (The Complexity Perspective)

Graphical Representation
Graphical Representation Series

Scientists attempt to discover laws and principles, and for this they conduct experiments, build theories and models rooted in the data they collect. In the business setup, data professionals analyze the data for identifying patterns, trends, outliers or anything else that can lead to new information or knowledge. On one side scientists chose the boundaries of the systems they study, while for data professionals even if the systems are usually given, they can make similar choices. 

In theory, scientists are more flexible in what data they collect, though they might have constraints imposed by the boundaries of their experiments and the tools they use. For data professionals most of the data they need is already there, in the systems the business uses, though the constraints reside in the intrinsic and extrinsic quality of the data, whether the data are fit for the purpose. Both parties need to work around limitations, or attempt to improve the experiments, respectively the systems. 

Even if the data might have different characteristics, this doesn't mean that the methods applied by data professionals can't be used by scientists and vice-versa. The closer data professionals move from Data Analytics to Data Science, the higher the overlap between the business and scientific setup. 

Conversely, the problems data professionals meet have different characteristics. Scientists outlook is directed mainly at the phenomena and processes occurring in nature and society, where randomness, emergence and chaos seem to feel at home. Business processes deal more with predefined controlled structures, cyclicity, higher dependency between processes, feedback and delays. Even if the problems may seem to be different, they can be modeled with systems dynamics. 

Returning to data visualization and the problem of insight, there are multiple questions. Can we use simple designs or characterizations to find the answer to complex problems? Which must be the characteristics of a piece of information or knowledge to generate insight? How can a simple visualization generate an insight moment? 

Appealing to complexity theory, there are several general approaches in handling complexity. One approach resides in answering complexity with complexity. This means building complex data visualizations that attempt to model problem's complexity. For example, this could be done by building a complex model that reflects the problem studied, and build a set of complex visualizations that reflect the different important facets. Many data professionals advise against this approach as it goes against the simplicity principle. On the other hand, starting with something complex and removing the nonessential can prove to be an approachable strategy, even if it involves more effort. 

Another approach resides in reducing the complexity of the problem either by relaxing the constraints, or by breaking the problem into simple problems and addressing each one of them with visualizations. Relaxing the constraints allow studying upon case a more general problem or a linearization of the initial problem. Breaking down the problem into problems that can be easier solved, can help to better understand the general problem though we might lose the sight of emergence and other behavior that characterize complex systems.

Providing simple visualizations to complex problems implies a good understanding of the problem, its solution(s) and the overall context, which frankly is harder to achieve the more complex a problem is. For its understanding a problem requires a minimum of knowledge that needs to be reflected in the visualization(s). Even if some important aspects are assumed as known, they still need to be confirmed by the visualizations, otherwise any deviation from assumptions can lead to a new problem. Therefore, its questionable that simple visualizations can address the complexity of the problems in a general manner. 

Previous Post <<||>> Next Post 

22 October 2022

Data Analytics: Data Lakes/Lakehouses (Just the Quotes)

"If you think of a Data Mart as a store of bottled water, cleansed and packaged and structured for easy consumption, the Data Lake is a large body of water in a more natural state. [...] The contents of the Data Lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples." (James Dixon, "Pentaho, Hadoop, and Data Lakes", 2010) [sorce] [first known usage]

"A data lake represents an environment that collects and stores large volumes of structured and unstructured datasets, typically in their original, unaltered forms. More than a data depository, the data lake architecture enables the various users and data science teams to conduct data exploration and related analytical activities." (EMC Education Services, "Data Science & Big Data Analytics", 2015)

"A data lake strategy supports the introduction of a separate analytics environment that off-loads the analytics being done today on your overly expensive data warehouse. This separate analytics environment provides the data science team an on-demand, fail-fast environment for quickly ingesting and analyzing a wide variety of data sources in an attempt to address immediate business opportunities independent of the data warehouse's production schedule and service level agreement (SLA) rules." (Billl Schmarzo, "Driving Business Strategies with Data Science: Big Data MBA" 1st Ed., 2015)

"At its core, it is a data storage and processing repository in which all of the data in an organization can be placed so that every internal and external systems', partners', and collaborators' data flows into it and insights spring out. [...] Data Lake is a huge repository that holds every kind of data in its raw format until it is needed by anyone in the organization to analyze." (Beulah S Purra & Pradeep Pasupuleti, "Data Lake Development with Big Data", 2015) 

"Having multiple data lakes replicates the same problems that were created with multiple data warehouses - disparate data siloes and data fiefdoms that don't facilitate sharing of the corporate data assets across the organization. Organizations need to have a single data lake from which they can source the data for their BI/data warehousing and analytic needs. The data lake may never become the 'single version of the truth' for the organization, but then again, neither will the data warehouse. Instead, the data lake becomes the 'single or central repository for all the organization's data' from which all the organization's reporting and analytic needs are sourced." (Billl Schmarzo, "Driving Business Strategies with Data Science: Big Data MBA" 1st Ed., 2015)

"[...] the real power of the data lake is to enable advanced analytics or data science on the detailed and complete history of data in an attempt to uncover new variables and metrics that are better predictors of business performance." (Billl Schmarzo, "Driving Business Strategies with Data Science: Big Data MBA" 1st Ed., 2015)

"The data lake is not an incremental enhancement to the data warehouse, and it is NOT data warehouse 2.0. The data lake enables entirely new capabilities that allow your organization to address data and analytic challenges that the data warehouse could not address." (Billl Schmarzo, "Driving Business Strategies with Data Science: Big Data MBA" 1st Ed., 2015)

"Unfortunately, some organizations are replicating the bad data warehouse practice by creating special-purpose data lakes - data lakes to address a specific business need. Resist that urge! Instead, source the data that is needed for that specific business need into an 'analytic sandbox' where the data scientists and the business users can collaborate to find those data variables and analytic models that are better predictors of the business performance. Within the 'analytic sandbox', the organization can bring together (ingest and integrate) the data that it wants to test, build the analytic models, test the model's goodness of fit, acquire new data, refine the analytic models, and retest the goodness of fit." (Billl Schmarzo, "Driving Business Strategies with Data Science: Big Data MBA" 1st Ed., 2015)

"A data lake is a storage repository that holds a very large amount of data, often from diverse sources, in native format until needed. In some respects, a data lake can be compared to a staging area of a data warehouse, but there are key differences. Just like a staging area, a data lake is a conglomeration point for raw data from diverse sources. However, a staging area only stores new data needed for addition to the data warehouse and is a transient data store. In contrast, a data lake typically stores all possible data that might be needed for an undefined amount of analysis and reporting, allowing analysts to explore new data relationships. In addition, a data lake is usually built on commodity hardware and software such as Hadoop, whereas traditional staging areas typically reside in structured databases that require specialized servers." (Mike Fleckenstein & Lorraine Fellows, "Modern Data Strategy", 2018)

"A data warehouse follows a pre-built static structure to model source data. Any changes at the structural and configuration level must go through a stringent business review process and impact analysis. Data lakes are very agile. Consumption or analytical layer can be modified to fit in the model requirements. Consumers of a data lake are not constant; therefore, schema and modeling lies at the liberty of analysts and scientists." (Saurabh Gupta et al, "Practical Enterprise Data Lake Insights", 2018)

"Data in the data lake should never get disposed. Data driven strategy must define steps to version the data and handle deletes and updates from the source systems." (Saurabh Gupta et al, "Practical Enterprise Data Lake Insights", 2018)

"Data governance policies must not enforce constraints on data - Data governance intends to control the level of democracy within the data lake. Its sole purpose of existence is to maintain the quality level through audits, compliance, and timely checks. Data flow, either by its size or quality, must not be constrained through governance norms. [...] Effective data governance elevates confidence in data lake quality and stability, which is a critical factor to data lake success story. Data compliance, data sharing, risk and privacy evaluation, access management, and data security are all factors that impact regulation." (Saurabh Gupta et al, "Practical Enterprise Data Lake Insights", 2018)

"Data Lake induces accessibility and catalyzes availability. It warrants data discovery platforms to soak the data trends at a horizontal scale and produce visual insights. It largely cuts down the time that goes into data preparation and exhaustive data analysis." (Saurabh Gupta et al, "Practical Enterprise Data Lake Insights", 2018)

"Data Lake is a single window snapshot of all enterprise data in its raw format, be it structured, semi-structured, or unstructured. Starting from curating the data ingestion pipeline to the transformation layer for analytical consumption, every aspect of data gets addressed in a data lake ecosystem. It is supposed to hold enormous volumes of data of varied structures." (Saurabh Gupta et al, "Practical Enterprise Data Lake Insights", 2018)

"Data swamp, on the other hand, presents the devil side of a lake. A data lake in a state of anarchy is nothing but turns into a data swamp. It lacks stable data governance practices, lacks metadata management, and plays weak on ingestion framework. Uncontrolled and untracked access to source data may produce duplicate copies of data and impose pressure on storage systems." (Saurabh Gupta et al, "Practical Enterprise Data Lake Insights", 2018)

"Data warehousing, as we are aware, is the traditional approach of consolidating data from multiple source systems and combining into one store that would serve as the source for analytical and business intelligence reporting. The concept of data warehousing resolved the problems of data heterogeneity and low-level integration. In terms of objectives, a data lake is no different from a data warehouse. Both are primary advocates of terms like 'single source of truth' and 'central data repository'." (Saurabh Gupta et al, "Practical Enterprise Data Lake Insights", 2018)

"A data lakehouse is an amalgamation of the best components from both data lakes and data warehouses. A data lakehouse implements data structure and data management features from data warehouses into a cost-effective storage like a data lake. It tries to combine the best from both worlds - data lake - based Big Data analytics and a data warehouse." (Bhadresh Shiyal, "Beginning Azure Synapse Analytics: Transition from Data Warehouse to Data Lakehouse", 2021) 

"A defining characteristic of the data lakehouse architecture is allowing direct access to data as files while retaining the valuable properties of a data warehouse. Just do both!" (Bill Inmon et al, "Building the Data Lakehouse", 2021)

"At first, we threw all of this data into a pit called the 'data lake'. But we soon discovered that merely throwing data into a pit was a pointless exercise. To be useful - to be analyzed - data needed to (1) be related to each other and (2) have its analytical infrastructure carefully arranged and made available to the end user. Unless we meet these two conditions, the data lake turns into a swamp, and swamps start to smell after a while. [...] In a data swamp, data just sits there are no one uses it. In the data swamp, data just rots over time." (Bill Inmon et al, "Building the Data Lakehouse", 2021)

"Data lake architecture suffers from complexity and deterioration. It creates complex and unwieldy pipelines of batch or streaming jobs operated by a central team of hyper-specialized data engineers. It deteriorates over time. Its unmanaged datasets, which are often untrusted and inaccessible, provide little value. The data lineage and dependencies are obscured and hard to track." (Zhamak Dehghani, "Data Mesh: Delivering Data-Driven Value at Scale", 2021)

"Once you combine the data lake along with analytical infrastructure, the entire infrastructure can be called a data lakehouse. [...] The data lake without the analytical infrastructure simply becomes a data swamp. And a data swamp does no one any good." (Bill Inmon et al, "Building the Data Lakehouse", 2021)

"The data lakehouse architecture presents an opportunity comparable to the one seen during the early years of the data warehouse market. The unique ability of the lakehouse to manage data in an open environment, blend all varieties of data from all parts of the enterprise, and combine the data science focus of the data lake with the end user analytics of the data warehouse will unlock incredible value for organizations. [...] The lakehouse architecture equally makes it natural to manage and apply models where the data lives." (Bill Inmon et al, "Building the Data Lakehouse", 2021)

"With the data lakehouse, it is possible to achieve a level of analytics and machine learning that is not feasible or possible any other way. But like all architectural structures, the data lakehouse requires an understanding of architecture and an ability to plan and create a blueprint." (Bill Inmon et al, "Building the Data Lakehouse", 2021)

"Delta Lake is a transactional storage software layer that runs on top of an existing data lake and adds RDW-like features that improve the lake’s reliability, security, and performance. Delta Lake itself is not storage. In most cases, it’s easy to turn a data lake into a Delta Lake; all you need to do is specify, when you are storing data to your data lake, that you want to save it in Delta Lake format (as opposed to other formats, like CSV or JSON)." (James Serra, "Deciphering Data Architectures", 2024)

DBMS: Data Warehouse/Mart (Just the Quotes)

"Unfortunately, just collecting the data in one place and making it easily available isn’t enough. When operational data from transactions is loaded into the data warehouse, it often contains missing or inaccurate data. How good or bad the data is a function of the amount of input checking done in the application that generates the transaction. Unfortunately, many deployed applications are less than stellar when it comes to validating the inputs. To overcome this problem, the operational data must go through a 'cleansing' process, which takes care of missing or out-of-range values. If this cleansing step is not done before the data is loaded into the data warehouse, it will have to be performed repeatedly whenever that data is used in a data mining operation." (Joseph P Bigus,"Data Mining with Neural Networks: Solving business problems from application development to decision support", 1996)

"There are four levels of data in the architected environment - the operational level, the atomic (or the data warehouse) level, the departmental (or the data mart) level, and the individual level. These different levels of data are the basis of a larger architecture called the corporate information factory (CIF). The operational level of data holds application-oriented primitive data only and primarily serves the high-performance transaction-processing community. The data-warehouse level of data holds integrated, historical primitive data that cannot be updated. In addition, some derived data is found there. The departmental or data mart level of data contains derived data almost exclusively. The departmental or data mart level of data is shaped by end-user requirements into a form specifically suited to the needs of the department. And the individual level of data is where much heuristic analysis is done." (William H Inmon, "Building the Data Warehouse" 4th Ed., 2005)

"Having a purposeless or poorly performing dashboard is more common than not. This happens when the underlying architecture is not designed properly to support the needs of dashboard interaction. There is an obvious disconnect between the design of the data warehouse and the design of the dashboards. The people who design the data warehouse do not know what the dashboard will do; and the people who design the dashboards do not know how the data warehouse was designed, resulting in a lack of cohesion between the two. A similar disconnect can also exist between the dashboard designer and the business analyst, resulting in a dashboard that may look beautiful and dazzling but brings very little business value." (Nils H Rasmussen et al, "Business Dashboards: A visual catalog for design and deployment", 2009)

"Having multiple data lakes replicates the same problems that were created with multiple data warehouses - disparate data siloes and data fiefdoms that don't facilitate sharing of the corporate data assets across the organization. Organizations need to have a single data lake from which they can source the data for their BI/data warehousing and analytic needs. The data lake may never become the 'single version of the truth' for the organization, but then again, neither will the data warehouse. Instead, the data lake becomes the 'single or central repository for all the organization's data' from which all the organization's reporting and analytic needs are sourced." (Billl Schmarzo, "Driving Business Strategies with Data Science: Big Data MBA" 1st Ed., 2015)

"There are, however, many problems with independent data marts. Independent data marts: (1) Do not have data that can be reconciled with other data marts (2) Require their own independent integration of raw data (3) Do not provide a foundation that can be built on whenever there are future analytical needs." (William H Inmon & Daniel Linstedt, "Data Architecture: A Primer for the Data Scientist: Big Data, Data Warehouse and Data Vault", 2015)

"Unfortunately, some organizations are replicating the bad data warehouse practice by creating special-purpose data lakes - data lakes to address a specific business need. Resist that urge! Instead, source the data that is needed for that specific business need into an 'analytic sandbox' where the data scientists and the business users can collaborate to find those data variables and analytic models that are better predictors of the business performance. Within the 'analytic sandbox', the organization can bring together (ingest and integrate) the data that it wants to test, build the analytic models, test the model's goodness of fit, acquire new data, refine the analytic models, and retest the goodness of fit." (Billl Schmarzo, "Driving Business Strategies with Data Science: Big Data MBA" 1st Ed., 2015)

"Data quality in warehousing and BI is typically defined in terms of the 4 C’s - is the data clean, correct, consistent, and complete? When it comes to big data, there are two schools of thought that have different views and expectations of data quality. The first school believes that the gold standard of the 4 C’s must apply to all data (big and little) used for clinical care and performance metrics. The second school believes that in big data environments, a stringent data quality standard is impossible, too costly, or not required. While diametrically opposite opinions may play well in panel discussions, they do little to reconcile the realities of healthcare data quality." (Prashant Natarajan et al, "Demystifying Big Data and Machine Learning for Healthcare", 2017) 

"Data warehousing has always been difficult, because leaders within an organization want to approach warehousing and analytics as just another technology or application buy. Viewed in this light, they fail to understand the complexity and interdependent nature of building an enterprise reporting environment." (Prashant Natarajan et al, "Demystifying Big Data and Machine Learning for Healthcare", 2017)

"A data lake is a storage repository that holds a very large amount of data, often from diverse sources, in native format until needed. In some respects, a data lake can be compared to a staging area of a data warehouse, but there are key differences. Just like a staging area, a data lake is a conglomeration point for raw data from diverse sources. However, a staging area only stores new data needed for addition to the data warehouse and is a transient data store. In contrast, a data lake typically stores all possible data that might be needed for an undefined amount of analysis and reporting, allowing analysts to explore new data relationships. In addition, a data lake is usually built on commodity hardware and software such as Hadoop, whereas traditional staging areas typically reside in structured databases that require specialized servers." (Mike Fleckenstein & Lorraine Fellows, "Modern Data Strategy", 2018)

"A data warehouse follows a pre-built static structure to model source data. Any changes at the structural and configuration level must go through a stringent business review process and impact analysis. Data lakes are very agile. Consumption or analytical layer can be modified to fit in the model requirements. Consumers of a data lake are not constant; therefore, schema and modeling lies at the liberty of analysts and scientists." (Saurabh Gupta et al, "Practical Enterprise Data Lake Insights", 2018)

"Data warehousing, as we are aware, is the traditional approach of consolidating data from multiple source systems and combining into one store that would serve as the source for analytical and business intelligence reporting. The concept of data warehousing resolved the problems of data heterogeneity and low-level integration. In terms of objectives, a data lake is no different from a data warehouse. Both are primary advocates of terms like 'single source of truth' and 'central data repository'." (Saurabh Gupta et al, "Practical Enterprise Data Lake Insights", 2018)

"A defining characteristic of the data lakehouse architecture is allowing direct access to data as files while retaining the valuable properties of a data warehouse. Just do both!" (Bill Inmon et al, "Building the Data Lakehouse", 2021)

"The data lakehouse architecture presents an opportunity comparable to the one seen during the early years of the data warehouse market. The unique ability of the lakehouse to manage data in an open environment, blend all varieties of data from all parts of the enterprise, and combine the data science focus of the data lake with the end user analytics of the data warehouse will unlock incredible value for organizations. [...] "The lakehouse architecture equally makes it natural to manage and apply models where the data lives." (Bill Inmon et al, "Building the Data Lakehouse", 2021)

19 October 2022

Performance Management: First Time Right (The Aim toward Operational Excellence)

 


Rooted in Six Sigma methodology as a step toward operational excellence, First Time Right (FTR) implies that any procedure is performed in the right manner the first time and every time. It equates to minimizing the waste in its various forms (inventory, motion, overprocessing, overproduction, waiting, transportation, defects). Like many quality concepts from the manufacturing industry, the concept was transported in the software development process as principle, process, goal and/or metric. Thus, it became part of Software Engineering, Project Management, Data Science, and any other similar endeavors whose outcome results in software products. 

Besides the quality aspect, FTR is rooted also in the economic imperative – the need to achieve something in the minimum amount of time with the minimum of effort. It’s about being efficient in delivering a product or achieving a given target. It can be associated with continuous improvement, learning and mastery, the aim being to encompass FTR as part of organization’s culture. 

Even if not explicitly declared, FTR lurks in each task planned. It seems that it became common practice to plan with the FTR in mind, however between this theoretical aim and practice there’s as usual an important gap. Unfortunately, planners, managers and even tasks' performers often forget that mistakes are made, that several iterations are needed to get the job done. It starts with the communication between people in clarifying the requirements and ends with the formal sign off. All the deviations from the FTR add up in the deviations between expected and actual effort, though probably more important are the deviations from the plan and all the consequences deriving from it. Especially in complex projects this adds up into a spiral of issues that can easily reinforce themselves. 

Many of the jobs that imply creativity, innovation, research or exploration require at least several iterations to get the job done and this is independent of participants’ professionalism and experience. Moreover, the more quality one needs, the higher the effort, the 80/20 being sometimes a good approximation of the effort needed. In extremis, aiming for perfection instead of excellence can make certain tasks a never-ending story. 

Achieving FTR requires practice - the more novelty, the higher the complexity, the communication or the synchronization needs, the more practice is needed. It starts with the individual to master the individual tasks and ends with the team, where communication, synchronization and other aspects need to be considered. The practice is usually achieved on hands-on work as part of the daily duties, project work, and so on. Unfortunately, it’s based primarily on individual experience, and seldom groomed in advance, as preparation for future tasks. That’s why sometimes when efficiency is needed in performing critical complex tasks, one also needs to consider the learning curve in achieving the required quality. 

Of course, many organizations demand from job applicants experience and, when possible, they hire people with experience, however the diversity, complexity and changing nature of tasks require further practice. This aspect is somehow recognized in the implementation in organizations of the various forms of DevOps, though how many organizations adopt it and enforce it on a regular basis? Moreover, a major requirement of nowadays businesses is to be agile, and besides the mere application of methodologies, being agile means to have also a FTR mindset. 

FTR starts with the wish for mastery at individual and team level and, with the right management attention, by allocating time for learning, self-development in the important areas, providing relevant feedback and building an infrastructure for knowledge sharing and harnessing, FTR can become part of organization’s culture. It’s up to each of us to do it!
Related Posts Plugin for WordPress, Blogger...

About Me

My photo
IT Professional with more than 24 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.