31 October 2020

Data Warehousing: Data Lakes & other Puddles

Data Warehousing

One can consider a data lake as a repository of all of an organization’s data found in raw form, however this constraint might be too harsh as the data found at different levels of processing can be imported as well, for example the results of data mining or other Data Science techniques/methods can be considered as raw data for further processing.

In the initial definition provided by James Dixon, the difference between a data lake and a data mart/warehouse was expressed metaphorically as the transition from bottled water to lakes streamed (artificially) from various sources. It’s contrasted thus the objective-oriented, limited and single-purposed role of the data mart/warehouse in respect to the flow of data in nature that could be tapped and harnessed as desired. These are though metaphors intended to sensitize the buyer. Personally, I like to think of the data lake as an extension of the data infrastructure, in which the data mart or warehouse is integrant part. Imposing further constrains seem to have no benefit.  

Probably the most important characteristic of a data lake is that it makes the data of an organization discoverable and consumable, though from there to insight and other benefits is a long road and requires specific knowledge about the techniques used, as well about organization’s processes and data. Without this data lake-based solutions can lead to erroneous results, same as mixing several ingredients without having knowledge about their usage can lead to cooking experiments aloof from the art of cooking.

A characteristic of data is that they go through continuous change and have different timeliness, respectively degrees of quality in respect to the data quality dimensions implied and sources considered. Data need to reflect the reality at the appropriate level of detail and quality required by the processing application(s), this applying to data warehouses/marts as well data lake-based solutions.

Data found in raw form don’t necessarily represent the true/truth and don’t necessarily acquire a good quality no matter how much they are processed. Solutions need to be resilient in respect to the data they handle through their layers, independently of the data quality and transmission problems. Whether one talks about ETL, data migration or other types of data processing, keeping the data integrity at various levels and layers can be maybe the most important demand upon solutions.

Snapshots as moment-in-time recordings of tables, entities, sets of entities, datasets or whole databases, prove to be often the best mechanisms in keeping data integrity when this aspect is essential to their processing (e.g. data migrations, high-accuracy measurements). Unfortunately, the more systems are involved in the process and the broader span of the solutions over the sources, the more difficult it become to take such snapshots.

A SQL query’s output represents a snapshot of the data, therefore SQL-based solutions are usually appropriate for most of the business scenarios in which the characteristics of data (typically volume, velocity and/or variety) make their processing manageable. However, when the data are extracted by other means integrity is harder to obtain, especially when there’s no timestamp to allow data partitioning on a time scale, the handling of data integrity becoming thus in extremis a programmer’s task. In addition, getting snapshots of the data as they are changed can be a costly and futile task.

Further on, maintaining data integrity can prove to be a matter of design in respect not only to the processing of data, but also in respect to the source applications and the business processes they implement. The mastery of the underlying principles, techniques, patterns and methodologies, helps in the process of designing the right solutions.

Written as answer to a Medium post on data lakes and batch processing in data warehouses. 

30 October 2020

Data Science: Generalists vs Specialists in the Field of Data Science

Data Science

Division of labor favorizes the tasks done repeatedly, where knowledge of the broader processes is not needed, where aspects as creativity are needed only at a small scale. Division invaded the IT domains as tools, methodologies and demands increased in complexity, and therefore Data Science and BI/Analytics make no exception from this.

The scale of this development gains sometimes humorous expectations or misbelieves when one hears headhunters asking potential candidates whether they are upfront or backend experts when a good understanding of both aspects is needed for providing adequate results. The development gains tragicomical implications when one is limited in action only to a given area despite the extended expertise, or when a generalist seems to step on the feet of specialists, sometimes from the right entitled reasons. 

Headhunters’ behavior is rooted maybe in the poor understanding of the domain of expertise and implications of the job descriptions. It’s hard to understand how people sustain of having knowledge about a domain just because they heard the words flying around and got some glimpse of the connotations associated with the words. Unfortunately, this is extended to management and further in the business environment, with all the implications deriving from it. 

As Data Science finds itself at the intersection between Artificial Intelligence, Data Mining, Machine Learning, Neurocomputing, Pattern Recognition, Statistics and Data Processing, the center of gravity is hard to determine. One way of dealing with the unknown is requiring candidates to have a few years of trackable experience in the respective fields or in the use of a few tools considered as important in the respective domains. Of course, the usage of tools and techniques is important, though it’s a big difference between using a tool and understanding the how, when, why, where, in which ways and by what means a tool can be used effectively to create value. This can be gained only when one’s exposed to different business scenarios across industries and is a tough thing to demand from a profession found in its baby steps. 

Moreover, being a good data scientist involves having a deep insight into the businesses, being able to understand data and the demands associated with data – the various qualitative and quantitative aspects. Seeing the big picture is important in defining, approaching and solving problems. The more one is exposed to different techniques and business scenarios, with right understanding and some problem-solving skillset one can transpose and solve problems across domains. However, the generalist will find his limitations as soon a certain depth is reached, and the collaboration with a specialist is then required. A good collaboration between generalists and specialists is important in complex projects which overreach the boundaries of one person’s knowledge and skillset. 

Complexity is addressed when one can focus on the important characteristic of the problem, respectively when the models built can reflect the demands. The most important skillset besides the use of technical tools is the ability to model problems and root the respective problems into data, to elaborate theories and check them against reality. 

Complex problems can require specialization in certain fields, though seldom one problem is dependent only on one aspect of the business, as problems occur in overreaching contexts that span sometimes the borders of an organization. In addition, the ability to solve problems seem to be impacted by the diversity of the people involved into the task, sometimes even with backgrounds not directly related to organization’s activity. As in evolution, a team’s diversity is an important factor in achievement and learning, most gain being obtained when knowledge gets shared and harnessed beyond the borders of teams.

Written as answer to a Medium post on Data Science generalists vs specialists.

Data Science: Big Data vs. Business Strategies

Data Science

A strategy, independently on whether applied to organizations, chess, and other situations, allows identifying the moves having the most promising results from a range of possible moves that can change as one progresses into the game. Typically, the moves compete for same or similar resources, each move having at the respective time a potential value expressed in quantitative and/or qualitative terms, while the values are dependent on the information available about one’s and partners’ positions into the game. Therefore, a strategy is dependent on the decision-making processes in place, the information available about own business, respective the concurrence, as well about the game.

Big data is not about a technology but an umbrella term for multiple technologies that support in handling data with high volume, veracity, velocity or variety. The technologies attempt helping organizations in harnessing what is known as Big data (data having the before mentioned characteristics), for example by allowing answering to business questions, gaining insight into the business or market, improving decision-making. Through this Big data helps delivering value to businesses, at least in theory.

Big-data technologies can harness all data of an organization though this doesn’t imply that all data can provide value, especially when considered in respect to the investments made. Data bring value when they have the potential of uncovering hidden trends or (special) patterns of behavior, when they can be associated in new meaningful ways. Data that don’t reflect such characteristics are less susceptible of bringing value for an organization no matter how much one tries to process the respective data. However, looking at the data through multiple techniques can help organization get a better understanding of the data, though here is more about the processes of attempting understanding the data than the potential associated directly with the data.

Through active effort in understanding the data one becomes aware of the impact the quality of data have on business decisions, on how the business and processes are reflected in its data, how data can be used to control processes and focus on what matters. These are aspects that can be corroborated with the use of simple BI capabilities and don’t necessarily require more complex capabilities or tools. Therefore allowing employees the time to analyze and play with the data, can in theory have a considerable impact on how data are harnessed within an organization.

If an organization’s decision-making processes is dependent on actual data and insight (e.g. stock market) then the organization is more likely to profit from it. In opposition, organizations whose decision-making processes hand handle hours, days or months of latency in their data, then more likely the technologies will bring little value. Probably can be found similar examples for veracity, variety or similar characteristics consider under Big data.

The Big data technologies can make a difference especially when the extreme aspects of their characteristics can be harnessed. One talks about potential use which is different than the actual use. The use of technologies doesn’t equate with results, as knowledge about the tools and the business is mandatory to harness the respective tools. For example, insight doesn’t necessarily imply improved decision-making because it relies on people’s understanding about the business, about the numbers and models used.

That’s maybe one of the reasons why organization fail in deriving value from Big data. It’s great that companies invest in their Big data, Analytics/BI infrastructures, though without working actively in integrating the new insights/knowledge and upgrading people’s skillset, the effects will be under expectations. Investing in employees’ skillset is maybe one of the important decisions an organization can make as part of its strategy.

Written as answer to a Medium post on Big data and business strategies. 

14 October 2020

Strategic Management: Simplicity VI (ERP Implementations' Story II)

Besides the witty sayings and theories advanced in defining what simplicity is about, life shows that there’s a considerable gap between theory and praxis. In the attempt of a definition one is forced to pull more concepts like harmony, robustness, variety, balance, economy or proportion, which can be grouped under organic unity or similar concepts. However, intuitionally one can advance the idea that from a cybernetic perspective simplicity is achieved when the information flows are not disrupted and don’t meet unnecessary resistance. By information here are considered the various data aggregations – data, information, knowledge, and eventually wisdom (aka DIKW pyramid) – though it can be extended to encompass materials, cash and vital energy.

One can go further and say that an organization is healthy when the various flows mentioned above run smoothly through the organization nourishing it. The comparison with the human body can go further and say that a blockage in the flow can cause minor headaches or states that can take a period of convalescence to recover from them. Moreover, the sustained effort applied by an organization can result in fatigue or more complex ailments or even diseases if the state is prolonged. 

For example, big projects like ERP implementations tend to suck the vital energy of an organization to the degree that it will take months to recover from the effort, while, the changes in the other types of flow can lead to disruptions, especially when the change is not properly managed. Even if ERP implementations provide standard solutions for the value-added processes, they represent vendors’ perspective into the respective processes, which don’t necessarily fit an organization’s needs. One is forced then to make compromises either by keeping close to the standard or by expanding the standard processes to close the gap. Either way processual changes are implied, which affect the information flow, especially for the steps where further coordination is needed, respectively the data flow in respect to implementation or integration with the further systems. A new integration as well a missing integration have the potential of disrupting the data and information flows.

The processual changes can imply changes in the material flow as the handling of the materials can change, however the most important impact is caused maybe by the processual bottlenecks, which can cause serious disruptions (e.g. late deliveries, production is stopped), and upon case also in the cash-flow (e.g. penalties for late deliveries, higher inventory costs). The two flows can be impacted by the data and information flows independently of the processual changes (e.g. when they have poor quality, when not available, respectively when don’t reach the consumer in timely manner). 

With a new ERP solution, the organization needs to integrate the new data sources into the existing BI infrastructure, or when not possible, to design and implement a new one by taking advantage of the technological advancements. Failing to exploit this potential will impact the other flows, however the major disruptions appear when the needed knowledge about business processes is not available in-house, in explicit and/or implicit form, before, during and after the implementation. 

Independently on how they are organized – in center of excellence or ad-hoc form – is needed a group of people who can manage the various flows and ideally, they should have the appropriate level of empowerment. Typically, the responsibility resides with key users, IT and one or two persons from the management. Without a form of ‘organization’ to manage the flows, the organization will reside only on individual effort, which seldom helps reaching the potential. Independently of the number of resources involved, simplicity is achieved when the activities flow naturally.  

Related Posts Plugin for WordPress, Blogger...