SQL Troubles

13 January 2021

💠🛠️SQL Server: Administration (Monitoring the Database Logs)

One of the aspects to monitor on a SQL Server instance is the size of the logs available for each database, respectively the degree to which the logs are used. Starting with SQL Server 2005 this could be achieved by using the 'Log File(s) Used Size (KB)' and 'Log File(s) Size (KB)' counters via the sys.dm_os_performance_counters DMV as follows:

-- log files - size (kb)
SELECT lfu.instance_name database_name
, lfu.cntr_value size_kb
, Cast(lfu.cntr_value/1024.00 as decimal (18,2)) size_MB
FROM sys.dm_os_performance_counters lfu 
WHERE lfu.counter_name LIKE  'Log File(s) Size (KB)%' 
  AND lfu.object_name LIKE 'SQLServer:Databases%'
  AND lfu.instance_name IN ('tempdb', 'master', 'model', 'msdb')
ORDER BY lfu.instance_name

-- log files - used size (kb)
SELECT lfs.instance_name database_name
, lfs.cntr_value used_size_kb
, Cast(lfs.cntr_value/1024.00 as decimal (18,2)) used_size_MB
FROM sys.dm_os_performance_counters lfs
WHERE lfs.counter_name LIKE  'Log File(s) Used Size (KB)%' 
  AND lfs.object_name LIKE 'SQLServer:Databases%'
  AND lfs.instance_name IN ('tempdb', 'master', 'model', 'msdb')
ORDER BY lfs.instance_name

The two queries can be combined into one as follows:

-- database log space allocation (SQL Server 2005+)
SELECT db.name database_name
, db.log_reuse_wait_desc 
, Cast(lfs.cntr_value/1024.00 as decimal(28,2)) size_MB
, Cast(lfu.cntr_value/1024.00 as decimal(28,2)) AS used_MB
, Cast(100.00*lfu.cntr_value/lfs.cntr_value as decimal(10,2)) used_percent 
, CASE WHEN CAST(lfu.cntr_value AS float) / CAST(lfs.cntr_value AS float) > .5 THEN 
   CASE 
    WHEN db.name = 'tempdb' AND log_reuse_wait_desc NOT IN ('CHECKPOINT', 'NOTHING') THEN 'WARNING'  
    WHEN db.name <> 'tempdb' THEN 'WARNING' 
    ELSE 'OK' 
    END 
  ELSE 'OK' END log_status 
FROM sys.databases db 
     JOIN sys.dm_os_performance_counters lfs 
       ON db.name = lfs.instance_name 
      AND lfs.counter_name LIKE 'Log File(s) Size (KB)%' 
     JOIN sys.dm_os_performance_counters lfu 
       ON db.name = lfu.instance_name 
      AND lfu.counter_name LIKE  'Log File(s) Used Size (KB)%' 
WHERE db.name IN ('tempdb', 'master', 'model', 'msdb')
ORDER BY db.name

Output:

database_name	log_reuse_wait_desc	size_MB	used_MB	used_percent	log_status
master	NOTHING	1.99	1.11	55.78	WARNING
model	NOTHING	7.99	1.74	21.80	OK
msdb	NOTHING	28.80	1.96	6.81	OK
tempdb	NOTHING	999.99	0.69	0.07	OK

Starting with SQL Server 2012 the same information can be obtained via the sys.dm_db_log_space_usage DMV, however the view returns information only for the current database:

-- getting the log space only for a database (SQL Server 2012+)
SELECT db_name(database_id) database_name 
, Cast(total_log_size_in_bytes/1024.00/1024.00 as decimal(28,2)) size_MB
, Cast(used_log_space_in_bytes/1024.00/1024.00 as decimal(28,2)) used_MB
, Cast(used_log_space_in_percent as decimal(28,2)) used_percent
FROM sys.dm_db_log_space_usage

Output:

database_name	size_MB	used_MB	used_percent
master	1.99	1.19	59.61

With less flexibility one can obtain the size in MB and the used percentage by using the DBCC utility as follows:

-- retrieving the log usage for all databases
DBCC SQLPERF(LOGSPACE);

Output:

Database Name	Log Size (MB)	Log Space Used (%)	Status
master	1,992188	33,13726	0
tempdb	999,9922	0,06352589	0
model	7,992188	21,11437	0
msdb	28,80469	6,617846	0
AdventureWorks2014	33,99219	14,76672	0
AdventureWorksDW2014	17,99219	22,6878	0

Notes:
1. All the mentioned objects require VIEW SERVER STATE permissions.
2. The solution based on the performance counters returns slightly different values than the other solutions, though the differences are neglectable.

Resources:
[1] SQL Docs (2017) sys.dm_os_performance_counters [source]
[2] SQL Docs (2017) DBCC SQLPERF [source]
[3] SQL Docs (2017) sys.dm_db_log_space_usage [source]

09 January 2021

🧮ERP: Panning (Part IV: It’s all about Planning - An Introduction)

Ideally the total volume of work can be uniformly distributed for all project’s duration though in praxis the curve representing the effort has the form of a wave or aggregation of waves that tend to reach the peak shortly before or during the Go-Live(s). More important, higher fluctuations occur in the various areas of the project on whole project’s duration, as there are dependencies between the various functional areas, as one needs to wait for decisions to be made, people are not available, etc. Typically, the time wasted on waiting, researching or other non-value-added activities have the potential of leading to such peaks. Therefore, the knowledge must be available, and decisions must be taken when required, which can be challenging but not impossible to achieve.

To bridge the time wasted on waiting, the team members need to work on other topics. If on customer’s side the resources can handle maybe other activities, on vendor’s side the costs can be high and proportional with the volume of waiting. Therefore, vendor’s resources must be involved at least in two projects or do work in other areas in advance, which is not always possible. However, when vendor’s resources are involved in two or more projects, unless the planning is perfect or each resource can handle the work as it comes, there are further waiting times added. The customer is then forced either to book the resources exclusively, or to wait and carry the costs associated with it.

On the other side ERP Implementations tend to become exploration projects, especially when the team has only partial knowledge about the system, or the requirements have a certain degree of specialization that deviates from the standard processes. The more unknowns an ERP implementation has, the more difficult is to plan. To be able to plan one must know the activities ahead, how long they take, and of course, one must adhere to the delivery dates, because each delay can have a cascading effect that can impact project’s schedule considerably.

Probably the best approach to planning is to group the activities into packages and plan the packages, being in each subteam’s duty to handle the planning for each package, following to manage at upper level only the open issues, risks or opportunities. This shifts the burden from Project Manager’s shoulders within the project. Moreover, even if in theory the plan can consider each activity, it will become obsolete as soon it’s updated given the considerable volume of work requested to maintain it. Periodically, one can still revise the whole plan to identify opportunities and risks. What the team can do is to plan for a certain time interval (e.g. 4-6 weeks) and build from there. This allows focusing on the most important activities.

To further shift the burden, activities like Data Migration, Data Cleaning or the integrations with third party systems should be treated when possible as subprojects. Despite having interdependencies with the main project (e.g. parameters, master data, decisions) and share same resources, they have their own schedule whose deadlines need to be aligned with main project’s milestones.

Unless the team and business put all effort to respect the plan and, as long the plan is realistic, the initial plan can seldom be respected – it’s anyway just a sketch of the road ahead that can change as the project progresses – and this aspect needs to be understood by the business otherwise will lead to false expectations. On the other side, the team must try respecting the deadlines and communicate in time inability to do so. It’s an interplay in which communication is more important than ever.

Previous Post <<||>> Next Post

🧮ERP: Implementations (Part V: It’s all about Partnership - An Introduction)

ERP Implementations Series

Unless the organization (customer) implementing an ERP system has a strong IT team and the knowledge required for the implementation is available already in-house, the resources need to be acquired from the market, and probably the right thing to do is to identify a certified implementer (partner) which can fill the knowledge and skillset gaps, respectively which can help splitting the risks associated with such an implementation.

In theory, the customer provides knowledge about its processes, while the partner comes with expertise about the system to be implemented and further technologies, industry best practices, project methodologies, etc. Further on, the mix is leveraged to harness the knowledge and reach project’s objectives.

In praxis however finding an implementer which can act as partner might be more challenging than expected. This because the implementer needs to understand customer’s business and where it’s heading, bridge the gap between functional requirements and system’s functionality, advise on areas of improvement, prepare the customer for the project and lead the customer through the changes, respectively establish a basis for the future. Some of the implications are seldom made explicit even if they are implied by what is needed by the project.

Technology is seldom the issue in an ERP implementation, the challenges residing in handing the change and the logistics required. There are so many aspects to be considered and handled, and this can be challenging for any implementer no matter how long has been on the market or how experienced the resources are. Somebody needs to lead the change and the customer seldom has the knowledge to handle the change. In some cases, the implementer must make the customer aware of the implications, while in others needs to take the initiative and lead the change, though the customer needs to play along, which can be challenging also.

Many aspects need to be handled at management level from a strategical point of view on customer’s side. It starts with assuring that the most important aspects of the business where considered, that the goals and objectives are clear, that the proper environment is created, and ends with the timely decision-making, with assuring that the resources are available when needed, that the needed organization structures and roles are in place, that the required knowledge is available before, during and after implementation, that the potential brought by the ERP system is harnessed for the years to come.

A partnership allows in theory splitting the implementation risks as ERP implementations have a high rate of failure. Quite often the outcomes of such projects don’t meet the expectations, the systems being in extremis unusable or a bottleneck for the organization. Ideally one should work with the partner(s) and attempt solving the issues, split eventually the incurred cost overruns, find a middle way. Most of the times it’s recommended to find a solution together rather than coming to a litigation.

Given the complex dependencies existing between the various parts of the project, the causes that lead to poor implementations are difficult to prove, as there are almost always grey areas. Moreover, the litigations can require a considerable time and resources to settle. These can be just extreme situations, and as long one has a good partner, there’s no need to think that far. On the other side, even if undesirable, one must be prepared also for such outcomes, even if the countermeasures may involve an additional effort. Therefore, one must address such issues in contracts by establishing the areas of accountability/responsibilities for each party, document adequately the requirements and further (important) communication, make sure that the deliverables have the expected quality, etc.

Previous Post <<||>> Next Post

05 January 2021

🧮ERP: Implementations (Part IV: It’s all about Scope II - Nonfunctional Requirements & MVP)

ERP Implementations Series

Nonfunctional Requirements

In contrast to functional requirements (FRs), nonfunctional requirements (NFRs) have no direct impact on system’s behavior, affecting end-users’ experience with the system, resuming thus to topics like performance, usability, reliability, compatibility, security, monitoring, maintainability, testability, respectively other constraints and quality attributes. Even if these requirements are in general addressed by design, the changes made to the system have the potential of impacting users’ experience negatively.

Moreover, the NFRs are usually difficult to quantify, and probably that’s why they are seldom made explicit in a formal document or are considered eventually only at high level. However, one can still find a basis for comparison against compliance requirements, general guidelines, standards, best practices or the legacy system(s) (e.g. the performance should not be worse than in the legacy system, the volume of effort for carrying the various activities should not increase). Even if they can’t be adequately described, it’s recommended to list the NFRs in general terms in a formal document (e.g. implementation contract). Failing to do so can open or widen the risk exposure one has, especially when the system lacks important support in the respective areas. In addition, these requirements need to be considered during testing and sign-off as well.

Minimum Viable Product (MVP)

Besides gaps’ consideration in respect to FRs, it’s important to consider sometimes on whether the whole functionality is mandatory, especially when considering the various activities that need to be carried out (parametrization, Data Migration).

For example, one can target to implement a minimum viable product (MVP) - a version of the product which has just enough features to cover the mandatory or the most important FRs. The MVP is based on the idea that implementing about 80% of the needed functionality has in theory the potential of providing earlier a usable product with a minimum of effort (quick wins), assure that project’s goals and objectives were met, respectively assure a basis for further development. In case of cost overruns, the MVP assures that the business has a workable product and has the opportunity of deciding whether it’s worth of investing more into the project now or later.

The MVP allows also to get early users’ feedback and integrate it into further enhancements and developments. Often the users understand the capabilities of a system, respectively implementation, only when they are able using the system. As this is a learning process, the learning period can take up to a few months until adequate feedback is available. Therefore, postponing implementation’s continuation with a few months can have in theory a positive impact, however it can come also with drawbacks (e.g. the resources are not available anymore).

A sketch of the MVP usually results from requirements’ prioritization, however then requirements need to be regarded holistically, as there can be different levels of dependencies existing between them. In addition, different costs can incur if the requirements will be handled later, and other constrains may apply as well. Considering an MVP approach can be a sword with two edges. In the worst-case scenario, the business will get only the MVP, with its good and bad characteristics. The business will be forced then to fill the gaps by working outside the system, which can lead to further effort and, in extremis, with poor acceptance of the system. In general, users expect having their processes fully implemented in the system, expectation which is not always economically grounded.

After establishing an MVP one can consider the further requirements (including improvement suggestions) based on a cost-benefit basis and implement them accordingly as part of a continuous improvement initiative, even if more time will be maybe required for implementing the same.

Previous Post <<||>> Next Post

🧮ERP: Implementations (Part III: It’s all about Scope I - Functional Requirements)

ERP Implementations Series

Introduction

ERP (Enterprise Resource Planning) Implementations tend to be expensive projects, often the actual costs overrunning the expectations by an important factor. The causes for this are multiple, the most important ones ranging from the completeness and complexity of the requirements and the impact they have on the organization to the availability of internal and external skilled resources, project methodology, project implementation, organization’s maturity in running projects, etc.

The most important decision in an ERP implementation is deciding what one needs, respectively what will be considered for the implementation, aspects reflected in a set of functional and nonfunctional requirements.

Functional Requirements

The functional requirements (FRs) reflect the expected behavior of the system in respect to the inputs and outputs – what the system must do. Typically, they encompass end-users’ requirements in the area of processes, interfaces and data processing, though are not limited to them.

The FRs are important because they reflect the future behavior of the system as perceived by the business, serving further as basis for identifying project’s scope, the gaps between end-users’ requirements and system’s functionality, respectively for estimating project’s duration and areas of focus. Further they are used as basis for validating system’s behavior and getting the sign-off for the system. Therefore, the FRs need to have the adequate level of detail, be complete, clear, comprehensible and implementable, otherwise any gaps in requirements can impact the project in adverse ways. To achieve this state of art they need to go through several iterations in which the requirements are reevaluated, enhanced, checked for duplication, relevance or any other important aspect. In the process it makes sense to categorize the requirements and provide further metadata needed for their appraisal (e.g. process, procedure, owner, status, priority).

Once brought close to a final form, the FRs are checked against the functionality available in the targeted system, or systems when more systems are considered for evaluation. Ideally all the requirements can be implemented with the proper parametrization of the systems, though it’s seldom the case as each business has certain specifics. The gaps need to be understood, their impact evaluated and decided whether the gaps need to be implemented. In general, it’s recommended to remain close to the standard functionality, as each further gap requires further changes to the system, gaps that in time can generate further quality-related and maintenance costs.

It can become a tedious effort, as in the process an impact and cost-benefit analysis need to be performed for each gap. Therefore, gaps’ estimation needs to occur earlier or intermixed with their justification. Once the list of the FRs is finalized and frozen, they will be used for estimating the final costs of the project, identifying the work packages, respectively planning the further work. Once the FRs frozen, any new requirements or changes to requirements (including taking out a requirement) need to go through the Change Management process and all the consequences deriving from it – additional effort, costs, delays, etc. This can trigger again an impact and cost-benefit analysis.

The FRs are documented in a specification document (aka functional requirement specification), which is supposed to track all the FRs through their lifetime. When evaluating the FRs against system’s functionality it’s recommended to provide general information on how they will be implemented, respectively which system function(s) will be used for that purpose. Besides the fact that it provides transparence, the information can be used as basic ground for further discussions.

Seldom all the FRs will be defined upfront or complete. Moreover, some requirements will become obsolete during project’s execution, or gaps will be downgraded as standard and vice-versa. Therefore, it’s important to recollect the unexpected.

Previous Post <<||>> Next Post

03 January 2021

🤝Governance: Responsibility (Just the Quotes)

"Weak character coupled with honored place, meager knowledge with large plans, limited powers with heavy responsibility, will seldom escape disaster." ("I Ching" ["Book of Changes"], cca. 600 BC)

"The only way for a large organization to function is to decentralize, to delegate real authority and responsibility to the man on the job. But be certain you have the right man on the job." (Robert E Wood, 1951)

"[...] authority - the right by which superiors are able to require conformity of subordinates to decisions - is the basis for responsibility and the force that binds organization together. The process of organizing encompasses grouping of activities for purposes of management and specification of authority relationships between superiors and subordinates and horizontally between managers. Consequently, authority and responsibility relationships come into being in all associative undertakings where the superior-subordinate link exists. It is these relationships that create the basic character of the managerial job." (Harold Koontz & Cyril O Donnell, "Principles of Management", 1955)

"[...] authority for given tasks is limited to that for which an individual may properly held responsible." (Harold Koontz & Cyril O Donnell, "Principles of Management", 1955)

"If charts do not reflect actual organization and if the organization is intended to be as charted, it is the job of effective management to see that actual organization conforms with that desired. Organization charts cannot supplant good organizing, nor can a chart take the place of spelling out authority relationships clearly and completely, of outlining duties of managers and their subordinates, and of defining responsibilities." (Harold Koontz & Cyril O Donnell, "Principles of Management", 1955)

"Responsibility cannot be delegated. While a manager may delegate to a subordinate authority to accomplish a service and the subordinate in turn delegate a portion of the authority received, none of these superiors delegates any of his responsibility. Responsibility, being an obligation to perform, is owed to one's superior, and no subordinate reduces his responsibility by assigning the duty to another. Authority may be delegated, but responsibility is created by the subordinate's acceptance of his assignment." (Harold Koontz & Cyril O Donnell, "Principles of Management", 1955)

"Viewed internally with respect to the enterprise, responsibility may be defined as the obligation of a subordinate, to whom a superior has assigned a duty, to perform the service required. The essence of responsibility is, then, obligation. It has no meaning except as it is applied to a person." (Harold Koontz & Cyril O Donnell, "Principles of Management", 1955)

"You can delegate authority, but you can never delegate responsibility by delegating a task to someone else. If you picked the right man, fine, but if you picked the wrong man, the responsibility is yours - not his." (Richard E Krafve, The Boston Sunday Globe, 1960)

"Modern organization makes demands on the individual to learn something he has never been able to do before: to use organization intelligently, purposefully, deliberately, responsibly [...] to manage organization [...] to make [...] his job in it serve his ends, his values, his desire to achieve." (Peter F Drucker, The Age of Discontinuity, 1968)

"[Management by objectives is] a process whereby the superior and the subordinate managers of an enterprise jointly identify its common goals, define each individual's major areas of responsibility in terms of the results expected of him, and use these measures as guides for operating the unit and assessing the contribution of each of its members." (Robert House, "Administrative Science Quarterly", 1971)

"'Management' means, in the last analysis, the substitution of thought for brawn and muscle, of knowledge for folkways and superstition, and of cooperation for force. It means the substitution of responsibility for obedience to rank, and of authority of performance for authority of rank. (Peter F Drucker, "People and Performance", 1977)

"[...] the first criterion in identifying those people within an organization who have management responsibility is not command over people. It is responsibility for contribution. Function rather than power has to be the distinctive criterion and the organizing principle." (Peter F Drucker, "People and Performance", 1977)

"The productivity of work is not the responsibility of the worker but of the manager." (Peter F Drucker, "Management in Turbulent Times", 1980)

"By assuming sole responsibility for their departments, managers produce the very narrowness and self-interest they deplore in subordinates. When subordinates are relegated to their narrow specialties, they tend to promote their own practical interests, which then forces other subordinates into counter-advocacy. The manager is thereby thrust into the roles of arbitrator, judge, and referee. Not only do priorities become distorted, but decisions become loaded with win/lose dynamics. So, try as the manager might, decisions inevitably lead to disgruntlement and plotting for the next battle." (David L Bradford & Allan R Cohen, "Managing for Excellence", 1984)

"The man who delegates responsibilities for running the company, without knowing the intimate details of what is involved, runs the enormous risk of rendering himself superfluous." (Harold Geneen, "Managing", 1984)

"Leadership is the total effect you have on the people and events around you. This effect is your influence. Effective leading is being consciously responsible for your organizational influence. [...] The essence of leadership is knowing that YOU CAN NEVER NOT LEAD. You lead by acts of commission and acts of omission." (Kenneth Schatz & Linda Schatz, "Managing by Influence", 1986)

"Looking for differences between the more productive and less productive organizations, we found that the most striking difference is the number of people who are involved and feel responsibility for solving problems." (Michael McTague, "Personnel Journal", 1986)

"Management has a responsibility to explain to the employee how the routine job contributes to the business's objectives. If management cannot explain the value of the job, then it should be eliminated and the employee reassigned." (Douglas M Reid, Harvard Business Review, 1986)

"A systematic effort must be made to emphasize the group instead of the individual. [...] Group goals and responsibilities can usually overcome any negative reactions to the individual and enforce a standard of cooperation that is attainable by persuasion or exhortation." (Eugene Raudsepp, MTS Digest, 1987)

"An individual without information cannot take responsibility; an individual who is given information cannot help but take responsibility." (Jan Carlzon, "Moments of Truth", 1987)

"Executives have to start understanding that they have certain legal and ethical responsibilities for information under their control." (Jim Leeke, PC Week, 1987)

"If responsibility - and particularly accountability - is most obviously upwards, moral responsibility also reaches downwards. The commander has a responsibility to those whom he commands. To forget this is to vitiate personal integrity and the ethical validity of the system." (Roger L Shinn, "Military Ethics", 1987)

[...] quality assurance is the job of the managers responsible for the product. A separate group can't 'assure' much if the responsible managers have not done their jobs properly. [...] Managers should be held responsible for quality and not allowed to slough off part of their responsibility to a group whose name sounds right but which cannot be guaranteed quality if the responsible managers have not been able to do so." (Philip W. Metzger, "Managing Programming People", 1987)

"Responsibility is a unique concept [...] You may share it with others, but your portion is not diminished. You may delegate it, but it is still with you. [...] If responsibility is rightfully yours, no evasion, or ignorance or passing the blame can shift the burden to someone else. Unless you can point your finger at the man who is responsible when something goes wrong, then you have never had anyone really responsible." (Hyman G Rickover, "The Rickover Effect", 1992)

"If you treat people as though they are responsible, they tend to behave that way." (James P Lewis, "Project Planning, Scheduling, and Control" 3rd Ed., 2001)

"You can’t delegate responsibility without giving a person authority commensurate with it." (James P Lewis, "Project Planning, Scheduling, and Control" 3rd Ed., 2001)

"What do people do today when they don’t understand 'the system'? They try to assign responsibility to someone to fix the problem, to oversee 'the system', to coordinate and control what is happening. It is time we recognized that 'the system' is how we work together. When we don’t work together effectively putting someone in charge by its very nature often makes things worse, rather than better, because no one person can understand 'the system' well enough to be responsible. We need to learn how to improve the way we work together, to improve 'the system' without putting someone in charge, in order to make things work." (Yaneer Bar-Yam, "Making Things Work: Solving Complex Problems in a Complex World", 2004)

"In order to cultivate a culture of accountability, first it is essential to assign it clearly. People ought to clearly know what they are accountable for before they can be held to it. This goes beyond assigning key responsibility areas (KRAs). To be accountable for an outcome, we need authority for making decisions, not just responsibility for execution. It is tempting to refrain from the tricky exercise of explicitly assigning accountability. Executives often hope that their reports will figure it out. Unfortunately, this is easier said than done." (Sriram Narayan, "Agile IT Organization Design: For Digital Transformation and Continuous Delivery", 2015)

"Any software project must have a technical leader, who is responsible for all technical decisions made by the team and have enough authority to make them. Responsibility and authority are two mandatory components that must be present in order to make it possible to call such a person an architect." (Yegor Bugayenko, "Code Ahead", 2018)

"Responsibility means an inevitable punishment for mistakes; authority means full power to make them." (Yegor Bugayenko, "Code Ahead", 2018)

31 December 2020

📊Graphical Representation: Graphics We Live by (Part V: Pie Charts in MS Excel)

Graphical Representation Series

From business dashboards to newspapers and other forms of content that capture the attention of average readers, pie charts seem to be one of the most used forms of graphical representation. Unfortunately, their characteristics make them inappropriate for displaying certain types of data, and of being misused. Therefore, there are many voices who advice against using them for any form of display.

It’s hard to agree with radical statements like ‘avoid (using) pie charts’ or ’pie charts are bad’. Each form of graphical representation (aka graphical tool, graphic) has advantages and disadvantages, which makes it appropriate or inappropriate for displaying data having certain characteristics. In addition, each tool can be easily misused, especially when basic representational practices are ignored. Avoiding one representational tool doesn’t mean that the use of another tool will be correct. Therefore, it’s important to make people aware of these aspects and let them decide which tools they should use.

From a graphical tool is expected to represent and describe a dataset in a small area without distorting the reality, while encouraging the reader to compare the different pieces of information, when possible at different levels of details [1] or how they change over time. As form of communication, they encode information and meaning; the reader needs to be able to read, understand and think critically about graphics and data – what is known as graphical/data literacy.

A pie chart consists of a circle split into wedge-shaped slices (aka edges, segments), each slice representing a group or category (aka component). It resembles with the spokes of a wheel, however with a few exceptions they are seldom equidistant. The size of each slice is proportional to the percentage of the component when compared to the whole. Therefore, pie charts are ideal when displaying percentages or values that can be converted into percentages. Thus, the percentages must sum up to 100% (at least that’s readers’ expectation).

Within or besides the slices are displayed components’ name and sometimes the percentages or other numeric or textual information associated with them (Fig. 1-4). The percentages become important when the slices seem to be of equal sizes. As long the slices have the same radius, comparison of the different components resumes in comparing arcs of circles or the chords defined by them, thing not always straightforward. 3-dimensional displays can upon case make the comparison more difficult.

The comparison increases in difficulty with the number of slices increases beyond a certain number. Therefore, it’s not recommended displaying more than 5-10 components within the same chart. If the components exceed this limit, the exceeding components can be summed up within an “other” component.

Within a graphic one needs a reference point that can be used as starting point for exploration. Typically for categorical data this reference point is the biggest or the smallest value, the other values being sorted in ascending, respectively descending order, fact that facilitates comparing the values. For pie charts, this would mean sorting the slices based on their sizes, except the slice for “others” which is typically considered last.

The slices can be filled optionally with meaningful colors or (hashing) patterns. When the same color pallet is used, the size can be reflected in colors’ hue, however this can generate confusion when not applied adequately. It’s recommended to provide further (textual) information when the graphical elements can lead to misinterpretations.

Pie charts can be used occasionally for comparing the changes of the same components between different points in time, geographies (Fig. 5-6) or other types of segmentation. Having the charts displayed besides each other and marking each component with a characteristic color or pattern facilitate the comparison.

Previous Post <<||>> Next Post

30 December 2020

🧊Data Warehousing: ETL (Part V: The Transform Subprocess)

As part of the ETL process, the Transform subprocess is responsible for bridging the gap between source and destination by leveraging SQL or the rich set of (data) transformations available in ETL tools, either to enable the implicit or explicit conversion between source and destination data types, or to transform the data as needed.

Transformations act on data as operators, the challenge being to transform the data in the smallest number of steps in the most efficient way. Some of the transformations available in the ETL tools (e.g. conversions, sorting, sampling, joins, lookups, aggregation, pivoting, unpivoting) can be replaced by SQL-based logic. One can easily prepare the data directly in the extraction query, taking thus advantage of the power provided by the database engines. Moreover, the logic can be encapsulated in views or other objects and called as required by the extraction logic when the source database allows it. This approach allows maintaining the logic independently of the ETL packages.

Unfortunately, SQL can replace the transformations that address sequential logic and not workflow-related logic (e.g. conditional splitd, merges, multicasts, slowly changing dimensions) or logic that includes certain computational complexity (e.g. fuzzy groupings or lookups). Such gaps need to be filled by the ETL tools via the built-in transformations, by allowing developers to build custom logic or simple use COTS solutions, when they prove capable of filling the gap.

Copying the data 1:1 at table or entity-level from the source system(s) involves in theory the simplest transformations, transformations revolving mainly around conversions between data types. The casual troublemakers are the numeric and date values, which can be found in different formats or precisions in the various environments. As this can apply to the ETL environment itself, it’s important to consider environment-agnostic data types when possible (e.g. strings).

Other sources for concerns are the user-defined data types which don’t have equivalents between the systems, needing thus additional transformations for further handing, respectively the invalid values which need to be handled accordingly. Besides the data from the source system(s) and the derived values, upon case one needs to consider the parameter-based or hardcoded metadata created in the process.

Independently of the purpose of the ETL packages it is usually required to document the data flow associated with them and the rules applied in transformations in what is known as a mapping document. Such a document needs to be understandable by the business, as it can serve for Data Management, projects, or other purposes. Even if it’s almost impossible to document everything, at minimum needs to be provided the source and destination tables, the attributes considered in the mappings, respectively the most important rules the business should be aware of. Otherwise, the technical people can always turn back to the SQL queries, when needed.

Some sources consider each non-trivial transformation as a business rule. Even if the rules used in transformations constrain the (business) data, not each rule is relevant for the business to the degree that it constrains some part of the business.

Data Migrations involve transformations between (database) schemas. Therefore, the logic requested to move the data could be handled in theory with a few well-designed packages, though there are considerations like logic complexity, transparency, flexibility, performance or auditability which could be better handled by using other techniques (e.g. saving the data in intermediary tables, breaking down the logic in several steps). Such considerations can apply also to simple ETL packages. Therefore, it’s important to recognize such scenarios, weight the choices and choose what fits best. However, unless one knows what one’s doing, it’s recommended to use the methods one knows best.

Previous Post <<||>> Next Post

28 December 2020

🧊Data Warehousing: ETL (Part IV: The Load Subprocess)

As part of the ETL process, the Load subprocess is responsible for loading the data into the destination table(s). It covers in theory the final steps from the data pipeline and in most of the cases it matches the definition of the query used for data extraction, though this depends also on the transformations used in the solution.

A commonly used approach is dumping the data into an intermediary table from the staging area, table with no constraints that matches only the data types from the source. Once the data loaded, they are further copied into the production table. This approach allows minimizing the unavailability of the production table as the load from an external data source normally takes longer than copying the data within the same database or instance. That might not be the case when the data are available in the same data center, however loading the data first in a staging table facilitates troubleshooting and testing. This approach allows also dropping the indexes on the production table before loading the data and recreating them afterwards. In practice, this proves to be an efficient method for improving data loads’ efficiency.

In general, it’s recommended to import the data 1:1 compared with the source query, though the transformations used can increase or decrease the number of attributes considered. The recommendation applies as well to the cases in which data come from different sources, primarily to separate the pipelines, as systems can have different refreshing requirements and other constraints.

One can consider adding a timestamp reflecting the refresh date and upon case also additional metadata (e.g. identifier for source system, unique identifier for the record). The timestamp is especially important when the data are imported incrementally - only the data created since the last load are loaded. Except the unique identifier, these metadata can however be saved also in a separate table, with the same granularity as the table (1:1) or one record for each load per table and system, storing a reference to the respective record into the load table. There are seldom logical argumentations for using the former approach, while the latter works well when the metadata are used only for auditing purposes. If the metadata are needed in further data processing and performance is important, then the metadata can be considered directly in the load table(s).

A special approach is considered by the Data Vault methodology for Data Warehousing which seems to gain increasing acceptance, especially to address the various compliance requirements for tracking the change in records at most granular level. To achieve this the fact and dimension tables are split into several tables – the hub tables store the business keys together with load metadata, the link tables store the relationships between business keys, while satellite tables store the descriptions of the business keys (the other attributes except the business key) and reference tables store the dropdown values. Besides table’s denormalization there are several other constraints that apply. The denormalization of the data over multiple tables can increase the overall complexity and come with performance penalties, as more tables need to be joined, however it might be the price to pay if traceability and auditability are a must.

There are scenarios in which the requirements for the ETL packages are driven by the target (load) tables – the format is already given - one needing thus to accommodate the data into the existing tables or extended the respective tables to accommodate more attributes. It’s the case for load tables storing data from multiple systems with similar purpose (e.g. financial data from different ERP systems needed for consolidations).

27 December 2020

🧊Data Warehousing: ETL (Part III: The Extract Subprocess)

Data Warehousing Series

As part of the ETL process with applicability to Data Warehousing, Data Migrations, Data Integrations or similar scenarios the extraction subprocess is responsible for preparing and implementing the logic required to extract the data from the various source systems at the required level of detail. The extraction is done typically based on SQL queries as long one deals with relational databases or any OLEDB or ODBC-based data repositories including flat or MS Office files.

One can consider the preparation of the extraction logic as separate design subprocess of the targeted solution. Even if high-level design decisions are considered at the respective level, the low-level design needs to be considered at ETL package level. As part of the process are identified the source of the data in terms of system, tables and attributes to be imported, as well the joins, business and transformation rules that need to be applied on the data. This can involve reengineering the logic from the source system(s) as well data profiling, discovery or exploration activities.

A common practice is to copy the source tables 1:1 into the solution, eventually by considering only the needed attributes to minimize the necessary space, loading time and content’s complexity, even if this would add more effort into the design phase to identify only the needed attributes. If further attributes are identified at a later stage, the packages need to be modified accordingly. If the data volume or the number of unnecessary attributes is neglectable, copying the table 1:1 could prove to be the best strategy.

A second approach is to model within the extraction the (business) entity as designed within the source system. For example, the entity could be split over multiple tables from design or other purposes. Thus, the extraction query will attempt modeling the entity. This approach reduces to some degree the number of tables from the targeted solution, as well the number of ETL packages involved, while providing a clear depiction of the entities involved.

A third approach is to extract the data as needed by the target system, eventually as a mix between master and transaction data, fact which could easily lead to data redundancy with different timeliness and all the consequences resulting from this. This approach is usually met in solutions which require fast data availability in the detriment of design.

Unfortunately, there can be design constraints or choice considerations that could lead to a mix between these approaches. If the impact caused by the mix between the first two approaches is minimal, the third approach can cause more challenges, though it might be a small price to pay as long the considered data are disconnected from other data.

To reduce the redundancy of data, it’s recommended to consider as goal creating a unique source of facts, which can be obtained by minimizing as much as possible the overlaps between tables, respectively entities. Ideally there should be no overlaps. On the other sides the overlaps can be acceptable when the same data are available in more systems and the solution requires all the data to be available.

If the above approaches consider the vertical partitioning of the data, there can be also horizontal partitioning needs especially when a subset of the data is needed or when is needed to partition the data based on a set of values. In addition, one might be forced to include also transformation rules directly into the extraction logic, for example to handle conversion issues or minimize certain design overhead early in the process. In practice it makes sense to link such choices to business rules and document them accordingly.

Previous Post <<||>> Next Post

🧊☯Data Warehousing: Data Vault 2.0 (The Good, the Bad and the Ugly)

Data Warehousing Series

One of the interesting concepts that seems to gain adepts in Data Warehousing is the Data Vault – a methodology, architecture and implementation for Data Warehouses (DWH) developed by Dan Linstedt between 1990 and 2000, and evolved into an open standard with the 2.0 version.

According to its creator, the Data Vault is a detail-oriented, historical tracking and uniquely linked set of normalized tables that support one or more business functional areas [2]. To hold data at the lowest grain of detail from the source system(s) and track the changes occurred in the data, it splits the fact and dimension tables into hubs (business keys), links (the relationships between business keys), satellites (descriptions of the business keys), and reference (dropdown values) tables [3], while adopting a hybrid approach between 3rd normal form and star schemas. In addition, it provides a two- or three-layered data integration architecture, a series of standards, methods and best practices supposed to facilitate its use.

It integrates several other methodologies that allow bridging the gap between the technical, logistic and execution parts of the DWH life-cycle – the PMI methodology is used for the various levels of planning and execution, while the Scrum methodology is used for coordinating the day-to-day project tasks. Six Sigma is used together with Total Quality Management for the design and continuous improvement of DWH and data-related processes. In addition, it follows the CMMI maturity model for providing a clear baseline for benchmarking an organization’s DWH capabilities in development, acquisition and service areas.

The Good: The decomposition of the source data models into hub, link and satellite tables provides traceability and auditability at raw data level, allowing thus to address the compliance requirements of Sarabanes-Oxley, HIPPA and Basel II by design.

The considered standards, methods, principles and best practices are leveraged from Software Engineering [1], establishing common ground and a standardized approach to DWH design, implementation and testing. It also narrows down the learning and implementation paths, while allowing an incremental approach to the various phases.

Data Vault 2.0 offers support for real-time, near-real-time and unstructured data, while new technologies like MapReduce, NoSQL can be integrated within its architecture, though the same can be said about other approaches as long there’s compatibility between the considered technologies. In fact, except business entities’ decomposition, many of the notions used are common to DWH design.

The Bad: Further decomposing the fact and dimension tables can impact the performance of the queries run against the tables as more joins are required to gather the data from the various tables. The further denormalization of tables can lead to higher data storage needs, though this can be neglectable compared with the volume of additional objects that need to be created in DWH. For an ERP system with a few hundred of meaningful tables the complexity can become overwhelming.

Unless one uses a COTS tool which automates some part of the design and creation process, building everything from scratch can be time-consuming, increasing thus the time-to-market for solutions. However, the COTS tools can introduce restrictions of their own, which can negatively impact the overall experience with the methodology.

The incorporation of non-technical methodologies can have positive impact, though unless one has experience with the respective methodologies, the disadvantages can easily overshadow the (theoretical) advantages.

The Ugly: The dangers of using Data Vault can be corroborated as usual with the poor understanding of the methodology, poor level of skillset or the attempt of implementing the methodology without allowing some flexibility when required. Unless one knows what he is doing, bringing more complexity in a field which is already complex, can easily impact negatively projects’ outcomes.

Previous Post <<||>> Next Post

References:
[1] Dan Linstedt & Michael Olschimke (2015) Building a Scalable Data Warehouse with Data Vault 2.0
[2] Dan Linstedt (?) Data Vault Basics [source]
[3] Dan Linstedt (2018) Data Vault: Data Modeling Specification v 2.0.2 [source]

27 November 2020

🧊Data Warehousing: ETL (Part II: An Introduction)

ETL (Extract, Transform, Load) processes, technologies or tools are about extracting data from one or more data sources via a set of queries, performing changes on the data via conversions, aggregations, mappings or other types of transformations, respectively loading the data into target tables or other type of repositories. Thus, an ETL process allows moving and transforming data between predefined data structures on an ad-hoc basis or as part of stable repetitive processes, which makes ETL ideal for data warehousing, data integrations, data migrations or similar scenarios.

Extract: The extraction of data is done typically based on SQL queries from relational databases or any OLEDB or ODBC-based data repositories including flat or MS Office files, though modern ETL tools can support other type of queries (CAML, XQuery, DAX) or even NoSQL architectures (Handoop). This allows addressing a wide range of requirements, the complexity of the logic depending on the functionality provided by the query languages, respectively the extraction functionality available.

Transform: The transformation logic can be implemented based on the functionality provided by the ETL tool, and can involve after case any combination of aggregates, conditional splits, merges, lookups, multicasts, pivoting/unpivoting, cleansing, data conversions, sampling, mapping or any other transformations that can be performed on an in-transit dataset. On the other side, quite often the same can be achieved with the help of SQL-based manipulations directly in the extraction logic or later in the process. SQL can prove to be occasionally faster and more flexible than the transformations provided by the ETL tool, however despite the overlaps, the two approaches can complement each other when used adequately.

Load: The load is usually just a dump of the data into one or more final or intermediary tables with predefined structures. Unless the data don’t match the data type, format or further defined constraints, the load seldom involve further challenges as long the solution was designed adequately.

Within the logical model, extract, transform and load can be considered as process by themselves. Within the object model provided by the ETL tool, they are considered in the mentioned sequence within a data flow, which within a set of workflow constraints defines how the data move through the pipeline – the sequence of processing steps considered. The basic unit of work is the data flow and the workflow it belongs to, unit that can be encapsulated in one container for easier management or simply convenience. Several containers can be linked within a workflow to create more complex behavior.

The data flows and workflow constraints, together with the supporting connections and containers form an ETL package, the main unit of work for encapsulating and running ETL logic. ETL packages are scheduled and run as fit for the purpose.

With the right design, these building blocks allow enough flexibility in handling ad-hoc requests or of building complex solutions. This involves decisions on how to partition the ETL packages, respectively the data flows, in which order they should be run, where and in which sequence the data should be transformed, how to handle exceptions, how to build eventually intermediary data repositories, how to handles audit requirements, and so on. Each of these choices can prove to be important.

The knowledge of the ETL architecture and functionality is quintessential in providing the right solution for the problem considered, however once the basics were understood the challenges typically reside in understanding the source and/or target structures, the logical and physical entities available, identify the way the data can be partitioned horizontally or vertically, respectively what type of transformations are required for moving the data, as required by the solution.

Previous Post <<||>> Next Post

07 November 2020

⛁DBMS: Event Streaming Databases (More of a Kafka’s Story)

Event streaming architectures are architectures in which data are generated by different sources, and then processed, stored, analyzed, and acted upon in real-time by the different applications tapped into the data streams. An event streaming database is then a database that assures that its data are continuously up-to-date, providing specific functionality like management of connectors, materialized views and running queries on data-in-motion (rather than on static data).

Reading about this type of technologies one can easily start fantasizing about the Web as a database in which intelligent agents can process streams of data in real-time, in which knowledge is derived and propagated over the networks in an infinitely and ever-growing flow in which the limits are hardly perceptible, in which the agents act as a mind disconnected in the end from the human intent. One is stroke by the fusing elements of realism and the fantastic aspects, more like in a Kafka’s story in which the metamorphosis of the technologies and social aspects can easily lead to absurd implications.

The link to Kafka was somehow suggested by Apache Kafka, an open-source distributed event streaming platform, which seems to lead the trends within this new-developing market. Kafka provides database functionality and guarantees the ACID (atomicity, concurrency, isolation, durability) properties of transactions while tapping into data streams.

Data streaming is an appealing concept though it has some important challenges like data overload or over-flooding, the complexity derived from building specific (business) and integrity rules for processing the data, respectively for keeping data consistency and truth within the ever-growing and ever-changing flows.

Data overload or over-flooding occurs when applications are not able to keep the pace with the volume of data or events fired with each change. Imagine the raindrops falling on a wide surface in which each millimeter or micrometer has its own rules for handling the raindrops and this at large scale. The raindrops must infiltrate into the surface to be processed and find their way to the beneath water flows, aggregating up to streams that could nurture seas or even oceans. Same metaphor can be applied to the data events, in which the data pervade applications accumulating in streams processed by engines to derive value. However heavy rains can easily lead to floods in which water aggregates at the surface.

Business applications rely on predefined paths in which the flow of data is tidily linked to specific actions found themselves in processual sequences that reflect the material or cash flow. Any variation in the data flow from expectations will lead to inefficiencies and ultimately to chaos. Some benefit might be derived from data integrations between the business applications, however applications must be designed for this purpose and handle extreme behaviors like over-flooding.

Data streams are maybe ideal for social media networks in which one broadcasts data through the networks and any consumer that can tap to the network can further use the respective data. We can see however the problems of nowadays social media – data, better said information, flow through the networks being changed as fit for purposes that can easily diverge from the initial intent. Moreover, information gets denatured, misused, overused to the degree that it overloads the networks, being more and more difficult to distinguish between reliable and non-reliable information. If common sense helps in the process of handling such information, not the same can be said about machines or applications.

It will be challenging to deal with the vastness, vagueness, uncertainty, inconsistency, and deceit of the networks of the future, however data streaming more likely will have a future as long it can address such issues by design.

06 November 2020

🧭Business Intelligence: Perspectives (Part 6: Data Soup - Reports vs. Data Visualizations)

Business Intelligence Series

Considering visualizations, John Tukey remarked that ‘the greatest value of a picture is when it forces us to notice what we never expected to see’, which is not always the case for many of the graphics and visualizations available in organizations, typically in the form of simple charts and dashboards, quite often with no esthetics or meaning behind.

In general, reports are needed as source for operational activities, in which the details in form of raw or aggregate data are important. As one moves further to the tactical or strategic aspects of a business, visualizations gain in importance especially when they allow encoding data and information, respectively variations, trends or relations in smaller places with minimal loss of information.

There are also different aspects of visualizations that need to be considered. Modern tools allow rapid visualization and interactive navigation of data across different variables which is great as long one knows what is searching for, which is not always the case.

There are junk charts in which the data drowns in graphical elements that bring no value to the reader, in extremis even distorting the message/meaning.

There are graphics/visualizations that attempt bringing together and encoding multiple variables in respect to a theme, and for which a ‘project’ is typically needed as data is not ad-hoc available, don’t have the desired quality or need further transformations to be ready for consumption. Good quality graphics/visualizations require time and a good understanding of the business, which are not necessarily available into the BI/Analytics teams, and unfortunately few organizations do something in that direction, ignoring typically such needs. In this type of environments is stressed the rapid availability of data for decision-making or action-relevant insight, which depends typically on the consumer.

The story-telling capabilities of graphics/visualizations are often exaggerated. Yes, they can tell a story though stories need to be framed into a context/problem, some background and further references need to be provided, while without detailed data the graphics/visualizations are just nice representations in which each consumer understands what he can.

In an ideal world the consumer and the ‘designer’ would work together to identify the important data for the theme considered, to find the appropriate level of detail, respectively the best form of encoding. Such attempts can stop at table-based representations (aka reports), respectively basic or richer forms of graphical representations. One can consider reports as an early stage of the visualization process, with the potential to derive move value when the data allow meaningful graphical representations. Unfortunately, the time, data and knowledge available seldom make this achievable.

In addition, a well-designed report can be used as basis for multiple purposes, while a graphic/visualization can enforce more limitations. Ideal would be when multiple forms of representation (including reports) are combined to harness the value of data. Navigations from visualizations to detailed data can be useful to understand what happens; learning and understanding the various aspects being an iterative process.

It’s also difficult to demonstrate the value of insight derived from visualizations, especially when graphical literacy goes behind the numeracy and statistical literacy - many consumers lacking the skills needed to evaluate numbers and statistics adequately. If for a good artistic movie you need an assistance to enjoy the show and understand the message(s) behind it, the same can be said also about good graphics/visualizations. Moreover, this requires creativity, abstraction-based thinking, and other capabilities to harness the value of representations.

Given the considerable volume of requirements related to the need of basis data, reports will continue to be on high demand in organizations. In exchange visualizations can complement them by providing insights otherwise not available.

Initially published on Medium as answer to a post on Reporting and Visualizations.

Previous Post <<||>> Next Post

SQL Troubles

Pages

13 January 2021

💠🛠️SQL Server: Administration (Monitoring the Database Logs)

09 January 2021

🧮ERP: Panning (Part IV: It’s all about Planning - An Introduction)

🧮ERP: Implementations (Part V: It’s all about Partnership - An Introduction)

05 January 2021

🧮ERP: Implementations (Part IV: It’s all about Scope II - Nonfunctional Requirements & MVP)

🧮ERP: Implementations (Part III: It’s all about Scope I - Functional Requirements)

03 January 2021

🤝Governance: Responsibility (Just the Quotes)

31 December 2020

📊Graphical Representation: Graphics We Live by (Part V: Pie Charts in MS Excel)

30 December 2020

🧊Data Warehousing: ETL (Part V: The Transform Subprocess)

28 December 2020

🧊Data Warehousing: ETL (Part IV: The Load Subprocess)

27 December 2020

🧊Data Warehousing: ETL (Part III: The Extract Subprocess)

🧊☯Data Warehousing: Data Vault 2.0 (The Good, the Bad and the Ugly)

27 November 2020

🧊Data Warehousing: ETL (Part II: An Introduction)

07 November 2020

⛁DBMS: Event Streaming Databases (More of a Kafka’s Story)

06 November 2020

🧭Business Intelligence: Perspectives (Part 6: Data Soup - Reports vs. Data Visualizations)

About Me