SQL Troubles: data flows

Showing posts with label data flows. Show all posts

30 December 2020

🧊Data Warehousing: ETL (Part V: The Transform Subprocess)

As part of the ETL process, the Transform subprocess is responsible for bridging the gap between source and destination by leveraging SQL or the rich set of (data) transformations available in ETL tools, either to enable the implicit or explicit conversion between source and destination data types, or to transform the data as needed.

Transformations act on data as operators, the challenge being to transform the data in the smallest number of steps in the most efficient way. Some of the transformations available in the ETL tools (e.g. conversions, sorting, sampling, joins, lookups, aggregation, pivoting, unpivoting) can be replaced by SQL-based logic. One can easily prepare the data directly in the extraction query, taking thus advantage of the power provided by the database engines. Moreover, the logic can be encapsulated in views or other objects and called as required by the extraction logic when the source database allows it. This approach allows maintaining the logic independently of the ETL packages.

Unfortunately, SQL can replace the transformations that address sequential logic and not workflow-related logic (e.g. conditional splitd, merges, multicasts, slowly changing dimensions) or logic that includes certain computational complexity (e.g. fuzzy groupings or lookups). Such gaps need to be filled by the ETL tools via the built-in transformations, by allowing developers to build custom logic or simple use COTS solutions, when they prove capable of filling the gap.

Copying the data 1:1 at table or entity-level from the source system(s) involves in theory the simplest transformations, transformations revolving mainly around conversions between data types. The casual troublemakers are the numeric and date values, which can be found in different formats or precisions in the various environments. As this can apply to the ETL environment itself, it’s important to consider environment-agnostic data types when possible (e.g. strings).

Other sources for concerns are the user-defined data types which don’t have equivalents between the systems, needing thus additional transformations for further handing, respectively the invalid values which need to be handled accordingly. Besides the data from the source system(s) and the derived values, upon case one needs to consider the parameter-based or hardcoded metadata created in the process.

Independently of the purpose of the ETL packages it is usually required to document the data flow associated with them and the rules applied in transformations in what is known as a mapping document. Such a document needs to be understandable by the business, as it can serve for Data Management, projects, or other purposes. Even if it’s almost impossible to document everything, at minimum needs to be provided the source and destination tables, the attributes considered in the mappings, respectively the most important rules the business should be aware of. Otherwise, the technical people can always turn back to the SQL queries, when needed.

Some sources consider each non-trivial transformation as a business rule. Even if the rules used in transformations constrain the (business) data, not each rule is relevant for the business to the degree that it constrains some part of the business.

Data Migrations involve transformations between (database) schemas. Therefore, the logic requested to move the data could be handled in theory with a few well-designed packages, though there are considerations like logic complexity, transparency, flexibility, performance or auditability which could be better handled by using other techniques (e.g. saving the data in intermediary tables, breaking down the logic in several steps). Such considerations can apply also to simple ETL packages. Therefore, it’s important to recognize such scenarios, weight the choices and choose what fits best. However, unless one knows what one’s doing, it’s recommended to use the methods one knows best.

Previous Post <<||>> Next Post

27 November 2020

🧊Data Warehousing: ETL (Part II: An Introduction)

ETL (Extract, Transform, Load) processes, technologies or tools are about extracting data from one or more data sources via a set of queries, performing changes on the data via conversions, aggregations, mappings or other types of transformations, respectively loading the data into target tables or other type of repositories. Thus, an ETL process allows moving and transforming data between predefined data structures on an ad-hoc basis or as part of stable repetitive processes, which makes ETL ideal for data warehousing, data integrations, data migrations or similar scenarios.

Extract: The extraction of data is done typically based on SQL queries from relational databases or any OLEDB or ODBC-based data repositories including flat or MS Office files, though modern ETL tools can support other type of queries (CAML, XQuery, DAX) or even NoSQL architectures (Handoop). This allows addressing a wide range of requirements, the complexity of the logic depending on the functionality provided by the query languages, respectively the extraction functionality available.

Transform: The transformation logic can be implemented based on the functionality provided by the ETL tool, and can involve after case any combination of aggregates, conditional splits, merges, lookups, multicasts, pivoting/unpivoting, cleansing, data conversions, sampling, mapping or any other transformations that can be performed on an in-transit dataset. On the other side, quite often the same can be achieved with the help of SQL-based manipulations directly in the extraction logic or later in the process. SQL can prove to be occasionally faster and more flexible than the transformations provided by the ETL tool, however despite the overlaps, the two approaches can complement each other when used adequately.

Load: The load is usually just a dump of the data into one or more final or intermediary tables with predefined structures. Unless the data don’t match the data type, format or further defined constraints, the load seldom involve further challenges as long the solution was designed adequately.

Within the logical model, extract, transform and load can be considered as process by themselves. Within the object model provided by the ETL tool, they are considered in the mentioned sequence within a data flow, which within a set of workflow constraints defines how the data move through the pipeline – the sequence of processing steps considered. The basic unit of work is the data flow and the workflow it belongs to, unit that can be encapsulated in one container for easier management or simply convenience. Several containers can be linked within a workflow to create more complex behavior.

The data flows and workflow constraints, together with the supporting connections and containers form an ETL package, the main unit of work for encapsulating and running ETL logic. ETL packages are scheduled and run as fit for the purpose.

With the right design, these building blocks allow enough flexibility in handling ad-hoc requests or of building complex solutions. This involves decisions on how to partition the ETL packages, respectively the data flows, in which order they should be run, where and in which sequence the data should be transformed, how to handle exceptions, how to build eventually intermediary data repositories, how to handles audit requirements, and so on. Each of these choices can prove to be important.

The knowledge of the ETL architecture and functionality is quintessential in providing the right solution for the problem considered, however once the basics were understood the challenges typically reside in understanding the source and/or target structures, the logical and physical entities available, identify the way the data can be partitioned horizontally or vertically, respectively what type of transformations are required for moving the data, as required by the solution.

Previous Post <<||>> Next Post

04 May 2019

🧊Data Warehousing: Architecture (Part I: Push vs. Pull)

In data integrations, data migrations and data warehousing there is the need to move data between two or more systems. In the simplest scenario there are only two systems involved, a source and a target system, though there can be complex scenarios in which data from multiple sources need to be available in a common target system (as in the case of data warehouses/marts or data migrations), or data from one source (e.g. ERP systems) need to be available in other systems (e.g. Web shops, planning systems), or there can be complex cases in which there is a many-to-many relationship (e.g. data from two ERP systems are consolidated in other systems).

The data can flow in one direction from the source systems to the target systems (aka unidirectional flow), though there can be situations in which once the data are modified in the target system they need to flow back to the source system (aka bidirectional flow), as in the case of planning or product development systems. In complex scenarios the communication may occur multiple times within same process until a final state is reached.

Independently of the number of systems and the type of communication involved, data need to flow between the systems as smooth as possible, assuring that the data are consistent between the various systems and available when needed. The architectures responsible for moving data between the sources are based on two simple mechanisms - push vs pull – or combinations of them.

A push mechanism makes data to be pushed from the source system into the target system(s), the source system being responsible for the operation. Typically the push can happen as soon as an event occurs in the source system, event that leads to or follows a change in the data. There can be also cases when is preferred to push the data at regular points in time (e.g. hourly, daily), especially when the changes aren’t needed immediately. This later scenario allows to still make changes to the data in the source until they are sent to other system(s). When the ability to make changes is critical this can be controlled over specific business rules.

A pull mechanism makes the data to be pulled from the source system into the target system, the target systems being responsible for the operation. This usually happens at regular points in time or on demand, however the target system has to check whether the data have been changed.

Hybrid scenarios may involve a middleware that sits between the systems, being responsible for pulling the data from the source systems and pushing them into the targets system. Another hybrid scenario is when the source system pushes the data to an intermediary repository, the target system(s) pulling the data on a need basis. The repository can reside on the source, target on in-between. A variation of it is when the source informs the target that a change happened at it’s up to the target to decide whether it needs the data or not.

The main differentiators between the various methods is the timeliness, completeness and consistency of the data. Timeliness refers to the urgency with which data need to be available in the target system(s), completeness refers to the degree to which the data are ready to be sent, while consistency refers to the degree the data from the source are consistent with the data from the target systems.

Based on their characteristics integrations seem to favor push methods while data migrations and data warehousing the pull methods, though which method suits the best depends entirely on the business needs under consideration.

Previous Post <<||>> Next Post

29 April 2019

🗄️Data Management: Data Integration (Part I: From Disintegration to Integration)

Data Management Series

No matter how tight the integration between the various systems or processes there will be always gaps that need to be addressed in one way or another. The problems are in general caused by design errors rooted in the complexity of the logic from the integration layer or from the systems integrated. The errors can range from missing or incorrect validation rules, mappings and parameters to data quality issues.

A unidirectional integration involves distributing data from one system (aka publisher) to one or more systems (aka subscribers), while in bidirectional integrations systems can act as publishers and subscribers, resulting thus complex data flows with multiple endpoints. In simplest integrations the records flow one-to-one between systems, though more complex scenarios can involve logic based on business rules, mappings and other type of transformations. The challenge is to reflect the states as needed by the system with minimal involvement from the users.

Typically, it falls in application/process owners or key users’ responsibilities to make sure that the integration works smoothly. When the integration makes use of interface or staging tables they can be used as starting point for the troubleshooting, however even then the troubleshooting can be troublesome and involve a considerable manual effort. When possible the data can be exported manually from the various systems and matched in Excel or similar solutions. This leads often to personal or departmental solutions hard to maintain, control and support.

A better approach is to automatize the process by importing the data from the integrated systems at regular points in time into the same database (much like in a data warehouse), model the entities and the needed logic in there, and report the differences. Even if this approach involves a small investment in the beginning and some optimization in logic or performance over time, it can become a useful tool for troubleshooting the differences. Such solutions can be used successfully in multiple integration scenarios (e.g. web shop or ERP integrations).

A set of reports for each entity can help identify the differences between the various entities. Starting from the reported differences the users can identify, categorize and devise specific countermeasures for the various issues. The best time to have such a solution is shortly before or during UAT. This would allow to make sure that the integration layer really works, and helps correcting the issues as long they still have a small impact on the systems. Some integration issues might even lead to a postponement of the Go-Live. The second best time is during the time the first important issues were found, as the issues can be used as support for a Business Case for implementing this type of solutions.

In general, it’s recommended to fix the problems in the integration layer and use the reports only for troubleshooting and for assuring that the integration runs smoothly. There are however situations in which the integration problems can’t be fixed without creating more issues. It’s the case in which multiple systems are involved and integrated over an integration bus.

One extreme approach, not advisable though, is to build a second integration to correct the issues of the first. This solution might work in theory however there’s the risk of multiplying the issues is really high and the complexity of troubleshooting increases with the degree of dependency between the two integrations. It would be more advisable to rebuild the integration anew, however also this approach has its advantages and disadvantages.

Bottom line is that integration issues should be addressed while they are small and that an automated solution for comparing the data can help in the process

16 January 2017

⛏️Data Management: Data Flow (Definitions)

"The sequence in which data transfer, use, and transformation are performed during the execution of a computer program." (IEEE," IEEE Standard Glossary of Software Engineering Terminology", 1990)

"A component of a SQL Server Integration Services package that controls the flow of data within the package." (Marilyn Miller-White et al, "MCITP Administrator: Microsoft® SQL Server™ 2005 Optimization and Maintenance 70-444", 2007)

"Activities of a business process may exchange data during the execution of the process. The data flow graph of the process connects activities that exchange data and - in some notations - may also represent which input/output parameters of the activities are involved." (Cesare Pautasso, "Compiling Business Process Models into Executable Code", 2009)

"Data dependency and data movement between process steps to ensure that required data is available to a process step at execution time." (Christoph Bussler, "B2B and EAI with Business Process Management", 2009)

[logical data flow:] "A data flow diagram that describes the flow of information in an enterprise without regard to any mechanisms that might be required to support that flow." (David C Hay, "Data Model Patterns: A Metadata Map", 2010)

[physical data flow:] "A data flow diagram that identifies and represents data flows and processes in terms of the mechanisms currently used to carry them out." (David C Hay, "Data Model Patterns: A Metadata Map", 2010)

"The fact that data, in the form of a virtual entity class, can be sent from a party, position, external entity, or system process to a party, position, external entity, or system process." (David C Hay, "Data Model Patterns: A Metadata Map", 2010)

"An abstract representation of the sequence and possible changes of the state of data objects, where the state of an object is any of: creation, usage, or destruction [Beizer]." (International Qualifications Board for Business Analysis, "Standard glossary of terms used in Software Engineering", 2011)

"Data flow refers to the movement of data from one purpose to another; also the movement of data through a set of systems, or through a set of transformations within one system; it is a nontechnical description of how data is processed. See also Data Chain." (Laura Sebastian-Coleman, "Measuring Data Quality for Ongoing Improvement ", 2012)

"The movement of data through a group of connected elements that extract, transform, and load data." (Microsoft, "SQL Server 2012 Glossary", 2012)

"A path that carries packets of information of known composition; a roadway for data. Every data flow’s composition is recorded in the data dictionary." (James Robertson et al, "Complete Systems Analysis: The Workbook, the Textbook, the Answers", 2013)

"the path, in information systems or otherwise, through which data move during the active phase of a study." (Meredith Zozus, "The Data Book: Collection and Management of Research Data", 2017)

"The lifecycle movement and storage of data assets along business process networks, including creation and collection from external sources, movement within and between internal business units, and departure through disposal, archiving, or as products or other outputs." (Kevin J Sweeney, "Re-Imagining Data Governance", 2018)

"A graphical model that defines activities that extract data from flat files or relational tables, transform the data, and load it into a data warehouse, data mart, or staging table." (Sybase, "Open Server Server-Library/C Reference Manual", 2019)

"An abstract representation of the sequence and possible changes of the state of data objects, where the state of an object is any of: creation, usage, or destruction." (Software Quality Assurance)

13 February 2014

🕸Systems Engineering: System Dynamics (Definitions)

"A field of study that includes a methodology for constructing computer simulation models to achieve better under-standing of social and corporate systems. It draws on organizational studies, behavioral decision theory, and engineering to provide a theoretical and empirical base for structuring the relationships in complex systems." (Virginia Anderson & Lauren Johnson, "Systems Thinking Basics: From Concepts to Casual Loops", 1997)

"A methodology for studying and managing complex feedback systems, such as one finds in business and other social systems." (Lars O Petersen, "Balancing the Capacity in Health Care", 2008)

"System dynamics is a top-down approach for modelling system changes over time. Key state variables that define the behaviour of the system have to be identified and these are then related to each other through coupled, differential equations." (Peer-Olaf Siebers & Uwe Aickelin, "Introduction to Multi-Agent Simulation", 2008)

"A continuous simulation of systems exhibiting feedback loops. The feedbacks can either intensify activities of the system (positive feedback) or slow them down and stabilize the system (negative feedback)." (Nikola Vlahovic & Vlatko Ceric, "Multi-Agent Simulation in Organizations: An Overview", 2009)

"Is a scientific tool which embodies principles from biology, ecology, psychology, mathematics, and computer science to model complex and dynamic systems." (Kambiz E Maani, "Systems Thinking and the Internet from Independence to Interdependence", 2009)

"System dynamics is an approach to understanding the behaviour of over time. It deals with internal feedback loops and time delays that affect the behaviour of the entire system. It also helps the decision maker untangle the complexity of the connections between various policy variables by providing a new language and set of tools to describe. Then it does this by modeling the cause and effect relationships among these variables." (Raed M Al-Qirem & Saad G Yaseen, "Modelling a Small Firm in Jordan Using System Dynamics", 2010)

[system dynamics simulation:] "A dynamic form of visualization that combines causal loop diagrams and stock and flow diagrams to create a simulation of the workings of a system from one point in time to another." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"An approach for capturing the complex inter- and intra- dependencies that characterize systems, including feedback over time." (Howard Passell, "Collaborative, Stakeholder-Driven Resource Modeling and Management", 2011)

This studies the non-linear interaction of systems of many connected equations. The approach is based on differential equations. It describes the dynamical properties of a whole system using internal negative and positive feedback loops as well as the use of stocks and flows. (Martin Neumann, "An Epistemological Gap in Simulation Technologies and the Science of Society", 2011)

"A simulation-modelling approach to understand the structure and behaviour of complex dynamic systems over time." (Jaime A Palma-Mendoza, "Hybrid SD/DES Simulation for Supply Chain Analysis", 2014)

"A systems simulation methodology to study complex dynamic behavior of industrial and social systems based on control engineering and cybernetics." (Michael Mutingi & Charles Mbohwa, 2014)

[system dynamics:] "The interactions of connected and interdependent components, which may cause change over time and give rise to interconnected risks; emerging, unforeseeable issues; and unclear, disproportional cause-and-effect relationships." (Project Management Institute, "Navigating Complexity: A Practice Guide", 2014)

"A continuous simulation of systems exhibiting feedback loops. The feedbacks can either intensify activities of the system (positive feedback) or slow them down and stabilize the system (negative feedback)." (Nikola Vlahovic & Vlatko Ceric, "An Overview of Multi-Agent Simulation in Organizations", 2015)

"System Dynamics is a dynamic modelling approach at system level which is primarily used to understand interconnected systems and their evolution over time. Basic elements to represent the systems are internal feedback loops as well as stocks and flows." (Catalina Spataru et al, "Multi-Scale, Multi-Dimensional Modelling of Future Energy Systems", 2015)

"System dynamics [...] uses models and computer simulations to understand behavior of an entire system, and has been applied to the behavior of large and complex national issues. It portrays the relationships in systems as feedback loops, lags, and other descriptors to explain dynamics, that is, how a system behaves over time. Its quantitative methodology relies on what are called 'stock-and-flow diagrams' that reflect how levels of specific elements accumulate over time and the rate at which they change. Qualitative systems thinking constructs evolved from this quantitative discipline." (Karen L Higgins, "Economic Growth and Sustainability: Systems Thinking for a Complex World", 2015)

"A simulation technique based on the solution of differential equations, in which the status variables of a system vary with continuity." (Lorenzo Damiani et al, "Different Approaches for Studying Interruptible Industrial Processes: Application of Two Different Simulation Techniques", 2016)

"A technique que allow to obtain models to explore possible futures or scenarios and ask 'what if' questions in complex situations." (Ruth R Gallegos, "Using Modeling and Simulation to Learn Mathematics", Handbook of Research on Driving STEM Learning With Educational Technologies, 2017)

"A method through which the dynamic behaviour of a complex system over time can be better understood by taking into account internal feedback and time delays." (Henry Xu & Renae Agrey, "Major Techniques and Current Developments of Supply Chain Process Modelling", 2018)

"Computer-aided methodology able to represent the causal structure of a system through stock-and-flow feedback structures and computer simulations regarding the accumulation of materials, information, people, and money." (Francesca Costanza, "Governing Patients' Mobility to Pursue Public Value: A System Dynamic Approach to Improve Healthcare Performance Management", 2018)

"The basis of system dynamics is to understand how system structures cause system behavior and system events." (Arzu E Şenaras, "A Suggestion for Energy Policy Planning System Dynamics", 2018)

15 February 2013

🔦Process Management: Process model (Definitions)

"A formal, detailed description of a process that covers policies, activities, work products, roles, and responsibilities. Typically contains standards and procedures and identifies methods and tools as well. Contrast with process architecture." (Richard D Stutzke, "Estimating Software-Intensive Systems: Projects, Products, and Processes", 2005)

"A formal description of a business process. The definition is performed via a process definition language (PDL), which in most cases is WfMS-dependent." (C Combi & G Pozzi, "Workflow Management Systems for Healthcare Processes", 2008)

"Any description of a process (not necessarily formal), that shows a series of steps aimed at accomplishing some goal." (Harry S Delugach, "Formal Analysis of Workflows in Software Development", 2009)

"A means of representing the interrelated processes of a system at any level of detail with a graphic network of symbols, showing data flows, data stores, data processes, and data sources/destinations. Process modeling techniques are used to represent processes graphically for clearer understanding, communication, and refinement." (Anthony D Giordano, "Data Integration Blueprint and Modeling", 2010)

"Processes models (PM) are processes of the same nature that are classified together into a model. It involves the description and/or prescription of processes by the instantiation of levels to define process procedures and fuzzes." (Oluwole A Olatunji & William D Sher, "The Applications of Building Information Modelling in Facilities Management", 2010)

"(1) A framework wherein processes of the same nature are classified into an overall model, e.g. a test improvement model. (2) A method-independent process description of development processes." (IQBBA, "Standard glossary of terms used in Software Engineering", 2011)

"A model of the functions, activities, and procedures performed in any organization. A business process model may consist of: 1.A context diagram showing the relationship of the overall process to those outside the model’s scope, along with the inputs to and outputs from the overall process, 2.One or more functional decomposition diagram showing how the overall process is made up of contributing processes at lower levels (a “vertical view”), 3.One or more process flow diagrams showing how the outputs of one process serve as the inputs to other process (a “horizontal view”). The process flow may be cross-functional or within a single function, 4.One or more business process model diagrams, each depicting the inputs, outputs, start and end events, component activities, roles, and metrics of a single process, 5.The business definition of each process, and 6.The value chain analysis of the process, identifying relationships to data, organizations, roles, and systems." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"A detailed workflow diagram that expands upon a process map by including detailed descriptions of subprocesses, activities, and tasks including all input, output, decisions, and exceptions, as well as measurements of the resources consumed (such as time, FTEs, material, capital, systems, etc.) during the execution of the process. Supports analysis via drill-down examination and can provide the metrics necessary for use by software capable of process simulation and what-if scenario testing of alternative variables." (Carl F Lehmann, "Strategy and Business Process Management", 2012)

[Process Modeling and Analysis:] "The tools and techniques used to (1) map a workflow diagram illustrating the activities and tasks associated with a business process; (2) add complete detail necessary to identify and measure all the resources consumed during the execution of the processes; (3) measure performance outcomes; (4) simulate changes to activities, tasks, sequences, resources, assumptions, and so on using what-if scenarios to test and recalculate performance outcomes; (5) conclude the best combination of adjustments or changes necessary to optimize performance outcome of the process." (Carl F Lehmann, "Strategy and Business Process Management", 2012)

"A model showing the processes carried out by a system and the data interfaces between those processes; same as a data flow model." (James Robertson et al, "Complete Systems Analysis: The Workbook, the Textbook, the Answers", 2013)

14 August 2011

📈Graphical Representation: Data Flow Diagram (Definitions)

"A diagram that shows the data flows in an organization, including sources of data, where data are stored, and processes that transform data." (Jan L Harrington, "Relational Database Dessign: Clearly Explained" 2nd Ed., 2002)

"A diagram of the data flow from sources through processes and files to users. A source or user is represented by a square; a data file is represented by rectangles with missing righthand edges; a process is represented by a circle or rounded rectangle; and a data flow is represented by an arrow." (Jens Mende, "Data Flow Diagram Use to Plan Empirical Research Projects", 2009)

"A diagram used in functional analysis which specifies the functions of the system, the inputs/outputs from/to external (user) entities, and the data being retrieved from or updating data stores. There are well-defined rules for specifying correct DFDs, as well as for creating hierarchies of interrelated DFDs." (Peretz Shoval & Judith Kabeli, "Functional and Object-Oriented Methodology for Analysis and Design", 2009)

[Control Data Flow Graph (CDFG):] " Represents the control flow and the data dependencies in a program." (Alexander Dreweke et al, "Text Mining in Program Code", 2009)

"A graphic method for documenting the flow of data within an organization." (Jan L Harrington, "Relational Database Design and Implementation: Clearly explained" 3rd Ed., 2009)

"A graphic representation of the interactions between different processes in an organization in terms of data flow communications among them. This may be a physical data flow diagram that describes processes and flows in terms of the mechanisms involved, a logical data flow diagram that is without any representation of the mechansm, or an essential data flow diagram that is a logical data flow diagram organized in terms of the processes that respond to each external event." (David C Hay, "Data Model Patterns: A Metadata Map", 2010)

"Data-flow diagrams (DFDs) are system models that show a functional perspective where each transformation represents a single function or process. DFDs are used to show how data flows through a sequence of processing steps." (Ian Sommerville, "Software Engineering" 9th Ed., 2011)

"A model of the system that shows the system’s processes, the data that flow between them (hence the name), and the data stores used by the processes. The data flow diagram shows the system as a network of processes, and is thought to be the most easily recognized of all the analysis models." (James Robertson et al, "Complete Systems Analysis: The Workbook, the Textbook, the Answers", 2013)

"A picture of the movement of data between external entities and the processes and data stores within a system." (Jeffrey A Hoffer et al, "Modern Systems Analysis and Design" 7th Ed., 2014)

"A schematic indicating the direction of the movement of data" (Daniel Linstedt & W H Inmon, "Data Architecture: A Primer for the Data Scientist", 2014)

"A Data Flow Diagram (DFD) is a graphical representation of the 'flow' of data through an information system, modeling its process aspects. Often it is a preliminary step used to create an overview of the system that can later be elaborated." (Henrikvon Scheel et al, "Process Concept Evolution", 2015)

"Data flow maps are tools that graphically represent the results of a comprehensive data assessment to illustrate what information comes into an organization, for what purposes that information is used, and who has access to that information." (James R Kalyvas & Michael R Overly, "Big Data: A Business and Legal Guide", 2015)

"A graphical representation of the logical or conceptual movement of data within an existing or planned system." (George Tillmann, "Usage-Driven Database Design: From Logical Data Modeling through Physical Schmea Definition", 2017)

"a visual depiction using standard symbols and conventions of the sources of, movement of, operations on, and storage of data." (Meredith Zozus, "The Data Book: Collection and Management of Research Data", 2017)

"A data-flow diagram is a way of representing a flow of data through a process or a system (usually an information system). The DFD also provides information about the outputs and inputs of each entity and the process itself." (Wikipedia) [source]

"A graphical representation of the sequence and possible changes of the state of data objects, where the state of an object is any of: creation, usage, or destruction." (IQBBA)

15 February 2010

🕋Data Warehousing: Control Flow (Definitions)

"A component of a SQL Server Integration Services package that controls the flow of tasks within the package. The high-level workflow of a package. Often contains one or more data flow tasks." (Victor Isakov et al, "MCITP Administrator: Microsoft SQL Server 2005 Optimization and Maintenance (70-444) Study Guide", 2007)

"Concept to define causal dependency between process steps to enforce a specific execution order." (Christoph Bussler, "B2B and EAI with Business Process Management", 2009)

"The flow of control defines a partial order relationship between the activities of a business process model, specifying in which temporal order they will be executed." (Cesare Pautasso, "Compiling Business Process Models into Executable Code", 2009)

"A group of connected control flow elements that perform tasks." (Microsoft, "SQL Server 2012 Glossary", 2012)

"An abstract representation of all possible sequences of events (paths) during execution of a component or system. Often represented in graphical form, see control flow graph." (Tilo Linz et al, "Software Testing Foundations" 4th Ed., 2014)

"The control-flow perspective focuses on the control-flow, i.e., the ordering of activities. The goal of mining this perspective is to find a good characterization of all possible paths. Other popular perspectives are the organizational perspective and the case perspective." (Pavlos Delias et al, "Applying Process Mining to the Emergency Department", Encyclopedia of Business Analytics and Optimization, 2014)

"In DB2 data warehousing, a graphical model that sequences data flows and mining flows, integrates external commands, programs, and stored procedures, and provides conditional processing logic for a data warehouse application." (Sybase, "Open Server Server-Library/C Reference Manual", 2019)

"A sequence of events (paths) in the execution through a component or system." (ISTQB)

"The ordered workflow in an Integration Services package that performs tasks." (Microsoft Technet)

23 January 2010

🧊🎡Data Warehousing: ETL (SSIS packages vs. SQL code)

I’ve been working with DTS (Data Transformation Services) packages for the past 7-8 years, finding SQL Server’s 2000 functionality pretty useful in handling ETL (Extract, Transform, Load) tasks, especially when importing and exporting data between multiple data sources. The functionality provided by the DTS platform was basic, though it could be extended in custom packages with code developed in VB or by using ActiveX tasks in VBScript/Jscript, when the SQL-based logic was not enough. In addition, all three types of coding could make calls to .dlls, and thus operating system, vendor or in-house built libraries or simple functions could be (re)used.

Starting with SQL Server 2005, DTS was replaced by the SSIS (SQL Server Integration Services) component, becoming with SSRS (SQL Reporting Services) and SSAS (SQL Server Analysis Services) integrant part of Microsoft's data platform. Besides providing a new architecture, SSIS extended the functionality previously available, bringing more flexibility in constructing the packages, their elements and data manipulation. SSIS became with each new version of SQL Server (2008, 2012, 2014, 2016, 2017, 2019) a powerful ETL tool that can be easily used for Data Warehousing, Data Migration or Data Integration projects.

In what concerns ETL we could say there are two main philosophies - have as much of the business logic in the (ETL) package, or use the package mainly for loading data from the various sources and have the business logic in the database as SQL-based code. As always, each of the two philosophies has its own advantages and disadvantages, though I would consider a third philosophy – design for performance, reuse and maintainability, resulting thus a hybrid between the first two mentioned philosophies.

There are several other factors that need to be considered when building ETL solutions - synchronization, testability, security, stability, scalability, complexity, learning curve, etc. I would say that there is no perfect receipe, no architecture matching all requirements, each solution coming with its needs and constraints, sometimes being a good idea to go one level of abstraction above the requirements, while other times is better to stick to the requirements and problem at hand.

Performance

Designing for performance resumes in choosing the architecture/methods that provides the best performance individually and as a whole – either using package-based functionality, SQL-based functionality or a combination of both. In general SQL code is best suited for query-based data manipulation, while packages are better suited for sequential or workflow-based processing of data, though there could be exceptions and exceptions. Often it’s a good idea to test the performance of all alternative approaches via a prototype, even if in time the developer arrives to a good knowledge of the methods that suit best from a performance point of view.

Reuse

Each of the two architectures allow a lower or higher degree of reuse using parameters, variables and compartmenting of code, maybe with a plus for SQL code which has in theory a greater maturity and flexibility than the package-based functionality, allowing a wider range of reuse resulted from the compartmenting of code in the various supported objects (stored procedures, functions, views or tables).

Maintainability

Maintainability, the easiness of modifying packages or code, is an important factor because few are the cases in which the logic doesn’t change over time, many projects having to deal with change in requirements, sometimes implying a 180 degrees change of the overall perspective. Therefore it's important to be able to modify the code/package and even change the architecture with a minimum of effort.

Refactoring

Refactoring resumes in modifying the code without changing the functional behavior in order to improve the performance, readability, maintainability or extensibility, minimize the use of resources, remove redundant code, adhere to standards, best practices or naming conventions. It is said that there is always place for improvement, performance being the dimension with the most important impact in the world of databases. Refactoring is not always necessary, though it’s a good practice for achieving high quality solutions.

Synchronization

I would say there are three types of synchronization – of scope, business logic and data, the first targeting to synchronize the filters used to select the data, the second to synchronize the logic used in data processing, when is the case, and third to work with the same unaltered copy of the data. Considering the example of Invoices – Headers and Lines, the synchronization of scope would resume to apply the same constraints for the two, assuring that there is no Invoice Header without Lines, and vice-versa; the data synchronization between the two referring to the fact that the data between the two data sets should be consistent, there should be no change in Invoice Headers not reflected also at Line level (e.g. total amount matching between Lines and Header). Business logic synchronization refers typically to the use of the same set of data for similar purposes, if several transformations were used for Invoice Headers, they should be reflected accordingly also at Invoice Line level. Synchronization it’s actually quite an important topic, therefore I will reconsider it in a further post.

Testing & Troubleshooting

I find that the business logic implemented in SQL code is much easier to test than the logic implemented in packages, because in the first situation each object and step could be in theory tested individually/progressively, being thus much easier to troubleshoot. Troubleshooting packages logic can be quite complex because is not always possible to view the input/output for each intermediary step.

Complexity

Complexity is reflected in the easiness of understanding the logic broken down to pieces and as a whole. Packages are highly visual, being in theory easier to identify and understand the steps and their flow, even by non-technical people, while SQL-code might need auxiliary representations for the same purpose (e.g. data flow diagram) and need a higher level of expertize.

Security

Security is always a sensitive and complex topic, and in general it resumes to how secure is the code and sensitive information stored (e.g. user name & password, data), who has access to execute it and the context in which the code is run. This can become easily quite a complex topic, being highly dependent on the architecture used.

Stability

We can discuss about platform and design stability, which can be often a matter of perception and experience. Both SSIS and database engines could be considered as stable development environments, the later having in theory a greater maturity and flexibility, flexibility that could be easily brought to extreme creating bad coding monsters (e.g. loop calls to .dll libraries), impacting thus design’s stability, which is correlated to the adequate use of functionality, techniques and resources – each technology has its does and don’ts, strength and weakness.

Deployment

Deployment of business logic on the server can be quite easy or quite complex, depending on the overall architecture, the number of configurable items in scope, and the complexity of the dependencies existing between them. The deployment usually resumes at copying the code from one location to another, installing the eventual dependencies and configuring the objects for use.

Scalability

Scalability in this context refers mainly to the degree the business logic can cope with the increased volume of records, and not necessarily with the number of requests, though this aspect could be considered too, after case. SSIS and the database engines are designed to be highly scalable, though there are architectures and architectures, good uses and misuses of techniques. Designing for performance in theory equates with good scalability, unless the requirements makes it difficult to have a scalable solution.

Learning curve

The learning curve of technologies is always an important factor that needs to be considered in development, as it reflects how much time an average developer needs to master the basic/average/complex functionality provided by the respective technology. For ETL development is in general requested an average knowledge of both - SSIS architecture, respectively SQL-based programming - though it’s not easy to acquire both problem-solving mindsets. A SSIS developer usually attempts using as much as possible the functionality provided by SSIS, while a SQL developer the SQL-based functionality. In the end, it’s important to know how to balance between the two.

Previous Post <<||>> Next Post

31 December 2007

🏗️Software Engineering: Architecture (Just the Quotes)

"The term architecture is used here to describe the attributes of a system as seen by the programmer, i.e., the conceptual structure and functional behavior, as distinct from the organization of the data flow and controls, the logical design, and the physical implementation." (Gene Amdahl et al, "Architecture of the IBM System", IBM Journal of Research and Development. Vol 8 (2), 1964)

"In computer design three levels can be distinguished: architecture, implementation and realisation; for the ﬁrst of them, the following working deﬁnition is given: The architecture of a system can be defined as the functional appearance of the system to the user, its phenomenology. […] The inner structure of a system is not considered by the architecture: we do not need to know what makes the clock tick, to know what time it is. This inner structure, considered from a logical point of view, will be called the implementation, and its physical embodiment the realisation." (Gerrit A Blaauw, "Computer Architecture", 1972)

"There always is an architecture, whether it is defined in advance - as with modern computers - or found out after the fact - as with many older computers. For architecture is determined by behavior, not by words. Therefore, the term architecture, which rightly implies the notion of the arch, or prime structure, should not be understood as the vague overall idea. Rather, the product of the computer architecture, the principle of operations manual, should contain all detail which the user can know, and sooner or later is bound to know." (Gerrit A Blaauw, "Computer Architecture", 1972)

"The design of a digital system starts with the specification of the architecture of the system and continues with its implementation and its subsequent realisation... the purpose of architecture is to provide a function. Once that function is established, the purpose of implementation is to give a proper cost-performance and the purpose of realisation is to build and maintain the appropriate logical organisation." (Gerrit A Blaauw, "Specification of Digital Systems", Proc. Seminar in Digital Systems Design, 1978)

"With increasing size and complexity of the implementations of information systems, it is necessary to use some logical construct (or architecture) for defining and controlling the interfaces and the integration of all of the components of the system." (John Zachman, "A Framework for Information Systems Architecture", 1987)

"Every software system needs to have a simple yet powerful organizational philosophy (think of it as the software equivalent of a sound bite that describes the system's architecture). [A] step in [the] development process is to articulate this architectural framework, so that we might have a stable foundation upon which to evolve the system's function points." (Grady Booch, "Object-Oriented Design: with Applications", 1991)

"As the size of software systems increases, the algorithms and data structures of the computation no longer constitute the major design problems. When systems are constructed from many components, the organization of the overall system - the software architecture - presents a new set of design problems. This level of design has been addressed in a number of ways including informal diagrams and descriptive terms, module interconnection languages, templates and frameworks for systems that serve the needs of specific domains, and formal models of component integration mechanisms." (David Garlan & Mary Shaw, "An introduction to software architecture", Advances in software engineering and knowledge engineering Vol 1, 1993)

"Software architecture involves the description of elements from which systems are built, interactions among those elements, patterns that guide their composition, and constraints on these patterns. In general, a particular system is defined in terms of a collection of components and interactions among those components. Such a system may in turn be used as a (composite) element in a larger system design." (Mary Shaw & David Garlan,"Characteristics of Higher-Level Languages for Software Architecture", 1994)

"If a project has not achieved a system architecture, including its rationale, the project should not proceed to full-scale system development. Specifying the architecture as a deliverable enables its use throughout the development and maintenance process." (Barry Boehm, 1995)

"Our experience with designing and analyzing large and complex software-intensive systems has led us to recognize the role of business and organization in the design of the system and in its ultimate success or failure. Systems are built to satisfy an organization's requirements (or assumed requirements in the case of shrink-wrapped products). These requirements dictate the system's performance, availability, security, compatibility with other systems, and the ability to accommodate change over its lifetime. The desire to satisfy these goals with software that has the requisite properties influences the design choices made by a software architect." (Len Bass et al, "Software Architecture in Practice", 1998)

"Generically, an architecture is the description of the set of components and the relationships between them. […] A software architecture describes the layout of the software modules and the connections and relationships among them. A hardware architecture can describe how the hardware components are organized. However, both these deﬁnitions can apply to a single computer, a single information system, or a family of information systems. Thus 'architecture' can have a range of meanings, goals, and abstraction levels, depending on who’s speaking." (Frank J Armour et al, "A big-picture look at enterprise architectures", IT professional Vol 1 (1), 1999)

"An architecture framework is a tool which can be used for developing a broad range of different architectures [architecture descriptions]. It should describe a method for designing an information system in terms of a set of building blocks, and for showing how the building blocks fit together. It should contain a set of tools and provide a common vocabulary. It should also include a list of recommended standards and compliant products that can be used to implement the building blocks." (TOGAF, 2002)

"The aim of architectural design is to prepare overall specifications, derived from the needs and desires of the user, for subsequent design and construction stages. The first task for the architect in each design project is thus to determine what the real needs and desires of the user are […]" (George J Klir & Doug Elias, "Architecture of Systems Problem Solving" 2nd Ed, 2003)

"The software architecture of a system or a family of systems has one of the most significant impacts on the quality of an organization's enterprise architecture. While the design of software systems concentrates on satisfying the functional requirements for a system, the design of the software architecture for systems concentrates on the nonfunctional or quality requirements for systems. These quality requirements are concerns at the enterprise level. The better an organization specifies and characterizes the software architecture for its systems, the better it can characterize and manage its enterprise architecture. By explicitly defining the systems software architectures, an organization will be better able to reflect the priorities and trade-offs that are important to the organization in the software that it builds." (James McGovern et al, "A Practical Guide to Enterprise Architecture", 2004)

"The traditional view on software architecture suffers from a number of key problems that cannot be solved without changing our perspective on the notion of software architecture. These problems include the lack of first-class representation of design decisions, the fact that these design decisions are cross-cutting and intertwined, that these problems lead to high maintenance cost, because of which design rules and constraints are easily violated and obsolete design decisions are not removed." (Jan Bosch, "Software architecture: The next step", 2004)

"As a noun, design is the named (although sometimes unnamable) structure or behavior of a system whose presence resolves or contributes to the resolution of a force or forces on that system. A design thus represents one point in a potential decision space. A design may be singular (representing a leaf decision) or it may be collective (representing a set of other decisions). As a verb, design is the activity of making such decisions. Given a large set of forces, a relatively malleable set of materials, and a large landscape upon which to play, the resulting decision space may be large and complex. As such, there is a science associated with design (empirical analysis can point us to optimal regions or exact points in this design space) as well as an art (within the degrees of freedom that range beyond an empirical decision; there are opportunities for elegance, beauty, simplicity, novelty, and cleverness). All architecture is design but not all design is architecture. Architecture represents the significant design decisions that shape a system, where significant is measured by cost of change." (Grady Booch, "On design", 2006)

"The goal for our software architecture is to provide the key mechanisms that are required to implement a wide variety of cross-layer adaptations described by our taxonomy. Our strategy for developing such an architecture is actually to create two architectures, a 'conceptual' one, followed by a 'concrete' one." (Soon H Choi, "A Software Architecture for Cross-layer Wireless Networks", 2008)

"A good system design is based on a sound conceptual model (architecture). A system design that has no conceptual structure and little logic to its organization is ultimately going to be unsuccessful. Good architecture will address all the requirements of the system at the right level of abstraction." (Vasudeva Varma, "Software Architecture: A Case Based Approach", 2009)

"Architecting is both an art and a science - both synthesis and analysis, induction and deduction, and conceptualization and certification - using guidelines from its art and methods from its science. As a process, it is distinguished from systems engineering in its greater use of heuristic reasoning, lesser use of analytics, closer ties to the client, and particular concern with certification of readiness for use." (Mark W Maier, "The Art Systems of Architecting" 3rd Ed., 2009)

"Architecting is creating and building structures - that is, 'structuring'. Systems architecting is creating and building systems. It strives for fit, balance, and compromise among the tensions of client needs and resources, technology, and multiple stakeholder interests." (Mark W Maier, "The Art Systems of Architecting" 3rd Ed., 2009)

"Taking a systems approach means paying close attention to results, the reasons we build a system. Architecture must be grounded in the client’s/user’s/customer’s purpose. Architecture is not just about the structure of components. One of the essential distinguishing features of architectural design versus other sorts of engineering design is the degree to which architectural design embraces results from the perspective of the client/user/customer. The architect does not assume some particular problem formulation, as “requirements” is fixed. The architect engages in joint exploration, ideally directly with the client/user/customer, of what system attributes will yield results worth paying for." (Mark W Maier, "The Art Systems of Architecting" 3rd Ed., 2009)

"An architecture is the response to the integrated collections of models and views within the problem area being examined." (Charles D Tupper, "Data Architecture: From Zen to Reality", 2011)

"An architecture represents combined perspectives in a structured format that is easily viewable and explains the context of the area being analyzed to all those viewing it." (Charles D Tupper, "Data Architecture: From Zen to Reality", 2011)

"Using architecture leads to foundational stability, not rigidity. As long as the appropriate characteristics are in place to ensure positive architectural evolution, the architecture will remain a living construct. Well-developed architectures are frameworks that evolve as the business evolves." (Charles D Tupper, "Data Architecture: From Zen to Reality", 2011)

"A software architecture encompasses the significant decisions about the organization of the software system, the selection of structural elements and interfaces by which the system is composed, and determines their behavior through collaboration among these elements and their composition into progressively larger subsystems. Hence, the software architecture provides the skeleton of a system around which all other aspects of a system revolve." (Muhammad A Babar et al, "Agile Software Architecture Aligning Agile Processes and Software Architectures", 2014)

"Good architecture is all about splitting stuff reliably into self-contained parcels that allow work on them to continue relatively independently in parallel (often these days in different locations)." (Richard Hopkins & Stephen Harcombe, "Agile Architecting: Enabling the Delivery of Complex Agile Systems Development Projects", 2014)

"Good architecture provides good interfaces that separate the shear layers of its implementation: a necessity for evolution and maintenance. Class-oriented programming puts both data evolution and method evolution in the same shear layer: the class. Data tend to remain fairly stable over time, while methods change regularly to support new services and system operations. The tension in these rates of change stresses the design." (James O Coplien & Trygve Reenskaug, "The DCI Paradigm: Taking Object Orientation into the Architecture World", 2014)

"In more ways than one, architecture is all about avoiding bottlenecks. In architecture, the term bottleneck typically refers to a design problem that is preventing processing from occurring at full speed. [...] A good architecture will avoid bottlenecks in both." (Richard Hopkins & Stephen Harcombe, "Agile Architecting: Enabling the Delivery of Complex Agile Systems Development Projects", 2014)

"There is a tendency to believe that good architecture leads to systems that perform better and are more secure, but such claims relate less to any given architectural principle than to the timing of big-picture deliberations in the design cycle and to the proper engagement of suitable stakeholders." (James O Coplien & Trygve Reenskaug, "The DCI Paradigm: Taking Object Orientation into the Architecture World", 2014)

"Any software project must have a technical leader, who is responsible for all technical decisions made by the team and have enough authority to make them. Responsibility and authority are two mandatory components that must be present in order to make it possible to call such a person an architect." (Yegor Bugayenko, "Code Ahead", 2018)

"Just by making the architect role explicit, a team can effectively resolve many technical conflicts." (Yegor Bugayenko, "Code Ahead", 2018)

"To make technical decisions, a result-oriented team needs a strong architect and a decision making process, not meetings." (Yegor Bugayenko, "Code Ahead", 2018)

"[...] an architect’s work [...] comprises the set of strategic and technical models that create a context for position (capabilities), velocity (directedness, ability to adjust), and potential (relations) to harmonize strategic business and technology goals." (Eben Hewitt, "Technology Strategy Patterns: Architecture as strategy" 2nd Ed., 2019)

"Software architecture is the process and product of creating technical systems designs. Architecture may include a specification of resources, patterns, conventions, and communication protocols, among other details." (Morgan Evans, "Engineering Manager's Handbook", 2023)

"Systems architecture is the portion of any project over which the engineering team has the most control. This design is usually less of a collaboration between different functions and more clearly in the domain of engineers. As such, engineering managers have a high responsibility to own this process and its decisions. To produce the best decisions possible, you must have the right decision-building blocks: complete information to work from and a structured methodology to guide you." (Morgan Evans, "Engineering Manager's Handbook", 2023)

"Architecture begins where engineering ends." (Walter Gropius, [speech])

"Architecture is the tension between coupling and cohesion." (Neal Ford)

"Programming without an overall architecture or design in mind is like exploring a cave with only a flashlight: You don't know where you've been, you don't know where you're going, and you don't know quite where you are." (Danny Thorpe)

"The fundamental organization of a system embodied in its components, their relationships to each other, and to the environment, and the principles guiding its design and evolution." (ANSI/IEEE Std 1471: 2000)

SQL Troubles

Pages

30 December 2020

🧊Data Warehousing: ETL (Part V: The Transform Subprocess)

27 November 2020

🧊Data Warehousing: ETL (Part II: An Introduction)

04 May 2019

🧊Data Warehousing: Architecture (Part I: Push vs. Pull)

29 April 2019

🗄️Data Management: Data Integration (Part I: From Disintegration to Integration)

16 January 2017

⛏️Data Management: Data Flow (Definitions)

13 February 2014

🕸Systems Engineering: System Dynamics (Definitions)

15 February 2013

🔦Process Management: Process model (Definitions)

14 August 2011

📈Graphical Representation: Data Flow Diagram (Definitions)

15 February 2010

🕋Data Warehousing: Control Flow (Definitions)

23 January 2010

🧊🎡Data Warehousing: ETL (SSIS packages vs. SQL code)

31 December 2007

🏗️Software Engineering: Architecture (Just the Quotes)

About Me