SQL Troubles: SSIS

Showing posts with label SSIS. Show all posts

17 March 2024

🧭Business Intelligence: Data Products (Part II: The Complexity Challenge)

Business Intelligence Series

Creating data products within a data mesh resumes in "partitioning" a given set of inputs, outputs and transformations to create something that looks like a Lego structure, in which each Lego piece represents a data product. The word partition is improperly used as there can be overlapping in terms of inputs, outputs and transformations, though in an ideal solution the outcome should be close to a partition.

If the complexity of inputs and outputs can be neglected, even if their number could amount to a big number, not the same can be said about the transformations that must be performed in the process. Moreover, the transformations involve reengineering the logic built in the source systems, which is not a trivial task and must involve adequate testing. The transformations are a must and there's no way to avoid them.

When designing a data warehouse or data mart one of the goals is to keep the redundancy of the transformations and of the intermediary results to a minimum to minimize the unnecessary duplication of code and data. Code duplication becomes usually an issue when the logic needs to be changed, and in business contexts that can happen often enough to create other challenges. Data duplication becomes an issue when they are not in synch, fact derived from code not synchronized or with different refresh rates.

Building the transformations as SQL-based database objects has its advantages. There were many attempts for providing non-SQL operators for the same (in SSIS, Power Query) though the solutions built based on them are difficult to troubleshoot and maintain, the overall complexity increasing with the volume of transformations that must be performed. In data mashes, the complexity increases also with the number of data products involved, especially when there are multiple stakeholders and different goals involved (see the challenges for developing data marts supposed to be domain-specific).

To growing complexity organizations answer with complexity. On one side the teams of developers, business users and other members of the governance teams who together with the solution create an ecosystem. On the other side, the inherent coordination and organization meetings, managing proposals, the negotiation of scope for data products, their design, testing, etc. The more complex the whole ecosystem becomes, the higher the chances for systemic errors to occur and multiply, respectively to create unwanted behavior of the parties involved. Ecosystems are challenging to monitor and manage.

The more complex the architecture, the higher the chances for failure. Even if some organizations might succeed, it doesn't mean that such an endeavor is for everybody - a certain maturity in building data architectures, data-based artefacts and managing projects must exist in the organization. Many organizations fail in addressing basic analytical requirements, why would one think that they are capable of handling an increased complexity? Even if one breaks the complexity of a data warehouse to more manageable units, the complexity is just moved at other levels that are more difficult to manage in ensemble.

Being able to audit and test each data product individually has its advantages, though when a data product becomes part of an aggregate it can be easily get lost in the bigger picture. Thus, is needed a global observability framework that allows to monitor the performance and health of each data product in aggregate. Besides that, there are needed event brokers and other mechanisms to handle failure, availability, security, etc.

Data products make sense in certain scenarios, especially when the complexity of architectures is manageable, though attempting to redesign everything from their perspective is like having a hammer in one's hand and treating everything like a nail.

Previous Post <<||>> Next Post

07 March 2024

📦Data Migrations (DM): The SQL Server Perspective (Licensing Costs and Edition Choices)

Data Migration Series

A Data Migration (DM) moves all or a subset of the data available from one or more system(s) into other system(s). For this purpose, especially in ERP Implementations, one can use a SQL Server as intermediate layer, where SSIS can be used for the data extraction and exporting, SSRS for reporting the errors, while the database engine for the heavy processing. Master Data and Data Quality Services can be used as well in certain scenarios. Therefore, SQL Server allows by design to address the various challenges related to a DM. At high level the architecture can be depicted as follows:

Data Migration Architecture

Once the decision to go with SQL Server for the DM layer is made, one needs to define which edition to use. If the DM doesn't have special requirements, one can use for it an available SQL Server instance, as long as the cumulated workloads don't create major issues. Therefore, in the past I used existing licensed versions of SQL Server to build solutions for DMs in ERP implementations, though I evaluated in each project whether it's possible to reduce the costs and remain compliant with the license requirements.

Of course, there's always the alternative of using SQL Server Express which supports databases with a maximum of 10 GB, which should be enough for most of DMs, though it has also further limitations (see [2]). There are also ways of moving around existing limitations, like splitting the logic across multiple databases.

Then there's the SQL Server Developer edition, which involves no license costs, has the full SQL Server functionality available, and can be used to build and test applications. In a recent post [1], Bob Ward, principal architect at Microsoft made several clarifications on the licenses for the Developer edition, which is "licensed for development, test, and demonstration purposes only" and "may not be used in a production environment”. Bob Ward makes the following clarifications:
(1) "Production environments include any system that is accessed by end-users for anything more than acceptance testing, environments that connects to production systems (such as Linked servers), disaster recovery or backups of production systems, and environments that are 'rotated' into production at any point in time." [1]
(2) One "cannot use Developer edition to build test data and move that same data into production" [1].
(3) One can "restore a production set of data backup for testing purposes" [1].

There are two-three impediments for using the Developer edition completely for a DM. The first, at least during Go Live and UAT, one needs to work with data coming directly from the various production environments. Secondly, the data generated by the solution are used primarily for UAT and in a second step for Production, which seems to be against the rule (2), or at least it's a grey area (which might be overlooked by Microsoft). Thirdly, some data from the production environment might need to be imported back into the DM layer for validation or enhancing the entities with data generated in the target systems.

In what concerns the first issue, the DM solution can always point to the test environments used as source, following that during UAT to copy the databases from production into the test environments. This might be anyway necessary for other purposes. Otherwise, the effort might be considerable and not working in the last phases with the data timeliness might raise other concerns.

The second issue is a matter of interpretation. The UAT phase makes sure that the data generated by the DM solution respects the criteria for Go Live. If there are no issues, the same data can be used for Go-Live. If for this is required another licensed edition, then an environment can be built only for UAT and Go Live, project phases which usually span over a couple of weeks, unless multiple migrations need to be performed at different time intervals. If the environments are in the cloud, probably the instances can be turned on and off on a as-needed basis.

One can plan for different environments between Production and Development and the environments can be on the same SQL Server as distinct databases, respectively use the Developer edition for Development, and use a different licensed edition for UAT and Production. This approach involves additional overhead in synchronizing the logic between environments. Conversely, in the case of the DM layer, the same environment can be used from beginning to the end, while the code should/must be backed-up periodically. For multiple migrations based on the same data, one should archive the data after each migration or important phase.

For the scenarios in which after migration the data are copied back to the DM solution, it's enough to have these steps performed against the UAT target system(s). This should work as long there are no differences in configuration between UAT and Production. There are however exceptions, e.g. data generated by the target systems, for which the values between Prod and UAT are different. At least in Dynamics 365 one can attempt to generate the values in the DM layer and import them as they are into the target system. It worked for many scenarios, though there can be exceptions here as well.

A more complex scenario is when data from the DM layer needs to be exported to Data Warehouses or similar solutions that can be considered as Production systems. Here a licensed edition seems to be mandatory. For other scenarios in which Master Data and/or Data Quality Services are needed, there's only the option to use the Enterprise or Developer editions.

To summarize, to reduce the overall costs for the DM, consider using an existing licensed SQL Server instance for building the solution. If separates environments need to be built, the Express edition might have some limitations though it can prove to be a viable solutions in many cases. Otherwise, consider the above workarounds for using the Developer edition, including the scenario in which distinct environments are used for Production and Development.

Resources:
[1] Microsoft Data Platform (2024) How SQL developers can maximize savings, by Bob Ward (link)
[2] Microsoft Learn (2024) Editions and supported features of SQL Server 2022 (link)
[3] Microsoft Learn (2023) Master Data Services and Data Quality Services Features Support (link)

02 March 2024

🧭Business Intelligence: Microsoft Releases for the BI Technology Stack (Timeline)

Business Intelligence Series

I started some years back to put together a timeline for the most important events happening in the BI technology stack (work in progress):

2023: Microsoft announces Microsoft Fabric (>>)

Synapse Data Warehouse is the next generation of data warehousing in Microsoft Fabric with native support for the delta lake.
Data Engineering & Data Science workloads with support for lakehouses, notebooks, Spark Job definitions, models and experiments.
Real-Time Analytics is a robust platform tailored to deliver real-time data insights and observability analytics capabilities for a wide range of data types.
OneLake provides a single unified storage location for all your data analytics needs.

2022: Microsoft releases SQL Server 2022 (>>)

Synapse Link for SQL Server 2022 allows to seamlessly replicate operational data in near real-time to be able to have more powerful analytics.
Purview is a unified data governance and management service.

2019: Microsoft launches Azure Synapse Analytics service (formerly SQL Data Warehouse), a limitless analytics service, that brings together enterprise data warehousing and Big Data analytics. (>>)

2019: Microsoft releases SQL Server 2019 (>>)

Big Data Clusters add-in for SQL Server allows to deploy scalable clusters of SQL Server, Spark, and HDFS containers running on Kubernetes (feature to be retired)

2018: Microsoft extends PowerQuery with ETL capabilities. (>>)

2018: Microsoft releases Azure Data Studio, a data management tool that enables to work with SQL Server, Azure SQL DB and SQL DW from Windows, macOS and Linux. (>>)

2017: Microsoft releases Power BI Report Server, an on-premises server that enables Power BI Pro users to publish Power BI reports and distribute them broadly across the enterprise, without requiring report consumers to be licensed individually per use (>>)

2017: Microsoft released SQL Server Data Tools (SSDT), which uses PowerQuery to import and prepare data in SSAS/AAS tabular models.

2017: Microsoft releases SQL Server 2017. (>>)

SSRS is no longer available to install through SQL Server setup.
Python support added, R Services renamed to Machine Learning Services. (>>)

2016: Microsoft releases SQL Server 2016 (What's new, >>)

Query Store allows to monitor and troubleshoot performance issues.
SQL Server R Services integrate the R programming language into SQL Server.
Direct Query for SSAS.
PolyBase for querying the data stored in HDFS. (>>)
Support for Support for HDFS in SSIS.
Azure SQL Data Warehouse is GA. (>>)
modern reports with SSRS. (>>)
Real-Time Operational Analytics. (>>)

2016: SQL Server 2014 Developer Edition becomes free. (>>)

2015: Microsoft announces elastic databases SQL Data Warehouse & Azure Data Lake. (>>)

Elastic databases allows to build SaaS applications to manage large numbers of databases that have unpredictable resource demand.
Azure SQL Data Warehouse is an elastic data warehouse in the cloud that can dynamically grow, shrink and pause compute in seconds independent of storage.
Azure Data Lake is a hyper-scale data store for big data analytic workloads.

2015: Microsoft releases Power BI to the general public.

Power BI Designer renamed to Power BI Desktop.

2015: Microsoft releases several Azure services:

launches the SQL Server Cloud database.
Azure Data Factory (ADF), a fully managed service that does information production by orchestrating data with processing services as managed data pipelines. (>>)
Azure Stream Analytics, a fully managed stream processing engine that is designed to analyze and process large volumes of streaming data with sub-millisecond latencies. (>>)

2014: Microsoft released Power BI Designer unifying Power Query, Power Pivot & Power View.

2013: Microsoft announces Power BI for Office 365. (>>)

2012: Microsoft releases with SQL Server 2012. (>>)

BI Semantic Model for SSAS provides a single, scalable model for BI applications.
Parallel Data Warehouse with PolyBase capabilities.
in-memory capabilities. (>>)
Windows Azure SQL Reporting service available (>>)
SQL Server Data Tools unifies SQL Server and cloud SQL Azure development for both professional database and application developers.

2010: Microsoft released

Power Pivot as part of SQL Server R2.
Azure SQL Database.

2010: Microsoft releases SQL Server 2008 R2.

Master Data Services.
Power Pivot & Self-service BI capabilities in SSAS.

2008: Microsoft releases SQL Server 2008 (>>)

Table compression.
Change Data Capture (CDC).

2005: Microsoft releases SQL Server 2005

a greatly enhanced version of Analysis Services.
SQL Server Integration Services to replace DTS.

2004: Microsoft released SQL Server Reporting Services (SSRS) as add-on to SQL Server 2000.

2000: Microsoft released SQL Server Analysis Services (SSAS) with SQL Server 2000.

1998: Microsoft released SQL Server 7.

OLAP services & first MDX specifications.
Data Transformation Services (DTS) for ETL workloads.

03 March 2023

🧊Data Warehousing: Architecture (Part IV: Building a Modern Data Warehouse with Azure Synapse)

Introduction

When building a data warehouse (DWH) several key words or derivatives of them appear in requirements: secure, flexible, simple, scalable, reliable, performant, non-redundant, modern, automated, real-timed, etc. As it proves in practice, all these requirements are sometimes challenging to address with the increased complexity of the architecture chosen. There are so many technologies on the DWH market promising all these at low costs, low effort and high ROI, though DWH projects continue to fail addressing the business and technical requirements.

On a basic level for building a DWH is needed a data storage layer and an ETL (Extract, Transfer, Load) tool responsible for the data movement between the various source systems and DWH, and eventually within the DWH itself. After that, each technology added to the landscape tends to increase the overall complexity (and should be regarded with a critical eye in what concerns the advantages and disadvantages).

Data Warehouse Architecture (on-premise)

A Reference Architecture

When building a DWH or a data migration solution, which has many of the characteristics of a DWH, from the many designs, I prefer to keep things as simple as possible. An approach based on a performant database engine like SQL Server as storage layer and SSIS (SQL Server Integration Services) as ETL proved to be the best choice until now, allowing to address most of the technical requirements by design. Then come the choices on how and where to import and transform the data, at what level of granularity, on how the semantic layer is built, how the data are accessed, etc.

Being able to pull (see extract subprocess) the data from the data sources on a need by basis offers the most flexible approach, however there are cases in which the direct access to source data is not possible, having to rely on a push approach, where data are dumped regularly to a given location (e.g. FTP folder structure), following to be picked up as needed. It's actually a hybrid between a push and pull, because a fully push approach would mean pushing the data directly to the DWH, which can be also acceptable, though might offer lower control on data's movement and involve a few other challenges (e.g. permissions, concurrency).

Data can be prepared for the DWH in the source systems (e.g. exposed via data objects or API calls), anywhere in between via ETL-based transformations (see transform subprocess) or directly in the DWH. I prefer importing the data (see load subprocess) 1:1 without any transformations from the various sources via SSIS (or similar technologies) into a set of tables that designated the staging area. It's true that in this way the ETL technology is used to a minimum, though unless there's a major benefit to use it for data transformations, using DWH's capabilities and SQL for data processing can provide better performance and flexibility.

Besides the selection of the columns in scope (typically columns with meaningful values), it's important not to do any transformations in the extraction layer because the data is imported faster (eventually using fast load options as in SSIS) and it assures a basis for troubleshooting (as the data don't change between loads). Some filters can be applied only when the volume of data is high, and the subset of the data could be identified clearly (e.g. when data are partitioned based on a key like business unit, legal entity or creation date).

For better traceability, the staging schemas can reflect the systems they come from, the tables and the columns should have the same names, respectively same data types. On such tables no constraints are applied and no indexes are needed. They can be constructed however on the production tables (aka base tables) - copy of the tables from production.

Some DWH architects try replicating the constraints from the source systems and/or add more constraints on top to define the various business rules. Rigor is good in some scenarios, though it can involve a considerable effort and it might be challenging to keep over time, especially when considering the impact of big data on DWH architectures. Instead of using constraints, building a set of SQL scripts that pinpoint the issues as reports allow more flexibility with the risk of having inconsistencies running wild through the reports. The data should be cleaned in the source system and not possible then properly addressed in the DWH. Applying constraints will make the data unavailable for reporting until data are corrected, while being more permissive would allow dirty data. Thus, either case has advantages or disadvantages, though the latter seems to be more appropriate.

Indexes on the production schema should reflect the characteristics of the queries run on the data and shouldn't replicate the indexes from the source environments, even if some overlaps might exist. In practice, dropping the non-clustered indexes on the production tables before loading the data from staging, and recreating them afterwards proves to provide faster loading (see load optimization techniques).

The production tables are used for building a "semantic" data model or something similar. Several levels of views, table-valued functions and/or indexed/materialized views allows building the dimensions and facts tables, the latter incorporating the business logic needed by the reports. Upon case, stored-procedures, physical or temporary tables, table variables can be used to prepare the data, though they tend to break the "free" flow of data as steps in-between need to be run. On the other side, in certain scenarios their use is unavoidable.

The first level of views (aka base views) is based on the base tables without any joins, though they include only the fields in use (needed by the business) ordered and "grouped" together based on their importance or certain characteristics. The views can include conversions of data types, translations of codes into meaningful values, and quite seldom filters on the data. Based on these "base" views the second level is built, which attempts to define the dimension and fact tables at the lowest granularity. These views include joins between tables coming from the same or different systems, respectively mappings of values defined in tables, and whatever it takes to build such entities. However, transformations on individual fields are pushed, when possible, to the lower level to minimize logic redundancy. From similar reasons, the logic could be broken down over two or more "helper" views when visible benefits could be obtained from it (e.g troubleshooting, reuse, maintenance). It's important to balance between creating too many helper views and encapsulating too much logic in a view.

One of the design principles used in building the entities is to minimize the redundance of the fields used, ideally without having columns duplicated between entities at this level. This would facilitate the traceability of columns to the source tables within the "semantic" layer (typically in the detriment of a few more joins). In practice, one is forced to replicate some columns to simplify some parts of the logic.

Further views can be built based on the dimension and fact entities to define the logic needed by the reports. Only these objects are used and no direct reference to the "base" tables or views are made. Moreover, to offer better performance when the views can be materialized or, when there's an important benefit, physically saved as table (e.g. having multiple indexes for different scenarios). It's the case of entities with considerable data volume called over and over.

This approach of building the entities is usually flexible enough to address most of the reporting requirements, independently whether the technical solution has the characteristics of a DWH, data mart or data migration layer. Moreover, the overall architectural approach can be used on-premise as well in cloud architectures, where Azure SQL Server and ADF (Azure Data Factory) provide similar capabilities. Compared with standard SQL Server, some features might not be available, while other features might bring further benefits, though the gaps should be neglectable.

Data Management topics like Master Data Management (MDM), Data Quality Management (DQM) and/or Metadata Management can be addressed as well by using third-party tools or tools from the Microsoft stack - Master Data Services (MDS) and Data Quality Services (DQS) in combination with SSIS help addressing a wide range of scenarios - however these are optional.

Moving to the Cloud

Within the context of big data, characterized by (high/variable) volume, value, variety, velocity, veracity, and further less important V's, the before technical requirements still apply, however within a cloud environment the overall architecture becomes more complex. Each component becomes a service. There are thus various services for data ingestion, storage, processing, sharing, collaboration, etc. The way data are processed involves also several important transformations: ETL becomes ELT, FTP and local storage by Data Lakes, data packages by data pipelines, stateful by stateless, SMP (Symmetric Multi-Processing) by MPP (Massive Parallel Processing), and so on.

As file storage is less expensive than database storage, there's an increasing trend of dumping business critical data into the Data Lake via data pipelines or features like Link to Data Lake or Export to Data Lake (*), which synchronize the data between source systems and Data Lake in near real-time at table or entity level. Either saved as csv, parquet, delta lake or any other standard file format, in single files or partitions, the data can be used directly or indirectly for analytics.

Cloud-native warehouses allow addressing topics like scalability, elasticity, fault-tolerance and performance by design, though further challenges appear as compute needs to be decoupled from storage, the workloads need to be estimated for assuring the performance, data may be distributed across data centers spanning geographies, the infrastructure is exposed to attacks, etc.

Azure Synapse

If one wants to take advantage of the MPP architecture's power, Microsoft provides an analytical architecture based on Azure Synapse, an analytics service that brings together data integration, enterprise DWH, and big data analytics. Besides two types of SQL-based data processing services (dedicated vs serverless SQL pools) it comes also with a Spark pool for in-memory cluster computing.

A DWH based on Azure Synapse is not that different from the reference architecture described above for an on-premise solution. Actually, a DWH based on a dedicated SQL pool (aka a physical data warehouse) involves the same steps mentioned above.

Data Warehouse Architecture with Dedicated SQL Pool

The data can be imported via ETL/ELT pipelines in the DWH, though there are also mechanisms for consuming the data directly from the files stored in the Data Lake or Azure storage. CETAS (aka Create External Table as Select) can be defined on top of the data files, the external tables acting as "staging" or "base" tables in the architecture described above. When using a dedicated SQL pool it makes sense to use the CETAS as "staging" tables, the processed data following to be dumped to "optimized" physical tables for consumption and refreshed periodically. However, when this happens the near real-time character of data is lost. Using the CETAs as base tables would keep this characteristic as long the data isn't saved physically in tables or files, maybe in the detriment of performance.

Using a dedicated SQL pool for direct reporting can become expensive as the pool needs to be available at least during business hours for incoming user requests, or at least for importing the data and refreshing the datasets. When using the CETAS as a base table, a serverless (aka on-demand) SQL pool, which uses a per-pay-use billing model could prove to be more cost-effective and flexible in many scenarios. By design, it helps to keep the near real-time character of the data. Moreover, even if the data are actually moved from the source tables into the Data Lake, this architecture has the characteristics of a logical data warehouse:

Data Warehouse Architecture with Serverless SQL Pool

Unfortunately, unless one uses Spark tables, misuses views or adds an Azure SQL database to the architecture, there are no physical tables or materialized views in a serverless SQL pool. There's still the option to use data pipelines for regullarly exporting intermediary data to files (incl. over partitions or folders), even if this involves more overhead as it's not possible to export data over SQL syntax to files more than once (though this might change in the future). For certain scenario it could be useful to store data in a Azure SQL Server or similar database, including a dedicated SQL pool.

Choosing between serverless and dedicated SQL pool is not an exclusive choice, both or all 3 types of pools (if we consider also the Spark pool) can be used in the architecture for addressing specific challenges, especially when we consider that there are important differences between the features available in each of the pools. Moreover, one can start the PoC based on the serverless SQL pool and when the solution became mature enough and used in all enterprise, parts of the logic or all of it can be migrated to a dedicated SQL pool. This would allow to save costs at the beginning in the detriment of further effort later.

Talking about the physical storage, data engineers recommend defining within a Data Lake several layers (aka regions, zones) labeled as bronze, silver and gold (and probably platinum will join the club anytime soon). The bronze layer refers to the raw data available in the Data Lake, including the files on which the initial CETAS are defined upon. The silver refers to transformed, cleaned, enriched and integrated data, data resulting from the second layer of views described above. The gold layer refers to the data to which business logic was applied and prepared for consumption, data resulting from the final layer of views. Of course, data pipelines can be used to prepare the data at these stages, though a view-based approach offers more flexibility, are easier to troubleshoot, manage and reuse than data pipelines.

Ideally the gold data should involve no or minimal further transformation before reaching the users, though that's not realistic. Building a DWH takes a considerable time and the business can't usually wait until everything is in place. Therefore, reports based on DWH will continue to coexist with reports directly accessing the source data, which will lead to controversies. Enforcing a single source of truth will help to minimize the gap, though will not eliminate it completely.

Closing Notes

These are just outlines of a minimal reference architecture. There's more to consider, as there are several alternatives (see [1] [2] [3] [4]) for each of the steps considered in here, each technology, new features or mechanisms opening new opportunities. The advantages and disadvantages should be always considered against the business needs and requirements. One approach, even if recommended, might not work for all, though unless there's an important requirement or an opportunity associated with an additional technology, deviating from reference architectures might not be such a good idea afterall.

Note:

(*) Existing customers have until 1-Nov-2024 to transition from Export to Data lake to Synapse link. Microsoft advises new customers to use Synapse Link.

Previous Post <<||>> Next Post

Resources:

[1] Microsoft Learn (2022) Modern data warehouse for small and medium business (link)

[2] Microsoft Learn (2022) Data warehousing and analytics (link)

[3] Microsoft Learn (2022) Enterprise business intelligence (link)

[4] Microsoft Learn (2022) Serverless Modern Data Warehouse Sample using Azure Synapse Analytics and Power BI (link)

[5] Coursera (2023) Data Warehousing with Microsoft Azure Synapse Analytics (link) [course, free to audit]

[6] SQLBits (2020) Mahesh Balija's Building Modern Data Warehouse with Azure Synapse Analytics (link)

[7] Matt How (2020) The Modern Data Warehouse in Azure: Building with Speed and Agility on Microsoft’s Cloud Platform (Amazon)

[8] James Serra's blog (2022) Data lake architecture (link)

[9] SQL Stijn (2022) SQL Building a Modern Lakehouse Data Warehouse with Azure Synapse Analytics: Moving your Database to the lake (link)

[10] Solliance (2022) Azure Synapse Analytics Workshop 400 (link) [GitHub repository]

11 March 2021

💠🗒️Microsoft Azure: Azure Data Factory [Notes]

Microsoft Azure: Azure Data Factory (ADF)

{definition} pay-per-use serverless cloud-based data integration service that orchestrates and automates the movement and transformation of both cloud-based and on-premises data sources [1]
- ⇐ a hybrid and scalable data integration service for Big Data and advanced end-to-end analytics solutions [11]
- ⇐ Microsoft Azure PaaS offering for ETL/ELT workloads found at its second generation [11]
- allows creating data-driven flows to orchestrate movement of data between supported data stores and processing of data using compute services in other regions or in an on-premises environment
{benefit} easy-to-use
- {feature} allows creating code-free pipelines with drag-and-drop functionality [2]
- {feature} uses JSON to describe each of its entities
{benefit} cost-effective
- pay-as-you-go model against the Azure subscription with no up-front costs
- low price-to-performance ratio
  - ⇐ cost effective and performant at the same time
- fully managed serverless cloud service that scales on demand [2]
  - ⇒requires zero hardware maintenance [1]
  - ⇒can easily scale beyond what was originally anticipated [1]
- does not store any data [1]
- provides additional cost-saving functionality [11]
  - {feature} it takes care of the provisioning and teardown of the cluster once the job has executed [11]
{benefit} powerful
- allows ingesting on-premise and cloud-based data sources
- high-performance hybrid connectivity
  - over 90 built-in connectors make it easy to interact with all kinds of technologies [11]
- orchestrate at scale
  - on-demand compute
  - Big Data workloads are scaled over multiple nodes to chunk data in parallel [11]
- {feature} [ADFv2] monitoring
  - richer and natively integrating it with Azure Monitor and OMS [11]
    - includes feature-rich monitoring and management tools to visualize the current state of data pipelines, data lineage and pipeline dependencies [1]
- {feature} [ADFv2] control flow functionality
  - lets define complex workflows using programmatic or UI mechanisms
    - allows defining parameters at pipeline level [11]
    - includes custom state passing and looping containers [11]
    - pipelines can be authored via additional tools
      - e.g. PowerShell, .NET, Python, REST APIs
      - ⇒ helps ISVs build SaaS-based analytics solutions on top of ADF app models
{benefit} intelligent
- autonomous ETL allows unlocking operational efficiencies and enable citizen integrators [2]
{benefit} enterprise-grade security:
- provides same security standards as any other Microsoft service [11]
{benefit} monthly release cycle
- {feature} via auto-update
- improvements may include support for new connectors, bug fixes, security patches, and performance improvements [11]
{benefit} backwards compatibility
- {feature} [ADFv2] allows rehosting SSIS solutions [2]
  - ⇒ helpful for modernizing data warehouse solutions
{prerequisite} an Azure subscription with the contributor role assigned to at least one resource group
{limitation} availability
- the service isn’t available in all regions
  - an instance can be made available in other region to trigger the job on customer’s computer environment [1]
    - ⇐ the time for executing the job on the compute environment doesn’t change [1]
{concept} activity
- the unit of orchestration in ADF [1]
- defines the actions to perform on data [1]
- takes zero or more datasets as inputs and produces one or more datasets as outputs [1]
- activity types
  - data movement activities
  - data transformation activities
  - control activities
    - control how the pipeline works and interacts with the data [10]
    - allow executing pipelines [10]
    - allow running a foreach statement or Lookup activities [10]
{concept] pipeline
- logical grouping of activities that together perform a task [1]
  - the sequence can have a complex schedule and dependencies that need to be orchestrated and automated [1]
  - two activities can be chained by setting the output data set of one activity as the input dataset of the other activity
- allows building ETL/ELT workloads
- scheduled by scheduler triggers [10]
- data in a pipeline is referred to by different names
  - ⇐ based on the amount of modification that has been performed
  - raw data
    - data with no processing applied [10]
      - ⇒does not yet have a schema applied
    - stored in the message encoding format used to send tracking events such as JSON.
    - can be organized into meaningful data stores and data lakes [10]
      - ⇐ further used in decision-making
    - it's common to send all tracking events as raw events
      - ⇐ because all events can be sent to a single endpoint and schemas can be applied later in the pipeline [10]
  - processed data
    - raw data that has been decoded in the event-specific formats with the schema applied
      - e.g. JSON tracking events that have been translated into a session start event with a fixed schema [10]
    - usually stored in different event tables and destination in a data pipeline [10]
  - cooked data
    - processed data that has been aggregated or summarized [10]
- {concept} pipeline parameters
  - similar to SSIS package parameters
    - ⇐ need to be set from outside packages
  - can be passed from the parent pipeline
{concept} dataset
- named references/pointers to the data used as an input or an output of an activity [1]
- identifies data structures within different (linked) data stores [1]
  - ⇐ before creating a dataset, a linked service must be created to link the data store to ADF [10]
  - once created, it can be used with activities in a pipeline [10]
    - e.g. a dataset can be an input or output dataset of a copy activity
{concept} linked service
- defines the information needed by ADF to connect to external resources at runtime
  - much like connection strings which define the connection information [10]
- used to represent
  - {concept} data store
    - holds the input-output data to the ADF
    - e.g. tables, files, folders, and documents
  - {concept} compute resource
    - can host the execution of an activity [1]
{concept} scheduler triggers
- allow pipelines to be triggered on a wall-clock schedule [10]
  - pipelines and triggers have an n-m relationship
    - multiple triggers can kick off a single pipeline
    - the same trigger can kick off multiple pipelines
  - manual triggers trigger pipelines on demand [10]
- once defined, it must be started to begin triggering the pipeline [10]
- comes into effect only after publishing the solution to ADF [10]
  - ⇐ not when saving the trigger in the UI [10]
- to run a pipeline, a pipeline reference must be included in trigger definition [10]
- there is a cost associated with each pipeline run
  - {recommendation} when testing, make sure that the pipeline is triggered only a couple of times [10]
  - {recommendation} ensure that there is enough time for the pipeline to run between the published time and the end time [10]

Previous Post <<||>> Next Post

Acronyms:

Azure Data Factory (ADF)

Continuous Integration/Continuous Deployment (CI/CD)

Extract Load Transform (ELT)

Extract Transform Load (ETL)

Independent Software Vendors (ISVs)

Operations Management Suite (OMS)

pay-as-you-go (PAYG)

SQL Server Integration Services (SSIS)

Resources:

[1] Microsoft (2020) "Microsoft Business Intelligence and Information Management: Design Guidance", by Rod College

[2] Microsoft (2021) Azure Data Factory [source]

[3] Microsoft (2018) Azure Data Factory: Data Integration in the Cloud [source]

[4] Microsoft (2021) Integrate data with Azure Data Factory or Azure Synapse Pipeline [source]

[10] Coursera (2021) Data Processing with Azure [source]

[11] Sudhir Rawat & Abhishek Narain (2019) "Understanding Azure Data Factory: Operationalizing Big Data and Advanced Analytics Solutions"

06 July 2020

💠🛠️🪄SQL Server: Undocumented (Part III: SQL Server CPU Utilization via the Ring Buffer)

Introduction

If no proper monitoring solution of the SQL Server and the hosting server is in place to review the CPU utilization, one can use the Scheduler Monitor buffer provided by the undocumented sys.dm_os_ring_buffers data management view (DMV). Introduced with SQL Server 2005, the DMV provides significant amount of diagnostic memory information in XML form via several buffers: Resource Monitor, Out-of-Memory, Memory Broker, Buffer Pool, respectively Scheduler Monitor buffer [2]. A ring buffer is a recorded response to a notification [1].

The view changed between the various versions of SQL Server, while with the introduction of Always On availability groups in SQL Server 2017 further buffer rings were made available (see [5]).

Warning:

According to Microsoft (see [4] the sys.dm_os_ring_buffers is provided only for information purposes, the future compatibility post SQL Server 2019 being not guaranteed!

Querying the Scheduler Monitor Buffer

Within the Scheduler Monitor buffer, the DMV stores a history of 4 hours uptime with minute by minute data points (in total 256 entries) with the CPU utilization for the SQL Server, other processes, respectively the system idle time as percentages. It allows thus to identify the peaks in CPU utilization and thus to determine the intervals of focus for further troubleshooting. As the data are stored within an XML structure, the values can be queried via the XQuery syntax as follows:

-- cpu utilization for SQL Server and other applications
DECLARE @ts_now bigint = (SELECT cpu_ticks/(cpu_ticks/ms_ticks)
        FROM sys.dm_os_sys_info); 

SELECT DAT.record_id
, DAT.EventTime
, DAT.SQLProcessUtilization 
, DAT.SystemIdle 
, 100 - (DAT.SystemIdle + DAT.SQLProcessUtilization) OtherUtilization
FROM ( 
	SELECT record.value('(./Record/@id)[1]', 'int') record_id
	, record.value('(./Record/SchedulerMonitorEvent/SystemHealth/SystemIdle)[1]', 'int') SystemIdle 
	, record.value('(./Record/SchedulerMonitorEvent/SystemHealth/ProcessUtilization)[1]', 'int') SQLProcessUtilization
	, EventTime 
	FROM ( 
		SELECT DATEADD(ms, -1 * (@ts_now - [timestamp]), GETDATE()) EventTime
		, [timestamp]
		, CONVERT(xml, record) AS [record] 
		FROM sys.dm_os_ring_buffers 
		WHERE ring_buffer_type = N'RING_BUFFER_SCHEDULER_MONITOR' 
		  AND record LIKE N'%<SystemHealth>%') AS x 
	) AS DAT
ORDER BY DAT.record_id DESC;

If the SQL Server is not busy as all, the SQL Server utilization time may tend to 0%, while the system idle time to 90%. (It's the case of my SQL Server lab.)

CPU Utilization for my home SQL Server lab

Notes:

If the server was restarted within the last 4 hours, then the points will have a gap between two readings corresponding to the downtime interval.

The query is supposed to run also on Linux machines, though the SystemIdle time will be 0. One can thus consider the SQL and non-SQL CPU utilization.

Storing the History

The above query can be run on a regular basis (e.g. every 3-4 hours) via a SSIS package and push the data into a table for historical purposes. Because is needed to have a continuous history of the readings, it's better if the gap between runs is smaller than the 4 hours. No matter of the approach used is better to check for overlappings when storing the data:

-- dropping the table
-- DROP TABLE IF EXISTS dbo.T_RingBufferReadings 

-- reinitilizing the history
-- TRUNCATE TABLE dbo.T_RingBufferReadings

-- creating the table
CREATE TABLE dbo.T_RingBufferReadings (
  Id bigint IDENTITY (1,1) NOT NULL
, RecordId bigint 
, EventTime datetime2(3) NOT NULL
, SQLProcessUtilization int NOT NULL
, SystemIdle int NOT NULL
, OtherUtilization int NOT NULL
)


-- reviewing the data
SELECT *
FROM dbo.T_RingBufferReadings 
ORDER BY EventTime DESC

If there are many records, to improve the performance, one can create also an index, which can include the reading points as well:

-- creating a unique index with an include 
CREATE UNIQUE NONCLUSTERED INDEX [UI_T_RingBufferReadings_EventTime] ON dbo.T_RingBufferReadings
(
	EventTime ASC,
    RecordId ASC
) INCLUDE (SQLProcessUtilization, SystemIdle, OtherUtilization)
GO

The above query based on the DMV becomes:

-- cpu utilization by SQL Server and other applications
DECLARE @ts_now bigint = (SELECT cpu_ticks/(cpu_ticks/ms_ticks)
        FROM sys.dm_os_sys_info); 

INSERT INTO dbo.T_RingBufferReadings
SELECT record_id
, DAT.EventTime
, DAT.SQLProcessUtilization 
, DAT.SystemIdle 
, 100 - (DAT.SystemIdle + DAT.SQLProcessUtilization) OtherUtilization
FROM ( 
	SELECT record.value('(./Record/@id)[1]', 'int') record_id
	, record.value('(./Record/SchedulerMonitorEvent/SystemHealth/SystemIdle)[1]', 'int') SystemIdle 
	, record.value('(./Record/SchedulerMonitorEvent/SystemHealth/ProcessUtilization)[1]', 'int') SQLProcessUtilization
	, EventTime 
	FROM ( 
		SELECT DATEADD(ms, -1 * (@ts_now - [timestamp]), GETDATE()) EventTime
		, [timestamp]
		, CONVERT(xml, record) AS [record] 
		FROM sys.dm_os_ring_buffers 
		WHERE ring_buffer_type = N'RING_BUFFER_SCHEDULER_MONITOR' 
		  AND record LIKE N'%<SystemHealth>%') AS x 
	) AS DAT
	LEFT JOIN dbo.T_RingBufferReadings RBR
	  ON DAT.record_id = RBR.Recordid 
WHERE RBR.Recordid IS NULL
ORDER BY DAT.record_id DESC;

Note:
A ServerName column can be added to the table if is needed to store the values for different SQL Servers. Then the LEFT JOIN has to consider the new added column.
Either of the two queries can be used to display the data points within a chart via SSRS, Power BI or any reporting tool available.

Happy coding!

References:
[1] Grant Fritchey (2014) SQL Server Query Performance Tuning: Troubleshoot and Optimize Query Performance in SQL Server 2014, 4th Ed.
[2] Sunil Agarwal et al (2005), Troubleshooting Performance Problems in SQL Server 2005, Source: TShootPerfProbs.docx
[3] Sunil Agarwal et al (2008), Troubleshooting Performance Problems in SQL Server 2008, Source: TShootPerfProbs2008.docx
[4] Microsoft SQL Docs (2018) Related Dynamic Management Views, Source
[5] Microsoft SQL Docs (2017) Use ring buffers to obtain health information about Always On availability groups, Source

Previous Post <<||>> Previous Post

12 June 2020

🎡SSIS Project: Covid-19 Data

Introduction

I was exploring the Covid-19 data provided by the John Hopkins institute and I stumbled as usual on several data issues. Therefore I thought I could share some scripts and ideas in a post.

The data from the downloaded files cover a timeframe between 22nd of January and 10th of June 2020 and reflect the number of confirmed cases, deaths and recoveries per day, country and state. As it seems the data are updated on a daily basis. Unfortunately, the data are spread over several files (one for each indicator), which makes their consumption more difficult than expected, though the challenges are minor. I downloaded for the beginning the following files from the above link:

time_series_covid19_recovered_global_narrow.csv
time_series_covid19_confirmed_global_narrow.csv
time_series_covid19_deaths_global_narrow.csv

Before attempting anything with the files, it's recommended to look over them to check whether column names are provided and the columns are properly named, respectively how the columns and the rows are delimited, or on whether other things can be observed during a first review. I needed for example to delete the second line from each file.

Data Loading

When starting this kind of projects, it's useful to check first project's feasibility in term of whether the data are usable. Therefore, I use first to import the data via the 'Import Data' wizard, following to decide later whether it makes sense to build a project for this. Right click on the database in which you'd like to import the data, then from 'Tasks' choose 'Import Data' to use the wizard:

As data source we will consider the first data file, therefore into the 'Choose a Data Source' step, browse for the file:

The 'Locale' might appear different for you, therefore let the default value, however you'll have to make sure that the following values are like in the above screenshot. Look over the 'Columns' section to check whether the formatting was applied correctly. The preview offers a first overview of the data:

It's useful to review the default data types defined by the wizard for each column within the 'Advanced' section. For example the first two fields could have more than the default of 50 characters. The length or the data type can be modified as needed:

One can attempt in theory to get all the column definitions right from the beginning, though for the first attempt this is less important. In addition, without knowing data's definition, there's always the possibility for something to go wrong. Personally, I prefer loading the data as text and doing later the needed conversions, if necessary.

Into the next step one can define the 'Destination', the database where the data will be loaded:

After defining the source and destination is needing to define the mapping between the two. One can consider going with the table definition provided by the wizard or modify table's name directly in the wizard:

By clicking on the 'Edit Mappings' one can review the mappings. As there's a one-to-one import, one can in theory skip this step. However, if there are data already into the table, one can delete the rows from the destination tables or append them directly to the destination. The first radio button is selected though, as the table will be created as well:

With this being done, one can run the package as it is - just click 'Next':

If everything is ok, each step from the package will appear in green, in the end the number of records appears.

It may look like many steps, though once one got used to using the wizard, the data are loaded in less than 5 minutes.

Data Discovery

Once the data loaded, it's time for data discovery - looking at the structure of the data and trying to understand their meaning. Before further using the data it's important to identify the attributes which identify uniquely a record within your dataset - in this case the combination between State, Country and Date. At least it should be unique, because the second query returned some duplicates for the Country 'Korea', respectively 'Sint Eustatius and Saba' which seems to be a region within 'Netherlands'.


-- looking at the data
SELECT top 1000 *
FROM [dbo].[time_series_covid19_recovered_global_narrow]


-- checking for duplicates 
SELECT [Province State]
, [Country Region]
, Date 
, count(*) NoRecords 
FROM [dbo].[time_series_covid19_recovered_global_narrow]
GROUP BY [Province State]
, [Country Region]
, Date
HAVING count(*)>1

-- checking for distinct values  
SELECT [Country Region]
, count(*) NoRecords 
, SUM(TRY_CAST(Value as int)) NoCases
FROM [dbo].[time_series_covid19_recovered_global_narrow]
GROUP BY [Country Region]
HAVING count(*)>1
ORDER BY [Country Region]

The two duplicates are caused by the fact that a comma was used in a country, respectively a province's name, when the comma is actually used as delimiter. (Therefore it's better to use a sign like "|" as delimiter, as the chances are small for the sign to be used anywhere. Another solution would be to use quotes for alphanumeric values.)  Fortunately, the problem can be easily fixed with an update which uses the dbo.CutLeft, respectively dbo.CutRight functions defined in a previous post. Therefore the functions need to be created first within the same database before running the scripts.


-- review the issue
SELECT *
FROM [dbo].[time_series_covid19_recovered_global_narrow]
WHERE [Country Region]  LIKE '%int Eustatius and Saba%'

-- correct the data
UPDATE [dbo].[time_series_covid19_recovered_global_narrow]
SET [Province State] = [Province State] + ', ' + [Country Region] 
, [Country Region] = [Lat]
, [Lat] = [Long]
, [Long] = [Date]
, [Date] = [Value]
, [Value] = [ISO 3166-1 Alpha 3-Codes]
, [ISO 3166-1 Alpha 3-Codes] = [Region Code]
, [Region Code] = [Sub-region Code]
, [Sub-region Code] = dbo.CutLeft([Intermediate Region Code], ',', 0)
, [Intermediate Region Code] = dbo.CutRight([Intermediate Region Code], ',', 0)
 WHERE [Country Region]  LIKE '%Sint Eustatius and Saba%'

-- review data after correction
SELECT *
FROM [dbo].[time_series_covid19_recovered_global_narrow]
WHERE [Country Region] LIKE '%int Eustatius and Saba%'

A similar solution is used for the second problem:


-- review the issue
SELECT *
FROM [dbo].[time_series_covid19_recovered_global_narrow]
WHERE [Country Region] LIKE '%Korea%'

-- correct the data
UPDATE [dbo].[time_series_covid19_recovered_global_narrow]
SET [Province State] = '' 
, [Country Region] = Replace( [Country Region] + ', ' + [Lat], '"', '')
, [Lat] = [Long]
, [Long] = [Date]
, [Date] = [Value]
, [Value] = [ISO 3166-1 Alpha 3-Codes]
, [ISO 3166-1 Alpha 3-Codes] = [Region Code]
, [Region Code] = [Sub-region Code]
, [Sub-region Code] = dbo.CutLeft([Intermediate Region Code], ',', 0)
, [Intermediate Region Code] = Replace(dbo.CutRight([Intermediate Region Code], ',', 0), '"', '')
WHERE [Country Region] LIKE '%Korea%'

-- review data after correction
SELECT *
FROM [dbo].[time_series_covid19_recovered_global_narrow]
WHERE [Country Region] LIKE '%Korea%'

With this the data from the first file are ready to use. The data from the other two files can be loaded following the same steps as above. The tables not only that they have a similar structure, but they have the same issues. One can just replace the name of the tables into the scripts to correct the issues.   

Putting All Together

As we need for analysis the data from all three tables, we could create a query that joins them together and encapsulate it within a view. The volume of data is neglectable, and even without an index the query can perform acceptably. However, as soon the number of data increases, it's useful to have only one table for consumption. Independently of the approach considered the query is similar.  As we made sure that the key is unique across all the data, we could write the query as follows:


-- combining the data together 
SELECT [Province State]
, [Country Region]
, TRY_CAST([Lat] as decimal(10,4)) [Lat]
, TRY_CAST([Long] as decimal(10,4)) [Long]
, [Date]
, Sum(Confirmed) Confirmed
, Sum(Death) Death
, Sum(Recovered) Recovered
, [ISO 3166-1 Alpha 3-Codes]
, [Region Code]
, [Sub-region Code]
, [Intermediate Region Code]
INTO [dbo].[time_series_covid19_global_narrow]
FROM (
 SELECT [Province State]
 , [Country Region]
 , [Lat]
 , [Long]
 , [Date]
 , TRY_CAST([Value] as int) Confirmed
 , 0 Death 
 , 0 Recovered
 , [ISO 3166-1 Alpha 3-Codes]
 , [Region Code]
 , [Sub-region Code]
 , [Intermediate Region Code]
 FROM [dbo].[time_series_covid19_confirmed_global_narrow]
 UNION ALL
 SELECT [Province State]
 , [Country Region]
 , [Lat]
 , [Long]
 , [Date]
 , 0 Confirmed
 , TRY_CAST([Value] as int) Death 
 , 0 Recovered
 , [ISO 3166-1 Alpha 3-Codes]
 , [Region Code]
 , [Sub-region Code]
 , [Intermediate Region Code]
 FROM [dbo].[time_series_covid19_deaths_global_narrow]
 UNION ALL
 SELECT [Province State]
 , [Country Region]
 , [Lat]
 , [Long]
 , [Date]
 , 0 Confirmed
 , 0 Death 
 , TRY_CAST([Value] as int) Recovered
 , [ISO 3166-1 Alpha 3-Codes]
 , [Region Code]
 , [Sub-region Code]
 , [Intermediate Region Code]
 FROM [dbo].[time_series_covid19_recovered_global_narrow]
  ) DAT
GROUP BY [Province State]
, [Country Region]
, TRY_CAST([Lat] as decimal(10,4)) 
, TRY_CAST([Long] as decimal(10,4)) 
, [Date]
, [ISO 3166-1 Alpha 3-Codes]
, [Region Code]
, [Sub-region Code]
, [Intermediate Region Code]

-- reviewing the data
SELECT *
FROM [dbo].[time_series_covid19_global_narrow]

-- checking for duplicates 
SELECT [Province State]
, [Country Region]
, Date 
, count(*) NoRecords 
FROM [dbo].[time_series_covid19_global_narrow]
GROUP BY [Province State]
, [Country Region]
, Date
HAVING count(*)>1

If everything went smoothly, then the last query will return no records. As the latitude was given differently, it was needed to format and cut the values after 4 decimals. Before using the data is needed to do a few adjustments. As the data are incremental, adding up to the previous date, it's useful to calculate the increase between two consecutive days. This can be done via the LAG window function:

-- preparing the data for analysis in a view
CREATE VIEW dbo.v_time_series_covid19
AS
SELECT * 
, LAG(Confirmed,1,0) OVER (PARTITION BY [Province State] , [Country Region] ORDER BY Date) PrevConfirmed
, LAG(Death,1,0) OVER (PARTITION BY [Province State] , [Country Region] ORDER BY Date) PrevDeath
, LAG(Recovered,1,0) OVER (PARTITION BY [Province State] , [Country Region] ORDER BY Date) PrevRecovered
, Confirmed-LAG(Confirmed,1,0) OVER (PARTITION BY [Province State] , [Country Region] ORDER BY Date) IncreaseConfirmed
, Death-LAG(Death,1,0) OVER (PARTITION BY [Province State] , [Country Region] ORDER BY Date) IncreaseDeath
, Recovered - LAG(Recovered,1,0) OVER (PARTITION BY [Province State] , [Country Region] ORDER BY Date) IncreaseRecovered
FROM [dbo].[time_series_covid19_global_narrow]

With this the data are ready for consumption:

-- sample query
SELECT *
FROM dbo.v_time_series_covid19
WHERE [Country Region]  LIKE '%China%'
  AND [Province State] LIKE '%Hubei%'
ORDER BY DATE

Of course, one can increase the value of this dataset by pulling further information, like the size of the population, the average density, or any other factors that could have impact on the propagation of the disease.

Instead of loading the data via the wizard, one can create an SSIS project instead, however some of the corrections still need to be done manually, unless one includes into the logic the corrections as well.

Happy coding!

SQL Troubles

Pages

17 March 2024

🧭Business Intelligence: Data Products (Part II: The Complexity Challenge)

07 March 2024

📦Data Migrations (DM): The SQL Server Perspective (Licensing Costs and Edition Choices)

02 March 2024

🧭Business Intelligence: Microsoft Releases for the BI Technology Stack (Timeline)

03 March 2023

🧊Data Warehousing: Architecture (Part IV: Building a Modern Data Warehouse with Azure Synapse)

11 March 2021

💠🗒️Microsoft Azure: Azure Data Factory [Notes]

06 July 2020

💠🛠️🪄SQL Server: Undocumented (Part III: SQL Server CPU Utilization via the Ring Buffer)

12 June 2020

🎡SSIS Project: Covid-19 Data

About Me