SQL Troubles: integrity

Showing posts with label integrity. Show all posts

24 April 2025

💎🏭SQL Reloaded: Microsoft Fabric's Lakehouses at Work (Part I: Proof-of-Concept)

Introduction

One way to work with the data files existing in organization is to import them into a lakehouse and build a data model based on them that can be reused in the various solutions (incl. Power BI). As a reminder, a lakehouse is a data architecture platform for storing, managing, and analyzing structured and unstructured data in a single location.

The tutorials available on lakehouses are pretty useful for getting an idea how to start, though probably each seasoned professional has his/her way of doing things, at least for testing the capabilities before building a proper solution. The target is thus, to create the minimum for testing the capabilities needed for a proof-of-concept solution.

The files used in this post are available on Microsoft's GitHub. Download the files and store them to be easily available for the next steps. The following files were considered for the current post: customers.csv, orders.csv and products.csv.

Create the Lakehouse

It's recommended to have a naming convention in place for the various items created in a workspace, e.g. a combination between item type (lakehouse, warehouse), system type (Prod, UAT, Dev, PoC) and eventually department (e.g. FIN, SCM, HR, etc.). One should try to balance between meaning and usefulness. Ideally, one should use 2 maximum 4 letters for each part encoded in the name. For example, the following scripts were created in the LH_SCM_PoC lakehouse.

More complex naming conventions can include the system (e.g. D365, CRM, EBS) or the company. The target is to easily identify the systems, independently how complex the rules are. Given that it can become challenging to rename the schemas afterwards, ideally the naming convention should be available from the early stages.

Create the Schema

A lakehouse comes with a dbo schema available by default, though it's recommended to create meaningful schema(s) as needed. The schemas should ideally reflect the domain of the data (e.g. departments or other key areas) and the schemas shouldn't change when the objects are deployed between the different environments. Upon case, one should consider creating multiple schemas that reflect the model's key areas. The names should be simple and suggestive.

-- create schema
CREATE Schema Orders

Create a Staging Area

The next step is to create a staging area where the files in scope can be made available and then further loaded in the lakehouse. One needs to compromise between creating a deep hierarchical structure that reflects the business structure and the need to easily identify, respectively manage the files. An hierarchical structure with 1-2 level could provide the needed compromise, though each additional level tends to increase the complexity.

One should also consider rules for archiving or even deleting the files.

Upload the Files

Microsoft Fabric allows users to upload multiple files together into a single step. Ideally the files should have proper names for each column, otherwise overheads deriving from this may appear later in the process.

When the files are available in multiple folders in a complex hierarchical structure, a set of shortcuts could help in their management.

Load the Data

A file's data can be loaded into the lakehouse on the fly by providing a valid table name:
Files >> SCM_Orders >> (select file) >> Load to Tables >> new table >> Load file to new table >> (provide information) >> Load

Load file to new table

Of course, the table's name must be unique within the Schema and the further properties must define files' definition.

One should consider loading first a couple of tables, performing a rough validation of the data imported, and only after that the remaining tables can be imported. This allows to identify the issues that typically lead to reimports of the data (wrong formatting, invalid column names, duplicated files, etc.) or rework.

If the files have different characteristics (e.g. delimiters, number of attributes/records, special data types), one should consider this in the initial scope and have at least one example from each category.

Review the Metadata

Once the files were made available, there's the tendency to start directly with the development without analyzing the data, or equally important, the metadata available. To review the metadata of the tables newly created, one can use the objects from the standard INFORMATION_SCHEMA (see post):

-- retrieve the list of tables
SELECT * 
FROM INFORMATION_SCHEMA.TABLES
WHERE TABLE_SCHEMA = 'orders'

ORDER BY TABLE_SCHEMA

Further on, one can review columns' definition:

-- retrieve column metadata
SELECT TABLE_CATALOG
, TABLE_SCHEMA
, TABLE_NAME
, COLUMN_NAME
, ORDINAL_POSITION
, DATA_TYPE
, CHARACTER_MAXIMUM_LENGTH
, NUMERIC_PRECISION
, NUMERIC_SCALE
, DATETIME_PRECISION
, CHARACTER_SET_NAME
FROM INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_SCHEMA = 'orders'
ORDER BY ORDINAL_POSITION

It's a good idea to save the metadata to a file and use it later for reviews, respectively for metadata management, when no other solution is in place for the same (e.g. Purview). That's useful also for the people with limited or no access to the workspace.

Alternatively, one can use a notebook with the following SQL-based commands:

%%sql

DESCRIBE TABLE LH_SCM_PoC.orders.sales;

DESCRIBE TABLE EXTENDED LH_SCM_PoC.orders.sales;

One can even provide meaningful descriptions for each table and its attributes via scripts like the ones below, however it might be a good idea to do this in the later phases of the PoC, when the logic become stable:

%%sql

-- modify a table's COMMENT
COMMENT ON TABLE LH_SCM_PoC.orders.sales IS 'Customer sales orders';

-- modify columns' COMMENT for an existing table
ALTER TABLE LH_SCM_DWH.orders.sales  
ALTER COLUMN SalesOrderNumber COMMENT 'Sales Order Number';

Data Validation

Before diving into building any business logic, besides identifying the primary, foreign keys and further attributes used in bringing the data together, it's recommended to get an overview of data's intrinsic and extrinsic characteristics relevant to the analysis. Some of the rules used typically for studying the quality of data apply to some extent also in here, though one needs to prioritize accordingly, otherwise one replicates the effort that's typically part of the Data Management initiatives.

In addition, it's important to check how much the identified issues impact the business logic, respectively on whether the issues can be corrected to match the expectations. Often, no logic can compensate for major data quality issues, and this can also affect PoC's results as soon as the outcomes are validated against the expectations!

Data Understanding

Further on, it makes sense to get a high-level understanding of the data by looking at the distribution of values, respectively at the records participating in the joins. Of course, more similar queries can be built, though again, one should try to focus on the most important aspects!

The analysis could for example consider the following points:

/* validation of Products */

-- review duplicated product numbers (should be 0)
SELECT ProductName
, count(*) RecordCount
FROM orders.products
GROUP BY ProductName
HAVING count(*)>1

-- review most (in)expensive products
SELECT top 100 ProductID
, ProductName
, Category
, ListPrice 
FROM orders.products
ORDER BY ListPrice DESC --ASC

-- review category distribution
SELECT Category
, count(*) RecordCount 
FROM orders.products
GROUP BY Category
ORDER BY RecordCount DESC

-- review price ranges (
SELECT Len(floor(ListPrice)) RangeCount
, count(*) RecordCount 
FROM orders.products
GROUP BY Len(floor(ListPrice)) 
ORDER BY RangeCount DESC

/* validation of Customers */

-- duplicated email address 
SELECT CST.CustomerID
, CST.FirstName
, CST.LastName 
, CST.EmailAddress 
, DUP.RecordCount
FROM (-- duplicates
	SELECT EmailAddress
	, count(*) RecordCount 
	FROM orders.customers 
	GROUP BY EmailAddress 
	HAVING count(*)>1
	) DUP
	JOIN orders.customers CST
	   ON DUP.EmailAddress = CST.EmailAddress
ORDER BY DUP.RecordCount DESC
, DUP.EmailAddress 

-- duplicated Customer names (not necessarily duplicates)
SELECT CST.CustomerID
, CST.FirstName
, CST.LastName 
, CST.EmailAddress 
, DUP.RecordCount
FROM (-- duplicates
	SELECT FirstName
	, LastName
	, count(*) RecordCount 
	FROM orders.customers 
	GROUP BY FirstName
	, LastName 
	HAVING count(*)>1
	) DUP
	JOIN orders.customers CST
	   ON DUP.FirstName = CST.FirstName
      AND DUP.LastName = CST.LastName
ORDER BY DUP.RecordCount DESC
, DUP.FirstName
, DUP.LastName

/* validation of Orders */

-- review a typical order
SELECT SalesOrderID
, OrderDate
, CustomerID
, LineItem
, ProductID
, OrderQty
, LineItemTotal
FROM orders.orders
WHERE SalesOrderID = 71780
ORDER BY SalesOrderID 
, LineItem

-- review orders' distribution by month
SELECT Year(OrderDate) Year
, Month(OrderDate) Month
, count(*) RecordCount
FROM orders.orders
GROUP BY Year(OrderDate) 
, Month(OrderDate) 
ORDER BY Year
, Month

-- checking for duplicates
SELECT SalesOrderID
, LineItem
, count(*) RecordCount
FROM orders.orders ord 
GROUP BY SalesOrderID
, LineItem
HAVING count(*)>1

-- checking for biggest orders
SELECT SalesOrderID
, count(*) RecordCount
FROM orders.orders ord 
GROUP BY SalesOrderID
HAVING count(*) > 10
ORDER BY NoRecords DESC

-- checking for most purchased products
SELECT ProductID
, count(*) NoRecords
FROM orders.orders ord 
GROUP BY ProductID
HAVING count(*) > 8
ORDER BY NoRecords DESC

-- checking for most active customers
SELECT CustomerID
, count(*) RecordCount
FROM orders.orders ord 
GROUP BY CustomerID
HAVING count(*) > 10
ORDER BY RecordCount DESC

/* join checks */

-- Prders without Product (should be 0)
SELECT count(*) RecordCount
FROM orders.orders ord 
	 LEFT JOIN orders.products prd
	   ON ord.ProductID = prd.ProductID
WHERE prd.ProductID IS NULL

-- Prders without Customer (should be 0)
SELECT count(*) RecordCount
FROM orders.orders ORD 
	 LEFT JOIN orders.customers CST
	   ON ORD.CustomerID = CST.CustomerID
WHERE CST.CustomerID IS NULL

-- Products without Orders (153 records)
SELECT count(*) RecordCount
FROM orders.products prd
	 LEFT JOIN orders.orders ord 
	   ON prd.ProductID = ord.ProductID 
WHERE ord.ProductID IS NULL


-- Customers without Orders (815 records)
SELECT count(*) RecordCount
FROM orders.customers CST
	 LEFT JOIN orders.orders ORD
	   ON ORD.CustomerID = CST.CustomerID
WHERE ORD.CustomerID IS NULL

The more tables are involved, the more complex the validation logic can become. One should focus on the most important aspects.

Building the Logic

Once one has an acceptable understanding of the data entities involved and the relation between them, it's time to build the needed business logic by joining the various tables at the various levels of detail. One can focus on the minimum required, respectively attempt to build a general model that can address a broader set of requirements. For the PoC it's usually recommended to start small by addressing the immediate requirements, though some flexibility might be needed for exploring the data and preparing the logic for a broader set of requirements. Independently of the scope, one should consider a set of validations.

Usually, it makes sense to encapsulate the logic in several views or table-valued functions that reflect the logic for the main purposes and which allow a high degree of reuse (see [1]). Of course, one can use the standard approach for modelling the bronze, silver, respectively the gold layers adopted by many professionals. For a PoC, even if that's not mandatory, it might still be a good idea to make steps in the respective direction.

In this case, dealing with only three tables - a fact table and two dimensions table - there are several perspectives that can be built:

a) all records from fact table + dimension records

The following view provides the lowest level of details for the fact table, allowing thus to look at the data from different perspectives as long as focus is only the values used is Sales Orders:

-- create the view
CREATE OR ALTER VIEW orders.vSalesOrders
-- Sales Orders with Product & Customer information
AS
SELECT ORD.SalesOrderID
, ORD.OrderDate
, ORD.CustomerID
, CST.FirstName 
, CST.LastName
, CST.EmailAddress
, ORD.LineItem
, ORD.ProductID
, PRD.ProductName 
, PRD.Category
, ORD.OrderQty
, ORD.LineItemTotal
, PRD.ListPrice 
, ORD.OrderQty * PRD.ListPrice ListPriceTotal
FROM orders.orders ORD 
	 JOIN orders.products PRD
	   ON ORD.ProductID = PRD.ProductID
	 JOIN orders.customers CST
	   ON ORD.CustomerID = CST.CustomerID

-- test the view   
SELECT *
FROM orders.vSalesOrders
WHERE SalesOrderID = 71780

One can use full joins unless some of the references dimensions are not available.

b) aggregated data for all dimension combinations

The previous view allows to aggregate the data at the various levels of details:

-- Sales volume by Customer & Product
SELECT ORD.EmailAddress
, ORD.ProductName 
, ORD.Category
, SUM(ORD.OrderQty) OrderQty
, SUM(ORD.LineItemTotal) LineItemTotal
FROM orders.vSalesOrders ORD 
WHERE ORD.OrderDate >= '2022-06-01'
  AND ORD.OrderDate < '2022-07-01'
GROUP BY ORD.EmailAddress
, ORD.ProductName 
, ORD.Category
ORDER BY ORD.EmailAddress
, ORD.ProductName

One can comment out the dimensions not needed. The query can be included in a view as well.

c) all records from each dimension table + aggregated fact records

Sometimes, it's useful to look at the data from a dimension's perspective, though it might be needed to create such an object for each dimension, like in the below examples. For the maximum of flexibility the logic can be included in a table-valued function:

-- create the user-defined function
CREATE OR ALTER FUNCTION orders.tvfProductsSalesVolume(
    @StartDate date NULL,
    @EndDate date NULL
)
RETURNS TABLE
-- Sales volume by Product
AS
RETURN (
SELECT PRD.ProductID
, PRD.ProductName 
, PRD.Category
, ORD.FirstOrderDate
, ORD.LastOrderDate 
, IsNull(ORD.TotalSalesQty, 0) TotalSalesQty 
, IsNull(ORD.TotalSalesValue, 0) TotalSalesValue
, IsNull(ORD.OrderCount, 0) OrderCount
, IsNull(ORD.LineCount, 0) LineCount
FROM orders.products PRD
     OUTER APPLY (
		SELECT Min(ORD.OrderDate) FirstOrderDate
		, Max(ORD.OrderDate) LastOrderDate 
		, SUM(ORD.OrderQty) TotalSalesQty
		, SUM(ORD.LineItemTotal) TotalSalesValue
		, count(DISTINCT SalesOrderID) OrderCount
		, count(*) LineCount
		FROM orders.orders ORD 
		WHERE ORD.ProductID = PRD.ProductID
		  AND ORD.OrderDate >= @StartDate 
		  AND ORD.OrderDate < @EndDate 
	 ) ORD
);

-- test the user-defined function
SELECT *
FROM orders.tvfProductsSalesVolume('2022-06-01','2022-07-01') PRD
WHERE TotalSalesValue <> 0
ORDER BY TotalSalesValue DESC
, LastOrderDate DESC


-- create the user-defined function
CREATE OR ALTER FUNCTION orders.tvfCustomersSalesVolume(
    @StartDate date NULL,
    @EndDate date NULL
)
RETURNS TABLE
-- Sales volume by Customer
AS
RETURN (
SELECT CST.CustomerID
, CST.FirstName 
, CST.LastName
, CST.EmailAddress
, ORD.FirstOrderDate
, ORD.LastOrderDate 
, IsNull(ORD.TotalSalesValue, 0) TotalSalesValue
, IsNull(ORD.OrderCount, 0) OrderCount
, IsNull(ORD.LineCount, 0) LineCount
FROM orders.customers CST
     OUTER APPLY (
		SELECT Min(ORD.OrderDate) FirstOrderDate
		, Max(ORD.OrderDate) LastOrderDate 
		, SUM(ORD.LineItemTotal) TotalSalesValue
		, count(DISTINCT SalesOrderID) OrderCount
		, count(*) LineCount
		FROM orders.orders ORD 
		WHERE ORD.CustomerID = CST.CustomerID
		  AND ORD.OrderDate >= @StartDate 
		  AND ORD.OrderDate < @EndDate 
	 ) ORD
);

-- test the user-defined function
SELECT *
FROM orders.tvfCustomersSalesVolume('2022-06-01','2022-07-01') PRD
WHERE TotalSalesValue <> 0
ORDER BY TotalSalesValue DESC
, LastOrderDate DESC

When restructuring the queries in similar ways, there's always a compromise between the various factors: (re)usability, performance or completeness.

Further Comments

The above database objects should allow users to address most of the requirements, though, as usual, there can be also exceptions, especially when the data needs to be aggregated at a different level of detail that requires the query to be structured differently.

The number of perspectives can increase also with the number of fact tables used to model a certain entity (e.g. Sales order headers vs. lines). For example,

In theory, one can also find ways to automate the process of creating database objects, though one must choose the relevant attributes, respectively include logic that makes sense only within a certain perspective.

No matter the data, respectively systems used as source, expect surprises and test your assumptions! For example, in the file used to create the orders.customers table, there seem to be duplicated entities with the same name and email address. One must clarify how such entities must be handled in data analysis, respectively in data modeling. For example, a person can appear twice because of the roles associated with the name or can be other entitled reasons.

The files in scope of this post are small compared with the files existing in organizations. In many scenarios files' size could range from GB to TB and thus require partitioning and different other strategies.

|>> Next Post

References
[1] sql-troubles (2023) Architecture Part IV: Building a Modern Data Warehouse with Azure Synapse [link]

Resources
[1] Microsoft Learn (2024) Fabric: Lakehouse and Delta Lake tables [link]

27 February 2024

🔖Book Review: Rolf Hichert & Jürgen Faisst's International Business Communication Standards (IBCS Version 1.2)

Over the last months I found several references to Rolf Hichert & Jürgen Faisst's booklet on business communication standards [1]. It draw my attention especially because it attempts to provide a standard for reports and data visualizations, which frankly it seems like a tremendous endeavor if done right. The two authors founded the IBCS institute 20 years ago, which is a host, training institute, and certification body of the Creative Commons project called IBCS.

The 150 pages booklet considers various standardization techniques with the help of more than 180 instructive figures, the overall structure being based on a set of principles and rules rooted in an acronym that spells "SUCCESS" - Say, Unify, Condense, Check, Express, Simplify, Structure. On one side the principles seem to form a solid fundament, however the fundament seems to suffer from the same rigidity resulted from fitting something in a nicely-spelled acronym.

Say or conveying a message reflects the principle that each report should convey a message, otherwise the report is just a data collection. According to this "definition" most of the operational reports are just collections of data. Conversely, lot of communication in organizations revolve around issues, metrics and decision making, scenarios in which the messages conveyed can be powerful though dependent on the business context. Settling on only one message can make the message fall short.

Unifying or applying semantic notation reflects the principle that things that have same meaning should look the same. There are many patterns out there that can be standardized, however it's questionable how much complex visualizations can be standardized, respectively how much liberty of expressing certain aspects the standardization allows.

Condense or increasing the information density reflects the requirements that all information necessary to understanding the content should, if possible, be included on one page. This allows to easier navigate the content and prioritize what the audience is able to see. The principle however seems to have more to do with the ink-information ratio principle (see [2]).

Check or ensuring the visual integrity reflects the principle that the information should be presented in the most truthful and the most easily understood way. This is something that many data visualizations out there lack.

Express or choosing the proper visualizations is based on the principle that the visuals considered should be as intuitive as possible. In theory, the more intuitive a visual the easier is to be understood and reused, however this depends on the "visual vocabulary" and "visual grammar" of each individual. Intuition is something that needs to grow through the interplay of these two areas. Having the expectation of displaying everything in terms of basic elements is unrealistic and suboptimal.

Simplify or avoiding clutter refers to eliminating the unnecessary from a visualization, when there's nothing to take out without changing the meaning of a visualization. At least, the principle is correctly considered even if is in general difficult to apply because quite often one needs to build something more complex and reduce the complexity through iterative steps until the simple is obtained.

Structure or organizing the content is based on the principle that content should follow (a logical consistent) structure. The interplay between function and structure is an important topic in itself.

Browsing through the many data visualizations given as example, I'd say that many of the recommendations make sense, though from there to a standardization is still a long way. The reader should evaluate against his/her own judgements the practices described and consider what seems to work.

The book is available on the IBS website as PDF, though the Kindle version is 40% cheaper. Overall, it is worth a read.

Previous Post <<||>> Next Post

Resources:
[1] Rolf Hichert & Jürgen Faisst (2022) "International Business Communication Standards (IBCS Version 1.2): Conceptual, perceptual, and semantic design of comprehensible business reports, presentations, and dashboards" (link)
[2] Edward R Tufte (1983) "The Visual Display of Quantitative Information"
[3] IBCS Institude (2024) About (link)

01 February 2021

📦Data Migrations (DM): Quality Assurance (Part II: Quality Acceptance Criteria II)

Data Migrations Series

Auditability

Auditability is the degree to which the solution allows checking the data for their accuracy, or for their quality in general, respectively the degree to which the DM solution and processes allow to be audited regarding compliance, security and other types of requirements. All these aspects are important in case an external sign-off from an auditor is mandatory.

Automation

Automation is the degree to which the activities within a DM can be automated. Ideally all the processes or activities should be automated, though other requirements might be impacted negatively. Ideally, one needs to find the right balance between the various requirements.

Cohesion

Cohesion is the degree to which the tasks performed by the solution, respectively during the migration, are related to each other. Given the dependencies existing between data, their processing and further project-related activities, DM imply a high degree of cohesion that need to be addressed by design.

Complexity

Complexity is the degree to which a solution is difficult to understand given the various processing layers and dependencies existing within the data. The complexity of DM revolve mainly around the data structures and the transformations needed to translate the data between the various data models.

Compliance

Compliance is the degree to which a solution is compliant with internal or external regulations that apply. There should be differentiated between mandatory requirements, respectively recommendations and other requirements.

Consistency

Consistency is the degree to which data conform to an equivalent set of data, in this case the entities considered for the DM need to be consistent to each other. A record referenced in any entity of the migration need to be considered, respectively made available in the target system(s) either by parametrization or migration.

During each iteration, the data need to remain consistent, so it can facilitate the troubleshooting. The data are usually reimported between iterations or during same iteration, typically to reflect the changes occurred in the source systems or other purposes.

Coupling

Data coupling is the degree to which different processing areas within a DM share the same data, typically a reflection of the dependencies existing between the data. Ideally, the areas should be decoupled as much as possible.

Extensibility

Extensibility is the degree to which the solution or parts of the logic can be extended to accommodate further requirements. Typically, this involves changes that deviate from the standard functionality. Extensibility impacts positively the flexibility.

Flexibility

Flexibility is the degree to which a solution can handle new requirements or ad-hoc changes to the logic. No matter how good everything was planned there’s always something forgotten or new information is identified. Having the flexibility to change code or data on the fly can make an important difference.

Integrity

Integrity is the degree to which a solution prevents the changes to data besides the ones considered by design. Users and processes should not be able modifying the data besides the agreed procedures. This means that the data need to be processed in the sequence agreed. All aspects related to data integrity need to be documented accordingly.

Interoperability

Interoperability is the degree to which a solution’s components can exchange data and use the respective data. The various layers of a DM’s solutions must be able to process the data and this should be possible by design.

Maintainability

Maintainability is the degree to which a solution can be modified to or add minor features, change existing code, corrects issues, refactor code, improve performance or address changes in environment. The data required and the transformation rules are seldom known in advance. The data requirements are definitized during the various iterations, the changes needing to be implemented as the iterations progress. Thus, maintainability is a critical requirement.

Previous Post - Next Post

25 August 2019

🛡️Information Security: Digital Signature (Definitions)

"A form of electronic authentication of a digital document. Digital signatures are created and verified using public key cryptography and serve to tie the document being signed to the signer." (J P Getty Trust, "Introduction to Metadata" 2nd Ed., 2008)

"Data which proves that a document, message, or other piece of data was not modified since being processed and sent from a particular party." (Mark S Merkow & Lakshmikanth Raghavan, "Secure and Resilient Software Development", 2010)

"cryptographic transformations of data that allow a recipient of the data to prove the source (non-repudiation) and integrity of the data." (Manish Agrawal, "Information Security and IT Risk Management", 2014)

"Data that is appended to a message, made from the message itself and the sender’s private key, to ensure the authenticity of the message" (Nell Dale & John Lewis, "Computer Science Illuminated" 6th Ed., 2015)

"Ensuring the authenticity and integrity of a message through the use of hashing algorithms and asymmetric algorithms. The message digest is encrypted with the sender’s private key." (Adam Gordon, "Official (ISC)2 Guide to the CISSP CBK" 4th Ed., 2015)

"A means of authenticating that a message or data came from a particular source with a known system identity." (O Sami Saydjari, "Engineering Trustworthy Systems: Get Cybersecurity Design Right the First Time", 2018)

"An electronic signature based upon cryptographic methods of originator authentication, computed by using a set of rules and a set of parameters such that the identity of the signer and the integrity of the data can be verified." (Shon Harris & Fernando Maymi, "CISSP All-in-One Exam Guide, 8th Ed", 2018)

"An encrypted means of identification that cannot be forged and that enables clients to validate servers and vice versa." (Microfocus)

"The combination of the private key, public key, message and hashing generates a digital signature. A digital signature is unique for every transaction and is a way to prove that the originator of the message has access to the private key." (AICPA)

25 July 2019

🧱IT: Blockchain (Definitions)

"A block chain is a perfect place to store value, identities, agreements, property rights, credentials, etc. Once you put something like a Bit coin into it, it will stay there forever. It is decentralized, disinter mediated, cheap, and censorship-resistant." (Kirti R Bhatele et al, "The Role of Artificial Intelligence in Cyber Security", 2019)

"A system made-up of blocks that are used to record transactions in a peer-to-peer cryptocurrency network such as bitcoins." (Murad Al Shibli, "Hybrid Artificially Intelligent Multi-Layer Blockchain and Bitcoin Cryptology", 2020)

"A chain of blocks containing data that is bundled together. This database is shared across a network of computers (so-called distributed ledger network). Each data block links to the previous block in the blockchain through a cryptographic hash of the previous block, a timestamp, and transaction data. The blockchain only allows data to be written, and once that data has been accepted by the network, it cannot be changed." (Jurij Urbančič et al, "Expansion of Technology Utilization Through Tourism 4.0 in Slovenia", 2020)

"A system in which a record of transactions made in Bitcoin or another cryptocurrency is maintained across several computers that are linked in a peer-to-peer network. Amany M Alshawi, "Decentralized Cryptocurrency Security and Financial Implications: The Bitcoin Paradigm", 2020)

"An encrypted ledger that protects transaction data from modification." (David T A Wesley, "Regulating the Internet, Encyclopedia of Criminal Activities and the Deep Web", 2020)

"Blockchain is a decentralized, immutable, secure data repository or digital ledger where the data is chronologically recorded. The initial block named as Genesis. It is a chain of immutable data blocks what has anonymous individuals as nodes who can transact securely using cryptology. Blockchain technology is subset of distributed ledger technology." (Umit Cali & Claudio Lima, "Energy Informatics Using the Distributed Ledger Technology and Advanced Data Analytics", 2020)

"Blockchain is a meta-technology interconnected with other technologies and consists of several architectural layers: a database, a software application, a number of computers connected to each other, peoples’ access to the system and a software ecosystem that enables development. The blockchain runs on the existing stack of Internet protocols, adding an entire new tier to the Internet to ensure economic transactions, both instant digital currency payments and complicated financial contracts." (Aslı Taşbaşı et al, "An Analysis of Risk Transfer and Trust Nexus in International Trade With Reference to Turkish Data", 2020)

"Is a growing list of records, called blocks, which are linked using cryptography. Each block contains a cryptographic hash of the previous block a timestamp, and transaction data. (Vardan Mkrttchian, "Perspective Tools to Improve Machine Learning Applications for Cyber Security", 2020)

"This is viewed as a mechanism to provide further protection and enhance the security of data by using its properties of immutability, auditability and encryption whilst providing transparency amongst parties who may not know each other, so operating in a trustless environment." (Hamid Jahankhani & Ionuț O Popescu, "Millennials vs. Cyborgs and Blockchain Role in Trust and Privacy", 2020)

"A blockchain is a data structure that represents the record of each accounting move. Each account transaction is signed digitally to protect its authenticity, and no one can intervene in this transaction." (Ebru E Saygili & Tuncay Ercan, "An Overview of International Fintech Instruments Using Innovation Diffusion Theory Adoption Strategies", 2021)

"A system in which a record of transactions made in bitcoin or another cryptocurrency are maintained across several computers that are linked in a peer-to-peer network." (Silvije Orsag et al, "Finance in the World of Artificial Intelligence and Digitalization", 2021)

"It is a decentralized computation and information sharing platform that enables multiple authoritative domains, who don’t trust each other, to cooperate, coordinate and collaborate in a rational decision-making process." (Vinod Kumar & Gotam Singh Lalotra, "Blockchain-Enabled Secure Internet of Things", 2021)

"A concept consisting of the methods, technologies, and tool sets to support a distributed, tamper-evident, and reliable way to ensure transaction integrity, irrefutability, and non-repudiation. Blockchains are write-once, append-only data stores that include validation, consensus, storage, replication, and security for transactions or other records." (Forrester)

[hybrid blockchain:] "A network with a combination of characteristics of public and private blockchains where a blockchain may incorporate select privacy, security and auditability elements required by the implementation." (AICPA)

[private blockchain:] "A restricted access network controlled by an entity or group which is similar to a traditional centralized network." (AICPA)

"A technology that records a list of records, referred to as blocks, that are linked using cryptography. Each block contains a cryptographic hash of the previous block, a timestamp and transaction data." (AICPA)

[public blockchain:] "An open network where participants can view, read and write data, and no one participant has control (e.g., Bitcoin, Ethereum)." (AICPA)

06 June 2013

🎓Knowledge Management: Ontology (Definitions)

"A data model that represents the entities that are defined and evaluated by its own attributes, and organized according to a hierarchy and a semantic. Ontologies are used for representing knowledge on the whole of a specific domain or on of it." (Gervásio Iwens et al, "Programming Body Sensor Networks", 2008)

"An ontology specifies a conceptualization, that is, a structure of related concepts for a given domain." (Troels Andreasen & Henrik Bulskov, "Query Expansion by Taxonomy", 2008)

"A semantic structure useful to standardize and provide rigorous definitions of the terminology used in a domain and to describe the knowledge of the domain. It is composed of a controlled vocabulary, which describes the concepts of the considered domain, and a semantic network, which describes the relations among such concepts. Each concept is connected to other concepts of the domain through semantic relations that specify the knowledge of the domain. A general concept can be described by several terms that can be synonyms or characteristic of different domains in which the concept exists. For this reason the ontologies tend to have a hierarchical structure, with generic concepts/terms at the higher levels of the hierarchy and specific concepts/terms at the lover levels, connected by different types of relations." (Mario Ceresa, "Clinical and Biomolecular Ontologies for E-Health", Handbook of Research on Distributed Medical Informatics and E-Health, 2009)

"In the context of knowledge sharing, the chapter uses the term ontology to mean a specification of conceptual relations. An ontology is the concepts and relationships that can exist for an agent or a community of agents. The chapter refers to designing ontologies for the purpose of enabling knowledge sharing and re-use." (Ivan Launders, "Socio-Technical Systems and Knowledge Representation", 2009)

"The systematic description of a given phenomenon, which often includes a controlled vocabulary and relationships, captures nuances in meaning and enables knowledge sharing and reuse. Typically, ontology defines data entities, data attributes, relations and possible functions and operations." (Mark Olive, "SHARE: A European Healthgrid Roadmap", 2009)

"Those things that exist are those things that have a formal representation within the context of a machine. Knowledge commits to an ontology if it adheres to the structure, vocabulary and semantics intrinsic to a particular ontology i.e. it conforms to the ontology definition. A formal ontology in computer science is a logical theory that represents a conceptualization of real world concepts." (Philip D. Smart, "Semantic Web Rule Languages for Geospatial Ontologies", 2009)

"A formal representation of a set of concepts within a domain and the relationships between those concepts. It is used to reason about the properties of that domain, and may be used to define the domain." (Yong Yu et al, "Social Tagging: Properties and Applications", 2010)

"Is set of well-defined concepts describing a specific domain." (Hak-Lae Kim et al, "Representing and Sharing Tagging Data Using the Social Semantic Cloud of Tags", 2010)

"An ontology is a 'formal, explicit specification of a shared conceptualisation'. It is composed of concepts and relations structured into hierarchies (i.e. they are linked together by using the Specialisation/Generalisation relationship). A heavyweight ontology is a lightweight ontology (i.e. an ontology simply based on a hierarchy of concepts and a hierarchy of relations) enriched with axioms used to fix the semantic interpretation of concepts and relations." (Francky Trichet et al, "OSIRIS: Ontology-Based System for Semantic Information Retrieval and Indexation Dedicated to Community and Open Web Spaces", 2011)

"The set of the things that can be dealt with in a particular domain, together with their relationships." (Steven Woods et al, "Knowledge Dissemination in Portals", 2011)

"In semantic web and related technologies, an ontology (aka domain ontology) is a set of taxonomies together with typed relationships connecting concepts from the taxonomies and, possibly, sets of integrity rules and constraints defining classes and relationships." (Marcus Spies & Said Tabet, "Emerging Standards and Protocols for Governance, Risk, and Compliance Management", 2012)

"High-level knowledge and data representation structure. Ontologies provide a formal frame to represent the knowledge related with a complex domain, as a qualitative model of the system. Ontologies can be used to represent the structure of a domain by means of defining concepts and properties that relate them." (Lenka Lhotska et al, "Interoperability of Medical Devices and Information Systems", 2013)

"(a) In computer science and information science, an ontology formally represents knowledge as a set of concepts within a domain, and the relationships between pairs of concepts. It can be used to model a domain and support reasoning about concepts. (b) In philosophy, ontology is the study of the nature of being, becoming, existence , or reality , as well as the basic categories of being and their relations. Traditionally listed as a part of the major branch of philosophy known as metaphysics, ontology deals with questions concerning what entities exist or can be said to exist, and how such entities can be grouped, related within a hierarchy, and subdivided according to similarities and differences." (Ronald J Lofaro, "Knowledge Engineering Methodology with Examples", 2015)

"It is a shared structure which classify and organizes all the entities of a given domain." (T R Gopalakrishnan Nair, "Intelligent Knowledge Systems", 2015)

"The study of how things relate. Used in big data to analyze seemingly unrelated data to discover insights." (Jason Williamson, "Getting a Big Data Job For Dummies", 2015)

"An ontology is a formal, explicit specification of a shared conceptualization." (Fu Zhang et al, "A Review of Answering Queries over Ontologies Based on Databases", 2016)

15 August 2009

🛢DBMS: Lock/Locking (Definitions)

"The process of restricting access to resources in a multi-user environment to maintain security and prevent concurrent access problems. SQL Server automatically applies locks to tables or pages." (Karen Paulsell et al, "Sybase SQL Server: Performance and Tuning Guide", 1996)

"SQL Server issues locks to prevent users from interfering with each other's work." (Owen Williams, "MCSE TestPrep: SQL Server 6.5 Design and Implementation", 1998)

"A restriction on access to a resource in a multiuser environment. SQL Server locks users out of a specific record, field, or file automatically to maintain security or prevent concurrent data manipulation problems." (Microsoft Corporation, "SQL Server 7.0 System Administration Training Kit", 1999)

"A method the DBMS uses to prevent concurrent transactions from interfering with one another. Physically, a lock is one of three things: a latch, a mark on the wall, or a RAM record." (Peter Gulutzan & Trudy Pelzer, "SQL Performance Tuning", 2002)

"A method of ensuring concurrency. Locking enables users to temporarily "check out" an object, preventing other users from changing the object, for the purpose of ensuring consistency." (Thomas Moore, "EXAM CRAM™ 2: Designing and Implementing Databases with SQL Server 2000 Enterprise Edition", 2005)

"SQL Server uses locks to prevent multiple users from modifying the same data at the same time." (Joseph L Jorden & Dandy Weyn, "MCTS Microsoft SQL Server 2005: Implementation and Maintenance Study Guide - Exam 70-431", 2006)

"A mechanism used by a concurrent system to prevent data anomalies by isolating transactions from each other." (Marilyn Miller-White et al, "MCITP Administrator: Microsoft® SQL Server™ 2005 Optimization and Maintenance 70-444", 2007)

"A lock is an access restriction placed on part of a database to prevent other users or processes from viewing or modifying data as it is being viewed or modified by one process. Locks can be placed on rows, pages, extents, tables, or databases." (Darril Gibson, "MCITP SQL Server 2005 Database Developer All-in-One Exam Guide", 2008)

"MongoDB uses locks to ensure that concurrency does not affect correctness. MongoDB uses read locks, write locks and intent locks. For more information, see What type of locking does MongoDB use?." (MongoDb, "Glossary", 2008)

"Used to control access to part of the database. For example, while one user updates a row, the database places a lock on the row so other users cannot interfere with the update. Different databases may lock data by rows, table, or disk page." (Rod Stephens, "Beginning Database Design Solutions", 2008)

"The processing of giving a transaction exclusive rights to view and/or update a database element to prevent problems that arise with interleaved transaction execution." (Jan L Harrington, "SQL Clearly Explained 3rd Ed. ", 2010)

"A DBMS function used to ensure the integrity of data. When a database resource is locked by one process, another process is not permitted to change the locked data. Locking is necessary to enable the DBMS to facilitate the ACID properties of transaction processing." (Craig S Mullins, "Database Administration: The Complete Guide to DBA Practices and Procedures" 2nd Ed, 2012)

"A restriction on access to a resource in a multiuser environment." (Microsoft, "SQL Server 2012 Glossary", 2012)

"A means of preventing uncommitted changes made by one application process from being perceived by another application process and for preventing one application process from updating data that is being accessed by another process. A lock ensures the integrity of data by preventing concurrent users from accessing inconsistent data. A means of serializing a sequence of events or serializing access to data." (Sybase, "Open Server Server-Library/C Reference Manual", 2019)

"The process by which a DBMS restricts access to a row in a multiuser environment. The DBMS usually sets a bit on a row or the physical page containing a row that indicates the row or page is locked." (Microsoft)

14 June 2009

🛢DBMS: Domain Integrity (Definitions)

"Describes the inclusion of attribute rules (for example, maximum discount and minimum order quantity) within the design of a database." (Owen Williams, "MCSE TestPrep: SQL Server 6.5 Design and Implementation", 1998)

"Integrity that enforces valid entries for a given column. Domain integrity is enforced by restricting the type (through data types), the format (through CHECK constraints and rules), or the range of possible values (through REFERENCE and CHECK constraints, and rules)." (Microsoft Corporation, "SQL Server 7.0 System Administration Training Kit", 1999)

"Domain integrity enforces the validity of entries for a given column. The mechanism, such as the CHECK constraint, can restrict the possible data values by data type, format, or range of values allowed." (Anthony Sequeira & Brian Alderman, "The SQL Server 2000 Book", 2003)

"A relational database integrity mechanism that enforces the validity of data at the column level." (Marilyn Miller-White et al, "MCITP Administrator: Microsoft® SQL Server™ 2005 Optimization and Maintenance 70-444", 2007)

"The validity of entries for a specific column of data." (Microsoft, "SQL Server 2012 Glossary", 2012)

12 June 2009

🛢DBMS: Entity Integrity (Definitions)

"Within a table, each row describes an entity that is a member of the set kept in the table. Entity integrity ensures that each row in the table is uniquely identifiable." (Owen Williams, "MCSE TestPrep: SQL Server 6.5 Design and Implementation", 1998)

"Integrity that defines a row as a unique entity for a particular table and ensures that the column cannot contain duplicate values. It usually enforces the primary key of a table (through indexes, UNIQUE constraints, or PRIMARY KEY constraints)." (Microsoft Corporation, "SQL Server 7.0 System Administration Training Kit", 1999)

"Entity integrity refers to a state in which all the rows in a database have a non-null primary key value, all tables have primary keys, and no table has any duplicate primary key values. Entity integrity ensures there are no duplicate entries for anything represented in the database." (Anthony Sequeira & Brian Alderman, "The SQL Server 2000 Book", 2003)

"A relational database integrity mechanism that ensures that duplicate rows do not exist in a table. Requiring that all rows in a table have a unique identifier." (Victor Isakov et al, "MCITP Administrator: Microsoft SQL Server 2005 Optimization and Maintenance (70-444) Study Guide", 2007)

"Integrity that defines a row as a unique entity for a particular table and ensures that the column cannot contain duplicate values." (S. Sumathi & S. Esakkirajan, "Fundamentals of Relational Database Management Systems", 2007)

"Requires that all tables have a primary key. The values in the primary key fields must be non-null and no two records can have the same primary key values." (Rod Stephens, "Beginning Database Design Solutions", 2008)

"A constraint on a relation that states that no part of the primary key can be null." (Jan L Harrington, "Relational Database Design and Implementation' 3rd Ed., 2009)

"The property of a relational table that guarantees that each entity has a unique value in a primary key and that there are no null values in the primary key." (Carlos Coronel et al, "Database Systems: Design, Implementation, and Management" 9th Ed., 2011)

"A state in which every row of every table can be uniquely identified." (Microsoft, "SQL Server 2012 Glossary", 2012)

"The most basic level of data integrity provided by relational databases stating that each occurrence of an entity must be uniquely identifiable." (Craig S Mullins, "Database Administration", 2012)

01 April 2009

🛢DBMS: Data Integrity (Definitions)

"The correctness and completeness of data within a database." (Karen Paulsell et al, "Sybase SQL Server: Performance and Tuning Guide", 1996)

"A general term that refers to the correctness of the data contained in a database." (Owen Williams, "MCSE TestPrep: SQL Server 6.5 Design and Implementation", 1998)

"The accuracy and reliability of data. Data integrity is important in both single-user and multiuser environments. In multiuser environments, where data is shared, both the potential for and the cost of data corruption are high. In large-scale relational database management system (RDBMS) environments, data integrity is a primary concern." (Microsoft Corporation, "SQL Server 7.0 System Administration Training Kit", 1999)

"Data integrity refers to a state in which all the data values stored in the database are correct." (Anthony Sequeira & Brian Alderman, "The SQL Server 2000 Book", 2003)

"The condition that exists when there’s no accidental or intentional destruction, alteration, or loss of data." (Sharon Allen & Evan Terry, "Beginning Relational Data Modeling" 2nd Ed., 2005)

"The bits of data that are put in storage (via I/O writes) are the same bits of data—order and completeness - that come out (via I/O reads)." (David G Hill, "Data Protection: Governance, Risk Management, and Compliance", 2009)

"In a relational database, refers to a condition in which the data in the database is in compliance with all entity and referential integrity constraints." (Carlos Coronel et al, "Database Systems: Design, Implementation, and Management 9th Ed", 2011)

"The accuracy of data and its conformity to its expected value, especially after being transmitted or processed." (Microsoft, "SQL Server 2012 Glossary", 2012)

"Refers to the accuracy and quality of the data." (Steve Conger, "Hands-on database : an introduction to database design and development", 2012)

"Data integrity is the state of data being free from corruption." (Vince Buffalo, "Bioinformatics Data Skills", 2015)

"The property that data has not been altered in an authorized manner." (O Sami Saydjari, "Engineering Trustworthy Systems: Get Cybersecurity Design Right the First Time", 2018)

"The degree to which the data is internal or referential/consistent. If the key to refer to a different table is invalid, the join between the two tables cannot be made." (Piethein Strengholt, "Data Management at Scale", 2020)

"(1) In the context of data and network security: The assurance that information can only be accessed or modified by those authorized to do so. (2) In the context of data quality: The assurance the data are clean, traceable, and fit for purpose." (CODATA)

"The degree to which a collection of data is complete, consistent, and accurate. See also: data security; database integrity; integrity." (IEEE 610.5-1990)

13 March 2009

🛢DBMS: Relational Model (Definitions)

"A method of organizing data into two-dimensional tables made up of rows and columns. The model is based on the mathematical theory of relations, a part of set theory." (Microsoft Corporation, "SQL Server 7.0 System Administration Training Kit", 1999)

"A model that provides a two-dimensional structure to data. The relational database model more or less throws out the window the concept and restriction of a hierarchical structure, but does not completely abandon data hierarchies. Any table can be accessed directly with having to access all parent objects. Precise data values (such as primary keys) are required to facilitate skirting the hierarchy (to find individual records) in specific tables." (Gavin Powell, "Beginning Database Design", 2006)

"A paradigm for describing the structure of a database in which entities are represented as tables, and relationships between the entities are represented by matching data." (Jan L Harrington, "Relational Database Design and Implementation" 3rd Ed., 2009)

"The relational model, based on mathematical set theory, represents data as independent relations. Each relation (table) is conceptually represented as a matrix of intersecting rows and columns. The relations are related to each other through the sharing of common entity characteristics (values in columns)." (Carlos Coronel et al, "Database Systems: Design, Implementation, and Management" 9th Ed., 2011)

"A database model based on first-order predicate logic [...]" (Craig S Mullins, "Database Administration: The Complete Guide to DBA Practices and Procedures", 2012)

"A form of data where data is normalized" (Daniel Linstedt & W H Inmon, "Data Architecture: A Primer for the Data Scientist", 2014)

"A type of model that aims to identify relationships of interest and quantify the strength of relationship between individuals or entities. Common examples include market basket analysis and social network analysis." (Evan Stubbs, "Big Data, Big Innovation", 2014)

"Data represented as a set of related tables or relations." (Jeffrey A Hoffer et al, "Modern Systems Analysis and Design" 7th Ed., 2014)

"A database model in which data and the relationships among them are organized into tables" (Nell Dale & John Lewis, "Computer Science Illuminated" 6th Ed., 2015)

"Relational modeling is a popular data modeling technique to reduce the duplication of data and ensure the referential integrity of the data." (Piethein Strengholt, "Data Management at Scale", 2020)

"(1) A data model whose pattern or organization is based on a set of relations, each of which consists of an unordered set of tuples. (2) A data model that provides for the expression of relationships among data elements as formal mathematical relations." (IEEE 610.5-1990)

SQL Troubles

Pages