Showing posts with label databases. Show all posts
Showing posts with label databases. Show all posts

18 April 2024

🏭Data Warehousing: Microsoft Fabric (Part II: Data(base) Mirroring) [New feature]

Data Warehousing
Data Warehousing Series

Microsoft recently announced [4] the preview of a new Fabric feature called Mirroring, a low-cost, low-latency fully managed service that allows to replicate data from various systems together into OneLake [1]. Currently only Azure SQL Database, Azure Cosmos DB, and Snowflake are supported, though probably more database vendors will be targeted soon. 

For Microsoft Fabric's data engineers, data scientists and data warehouse professionals this feature is huge as importance because they don't need to care anymore about making the data available in Microsoft Fabric, which involves a considerable amount of work. 

Usually, at least for flexibility, transparence, performance and standardization, data professionals prefer to extract the data 1:1 from the source systems into a landing zone in the data warehouse or data/delta lake from where the data are further processed as needed. One data pipeline is thus built for every table in scope, which sometimes is a 10–15-minute effort per table, when the process is standardized, though upon case the effort is much higher if troubleshooting (e.g. data type incompatibility or support) or further logic changes are involved. Maintaining such data pipelines can prove to be costly over time, especially when periodic changes are needed. 

Microsoft lists other downsides of the ETL approach - restricted access to data changes, friction between people, processes, and technology, respectively the effort needed to create the pipelines, and the time needed for importing the data [1]. There's some truth is each of these points, though everything is relative. For big tables, however, refreshing all the data overnight can prove to be time-consuming and costly, especially when the data don't lie within the same region, respectively data center. Unless the data can be refreshed incrementally, the night runs can extend into the day, will all the implications that derive from this - not having actual data, which decreases the trust in reports, etc. There are tricks to speed up the process, though there are limits to what can be done. 

With mirroring, the replication of data between data sources and the analytics platform is handled in the background, after an initial replication, the changes in the source systems being reflected with a near real-time latency into OneLake, which is amazing! This allows building near real-time reporting solutions which can help the business in many ways - reviewing (and correcting in the data source) records en masse, faster overview of what's happening in the organizations, faster basis for decision-making, etc. Moreover, the mechanism is fully managed by Microsoft, which is thus responsible for making sure that the data are correctly synchronized. Only from this perspective 10-20% from the effort of building an analytics solution is probably reduced.

Mirroring in Microsoft Fabric
Mirroring in Microsoft Fabric (adapted after [2])

According to the documentation, one can replicate a whole database or choose individual regular tables (currently views aren't supported [3]), stop, restart, or remove a table from a mirroring. Moreover, through sharing, users can grant to other users or groups of users access to a mirrored database without giving access to the workspace and the rest of its items [1]. 

The data professionals and citizens can write then cross-database queries against the mirrored databases, warehouses, and the SQL analytics endpoints of lakehouses, combining data from all these sources into a single T-SQL query, which opens lot of opportunities especially in what concerns the creation of an enterprise semantic model, which should be differentiated from the semantic model created by default by the mirroring together with the SQL analytics endpoint.

Considering that the data is replicated into delta tables, one can take advantage of all the capabilities available with such tables - data versioning, time travel, interoperability and/or performance, respectively direct consumption in Power BI.

Previous Post <<||>> Next Post

References:
[1] Microsoft Learn - Microsoft Fabric (2024) What is Mirroring in Fabric? (link)
[2] Microsoft Learn - Microsoft Fabric (2024) Mirroring Azure SQL Database [Preview] (link)
[3] Microsoft Learn - Microsoft Fabric (2024) Frequently asked questions for Mirroring Azure SQL Database in Microsoft Fabric [Preview] (link)
[4] Microsoft Fabric Updates Blog (2024) Announcing the Public Preview of Mirroring in Microsoft Fabric, by Charles Webb (link)

13 February 2024

🧭Business Intelligence: A One-Man Show (Part IV: Data Roles between Past and Future)

Business Intelligence Series
Business Intelligence Series

Databases nowadays are highly secure, reliable and available to a degree that reduces the involvement of DBAs to a minimum. The more databases and servers are available in an organization, and the older they are, the bigger the need for dedicated resources to manage them. The number of DBAs involved tends to be proportional with the volume of work required by the database infrastructure. However, if the infrastructure is in the cloud, managed by the cloud providers, it's enough to have a person in the middle who manages the communication between cloud provider(s) and the organization. The person doesn't even need to be a DBA, even if some knowledge in the field is usually recommended.

The requirement for a Data Architect comes when there are several systems in place and there're multiple projects to integrate or build around the respective systems. It'a also the question of what drives the respective requirement - is it the knowledge of data architectures, the supervision of changes, and/or the review of technical documents? The requirement is thus driven by the projects in progress and those waiting in the pipeline. Conversely, if all the systems are in the cloud, their integration is standardized or doesn't involve much architectural knowledge, the role becomes obsolete or at least not mandatory. 

The Data Engineer role is a bit more challenging to define because it appeared in the context of cloud-based data architectures. It seems to be related to the data movement via ETL/ELT pipelines and of data processing and preparation for the various needs. Data modeling or data presentation knowledge isn't mandatory even if ideal. The role seems to overlap with the one of a Data Warehouse professional, be it a simple architect or developer. Role's knowhow depends also on the tools involved, because one thing is to build a solution based on a standard SQL Server, and another thing to use dedicated layers and architectures for the various purposes. Engineers' number should be proportional with the number of data entities involved.

Conversely, the existence of solutions that move and process the data as needed, can reduce the volume of work. Moreover, the use of AI-driven tools like Copilot might shift the focus from data to prompt engineering. 

The Data Analyst role is kind of a Cinderella - it can involve upon case everything from requirements elicitation to reports writing and results' interpretation, respectively from data collection and data modeling to data visualization. If you have a special wish related to your data, just add it to the role! Analysts' number should be related to the number of issues existing in organization where the collection and processing of data could make a difference. Conversely, the Data Citizen, even if it's not a role but a desirable state of art, could absorb in theory the Data Analyst role.

The Data Scientist is supposed to reveal the gems of knowledge hidden in the data by using Machine Learning, Statistics and other magical tools. The more data available, the higher the chances of finding something, even if probably statistically insignificant or incorrect. The role makes sense mainly in the context of big data, even if some opportunities might be available at smaller scales. Scientists' number depends on the number of projects focused on the big questions. Again, one talks about the Data Scientist citizen. 

The Information Designer role seems to be more about data visualization and presentation. It makes sense in the organizations that rely heavily on visual content. All the other organizations can rely on the default settings of data visualization tools, independently on whether AI is involved or not. 

Previous Post <<||>> Next Post

27 January 2024

Data Science: Back to the Future I (About Beginnings)

Data Science
Data Science Series

I've attended again, after several years, a webcast on performance improvement in SQL Server with Claudio Silva, “Writing T-SQL code for the engine, not for you”. The session was great and I really enjoyed it! I recommend it to any data(base) professional, even if some of the scenarios presented should be known already.

It's strange to see the same topics from 20-25 years ago reappearing over and over again despite the advancements made in the area of database engines. Each version of SQL Server brought something new in what concerns the performance, though without some good experience and understanding of the basic optimization and troubleshooting techniques there's little overall improvement for the average data professional in terms of writing and tuning queries!

Especially with the boom of Data Science topics, the volume of material on SQL increased considerably and many discover how easy is to write queries, even if the start might be challenging for some. Writing a query is easy indeed, though writing a performant query requires besides the language itself also some knowledge about the database engine and the various techniques used for troubleshooting and optimization. It's not about knowing in advance what the engine will do - the engine will often surprise you - but about knowing what techniques work, in what cases, which are their advantages and disadvantages, respectively on how they might impact the processing.

Making a parable with writing literature, it's not enough to speak a language; one needs more for becoming a writer, and there are so many levels of mastery! However, in database world even if creativity is welcomed, its role is considerable diminished by the constraints existing in the database engine, the problems to be solved, the time and the resources available. More important, one needs to understand some of the rules and know how to use the building blocks to solve problems and build reliable solutions.

The learning process for newbies focuses mainly on the language itself, while the exposure to complexity is kept to a minimum. For some learners the problems start when writing queries based on multiple tables -  what joins to use, in what order, how to structure the queries, what database objects to use for encapsulating the code, etc. Even if there are some guidelines and best practices, the learner must walk the path and experiment alone or in an organized setup.

In university courses the focus is on operators algebras, algorithms, on general database technologies and architectures without much hand on experience. All is too theoretical and abstract, which is acceptable for research purposes,  but not for the contact with the real world out there! Probably some labs offer exposure to real life scenarios, though what to cover first in the few hours scheduled for them?

This was the state of art when I started to learn SQL a quarter century ago, and besides the current tendency of cutting corners, the increased confidence from doing some tests, and the eagerness of shouting one’s shaking knowledge and more or less orthodox ideas on the various social networks, nothing seems to have changed! Something did change – the increased complexity of the problems to solve, and, considering the recent technological advances, one can afford now an AI learn buddy to write some code for us based on the information provided in the prompt.

This opens opportunities for learning and growth. AI can be used in the learning process by providing additional curricula for learners to dive deeper in some topics. Moreover, it can help us in time to address the challenges of the ever-increase complexity of the problems.

04 February 2021

📦Data Migrations (DM): Conceptualization (Part VII: Data Import Layer)

Data Migration
Data Migrations Series

The data requirements for the Data Migration (DM) and Data Quality (DQ) are driven by the processes implemented in the target system(s). Therefore, a good knowledge of these requirements can decrease the effort needed for these two subprojects considerably. The needed knowledge basis starts with the entities and their attributes, the dependencies existing between them and the various rules that apply, and ends with the parametrization requirements, respectively the architecture(s) that can be used to import the data.

The DM process starts with defining the entities in scope and their attributes, respectively identifying the corresponding entities and attributes from the legacy systems. The attributes not having a correspondent in the legacy system need to be provided by the business and integrated in the DM logic. In addition, it’s needed to consider also the attributes needed by the business and not available in the target system, some of them more likely available in the legacy systems. For such attributes is needed either to misuse an attribute from the target or to extend the target system.

For each entity is created a data mapping that basically documents the data transformations needed for migrating the data. In the process is needed to consider also attributes’ data types, the (standard) formatting, their domain of definition, as well the various rules that apply. Their implementation belongs into the DM layer from which the data are exported in a standard format as needed by the target system.

Exporting the data from the DM layer directly into the target system’s tables has in theory the lowest overhead even if the rejected records are difficult to track, the rejections resulting only from records’ ‘validation against database’s schema. For this approach to work, one must have a good knowledge of the database schema and of the business rules implemented into the target system.

To solve the issue with errors’ logging, systems have a further layer on top of the database model, which also allow running data validation against target system’s business rules. Modern import frameworks allow loading the data via a set of standard files with a predefined structure. The data can be thus imported manually or via load jobs into the system a log with the issues being generated in the process. Some frameworks allow even the manual editing of failed records, respectively to import the data. Unfortunately, calling the layer from the DM layer is not possible from a database, though this would bring seldom a benefit. Some third-party tools attempt to improve the import functionality by calling the target system’s import layer.

The import files must be generated from the DM layer in the required structure with the appropriate formatting. The challenge however resides in identifying all the attributes that should make scope of the load. It’s an iterative process which sometimes is backed by try-and-error heuristics. Unless target system’s validation rules are known beforehand, the rules need to be discovered in this process, which can prove time-consuming. The discoveries need to be integrated also in the DM and from here results the big number of changes that need to be performed.

Given the dependencies existing between entities the files need to be generated and loaded in a predefined order. These dependencies are reflected also in the data processing and the validation rules considered in the DM layer.

A quality checkpoint can be implemented between the export from the DM layer and import to enforce the four-eyes principle. It’s normally the last opportunity for trapping the eventual issues. A further quality check is performed after import by validating on whether the data were imported as expected.

Previous Post <<||>> Next Post

📦Data Migrations (DM): Conceptualization (Part VI: Data Migration Layer)

Data Migration
Data Migrations Series

Besides migrating the master and transactional data from the legacy systems there are usually three additional important business requirements for a Data Migration (DM) – migrate the data within expected timeline, with minimal disruption for the business, respectively within expected quality levels. Hence, DM’ timeline must match and synchronize with main project’s timeline in terms of main milestones, though the DM needs to be executed typically within a small timeframe of a few days during the Go-Live. In what concerns the third requirement, even if the data have high quality as available in the source systems or provided by the business, there are aspects like integration and consistency that rely primarily on the DM logic.

To address these requirements the DM logic must reach a certain level of performance and quality that allows importing the data as expected. From project’s beginning until UAT the DM team will integrate the various information iteratively, will need to test the changes several times, troubleshoot the deviations from expectations. The volume of effort required for these activities can be overwhelming. It’s not only important for the whole solution to be performant but each step must be designed so that besides fast execution, the changes and troubleshooting must involve a minimum of overhead.

For better understanding the importance, imagine a quest game in which the character has to go through a labyrinth with traps. If the player made a mistake he’ll need to restart from a certain distant point in time or even from the beginning. Now imagine that for each mistake he has the possibility of going one step back try a new option and move forward. For some it may look like cheating though in this way one can finish the game relatively quickly. It would be great if executing a DM could allow the same flexibility.

Unfortunately, unless the data are stored between steps or each step is a different package, an ETL solution doesn’t provide the flexibility of changing the code, moving one step behind, rerunning the step and performing troubleshooting, and this over and over again like in the quest game. To better illustrate the impact of such approach let’s consider that the DM has about 40 entities and one needs to perform on average 20 changes per entity. If one is able to move forwards and backwards probably each change will take about a few minutes to execute the code. Otherwise rerunning a whole package can take 5-10 times or even more as this can depend on packages’ size and data volume. For 800 changes only an additional minute per change equates with 800 minutes (about 13 hours).

In exchange, storing the data for an entity in a database for the important points of the processing and implementing the logic as a succession of SQL scripts allows this flexibility. The most important downside is that the steps need to be executed manually though this is a small price to pay for the flexibility and control gained. Moreover, with a few tricks one can load deltas as in the case of a phased DM.

To assure that the consistency of the data is kept one needs to build for each entity a set of validation queries that check for duplicates, for special cases, for data integrity, incorrect format, etc. The queries can be included in the sequence of logic used for the DM. Thus, one can react promptly to each unexpected value. When required, the validation rules can be built within reports and used in the data cleaning process by users, or even logged periodically per entity for tracking the progress.

Previous Post <<||>> Next Post

13 January 2021

💠🛠️SQL Server: Administration (Monitoring the Database Logs)

One of the aspects to monitor on a SQL Server instance is the size of the logs available for each database, respectively the degree to which the logs are used. Starting with SQL Server 2005 this could be achieved by using the 'Log File(s) Used Size (KB)' and 'Log File(s) Size (KB)'  counters via the sys.dm_os_performance_counters DMV as follows:

-- log files - size (kb)
SELECT lfu.instance_name database_name
, lfu.cntr_value size_kb
, Cast(lfu.cntr_value/1024.00 as decimal (18,2)) size_MB
FROM sys.dm_os_performance_counters lfu 
WHERE lfu.counter_name LIKE  'Log File(s) Size (KB)%' 
  AND lfu.object_name LIKE 'SQLServer:Databases%'
  AND lfu.instance_name IN ('tempdb', 'master', 'model', 'msdb')
ORDER BY lfu.instance_name

-- log files - used size (kb)
SELECT lfs.instance_name database_name
, lfs.cntr_value used_size_kb
, Cast(lfs.cntr_value/1024.00 as decimal (18,2)) used_size_MB
FROM sys.dm_os_performance_counters lfs
WHERE lfs.counter_name LIKE  'Log File(s) Used Size (KB)%' 
  AND lfs.object_name LIKE 'SQLServer:Databases%'
  AND lfs.instance_name IN ('tempdb', 'master', 'model', 'msdb')
ORDER BY lfs.instance_name

The two queries can be combined into one as follows:

-- database log space allocation (SQL Server 2005+)
SELECT db.name database_name
, db.log_reuse_wait_desc 
, Cast(lfs.cntr_value/1024.00 as decimal(28,2)) size_MB
, Cast(lfu.cntr_value/1024.00 as decimal(28,2)) AS used_MB
, Cast(100.00*lfu.cntr_value/lfs.cntr_value as decimal(10,2)) used_percent 
, CASE WHEN CAST(lfu.cntr_value AS float) / CAST(lfs.cntr_value AS float) > .5 THEN 
   CASE 
    WHEN db.name = 'tempdb' AND log_reuse_wait_desc NOT IN ('CHECKPOINT', 'NOTHING') THEN 'WARNING'  
    WHEN db.name <> 'tempdb' THEN 'WARNING' 
    ELSE 'OK' 
    END 
  ELSE 'OK' END log_status 
FROM sys.databases db 
     JOIN sys.dm_os_performance_counters lfs 
       ON db.name = lfs.instance_name 
      AND lfs.counter_name LIKE 'Log File(s) Size (KB)%' 
     JOIN sys.dm_os_performance_counters lfu 
       ON db.name = lfu.instance_name 
      AND lfu.counter_name LIKE  'Log File(s) Used Size (KB)%' 
WHERE db.name IN ('tempdb', 'master', 'model', 'msdb')
ORDER BY db.name 

Output:
database_namelog_reuse_wait_descsize_MBused_MBused_percentlog_status
masterNOTHING1.991.1155.78WARNING
modelNOTHING7.991.7421.80OK
msdbNOTHING28.801.966.81OK
tempdbNOTHING999.990.690.07OK

Starting with SQL Server 2012 the same information can be obtained via the sys.dm_db_log_space_usage DMV, however the view returns information only for the current database:

-- getting the log space only for a database (SQL Server 2012+)
SELECT db_name(database_id) database_name 
, Cast(total_log_size_in_bytes/1024.00/1024.00 as decimal(28,2)) size_MB
, Cast(used_log_space_in_bytes/1024.00/1024.00 as decimal(28,2)) used_MB
, Cast(used_log_space_in_percent as decimal(28,2)) used_percent
FROM sys.dm_db_log_space_usage

Output:
database_namesize_MBused_MBused_percent
master1.991.1959.61

With less flexibility one can obtain the size in MB and the used percentage by using the DBCC utility as follows:

-- retrieving the log usage for all databases
DBCC SQLPERF(LOGSPACE); 

Output:
Database NameLog Size (MB)Log Space Used (%)Status
master1,99218833,137260
tempdb999,99220,063525890
model7,99218821,114370
msdb28,804696,6178460
AdventureWorks201433,9921914,766720
AdventureWorksDW201417,9921922,68780

Notes:
1. All the mentioned objects require VIEW SERVER STATE permissions.
2. The solution based on the performance counters returns slightly different values than the other solutions, though the differences are neglectable. 

Resources: 
[1] SQL Docs (2017) sys.dm_os_performance_counters [source]
[2] SQL Docs (2017) DBCC SQLPERF [source]
[3] SQL Docs (2017) sys.dm_db_log_space_usage [source]

21 August 2019

🛡️Information Security: SQL Injection Attack (Definitions)

"This is a way that hackers can bring a database down. SQL injection attacks can be avoided by using stored procedures with the appropriate configured parameters." (Joseph L Jorden & Dandy Weyn, "MCTS Microsoft SQL Server 2005: Implementation and Maintenance Study Guide - Exam 70-431", 2006)

"An attack on a database made by inserting escape characters or additional commands into a batch, allowing the attacker to run commands on the database server. This exploits poor validation or weak designs in application code that allow extra commands to be submitted to the server." (Marilyn Miller-White et al, "MCITP Administrator: Microsoft® SQL Server™ 2005 Optimization and Maintenance 70-444", 2007)

"An Internet attack against a database accessible via a web page. Automated programs are available to launch attacks, and successful SQL injection attacks can obtain the entire layout of a database and all the data." (Darril Gibson, "MCITP SQL Server 2005 Database Developer All-in-One Exam Guide", 2008)

"An attack against a database system launched through an application program containing embedded SQL." (Jan L Harrington, "Relational Database Design and Implementation: Clearly explained" 3rd Ed., 2009)

"A type of attack designed to break through database security and access the information. A SQL injection attack “injects” or manipulates SQL code." (Mike Harwood, "Internet Security: How to Defend Against Attackers on the Web 2nd Ed.", 2015)


25 July 2019

💻IT: Blockchain (Definitions)

"A block chain is a perfect place to store value, identities, agreements, property rights, credentials, etc. Once you put something like a Bit coin into it, it will stay there forever. It is decentralized, disinter mediated, cheap, and censorship-resistant." (Kirti R Bhatele et al, "The Role of Artificial Intelligence in Cyber Security", 2019)

"A system made-up of blocks that are used to record transactions in a peer-to-peer cryptocurrency network such as bitcoins." (Murad Al Shibli, "Hybrid Artificially Intelligent Multi-Layer Blockchain and Bitcoin Cryptology", 2020)

"A chain of blocks containing data that is bundled together. This database is shared across a network of computers (so-called distributed ledger network). Each data block links to the previous block in the blockchain through a cryptographic hash of the previous block, a timestamp, and transaction data. The blockchain only allows data to be written, and once that data has been accepted by the network, it cannot be changed." (Jurij Urbančič et al, "Expansion of Technology Utilization Through Tourism 4.0 in Slovenia", 2020)

"A system in which a record of transactions made in Bitcoin or another cryptocurrency is maintained across several computers that are linked in a peer-to-peer network. Amany M Alshawi, "Decentralized Cryptocurrency Security and Financial Implications: The Bitcoin Paradigm", 2020)

"An encrypted ledger that protects transaction data from modification." (David T A Wesley, "Regulating the Internet, Encyclopedia of Criminal Activities and the Deep Web", 2020)

"Blockchain is a decentralized, immutable, secure data repository or digital ledger where the data is chronologically recorded. The initial block named as Genesis. It is a chain of immutable data blocks what has anonymous individuals as nodes who can transact securely using cryptology. Blockchain technology is subset of distributed ledger technology." (Umit Cali & Claudio Lima, "Energy Informatics Using the Distributed Ledger Technology and Advanced Data Analytics", 2020)

"Blockchain is a meta-technology interconnected with other technologies and consists of several architectural layers: a database, a software application, a number of computers connected to each other, peoples’ access to the system and a software ecosystem that enables development. The blockchain runs on the existing stack of Internet protocols, adding an entire new tier to the Internet to ensure economic transactions, both instant digital currency payments and complicated financial contracts." (Aslı Taşbaşı et al, "An Analysis of Risk Transfer and Trust Nexus in International Trade With Reference to Turkish Data", 2020) 

"Is a growing list of records, called blocks, which are linked using cryptography. Each block contains a cryptographic hash of the previous block a timestamp, and transaction data. (Vardan Mkrttchian, "Perspective Tools to Improve Machine Learning Applications for Cyber Security", 2020)

"This is viewed as a mechanism to provide further protection and enhance the security of data by using its properties of immutability, auditability and encryption whilst providing transparency amongst parties who may not know each other, so operating in a trustless environment." (Hamid Jahankhani & Ionuț O Popescu, "Millennials vs. Cyborgs and Blockchain Role in Trust and Privacy", 2020)

"A blockchain is a data structure that represents the record of each accounting move. Each account transaction is signed digitally to protect its authenticity, and no one can intervene in this transaction." (Ebru E Saygili & Tuncay Ercan, "An Overview of International Fintech Instruments Using Innovation Diffusion Theory Adoption Strategies", 2021)

"A system in which a record of transactions made in bitcoin or another cryptocurrency are maintained across several computers that are linked in a peer-to-peer network." (Silvije Orsag et al, "Finance in the World of Artificial Intelligence and Digitalization", 2021)

"It is a decentralized computation and information sharing platform that enables multiple authoritative domains, who don’t trust each other, to cooperate, coordinate and collaborate in a rational decision-making process." (Vinod Kumar & Gotam Singh Lalotra, "Blockchain-Enabled Secure Internet of Things", 2021)

"A concept consisting of the methods, technologies, and tool sets to support a distributed, tamper-evident, and reliable way to ensure transaction integrity, irrefutability, and non-repudiation. Blockchains are write-once, append-only data stores that include validation, consensus, storage, replication, and security for transactions or other records." (Forrester)

[hybrid blockchain:] "A network with a combination of characteristics of public and private blockchains where a blockchain may incorporate select privacy, security and auditability elements required by the implementation." (AICPA)

[private blockchain:] "A restricted access network controlled by an entity or group which is similar to a traditional centralized network." (AICPA)

"A technology that records a list of records, referred to as blocks, that are linked using cryptography. Each block contains a cryptographic hash of the previous block, a timestamp and transaction data." (AICPA)

[public blockchain:] "An open network where participants can view, read and write data, and no one participant has control (e.g., Bitcoin, Ethereum)." (AICPA)

05 July 2019

💻IT: Gateway (Definitions)

"A network software product that allows computers or networks running dissimilar protocols to communicate, providing transparent access to a variety of foreign database management systems (DBMSs). A gateway moves specific database connectivity and conversion processing from individual client computers to a single server computer. Communication is enabled by translating up one protocol stack and down the other. Gateways usually operate at the session layer." (Microsoft Corporation, "SQL Server 7.0 System Administration Training Kit", 1999)

"Connectivity software that allows two or more computer systems with different network architectures to communicate." (Sybase, "Glossary", 2005)

"A generic term referring to a computer system that routes data or merges two dissimilar services together." (Paulraj Ponniah, "Data Warehousing Fundamentals for IT Professionals", 2010)

"A software product that allows SQL-based applications to access relational and non-relational data sources." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"An entrance point that allows users to connect from one network to another." (Linda Volonino & Efraim Turban, "Information Technology for Management" 8th Ed., 2011)

[database gateway:] "Software required to allow clients to access data stored on database servers over a network connection." (Craig S Mullins, "Database Administration: The Complete Guide to DBA Practices and Procedures" 2nd Ed., 2012)

"A connector box that enables you to connect two dissimilar networks." (Faithe Wempen, "Computing Fundamentals: Introduction to Computers", 2015)

"A node that handles communication between its LAN and other networks" (Nell Dale & John Lewis, "Computer Science Illuminated, 6th Ed.", 2015)

"A system or device that connects two unlike environments or systems. The gateway is usually required to translate between different types of applications or protocols." (Shon Harris & Fernando Maymi, "CISSP All-in-One Exam Guide" 8th Ed., 2018)

"An application that acts as an intermediary for clients and servers that cannot communicate directly. Acting as both client and server, a gateway application passes requests from a client to a server and returns results from the server to the client." (Sybase, "Open Server Server-Library/C Reference Manual", 2019)

31 December 2018

🔭Data Science: Big Data (Just the Quotes)

"If we gather more and more data and establish more and more associations, however, we will not finally find that we know something. We will simply end up having more and more data and larger sets of correlations." (Kenneth N Waltz, "Theory of International Politics Source: Theory of International Politics", 1979)

“There are those who try to generalize, synthesize, and build models, and there are those who believe nothing and constantly call for more data. The tension between these two groups is a healthy one; science develops mainly because of the model builders, yet they need the second group to keep them honest.” (Andrew Miall, “Principles of Sedimentary Basin Analysis”, 1984)

"Big data can change the way social science is performed, but will not replace statistical common sense." (Thomas Landsall-Welfare, "Nowcasting the mood of the nation", Significance 9(4), 2012)

"Big Data is data that exceeds the processing capacity of conventional database systems. The data is too big, moves too fast, or doesn’t fit the strictures of your database architectures. To gain value from this data, you must choose an alternative way to process it." (Edd Wilder-James, "What is big data?", 2012) [source]

"The secret to getting the most from Big Data isn’t found in huge server farms or massive parallel computing or in-memory algorithms. Instead, it’s in the almighty pencil." (Matt Ariker, "The One Tool You Need To Make Big Data Work: The Pencil", 2012)

"Big data is the most disruptive force this industry has seen since the introduction of the relational database." (Jeffrey Needham, "Disruptive Possibilities: How Big Data Changes Everything", 2013)

"No subjective metric can escape strategic gaming [...] The possibility of mischief is bottomless. Fighting ratings is fruitless, as they satisfy a very human need. If one scheme is beaten down, another will take its place and wear its flaws. Big Data just deepens the danger. The more complex the rating formulas, the more numerous the opportunities there are to dress up the numbers. The larger the data sets, the harder it is to audit them." (Kaiser Fung, "Numbersense: How To Use Big Data To Your Advantage", 2013)

"There is convincing evidence that data-driven decision-making and big data technologies substantially improve business performance. Data science supports data-driven decision-making - and sometimes conducts such decision-making automatically - and depends upon technologies for 'big data' storage and engineering, but its principles are separate." (Foster Provost & Tom Fawcett, "Data Science for Business", 2013)

"Our needs going forward will be best served by how we make use of not just this data but all data. We live in an era of Big Data. The world has seen an explosion of information in the past decades, so much so that people and institutions now struggle to keep pace. In fact, one of the reasons for the attachment to the simplicity of our indicators may be an inverse reaction to the sheer and bewildering volume of information most of us are bombarded by on a daily basis. […] The lesson for a world of Big Data is that in an environment with excessive information, people may gravitate toward answers that simplify reality rather than embrace the sheer complexity of it." (Zachary Karabell, "The Leading Indicators: A short history of the numbers that rule our world", 2014)

"The other buzzword that epitomizes a bias toward substitution is 'big data'. Today’s companies have an insatiable appetite for data, mistakenly believing that more data always creates more value. But big data is usually dumb data. Computers can find patterns that elude humans, but they don’t know how to compare patterns from different sources or how to interpret complex behaviors. Actionable insights can only come from a human analyst (or the kind of generalized artificial intelligence that exists only in science fiction)." (Peter Thiel & Blake Masters, "Zero to One: Notes on Startups, or How to Build the Future", 2014)

"We have let ourselves become enchanted by big data only because we exoticize technology. We’re impressed with small feats accomplished by computers alone, but we ignore big achievements from complementarity because the human contribution makes them less uncanny. Watson, Deep Blue, and ever-better machine learning algorithms are cool. But the most valuable companies in the future won’t ask what problems can be solved with computers alone. Instead, they’ll ask: how can computers help humans solve hard problems?" (Peter Thiel & Blake Masters, "Zero to One: Notes on Startups, or How to Build the Future", 2014)

"As business leaders we need to understand that lack of data is not the issue. Most businesses have more than enough data to use constructively; we just don't know how to use it. The reality is that most businesses are already data rich, but insight poor." (Bernard Marr, Big Data: Using SMART Big Data, Analytics and Metrics To Make Better Decisions and Improve Performance, 2015)

"Big data is based on the feedback economy where the Internet of Things places sensors on more and more equipment. More and more data is being generated as medical records are digitized, more stores have loyalty cards to track consumer purchases, and people are wearing health-tracking devices. Generally, big data is more about looking at behavior, rather than monitoring transactions, which is the domain of traditional relational databases. As the cost of storage is dropping, companies track more and more data to look for patterns and build predictive models." (Neil Dunlop, "Big Data", 2015)

"Big Data often seems like a meaningless buzz phrase to older database professionals who have been experiencing exponential growth in database volumes since time immemorial. There has never been a moment in the history of database management systems when the increasing volume of data has not been remarkable." (Guy Harrison, "Next Generation Databases: NoSQL, NewSQL, and Big Data", 2015)

"Dimensionality reduction is essential for coping with big data - like the data coming in through your senses every second. A picture may be worth a thousand words, but it’s also a million times more costly to process and remember. [...] A common complaint about big data is that the more data you have, the easier it is to find spurious patterns in it. This may be true if the data is just a huge set of disconnected entities, but if they’re interrelated, the picture changes." (Pedro Domingos, "The Master Algorithm", 2015)

"Science’s predictions are more trustworthy, but they are limited to what we can systematically observe and tractably model. Big data and machine learning greatly expand that scope. Some everyday things can be predicted by the unaided mind, from catching a ball to carrying on a conversation. Some things, try as we might, are just unpredictable. For the vast middle ground between the two, there’s machine learning." (Pedro Domingos, "The Master Algorithm", 2015)

"The human side of analytics is the biggest challenge to implementing big data." (Paul Gibbons, "The Science of Successful Organizational Change", 2015)

"To make progress, every field of science needs to have data commensurate with the complexity of the phenomena it studies. [...] With big data and machine learning, you can understand much more complex phenomena than before. In most fields, scientists have traditionally used only very limited kinds of models, like linear regression, where the curve you fit to the data is always a straight line. Unfortunately, most phenomena in the world are nonlinear. [...] Machine learning opens up a vast new world of nonlinear models." (Pedro Domingos, "The Master Algorithm", 2015)

"Underfitting is when a model doesn’t take into account enough information to accurately model real life. For example, if we observed only two points on an exponential curve, we would probably assert that there is a linear relationship there. But there may not be a pattern, because there are only two points to reference. [...] It seems that the best way to mitigate underfitting a model is to give it more information, but this actually can be a problem as well. More data can mean more noise and more problems. Using too much data and too complex of a model will yield something that works for that particular data set and nothing else." (Matthew Kirk, "Thoughtful Machine Learning", 2015)

"We are moving slowly into an era where Big Data is the starting point, not the end." (Pearl Zhu, "Digital Master: Debunk the Myths of Enterprise Digital Maturity", 2015)

"A popular misconception holds that the era of Big Data means the end of a need for sampling. In fact, the proliferation of data of varying quality and relevance reinforces the need for sampling as a tool to work efficiently with a variety of data, and minimize bias. Even in a Big Data project, predictive models are typically developed and piloted with samples." (Peter C Bruce & Andrew G Bruce, "Statistics for Data Scientists: 50 Essential Concepts", 2016)

"Big data is, in a nutshell, large amounts of data that can be gathered up and analyzed to determine whether any patterns emerge and to make better decisions." (Daniel Covington, Analytics: Data Science, Data Analysis and Predictive Analytics for Business, 2016)

"Big Data processes codify the past. They do not invent the future. Doing that requires moral imagination, and that’s something only humans can provide. We have to explicitly embed better values into our algorithms, creating Big Data models that follow our ethical lead. Sometimes that will mean putting fairness ahead of profit." (Cathy O'Neil, "Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy", 2016)

"While Big Data, when managed wisely, can provide important insights, many of them will be disruptive. After all, it aims to find patterns that are invisible to human eyes. The challenge for data scientists is to understand the ecosystems they are wading into and to present not just the problems but also their possible solutions." (Cathy O'Neil, "Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy", 2016)

"Big Data allows us to meaningfully zoom in on small segments of a dataset to gain new insights on who we are." (Seth Stephens-Davidowitz, "Everybody Lies: What the Internet Can Tell Us About Who We Really Are", 2017)

"Effects without an understanding of the causes behind them, on the other hand, are just bunches of data points floating in the ether, offering nothing useful by themselves. Big Data is information, equivalent to the patterns of light that fall onto the eye. Big Data is like the history of stimuli that our eyes have responded to. And as we discussed earlier, stimuli are themselves meaningless because they could mean anything. The same is true for Big Data, unless something transformative is brought to all those data sets… understanding." (Beau Lotto, "Deviate: The Science of Seeing Differently", 2017)

"The term [Big Data] simply refers to sets of data so immense that they require new methods of mathematical analysis, and numerous servers. Big Data - and, more accurately, the capacity to collect it - has changed the way companies conduct business and governments look at problems, since the belief wildly trumpeted in the media is that this vast repository of information will yield deep insights that were previously out of reach." (Beau Lotto, "Deviate: The Science of Seeing Differently", 2017)

"There are other problems with Big Data. In any large data set, there are bound to be inconsistencies, misclassifications, missing data - in other words, errors, blunders, and possibly lies. These problems with individual items occur in any data set, but they are often hidden in a large mass of numbers even when these numbers are generated out of computer interactions." (David S Salsburg, "Errors, Blunders, and Lies: How to Tell the Difference", 2017)

"Just as they did thirty years ago, machine learning programs (including those with deep neural networks) operate almost entirely in an associational mode. They are driven by a stream of observations to which they attempt to fit a function, in much the same way that a statistician tries to fit a line to a collection of points. Deep neural networks have added many more layers to the complexity of the fitted function, but raw data still drives the fitting process. They continue to improve in accuracy as more data are fitted, but they do not benefit from the 'super-evolutionary speedup'."  (Judea Pearl & Dana Mackenzie, "The Book of Why: The new science of cause and effect", 2018)

"One of the biggest myths is the belief that data science is an autonomous process that we can let loose on our data to find the answers to our problems. In reality, data science requires skilled human oversight throughout the different stages of the process. [...] The second big myth of data science is that every data science project needs big data and needs to use deep learning. In general, having more data helps, but having the right data is the more important requirement. [...] A third data science myth is that modern data science software is easy to use, and so data science is easy to do. [...] The last myth about data science [...] is the belief that data science pays for itself quickly. The truth of this belief depends on the context of the organization. Adopting data science can require significant investment in terms of developing data infrastructure and hiring staff with data science expertise. Furthermore, data science will not give positive results on every project." (John D Kelleher & Brendan Tierney, "Data Science", 2018)

"Apart from the technical challenge of working with the data itself, visualization in big data is different because showing the individual observations is just not an option. But visualization is essential here: for analysis to work well, we have to be assured that patterns and errors in the data have been spotted and understood. That is only possible by visualization with big data, because nobody can look over the data in a table or spreadsheet." (Robert Grant, "Data Visualization: Charts, Maps and Interactive Graphics", 2019)

"With the growing availability of massive data sets and user-friendly analysis software, it might be thought that there is less need for training in statistical methods. This would be naïve in the extreme. Far from freeing us from the need for statistical skills, bigger data and the rise in the number and complexity of scientific studies makes it even more difficult to draw appropriate conclusions. More data means that we need to be even more aware of what the evidence is actually worth." (David Spiegelhalter, "The Art of Statistics: Learning from Data", 2019)

"Big data is revolutionizing the world around us, and it is easy to feel alienated by tales of computers handing down decisions made in ways we don’t understand. I think we’re right to be concerned. Modern data analytics can produce some miraculous results, but big data is often less trustworthy than small data. Small data can typically be scrutinized; big data tends to be locked away in the vaults of Silicon Valley. The simple statistical tools used to analyze small datasets are usually easy to check; pattern-recognizing algorithms can all too easily be mysterious and commercially sensitive black boxes." (Tim Harford, "The Data Detective: Ten easy rules to make sense of statistics", 2020)

"Making big data work is harder than it seems. Statisticians have spent the past two hundred years figuring out what traps lie in wait when we try to understand the world through data. The data are bigger, faster, and cheaper these days, but we must not pretend that the traps have all been made safe. They have not." (Tim Harford, "The Data Detective: Ten easy rules to make sense of statistics", 2020)

"Many people have strong intuitions about whether they would rather have a vital decision about them made by algorithms or humans. Some people are touchingly impressed by the capabilities of the algorithms; others have far too much faith in human judgment. The truth is that sometimes the algorithms will do better than the humans, and sometimes they won’t. If we want to avoid the problems and unlock the promise of big data, we’re going to need to assess the performance of the algorithms on a case-by-case basis. All too often, this is much harder than it should be. […] So the problem is not the algorithms, or the big datasets. The problem is a lack of scrutiny, transparency, and debate." (Tim Harford, "The Data Detective: Ten easy rules to make sense of statistics", 2020)

"The problem is the hype, the notion that something magical will emerge if only we can accumulate data on a large enough scale. We just need to be reminded: Big data is not better; it’s just bigger. And it certainly doesn’t speak for itself." (Carl T Bergstrom & Jevin D West, "Calling Bullshit: The Art of Skepticism in a Data-Driven World", 2020)

"[...] the focus on Big Data AI seems to be an excuse to put forth a number of vague and hand-waving theories, where the actual details and the ultimate success of neuroscience is handed over to quasi- mythological claims about the powers of large datasets and inductive computation. Where humans fail to illuminate a complicated domain with testable theory, machine learning and big data supposedly can step in and render traditional concerns about finding robust theories. This seems to be the logic of Data Brain efforts today. (Erik J Larson, "The Myth of Artificial Intelligence: Why Computers Can’t Think the Way We Do", 2021)

"We live on islands surrounded by seas of data. Some call it 'big data'. In these seas live various species of observable phenomena. Ideas, hypotheses, explanations, and graphics also roam in the seas of data and can clarify the waters or allow unsupported species to die. These creatures thrive on visual explanation and scientific proof. Over time new varieties of graphical species arise, prompted by new problems and inner visions of the fishers in the seas of data." (Michael Friendly & Howard Wainer, "A History of Data Visualization and Graphic Communication", 2021)

"Visualizations can remove the background noise from enormous sets of data so that only the most important points stand out to the intended audience. This is particularly important in the era of big data. The more data there is, the more chance for noise and outliers to interfere with the core concepts of the data set." (Kate Strachnyi, "ColorWise: A Data Storyteller’s Guide to the Intentional Use of Color", 2023)

"Visualisation is fundamentally limited by the number of pixels you can pump to a screen. If you have big data, you have way more data than pixels, so you have to summarise your data. Statistics gives you lots of really good tools for this." (Hadley Wickham)

01 June 2018

🔬Data Science: Data Model (Definitions)

"simplified and approximative description of a system or process, based on a finite set of essential variables and their analytically definable behavior." (Teuvo Kohonen, "Self-Organizing Maps" 3rd Ed., 2001)

"(i) An abstract, self-contained logical definition of the data structures and associated operators that make up the abstract machine with which users interact (such as the relational model of data). (ii) A model of the persistent data of some enterprise" (Keith Gordon, "Principles of Data Management", 2007)

"A means of encapsulating the data elements that were decided on by the business experts, in conjunction with the data stewards and IT professionals. The data models reflect an organization’s business as represented in the data." (Tony Fisher, "The Data Asset", 2009)

"An abstraction of how individual data elements relate to each other. It visually depicts how the data is to be organized and stored in a database. A data model provides the mechanism to document and understand how data is organized." (Laura Reeves, "A Manager's Guide to Data Warehousing", 2009)

"An organization of data that describes the relationships among primitive and composite data elements." (Toby J. Teorey, "Database Modeling and Design" 4th Ed., 2010)

"A model of the structure (and to some extent the content) of a database and at least some of the rules governing the data therein." (Graham Witt, "Writing Effective Business Rules", 2012)

"A data model is a visual representation of data content and the relationships, created for purposes of understanding how data is or might be organized, and for ensuring the comprehensibility and usability of that way of organizing data." (Laura Sebastian-Coleman, "Measuring Data Quality for Ongoing Improvement", 2013)

"Represents data objects and their relationships with each other. Data models form the basis for data integration at the conceptual level as well as the improvement of data quality, such as with regard to the reduction of data redundancy. Data models are one component of the data architecture." (Boris Otto & Hubert Österle, "Corporate Data Quality", 2015)

"A template formalizing the relationship between an input and an output. Its structure is fixed but it also has parameters that are modifiable; the parameters are adjusted so that the same model with different parameters can be trained on different data to implement different relationships in different tasks." (Ethem Alpaydın, "Machine learning : the new AI", 2016)

"A visual means of depicting data and its relationship to other data." (Gregory Lampshire et al, "The Data and Analytics Playbook", 2016)

"An abstract representation of a subject that looks and/or behaves like all or part of the original." (George Tillmann, "Usage-Driven Database Design: From Logical Data Modeling through Physical Schmea Definition", 2017)

"In the context of machine learning, a model is a representation of a pattern extracted using machine learning from a data set. Consequently, models are trained, fitted to a data set, or created by running a machine learning algorithm on a data set. Popular model representations include decision trees and neural networks. A prediction model defines a mapping (or function) from a set of input attributes to a value for a target attribute. Once a model has been created, it can then be applied to new instances from the domain. For example, in order to train a spam filter model, we would apply a machine learning algorithm to a data set of historic emails that have been labeled as spam or not spam. Once the model has been trained it can be used to label (or filter) new emails that were not in the original data set." (John D Kelleher & Brendan Tierney, "Data science", 2018)

"An abstract model that describes how data is presented and used." (Piethein Strengholt, "Data Management at Scale", 2020)

"A logical map that represents the inherent properties of the data independent of software, hardware or machine performance considerations. The model shows data elements grouped into records, as well as the association around those records." (Information Management)

"Defines how data is structured, related, and standardized for the purpose of extracting meaningful insight." (Insight Software)

"A data model is an abstract model that organizes elements of data and standardizes how they relate to one another. Their properties are generally governed by properties of the real world entities." (kloudless)

11 May 2018

Application Support: One Database, Two Vendors, No Love Story or Maybe…

Data Warehousing

Introduction

Situation: An organization has several BI tools provisioned with data from the same data warehouse (DW), the BI infrastructure being supported by the same service provider (vendor). The organization wants to adopt a new BI technology, though for it must be brought another vendor into the picture. The data the tool requires are already available in the DW, though the DW needs to be extended with logic and other components to support the new tool. This means that two vendors will be active in the same DW, more generally in the same environment.

Question(s): What is the best approach for making this work? Which are the challenges for making it work, considering that two vendors?


Preliminary

    When you ask IT people about this situation, many will tell you that’s not a good idea, being circumspect at having two vendors within the same environment. Some will recall previous experience in which things went really bad, escalated to some degree. They will even show their disagreement through body language or increase tonality. Even if they had also good experiences with having two vendors support the same environment, the negative experiences will prevail. It’s the typical reaction to an idea when something that caused considerable trouble is recalled. This behavior is understandable as generally human tend to remember more the issues they had, rather than successes. Problems leave deeper marks than success, especially when challenges are seen as burdens.

    Reacting defensively is a result of the “I’ve been burned once” syndrome. People react adversely and tend to avoid situations in which they were burned, instead of dealing with them, instead of recognizing which were the circumstances that lead to the situation in the first place, of recognizing opportunities for healing and raising above the challenges.


   Personally, at a first glance, the caution would make me advise as well against having two or more vendors playing within same playground. I had my plate of extreme cases in which something went wrong and the vendors started acting like kids. Parents (in general people who work with children) know what I’m talking about, children don’t like to share their toys and parents often find themselves in the position of mediating between them. When the toy get’s broken it’s easy to blame other kid for it, same as somebody else must put the toy at its place, because that somebody played the last time with it. It’s a mix between I’m in charge and the blame game. Who needs that?

  At second sight, if parents made it, why wouldn’t professionals succeed in making two vendors work together? Sure, parents have more practice in dealing with kids, have such situations on a daily basis, and there are fewer variables to think about it… I have seen vendors sitting together until they come up with a solution, I’ve seen vendors open to communicate, putting the customer on the first place, even if that meant living the ego behind. Where there’s a will there’s a way.


The Solution Space

    In IT there are seldom general recipes that always lead to success, and whether a solution works or not depends on a serious of factors – environment, skills, communication, human behavior and quite often chance, the chance of doing the right thing at the right time. However, the recipe can be used as a starting point, eventually to define the best scenario, what will happen when everything goes well. At the opposite side there is the worst scenario, what will happen when everything goes south. These two opposite scenarios are in general the frame in which a solution can be defined.

    Within this frame one can add several other reference points or paths, and these are made of the experience of people handling and experiencing similar situations – what worked, what didn’t, what could work, what are the challenges, and so on. In general, people’s experience and knowledge prove to be a good estimator in decision making, and even if relative, it proves some insight into the problem at hand.


    Let’s reconsider the parents put in the situation of dealing with children fighting for the same toy, though from the perspective of all the toys available to play with. There are several options available: the kids could take (supervised) turn in playing with the toys, fact that could be a win-win situation if they are willing to cooperate. One can take the toys (temporarily) away, though this could lead to other issues. One can reaffirm who’s the owner of each toy, each kid being allowed to play only with his toy. One could buy a second toy, and thus brake the bank even if this will not make the issue entirely go away. Probably there are other solutions inventive parents might find.

    Similarly, in the above stated problem, one option, and maybe the best, is having the vendors share ownership for the DW by finding a way to work together. Defining the ownership for each tool can alleviate some of the problems but not all, same as building a second DW. We can probably all agree that taking the tools away is not the right thing to do, and even if it’s a solution, it doesn’t support the purpose.


Sharing Ownership

    Complex IT environments like the one of a DW depend on vendors’ capability of working together in reaching the same goal, even if in play are different interests. This presumes the disposition of the parties in relinquishing some control, sharing responsibilities. Unfortunately, not all vendors are willing to do that. That’s the point where imaginary obstacles are built, is where effort needs to be put to eliminate such obstacles.

    When working together, often one of the parties must play the coordinator role. In theory, this role can be played by any of the vendors, and the roles can even change after case. Another approach is when the coordinator role can be taken also by a person or a board from the customer side. In case of a data warehouse it can be an IT professional, a Project Manager or a BI Competency Center (BICC) . This would allow to smoothly coordinate the activities, as well to mediate the communication and other types of challenges faced.


    How will ownership sharing work? Let’s suppose vendor A wants to change something in the infrastructure. The change is first formulated, shortly reviewed, and approved by both vendors and customer, and will then be implemented and documented by vendor A as needed. Vendor B is involved in the process by validating the concept and reviewing the documentation, its involvement being minimized. There can be still some delays in the process, though the overhead is somehow minimized. There will be also scenarios in which vendor B needs only to be informed that a change has occurred, or sometimes is enough if a change was properly documented.

    This approach involves also a greater need for documentation, versioning, established processes, their role being to facilitate the communication and track the changes occurred in the environment.


Splitting Ownership

    Splitting the ownership involves setting clear boundaries and responsibilities within which each vendor can perform the work. One is forced thus to draw a line and say which components or activities belong to each vendor. 

    The architecture of existing solutions makes it sometimes hard to split the ownership when the architecture was not designed for it. A solution would be to redesign the whole architecture, though even then might not be possible to draw a clear line when there are grey areas. One needs eventually to consider the advantages and disadvantages and decide to which vendor the responsibility suits best.


    For example, in the context of a DW security can be enforced via schemas within same or different databases, though there are also objects (e.g. tables with basis data) used by multiple applications. One of the vendors (vendor A) will get the ownership of the objects, thus when vendor B needs a change to those table, it must require the change to vendor A. Once the changes are done the vendor B needs to validate the changes, and if there are problems further communication occurs. Per total this approach will take more time than if the vendor B would have done alone the changes. However, it works even if it comes with some challenges.

    There’s also the possibility to give vendor B temporary permissions to do the changes, fact that will shorten the effort needed. The vendor A will still be in charge, and will have to prove the documentation, and do eventually some validation as well.


Separating Ownership

    Giving each vendor its own playground is a costly solution, though it can be the only solution in certain scenarios. For example, when an architecture is supposed to replace (in time) another, or when the existing architecture has certain limitations. In the context of a DW this involves duplicating the data loads, the data themselves, as well logic, eventually processes, and so on.

    Pushing this just to solve a communication problem is the wrong thing to do. What happens if a third or a fourth vendor joins the party? Would it be for each vendor a new environment created? Hopefully, not…

    On the other side, there are also vendors that don’t want to relinquish the ownership, and will play their cards not to do it. The overhead of dealing with such issues may surpass in extremis the costs of having a second environment. In the end the final decision has the customer.


Hybrid Approach


    A hybrid between sharing and splitting ownership can prove to give the best from the two scenarios. It’s useful and even recommended to define the boundaries of work for each vendor, following to share ownership on the areas where there’s an intersection of concerns, the grey areas. For sensitive areas there could be some restrictions in cooperation.

    A hybrid solution can involve as well splitting some parts of the architecture, though the performance and security are mainly the driving factors.


Conclusion

   I wanted with this post to make the reader question some of the hot-brained decisions made when two or more vendors are involved in the supporting an architecture. Even if the problem is put within the context of a DW it’s occurrence extends far beyond this context. We are enablers and problem solvers. Instead of avoiding challenges we should better make sure that we’re removing or minimizing the risks. 

10 April 2018

🔬Data Science: Abstraction (Definitions)

"A broad and general term indicating (1) a less detailed model that conforms to (defines a subset of the properties of) another model, and (2) the process through which a less detailed but conforming model is made, that is, the process of removing details that are not relevant to the purpose of the model." (Anneke Kleppe et al, "MDA Explained: The Model Driven Architecture™: Practice and Promise", 2003)

"The process of ignoring or suppressing levels of detail to provide a simpler, more generalized view." (Sharon Allen & Evan Terry, "Beginning Relational Data Modeling" 2nd Ed., 2005)

"The process of moving from the specific to the general by neglecting minor differences or stressing common elements. Also used as a synonym for summarization." (Martin J Eppler, "Managing Information Quality" 2nd Ed., 2006)

"Data abstraction means the storage details of the data are hidden from the user and the user is provided with the conceptual view of the database." (S. Sumathi & S. Esakkirajan, "Fundamentals of Relational Database Management Systems", 2007)

"In data modeling, the redefinition of data entities, attributes, and relationships by removing details to broaden the applicability of data structures to a wider class of situations, often by implementing supertypes rather than subtypes." (DAMA International, "The DAMA Dictionary of Data Management" 1st Ed., 2010)

[horizontal abstraction:] "The process of partitioning a model into smaller subparts for presentation. Used in data modeling to show related areas in a more readable scale." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

[vertical abstraction:] "The presentation of all or part of a model detail. Used in data modeling to show higher levels of entities and relationships to illustrate the basic subject area contents." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"The separation of the logical view of data from its implementation." (Nell Dale & John Lewis, "Computer Science Illuminated" 6th Ed., 2015)

"The separation of a data type’s logical properties from its implementation." (Nell Dale et al, "Object-Oriented Data Structures Using Java" 4th Ed., 2016)

Related Posts Plugin for WordPress, Blogger...

About Me

My photo
Koeln, NRW, Germany
IT Professional with more than 24 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.