SQL Troubles

20 December 2017

🗃️Data Management: Versioning (Just the Quotes)

"There are two different methods to detect and collect changes: data versioning, which evaluates columns that identify rows that have changed (e.g., last-update-timestamp columns, version-number columns, status-indicator columns), or by reading logs that document the changes and enable them to be replicated in secondary systems." (DAMA International, "DAMA-DMBOK: Data Management Body of Knowledge" 2nd Ed., 2017)

"Moving your code to modules, checking it into version control, and versioning your data will help to create reproducible models. If you are building an ML model for an enterprise, or you are building a model for your start-up, knowing which model and which version is deployed and used in your service is essential. This is relevant for auditing, debugging, or resolving customer inquiries regarding service predictions." (Christoph Körner and Kaijisse Waaijer, "Mastering Azure Machine Learning". 2020)

"Versioning is a critical feature, because understanding the history of a master data record is vital to maintaining its quality and accuracy over time." (Cédrine MADERA, "Master Data and Reference Data in Data Lake Ecosystems" [in "Data Lake" ed. by Anne Laurent et al, 2020])

"Versioning of data is essential for ML systems as it helps us to keep track of which data was used for a particular version of code to generate a model. Versioning data can enable reproducing models and compliance with business needs and law. We can always backtrack and see the reason for certain actions taken by the ML system. Similarly, versioning of models (artifacts) is important for tracking which version of a model has generated certain results or actions for the ML system. We can also track or log parameters used for training a certain version of the model. This way, we can enable end-to-end traceability for model artifacts, data, and code. Version control for code, data, and models can enhance an ML system with great transparency and efficiency for the people developing and maintaining it." (Emmanuel Raj, "Engineering MLOps Rapidly build, test, and manage production-ready machine learning life cycles at scale", 2021)

"DevOps and Continuous Integration/Continuous Deployment (CI/CD) are vital to any software project that is developed by more than one developer and needs to uphold quality standards. A central code repository that offers versioning, branching, and merging for collaborative development and approval workflows and documentation features is the minimum requirement here." (Patrik Borosch, "Cloud Scale Analytics with Azure Data Services: Build modern data warehouses on Microsoft Azure", 2021)

"Automated data orchestration is a key DataOps principle. An example of orchestration can take ETL jobs and a Python script to ingest and transform data based on a specific sequence from different source systems. It can handle the versioning of data to avoid breaking existing data consumption pipelines already in place." (Sonia Mezzetta, "Principles of Data Fabric: Become a data-driven organization by implementing Data Fabric solutions efficiently", 2023)

"Data products should remain stable and be decoupled from the operational/transactional applications. This requires a mechanism for detecting schema drift, and avoiding disruptive changes. It also requires versioning and, in some cases, independent pipelines to run in parallel, giving your data consumers time to migrate from one version to another." (Piethein Strengholt, "Data Management at Scale: Modern Data Architecture with Data Mesh and Data Fabric" 2nd Ed., 2023)

"When performing experiments, the first step is to determine what compute infrastructure and environment you need.16 A general best practice is to start fresh, using a clean development environment. Keep track of everything you do in each experiment, versioning and capturing all your inputs and outputs to ensure reproducibility. Pay close attention to all data engineering activities. Some of these may be generic steps and will also apply for other use cases. Finally, you’ll need to determine the implementation integration pattern to use for your project in the production environment." (Piethein Strengholt, "Data Management at Scale: Modern Data Architecture with Data Mesh and Data Fabric" 2nd Ed., 2023)

13 December 2017

🗃️Data Management: Data Management (Just the Quotes)

"Metadata provides context for data by describing data about data. It answers 'who, what, when, where, how, and why' about every facet of the data. It is used to facilitate understanding, usage, and management of data." (Neera Bhansali, "Data Governance: Creating Value from Information Assets", 2014)

"How good the data quality is can be looked at both subjectively and objectively. The subjective component is based on the experience and needs of the stakeholders and can differ by who is being asked to judge it. For example, the data managers may see the data quality as excellent, but consumers may disagree. One way to assess it is to construct a survey for stakeholders and ask them about their perception of the data via a questionnaire. The other component of data quality is objective. Measuring the percentage of missing data elements, the degree of consistency between records, how quickly data can be retrieved on request, and the percentage of incorrect matches on identifiers (same identifier, different social security number, gender, date of birth) are some examples." (Aileen Rothbard, "Quality Issues in the Use of Administrative Data Records", 2015)

"Start by reviewing existing data management activities, such as who creates and manages data, who measures data quality, or even who has ‘data’ in their job title. Survey the organization to find out who may already be fulfilling needed roles and responsibilities. Such individuals may hold different titles. They are likely part of a distributed organization and not necessarily recognized by the enterprise. After compiling a list of ‘data people,’ identify gaps. What additional roles and skill sets are required to execute the data strategy? In many cases, people in other parts of the organization have analogous, transferrable skill sets. Remember, people already in the organization bring valuable knowledge and experience to a data management effort." (DAMA International, "DAMA-DMBOK: Data Management Body of Knowledge" 2nd Ed., 2017)

"A big part of data governance should be about helping people (business and technical) get their jobs done by providing them with resources to answer their questions, such as publishing the names of data stewards and authoritative sources and other metadata, and giving people a way to raise, and if necessary escalate, data issues that are hindering their ability to do their jobs. Data governance helps answer some basic data management questions." (Mike Fleckenstein & Lorraine Fellows, "Modern Data Strategy", 2018)

"A data lake is a storage repository that holds a very large amount of data, often from diverse sources, in native format until needed. In some respects, a data lake can be compared to a staging area of a data warehouse, but there are key differences. Just like a staging area, a data lake is a conglomeration point for raw data from diverse sources. However, a staging area only stores new data needed for addition to the data warehouse and is a transient data store. In contrast, a data lake typically stores all possible data that might be needed for an undefined amount of analysis and reporting, allowing analysts to explore new data relationships. In addition, a data lake is usually built on commodity hardware and software such as Hadoop, whereas traditional staging areas typically reside in structured databases that require specialized servers." (Mike Fleckenstein & Lorraine Fellows, "Modern Data Strategy", 2018)

"Data governance presents a clear shift in approach, signals a dedicated focus on data management, distinctly identifies accountability for data, and improves communication through a known escalation path for data questions and issues. In fact, data governance is central to data management in that it touches on essentially every other data management function. In so doing, organizational change will be brought to a group is newly - and seriously - engaging in any aspect of data management." (Mike Fleckenstein & Lorraine Fellows, "Modern Data Strategy", 2018)

"Indicators represent a way of 'distilling' the larger volume of data collected by organizations. As data become bigger and bigger, due to the greater span of control or growing complexity of operations, data management becomes increasingly difficult. Actions and decisions are greatly influenced by the nature, use and time horizon (e.g., short or long-term) of indicators." (Fiorenzo Franceschini et al, "Designing Performance Measurement Systems: Theory and Practice of Key Performance Indicators", 2019)

"The transformation of a monolithic application into a distributed application creates many challenges for data management." (Piethein Strengholt, "Data Management at Scale: Best Practices for Enterprise Architecture", 2020)

"Data management of the future must build in embracing change, by default. Rigid data modeling and querying languages that expect to put the system in a straitjacket of a never-changing schema can only result in a fragile and unusable analytics system. [...] The data management of the future must support managing and accessing data across multiple hosting platforms, by default." (Zhamak Dehghani, "Data Mesh: Delivering Data-Driven Value at Scale", 2021)

"I am using ‘data strategy’ as an overarching term to describe a far broader set of capabilities from which sub-strategies can be developed to focus on particular facets of the strategy, such as management information (MI) and reporting; analytics, machine learning and AI; insight; and, of course, data management." (Ian Wallis, "Data Strategy: From definition to execution", 2021)

"In short, a monolithic architecture, technology, and organizational structure are not suitable for analytical data management of large-scale and complex organizations." (Zhamak Dehghani, "Data Mesh: Delivering Data-Driven Value at Scale", 2021)

"In the same vein, data strategy is often a misnomer for a much wider scope of coverage, but the lack of coherence in how we use the language has led to data strategy being perceived to cover data management activities all the way through to exploitation of data in the broadest sense. The occasional use of information strategy, intelligence strategy or even data exploitation strategy may differentiate, but the lack of a common definition on what we mean tends to lead to data strategy being used as a catch-all for the more widespread coverage such a document would typically include. Much of this is due to the generic use of the term ‘data’ to cover everything from its capture, management, governance through to reporting, analytics and insight." (Ian Wallis, "Data Strategy: From definition to execution", 2021)

"One of the limitations of data management solutions today is how we have attempted to manage its unwieldy complexity, how we have decomposed an ever-growing monolithic data platform and team to smaller partitions. We have chosen the path of least resistance, a technical partitioning." (Zhamak Dehghani, "Data Mesh: Delivering Data-Driven Value at Scale", 2021)

🗃️Data Management: Data Strategy (Just the Quotes)

"Data strategy is one of the most ubiquitous and misunderstood topics in the information technology (IT) industry. Most corporations' data strategy and IT infrastructure were not planned, but grew out of "stovepipe" applications over time with little to no regard for the goals and objectives of the enterprise. This stovepipe approach has produced the highly convoluted and inflexible IT architectures so prevalent in corporations today." (Sid Adelman et al, "Data Strategy", 2005)

"The chaos without a data strategy is not as obvious, but the indicators abound: dirty data, redundant data, inconsistent data, the inability to integrate, poor performance, terrible availability, little accountability, users who are increasingly dissatisfied with the performance of IT, and the general feeling that things are out of control." (Sid Adelman et al, "Data Strategy", 2005)

"The vision of a data strategy that fits your organization has to conform to the overall strategy of IT, which in turn must conform to the strategy of the business. Therefore, the vision should conform to and support where the organization wants to be in 5 years." (Sid Adelman et al, "Data Strategy", 2005)

"Working without a data strategy is analogous to a company allowing each department and each person within each department to develop its own financial chart of accounts. This empowerment allows each person in the organization to choose his own numbering scheme. Existing charts of accounts would be ignored as each person exercises his or her own creativity." (Sid Adelman et al, "Data Strategy" 1st Ed., 2005)

"Data is great, but strategy is better!" (Steven Sinofsky, Harvard Business School, 2013)

"Strategy is everything. Without it, data, big or otherwise, is essentially useless. A bad strategy is worse than useless because it can be highly damaging to the organization. A bad strategy can divert resources, waste time, and demoralize employees. This would seem to be self-evident but in practice, strategy development is not quite so straightforward. There are numerous reasons why a strategy is MIA from the beginning, falls apart mid-project, or is destroyed in a head-on collision with another conflicting business strategy." (Pam Baker, "Data Divination: Big Data Strategies", 2015)

"The overall data strategy should be focused on continuously discovering ways to improve the business through refinement, innovation, and solid returns, both in the short and long terms. Project-specific strategies should lead to a specific measurable and actionable end for that effort. This should be immediately followed with ideas about what can be done from there, which in turn should ultimately lead to satisfying the goals in the overall big data strategy and reshaping it as necessary too." (Pam Baker, "Data Divination: Big Data Strategies", 2015)

"A data strategy should include business plans to use information to competitive advantage and support enterprise goals. Data strategy must come from an understanding of the data needs inherent in the business strategy: what data the organization needs, how it will get the data, how it will manage it and ensure its reliability over time, and how it will utilize it. Typically, a data strategy requires a supporting Data Management program strategy – a plan for maintaining and improving the quality of data, data integrity, access, and security while mitigating known and implied risks. The strategy must also address known challenges related to data management." (DAMA International, "DAMA-DMBOK: Data Management Body of Knowledge" 2nd Ed., 2017)

"A good data strategy is not determined by what data is readily or potentially available - it’s about what your business wants to achieve, and how data can help you get there." (Bernard Marr, "Data Strategy", 2017)

"A sound data strategy requires that the data contained in a company’s single source of truth (SSOT) is of high quality, granular, and standardized, and that multiple versions of the truth (MVOTs) are carefully controlled." (Leandro DalleMule & Thomas H Davenport, "What’s Your Data Strategy?", Harvard Business Review, 2017) [link]

"Companies that have not yet built a data strategy and a strong data-management function need to catch up very fast or start planning for their exit." (Leandro DalleMule & Thomas H Davenport, "What’s Your Data Strategy?", Harvard Business Review, 2017) [link]

"How a company’s data strategy changes in direction and velocity will be a function of its overall strategy, culture, competition, and market." (Leandro DalleMule & Thomas H Davenport, "What’s Your Data Strategy?", Harvard Business Review, 2017) [link]

"[…] if companies want to avoid drowning in data, they need to develop a smart [data] strategy that focuses on the data they really need to achieve their goals. In other words, this means defining the business-critical questions that need answering and then collecting and analysing only that data which will answer those questions." (Bernard Marr, "Data Strategy", 2017)

"In truth, all three of these perspectives - process, technology, and data - are needed to create a good data strategy. Each type of person approaches things differently and brings different perspectives to the table. Think of this as another aspect of diversity. Just as a multicultural team and a team with different educational backgrounds will produce a better result, so will a team that includes people with process, technology and data perspectives." (Mike Fleckenstein & Lorraine Fellows, "Modern Data Strategy", 2018)

"A data strategy is the opportunity to bring data, one of the most important assets your organisation has, to the fore and to drive the future direction of the organisation." (Ian Wallis, "Data Strategy: From definition to execution", 2021)

"Data strategy is even less understood [thank business strategy], so the chances of success can be further decreased, simply because you need organisation-wide commitment and buy-in to succeed. Data does not exist in a bubble; it is not the preserve of a function that can fix it for all, detached from touching everyone else. It is core to how you run the organisation, and without a focus on where you are heading, it is going to trip the organisation up at every turn – regulatory compliance; operational effectiveness; financial performance; customer and employee experience; essentially, the efficiency in managing virtually every activity in the organisation." (Ian Wallis, "Data Strategy: From definition to execution", 2021)

"It is also important to regard the data strategy as a living document. Do not regard it as a masterpiece, never to be reviewed, amended or critiqued within the time frame it covers, but instead see it as a strategy that can flex to the changing demands of an organisation." (Ian Wallis, "Data Strategy: From definition to execution", 2021)

"Many organisations start a data strategy from a need to get data into some sort of organised state in which it is feasible to demonstrate compliance. In my opinion, compliance should be a component of a data strategy, not the data strategy in itself." (Ian Wallis, "Data Strategy: From definition to execution", 2021)

"The data strategy should answer the questions: Where are we going? What are we trying to achieve? How does this data strategy fit with the vision, mission and strategy of the organisation? The digital strategy should answer the overarching question: How are we are planning to achieve this?" (Alison Holt [Ed.], Data Governance: Governing data for sustainable business", 2021)

"The key for a successful data strategy is to align it clearly with the corporate strategy. The data strategy is a crucial enabler of the corporate strategy, and the data strategy should clearly call out those components that have a clear line of sight to delivering, or enabling, the corporate goals. If the data strategy does not align to the corporate goals it will be a much more challenging task to get the wider organisation to buy into it, not least because it will fail to have any resonance with the objectives of the organisational leaders and be regarded as optional at best." (Ian Wallis, "Data Strategy: From definition to execution", 2021)

"Right now, the biggest challenge for organizations working on their data strategy might not have to do with technology at all. [...] It’s an understandable problem: to a degree that is perpetually underestimated, becoming data-driven is about the ability of people and organizations to adapt to change." (Randy Bean, "Why Becoming a Data-Driven Organization Is So Hard", Harvard Business Review, 2022) [link]

"A data strategy must align with the business goals and overall framework of how data will be used and managed within an organization. It needs to include standards for how data will be discovered, integrated, accessed, shared, and protected. It needs to address how data will meet regulatory compliance policies, Master Data Management, and data democratization. There needs to be an assurance that both data and metadata have a quality control framework in place to achieve data trust. A data strategy needs to have a clear path on how an organization will accomplish data monetization." (Sonia Mezzetta, "Principles of Data Fabric: Become a data-driven organization by implementing Data Fabric solutions efficiently", 2023)

"A data strategy is a living document that needs to be continuously updated to align with business goals. It should have a clear maintenance process with frequent reviews and identification of authors and stakeholders that will contribute to the data strategy. This also includes the handling of exceptions to a data strategy process for any one-off decisions in special circumstances. A data strategy document must always be easily assessable, to the point, and understandable." (Sonia Mezzetta, "Principles of Data Fabric: Become a data-driven organization by implementing Data Fabric solutions efficiently", 2023)

See also the quotes on Strategy and Tactics.

05 December 2017

🗃️Data Management: Quality (Just the Quotes)

"Quality is never an accident; it is always the result of intelligent effort." (John Ruskin, "Seven Lamps of Architecture", 1849)

"It is most important that top management be quality-minded. In the absence of sincere manifestation of interest at the top, little will happen below." (Joseph M Juran, "Management of Inspection and Quality Control", 1945)

"Data are of high quality if they are fit for their intended use in operations, decision-making, and planning." (Joseph M Juran, 1964)

"The management of a system has to deal with the generation of the plans for the system, i. e., consideration of all of the things we have discussed, the overall goals, the environment, the utilization of resources and the components. The management sets the component goals, allocates the resources, and controls the system performance." (C West Churchman, "The Systems Approach", 1968)

"When a product is manufactured by workers who find their work meaningful, it will inevitably be a product of high quality." (Pehr G Gyllenhammar, "Management", 1976)

"Quality management is a systematic way of guaranteeing that organized activities happen the way they are planned." (Philip B Crosby, "Quality Is Free: The Art of Making Quality Certain", 1977)

"The problem of quality management is not what people don't know about it. The problem is what they think they do know." (Philip B Crosby, "Quality Is Free: The Art of Making Quality Certain", 1977)

"Uncontrolled variation is the enemy of quality." (W Edwards Deming, 1980)

"Almost all quality improvement comes via simplification of design, manufacturing, layout, processes and procedures." (Tom Peters, "Thriving on Chaos", 1987)

"Quality is a matter of faith. You set your standards, and you have to stick by them no matter what. That's easy when you've got plenty of product on hand, but it's another thing when the freezer is empty and you've got a truck at the door waiting for the next shipment to come off the production line. That's when you really earn your reputation for quality." (Ben Cohen, Inc. Magazine, 1987)

"Quality is very simple. So simple, in fact, that it is difficult for people to understand." (Roger Hale, "Quest for Quality", 1987)

"[...] running numbers on a computer [is] easier than trying to judge quality." (Esther Dyson, Forbes, 1987)

"The [quality control] issue has more to do with people and motivation and less to do with capital and equipment than one would think. It involves a cultural change." (Michael Beer, The Washington Post, 1987)

"Cutting costs without improvements in quality is futile." (W Edwards Deming, Forbes, 1988)

"Quality planning consists of developing the products and processes required to meet customer's needs." (Joseph M Juran, "Juran on planning for quality", 1988)

"Quality means meeting customers' (agreed) requirements, formal and informal, at lowest cost, first time every time." (Robert L Flood, "Beyond TQM", 1993)

"Many quality failures arise because a customer uses the product in a manner different from that intended by the supplier." (Joseph M Juran, "The quality planning process", 1999)

"Quality goals that affect product salability should be based primarily on meeting or exceeding market quality. Because the market and the competition undoubtedly will be changing while the quality planning project is under way, goals should be set so as to meet or beat the competition estimated to be prevailing when the project is completed." (Joseph M Juran, "The quality planning process", 1999)

"'Quality' means freedom from deficiencies - freedom from errors that require doing work over again (rework) or that result in field failures, customer dissatisfaction, customer claims, and so on." (Joseph M Juran, "How to think about quality", 1999)

"‘Quality’ means those features of products which meet customer needs and thereby provide customer satisfaction." (Joseph M Juran, "How to think about quality", 1999)

"The anatomy of 'quality assurance' is very similar to that of quality control. Each evaluates actual quality. Each compares actual quality with the quality goal. Each stimulates corrective action as needed. What differs is the prime purpose to be served. Under quality control, the prime purpose is to serve those who are directly responsible for conducting operations - to help them regulate current operations. Under quality assurance, the prime purpose is to serve those who are not directly responsible for conducting operations but who have a need to know - to be informed as to the state of affairs and, hopefully, to be assured that all is well." (Joseph M Juran, "How to think about quality", 1999)

"To attain quality, it is well to begin by establishing the 'vision' for the organization, along with policies and goals. Conversion of goals into results (making quality happen) is then done through managerial processes - sequences of activities that produce the intended results." (Joseph M Juran, "How to think about quality", 1999)

"Our culture, obsessed with numbers, has given us the idea that what we can measure is more important than what we can't measure. Think about that for a minute. It means that we make quantity more important than quality." (Donella Meadows, "Thinking in Systems: A Primer", 2008)

"A model is a representation in that it (or its properties) is chosen to stand for some other entity (or its properties), known as the target system. A model is a tool in that it is used in the service of particular goals or purposes; typically these purposes involve answering some limited range of questions about the target system." (Wendy S Parker, "Confirmation and Adequacy-for-Purpose in Climate Modelling", Proceedings of the Aristotelian Society, Supplementary Volumes, Vol. 83, 2009)

03 December 2017

🗃️ Data Management: Data Migration (Just the Quotes)

"One problem area for refactoring is databases. Most business applications are tightly coupled to the database schema that supports them. That's one reason that the database is difficult to change. Another reason is data migration. Even if you have carefully layered your system to minimize the dependencies between the database schema and the object model, changing the database schema forces you to migrate the data, which can be a long and fraught task." (Martin Fowler et al, "Refactoring: Improving the Design of Existing Code", 2002)

"Data migration is not just about moving data from one place to another; it should be focused on: realizing all the benefits promised by the new system when you entertained the concept of new software in the first place; creating the improved enterprise performance that was the driver for the project; importing the best, the most appropriate and the cleanest data you can so that you enhance business intelligence; maintaining all your regulatory, legal and governance compliance criteria; staying securely in control of the project." (John Morris, "Practical Data Migration", 2009)

"One of my criticisms of the usual ad hoc approach to data migration is that it does not contain standardized work packages that allow for the migration activity to be tracked at programme level. Breaking data cleansing, the principal activity, down into methods and tasks allows different levels of control by the programme office." (John Morris, "Practical Data Migration", 2009)

"The key to a successful migration is to remember that data migration is a business not a technical problem and data quality is a business not a technical issue. It is for the enterprise to dictate how and where data comes from and goes and what constitutes sufficient data quality. It is our jobs, as handmaidens of progress, to assist with the technical issues of moving data from one place to another, identifying referential integrity and other technical issues, and facilitating the process. But we are the servants not the masters." (John Morris, "Practical Data Migration", 2009)

"Data migration is indeed a complex project. It is common for companies to underestimate the amount of time it takes to complete the data conversion successfully. Data quality usually suffers because it is the first thing to be dropped once the project is behind schedule. Make sure to allocate enough time to complete the task maintaining the highest standards of quality necessary. Migrate now, clean later typically leads to another source of mistrusted data, defeating the whole purpose of MDM." (Dalton Cervo & Mark Allen, "Master Data Management in Practice: Achieving true customer MDM", 2011)

"Data migration, in simple terms, is about physically copying data from one repository into another. The need for a data migration effort is highly dependent on the MDM architecture chosen [...] The goal of MDM is to eliminate redundant systems and have a single system-of-record. Therefore, it seems contradictory to have a temporary ongoing data migration, which is, in essence, maintaining multiple systems and having to write additional software to keep them synchronized." (Dalton Cervo & Mark Allen, "Master Data Management in Practice: Achieving true customer MDM", 2011)

"Companies typically underestimate the importance of metadata management in general, and more specifically during data migration projects. Metadata management is normally postponed when data migration projects are behind schedule because it doesn’t necessarily provide immediate benefit. However, in the long run, it becomes critical. It is common to see data issues later, and without proper metadata or data lineage it becomes difficult to assess the root cause of the problem." (Dalton Cervo & Mark Allen, "Master Data Management in Practice: Achieving true customer MDM", 2011)

"For a metadata management program to be successful, it needs to be accessible to everybody that needs it, either from a creation or a consumption perspective. It should also be readily available to be used as a byproduct of other activities, such as data migration and data cleansing. Remember, metadata is documentation, and the closer it is generated to the activity affecting it, the better." (Dalton Cervo & Mark Allen, "Master Data Management in Practice: Achieving true customer MDM", 2011)

01 December 2017

🗃️Data Management: Data Architecture (Just the Quotes)

"The data architecture is the most important technical aspect of your business intelligence initiative. Fail to build an information architecture that is flexible, with consistent, timely, quality data, and your BI initiative will fail. Business users will not trust the information, no matter how powerful and pretty the BI tools. However, sometimes it takes displaying that messy data to get business users to understand the importance of data quality and to take ownership of a problem that extends beyond business intelligence, to the source systems and to the organizational structures that govern a company’s data." (Cindi Howson, "Successful Business Intelligence: Secrets to making BI a killer App", 2008)

"Data architecture allows strategic development of flexible modular designs by insulating the data from the business as well as the technology process." (Charles D Tupper, "Data Architecture: From Zen to Reality", 2011)

"Data architectures are the heart of business functionality. Given the proper data architecture, all possible functions can be completed within the enterprise easily and expeditiously." (Charles D Tupper, "Data Architecture: From Zen to Reality", 2011)

"The enterprise architecture delineates the data according to the inherent structure within the organization rather than by organizational function or use. In this manner it makes the data dependent on business objects but independent of business processes." (Charles D Tupper, "Data Architecture: From Zen to Reality", 2011)

"A defining characteristic of the data lakehouse architecture is allowing direct access to data as files while retaining the valuable properties of a data warehouse. Just do both!" (Bill Inmon et al, "Building the Data Lakehouse", 2021)

"Data lake architecture suffers from complexity and deterioration. It creates complex and unwieldy pipelines of batch or streaming jobs operated by a central team of hyper-specialized data engineers. It deteriorates over time. Its unmanaged datasets, which are often untrusted and inaccessible, provide little value. The data lineage and dependencies are obscured and hard to track." (Zhamak Dehghani, "Data Mesh: Delivering Data-Driven Value at Scale", 2021)

"Data mesh [...] reduces points of centralization that act as coordination bottlenecks. It finds a new way of decomposing the data architecture without slowing the organization down with synchronizations. It removes the gap between where the data originates and where it gets used and removes the accidental complexities - aka pipelines - that happen in between the two planes of data. Data mesh departs from data myths such as a single source of truth, or one tightly controlled canonical data model." (Zhamak Dehghani, "Data Mesh: Delivering Data-Driven Value at Scale", 2021)

"The data lakehouse architecture presents an opportunity comparable to the one seen during the early years of the data warehouse market. The unique ability of the lakehouse to manage data in an open environment, blend all varieties of data from all parts of the enterprise, and combine the data science focus of the data lake with the end user analytics of the data warehouse will unlock incredible value for organizations. [...] "The lakehouse architecture equally makes it natural to manage and apply models where the data lives." (Bill Inmon et al, "Building the Data Lakehouse", 2021)

"Data architecture is the structure that enables the storage, transformation, exploitation, and governance of data." (Pradeep Menon, "Data Lakehouse in Action", 2022)

"A data architecture needs to have the robustness and ability to support multiple data management and operational models to provide the necessary business value and agility to support an enterprise’s business strategy and capabilities." (Sonia Mezzetta, "Principles of Data Fabric", 2023)

"Data architecture is the process of designing and building complex data platforms. This involves taking a comprehensive view, which includes not only moving and storing data but also all aspects of the data platform. Building a well-designed data ecosystem can be transformative to a business." (Brian Lipp, "Modern Data Architectures with Python", 2023)

"Enterprises have difficulties in interpreting new concepts like the data mesh and data fabric, because pragmatic guidance and experiences from the field are missing. In addition to that, the data mesh fully embraces a decentralized approach, which is a transformational change not only for the data architecture and technology, but even more so for organization and processes. This means the transformation cannot only be led by IT; it’s a business transformation as well." (Piethein Strengholt, "Data Management at Scale: Modern Data Architecture with Data Mesh and Data Fabric" 2nd Ed., 2023)

"A data architecture defines a high-level architectural approach and concept to follow, outlines a set of technologies to use, and states the flow of data that will be used to build your data solution to capture big data. [...] Data architecture refers to the overall design and organization of data within an information system." (James Serra, "Deciphering Data Architectures", 2024)

"A data mesh is a decentralized data architecture with four specific characteristics. First, it requires independent teams within designated domains to own their analytical data. Second, in a data mesh, data is treated and served as a product to help the data consumer to discover, trust, and utilize it for whatever purpose they like. Third, it relies on automated infrastructure provisioning. And fourth, it uses governance to ensure that all the independent data products are secure and follow global rules." (James Serra, "Deciphering Data Architectures", 2024)

"The goal of any data architecture solution you build should be to make it quick and easy for any end user, no matter what their technical skills are, to query the data and to create reports and dashboards." (James Serra, "Deciphering Data Architectures", 2024)

13 August 2017

#️⃣Software Engineering: SQL Reloaded (Patt II: Who Messed with My Data?)

Introduction

"Errors, like straws, upon the surface flow;
He who would search for pearls must dive below."
(John Dryden)

Life of a programmer is full of things that stopped working overnight. What’s beautiful about such experiences is that always there is a logical explanation for such “happenings”. There are two aspects - one is how to troubleshoot such problems, and the second – how to avoid such situations, and this is typically done through what we refer as defensive programming. On one side avoiding issues makes one’s life simpler, while issues make it fuller.

I can say that I had plenty such types of challenges in my life, most of them self-created, mainly in the learning process, but also a good share of challenges created by others. Independently of the time spent on troubleshooting such issues, it’s the experience that counts, the little wins against the “dark” side of programming. In the following series of posts I will describe some of the issues I was confronted directly or indirectly over time. In an ad-hoc characterization they can be split in syntax, logical, data, design and systemic errors.

Syntax Errors

"Watch your language young man!"

(anonymous mother)

Syntax in natural languages like English is the sequence in which words are put together, word’s order indicating the relationship existing between words. Based on the meaning the words carry and the relationships formed between words we are capable to interpret sentences. SQL, initially called SEQUEL (Structured English Query Language) is an English-like language designed to manipulate and retrieve data. Same as natural languages, artificial languages like SQL have their own set of (grammar) rules that when violated lead to runtime errors, leading to interruption in code execution or there can be cases when the code runs further leading to inconsistencies in data. Unlike natural languages, artificial languages interpreters are quite sensitive to syntax errors.

Syntax errors are common to beginners, though a moment of inattention or misspelling can happen to anyone, no matter how versatile one’s coding is. Some are more frequent or have a bigger negative impact than others. Here are some of the typical types of syntax errors:
- missing brackets and quotes, especially in complex formulas;
- misspelled commands, table or column names;
- omitting table aliases or database names;
- missing objects or incorrectly referenced objects or other resources;
- incorrect statement order;
- relying on implicit conversion;
- incompatible data types;
- incorrect parameters’ order;
- missing or misplaced semicolons;
- usage of deprecated syntax.

Typically, syntax errors are easy to track at runtime with minimal testing as long the query is static. Dynamic queries on the other side require sometimes a larger number of combinations to be tested. The higher the number of attributes to be combined and the more complex the logic behind them, the more difficult is to test all combinations. The more combinations not tested, the higher the probability that an error might lurk in the code. Dynamics queries can thus easily become (syntax) error generators.

Logical Errors

"Students are often able to use algorithms to solve numerical problems
without completely understanding the underlying scientific concept."
(Eric Mazur)

One beautiful aspect of the human mind is that it needs only a rough understanding about how a tool works in order to make use of it up to an acceptable level. Therefore often it settles for the minimum of understanding that allows it to use a tool. Aspects like the limits of a tool, contexts of applicability, how it can be used efficiently to get the job done, or available alternatives, all these can be ignored in the process. As the devil lies in details, misunderstanding how a piece of technology works can prove to be our Achilles’ heel. For example, misunderstanding how sets and the different types of joins work, that lexical order differ from logical order and further to order of execution, when is appropriate or inappropriate to use a certain technique or functionality can make us make poor choices.

One of these poor choices is the method used to solve a problem. A mature programming language can offer sometimes two or more alternatives for solving a problem. Choosing the inadequate solution can lead to performance issues in time. This type of errors can be rooted in the lack of understanding of the data, of how an application is used, or how a piece of technology works.

"I suppose it is tempting, if the only tool you have is a hammer,
to treat everything as if it were a nail."
(Abraham Maslow)

Some of the errors derive from the difference between how different programming languages work with data. There can be considerable differences between procedural, relational and vector languages. When jumping from one language to another, one can be tempted to apply the same old techniques to the new language. The solution might work, though (by far) not optimal.

The capital mistake is to be the man of one tool, and use it in all the cases, even when not appropriate. For example. when one learned working with views, attempts to apply them all over the code in order to reuse logic, creating thus chains of views which even prove to be flexible, their complexity sooner or later will kick back. Same can happen with stored procedures and other object types as well. A sign of mastery is when the developer adapts his tools to the purpose.

"For every complex problem there is an answer
that is clear, simple, and wrong."
(Henry L Mencken)

One can build elegant solutions but solve the wrong problem. Misunderstanding the problem at hand is one type of error sometimes quite difficult to identify. Typically, they can be found through thorough testing. Sometimes the unavailability of (quality) data can impede the process of testing, such errors being found late in the process.

At the opposite side, one can attempt to solve the right problem but with logic flaws – wrong steps order, wrong algorithm, wrong set of tools, or even missing facts/assumptions. A special type of logical errors are the programmatic errors, which occur when SQL code encounters a logic or behavioral error during processing (e.g. infinite loop, out of range input). [1]

Data Errors

"Data quality requires certain level of sophistication within a company
to even understand that it’s a problem."
(Colleen Graham)

Poor data quality is the source for all evil, or at least for some of the evil. Typically, a good designed database makes use of a mix of techniques to reduce the chances for inconsistencies: appropriate data types and data granularity, explicit transactions, check constraints, default values, triggers or integrity constraints. Some of these techniques can be too restrictive, therefore in design one has to provide a certain flexibility in the detriment of one of the above techniques, fact that makes the design vulnerable to same range of issues: missing values, missing or duplicate records.

No matter how good a database was designed, sometimes is difficult to cope with users’ ingenuity – misusage of functionality, typically resulting in deviations from standard processes, that can invalidate an existing query. Similar effects have the changes to processes or usage of new processed not addressed in existing queries or reports.

Another topic that have a considerable impact on queries’ correctness is the existence, or better said the inexistence of master data policies and a board to regulate the maintenance of master data. Without proper governance of master data one might end up with a big mess with no way to bring some order in it without addressing the quality of data adequately.

Designed to Fail

"The weakest spot in a good defense is designed to fail."
(Mark Lawrence)

In IT one can often meet systems designed to fail, the occurrences of errors being just a question of time, kind of a ticking bomb. In such situations, a system is only as good as its weakest link(s). Issues can be traced back to following aspects:
- systems used for what they were not designed to do – typically misusing a tool for a purpose for which another tool would be more appropriate (e.g. using Excel as database, using SSIS for real-time, using a reporting tool for data entry);
- poor performing systems - systems not adequately designed for the tasks supposed to handle (e.g. handling large volume of data/transactions);
- systems not coping with user’s inventiveness or mistakes (e.g. not validating adequately user input or not confirming critical actions like deletion of records);
- systems not configurable (e.g. usage of hardcoded values instead of parameters or configurable values);
- systems for which one of the design presumptions were invalidated by reality (e.g. input data don’t have the expected format, a certain resource always exists);
- systems not being able to handle changes in environment (e.g. changing user settings for language, numeric or data values);
- systems succumbing in their own complexity (e.g. overgeneralization, wrong mix of technologies);
- fault intolerant systems – system not handling adequately more or less unexpected errors or exceptions (e.g. division by zero, handling of nulls, network interruptions, out of memory).

Systemic Errors

Systemic errors can be found at the borders of the "impossible", situations in which the errors defy the common sense. Such errors are not determined by chance but are introduced by an inaccuracy inherent to the system/environment.

A systemic error occurs when a SQL program encounters a deficiency or unexpected condition with a system resource (e.g. a program encountered insufficient space in tempdb to process a large query, database/transaction log running out of space). [1]

Such errors are often difficult but not impossible to reproduce. The difficulty resides primarily in figuring out what happened, what caused the error. Once one found the cause, with a little resourcefulness one can come with an example to reproduce the error.

Conclusion

"To err is human; to try to prevent recurrence of error is science."
(Anon)

When one thinks about it, there are so many ways to fail. In the end to err is human and nobody is exempted from making mistakes, no matter how good or wise. The quest of a (good) programmer is to limit errors’ occurrences, and to correct them early in process, before they start becoming a nightmare.

References:
[1] Transact-SQL Programming: Covers Microsoft SQL Server 6.5 /7.0 and Sybase, by Kevin Kline, Lee Gould & Andrew Zanevsky, O’Reilly, ISBN 10: 1565924010, 1999

18 June 2017

💠🛠️SQL Server: Administration (Database Recovery on SQL Server 2017)

I installed today SQL Server 2017 CTP 2.1 on my Lab PC without any apparent problems. It was time to recreate some of the databases I used for testing. As previously I had an evaluation version of SQL Server 2016, it expired without having a backup for one of the databases. I could recreate the database from scripts and reload the data from various text files. This would have been a relatively laborious task (estimated time > 1 hour), though the chances were pretty high that everything would go smoothly. As the database is relatively small (about 2 GB) and possible data loss was neglectable, I thought it would be possible to recover the data from the database with minimal loss in less than half of hour. I knew this was possible, as I was forced a few times in the past to recover data from damaged databases in SQL Server 2005, 2008 and 2012 environments, though being in a new environment I wasn’t sure how smooth will go and how long it would take.

Plan A - Create the database with ATTACH_REBUILD_LOG option:

As it seems the option is available in SQL Server 2017, so I attempted to create the database via the following script:

CREATE DATABASE  ON 
(FILENAME='I:\Data\.mdf') 
FOR ATTACH_REBUILD_LOG

And as expected I run into the first error:

Msg 5120, Level 16, State 101, Line 1 Unable to open the physical file "I:\Data\.mdf". Operating system error 5: "5(Access is denied.)".

Msg 1802, Level 16, State 7, Line 1 CREATE DATABASE failed. Some file names listed could not be created. Check related errors.

It looked like a permissions problem, though I wasn’t entirely sure which account is causing the problem. In the past I had problems with the Administrator account, so it was the first thing to try. Once I removed the permissions for Administrator account to the folder containing the database and gave it full control permissions again, I tried to create the database anew using the above script, running into the next error:

File activation failure. The physical file name "D:\Logs\_log.ldf" may be incorrect. The log cannot be rebuilt because there were open transactions/users when the database was shutdown, no checkpoint occurred to the database, or the database was read-only. This error could occur if the transaction log file was manually deleted or lost due to a hardware or environment failure.

Msg 1813, Level 16, State 2, Line 1 Could not open new database ''. CREATE DATABASE is aborted.

This approach seemed to lead nowhere, so it was time for Plan B.

Plan B - Recover the database into an empty database with the same name:

Step 1: Create a new database with the same name, stop the SQL Server, then copy the old file over the new file, and delete the new log file manually. Then restarted the server. After the restart the database will appear in Management Studio with the SUSPECT state.

Step 2: Set the database in EMERGENCY mode:

ALTER DATABASE  SET EMERGENCY, SINGLE_USER

Step 3: Rebuild the log file:

ALTER DATABASE <database_name>

REBUILD LOG ON (Name=’_Log',

FileName='D:\Logs\.ldf')

The rebuild worked without problems.

Step 4: Set the database in MULTI_USER mode:

ALTER DATABASE  SET MULTI_USER

Step 5: Perform a consistency check:

DBCC CHECKDB () WITH ALL_ERRORMSGS, NO_INFOMSG

After 15 minutes of work the database was back online.

Warnings:
Always attempt to recover the data for production databases from the backup files! Use the above steps only if there is no other alternative!
The consistency check might return errors. In this case one might need to run CHECKDB with REPAIR_ALLOW_DATA_LOSS several times [2], until the database was repaired.
After recovery there can be problems with the user access. It might be needed to delete the users from the recovered database and reassign their permissions!

Resources:
[1] In Recovery (2008) Creating, detaching, re-attaching, and fixing a SUSPECT database, by Paul S Randal [Online] Available from: https://www.sqlskills.com/blogs/paul/creating-detaching-re-attaching-and-fixing-a-suspect-database/
[2] In Recovery (2009) Misconceptions around database repair, by Paul S Randal [Online] Available from: https://www.sqlskills.com/blogs/paul/misconceptions-around-database-repair/
[3] Microsoft Blogs (2013) Recovering from Log File Corruption, by Glen Small [Online] Available from: https://blogs.msdn.microsoft.com/glsmall/2013/11/14/recovering-from-log-file-corruption/

24 May 2017

⛏️Data Management: Data Contracts (Definitions)

"Data contracts specifically define the data that is being exchanged between a client and service. The data contract is an agreement, meaning that the client and the service must agree on the data contract in order for the exchange of data to take place. Note that they don't have to agree on the data types, just the contract." (Pablo Cibraro & Scott Klein, "Professional WCF Programming: .NET Development with the Windows Communication Foundation", 2007)

"A data contract is an agreement between a client and a service that conceptually depicts the data to be exchanged. Data contracts define the data types that are used in the service." (Nagaraju B et al, ".Net Interview Questions", 2010)

"The format of the data to be communicated and the logic under which it is created form the data contract. This contract is followed by both the producer and the consumer of the event data. It gives the event meaning and form beyond the context in which it is produced and extends the usability of the data to consumer applications." (Adam Bellemare, "Building Event-Driven Microservices", 2020)

"A data contract is a document that accompanies data movement and captures relevant information (like upstream contacts, service-level agreement, scenarios enabled, etc.)." (Vlad Riscutia, "Data Engineering on Azure", 2021)

"A data contract is a formal agreement between a service and a client that abstractly describes the data to be exchanged. That is, to communicate, the client and the service do not have to share the same types, only the same data contracts. A data contract precisely defines, for each parameter or return type, what data is serialized (turned into XML) to be exchanged." (Microsoft, "Using Data Contracts", 2021) [source]

"A data contract is a written agreement between the owner of a source system and the team ingesting data from that system for use in a data pipeline. The contract should state what data is being extracted, via what method (full, incremental), how often, as well as who (person, team) are the contacts for both the source system and the ingestion." (James Densmore, "Data Pipelines Pocket Reference", 2021)

"It's a formal agreement between the data producer and the data consumers. There is not yet a clear definition of the form and scope of a data contract. Usually, they cover the structure of the exchanged data (i.e. the schema) and its meaning (i.e. the semantics)." (Open Data Mesh, "Data Contract", 2022) [source]

"A data contract is an agreed interface between the generators of data and its consumers. It sets the expectations around that data, defines how it should be governed, and facilitates the explicit generation of quality data that meets the business requirements." (Andrew Jones, "Driving Data Quality with Data Contracts", 2023)

"A data contract is an agreement between the producer and the consumers of a data product. Just as business contracts hold up obligations between suppliers and consumers of a business product, data contracts define and enforce the functionality, manageability, and reliability of data products." (Atlan, "Data Contracts: The Key to Scaling Distributed Data Architecture and Reducing Data Chaos", 2023) [source]

"Data contracts are formal agreements outlining the structure and type of data exchanged between systems, ensuring all parties understand the data's format. Used in various contexts such as APIs, SOA, data pipelines, they provide crucial interoperability, making data contracts essential in managing and controlling data flow effectively." (Jatin Solanki, "What is Data Contracts, is it a hype?", 2023) [source]

"A formal agreement between a data consumer or user and a data provider or owner that defines the conditions under which the data is exchanged between both parties." (Circ Thread)

20 May 2017

⛏️Data Management: Data Scrubbing (Definitions)

"The process of making data consistent, either manually, or automatically using programs." (Microsoft Corporation, "Microsoft SQL Server 7.0 System Administration Training Kit", 1999)

Processing data to remove or repair inconsistencies." (Rod Stephens, "Beginning Database Design Solutions", 2008)

"The process of building a data warehouse out of data coming from multiple online transaction processing (OLTP) systems." (Microsoft, "SQL Server 2012 Glossary", 2012)

"A term that is very similar to data deidentification and is sometimes used improperly as a synonym for data deidentification. Data scrubbing refers to the removal, from data records, of identifying information (i.e., information linking the record to an individual) plus any other information that is considered unwanted. This may include any personal, sensitive, or private information contained in a record, any incriminating or otherwise objectionable language contained in a record, and any information irrelevant to the purpose served by the record." (Jules H Berman, "Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information", 2013)

"The process of removing corrupt, redundant, and inaccurate data in the data governance process. (Robert F Smallwood, Information Governance: Concepts, Strategies, and Best Practices, 2014)

"Data Cleansing (or Data Scrubbing) is the action of identifying and then removing or amending any data within a database that is: incorrect, incomplete, duplicated." (experian) [source]

"Data cleansing, or data scrubbing, is the process of detecting and correcting or removing inaccurate data or records from a database. It may also involve correcting or removing improperly formatted or duplicate data or records. Such data removed in this process is often referred to as 'dirty data'. Data cleansing is an essential task for preserving data quality." (Teradata) [source]

"Data scrubbing, also called data cleansing, is the process of amending or removing data in a database that is incorrect, incomplete, improperly formatted, or duplicated." (Techtarget) [source]

"Part of the process of building a data warehouse out of data coming from multiple online transaction processing (OLTP) systems." (Microsoft Technet)

"The process of filtering, merging, decoding, and translating source data to create validated data for the data warehouse." (Information Management)

05 May 2017

⛏️Data Management: Data Steward (Definitions)

"A person with responsibility to improve the accuracy, reliability, and security of an organization’s data; also works with various groups to clearly define and standardize data." (Margaret Y Chu, "Blissful Data ", 2004)

"Critical players in data governance councils. Comfortable with technology and business problems, data stewards seek to speak up for their business units when an organization-wide decision will not work for that business unit. Yet they are not turf protectors, instead seeking solutions that will work across an organization. Data stewards are responsible for communication between the business users and the IT community." (Tony Fisher, "The Data Asset", 2009)

"A business leader and/or subject matter expert designated as accountable for: a) the identification of operational and Business Intelligence data requirements within an assigned subject area, b) the quality of data names, business definitions, data integrity rules, and domain values within an assigned subject area, c) compliance with regulatory requirements and conformance to internal data policies and data standards, d) application of appropriate security controls, e) analyzing and improving data quality, and f) identifying and resolving data related issues. Data stewards are often categorized as executive data stewards, business data stewards, or coordinating data stewards." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

[business data steward:] "A knowledge worker, business leader, and recognized subject matter expert assigned accountability for the data specifications and data quality of specifically assigned business entities, subject areas or databases, but with less responsibility for data governance than a coordinating data steward or an executive data steward." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"The person responsible for maintaining a data element in a metadata registry." (Microsoft, "SQL Server 2012 Glossary, 2012)

"The term stewardship is “the management or care of another person’s property” (NOAD). Data stewards are individuals who are responsible for the care and management of data. This function is carried out in different ways based on the needs of particular organizations." (Laura Sebastian-Coleman, "Measuring Data Quality for Ongoing Improvement ", 2012)

"The person responsible for maintaining a data element in a metadata registry." (Microsoft, SQL Server 2012 Glossary, 2012)

"An individual comfortable with both technology and business problems. Stewards are responsible for communicating between the business users and the IT community." (Jim Davis & Aiman Zeid, "Business Transformation: A Roadmap for Maximizing Organizational Insights", 2014)

"A role in the data governance organization that is responsible for the development of a uniform data model for business objects used across boundaries. The data steward is also often responsible for the development of master data management and ensures compliance with the governance rules." (Boris Otto & Hubert Österle, "Corporate Data Quality", 2015)

"A natural person assigned the responsibility to catalog, define, and monitor changes to critical data. Example: The data steward for finance critical data is Dan." (Gregory Lampshire, "The Data and Analytics Playbook", 2016)

"A person responsible for managing data content, quality, standards, and controls within an organization or function." (Jonathan Ferrar et al, "The Power of People", 2017)

"A data steward is a job role that involves planning, implementing and managing the sourcing, use and maintenance of data assets in an organization. Data stewards enable an organization to take control and govern all the types and forms of data and their associated libraries or repositories." (Richard T Herschel, "Business Intelligence", 2019)

03 May 2017

⛏️Data Management: Hashing (Definitions)

"A technique for providing fast access to data based on a key value by determining the physical storage location of that data." (Jan L Harrington, "Relational Database Dessign: Clearly Explained" 2nd Ed., 2002)

"A mathematical technique for assigning a unique number to each record in a file." (S. Sumathi & S. Esakkirajan, "Fundamentals of Relational Database Management Systems", 2007)

"A technique that transforms a key value via an algorithm to a physical storage location to enable quick direct access to data. The algorithm is typically referred to as a randomizer, because the goal of the hashing routine is to spread the key values evenly throughout the physical storage." (Craig S Mullins, "Database Administration", 2012)

"A mathematical technique in which an infinite set of input values is mapped to a finite set of output values, called hash values. Hashing is useful for rapid lookups of data in a hash table." (Oracle, "Database SQL Tuning Guide Glossary", 2013)

"An algorithm converts data values into an address" (Daniel Linstedt & W H Inmon, "Data Architecture: A Primer for the Data Scientist", 2014)

"The technique used for ordering and accessing elements in a collection in a relatively constant amount of time by manipulating the element’s key to identify the element’s location in the collection" (Nell Dale et al, "Object-Oriented Data Structures Using Java" 4th Ed., 2016)

"The application of an algorithm to a search key to derive a physical storage location." (George Tillmann, "Usage-Driven Database Design: From Logical Data Modeling through Physical Schmea Definition", 2017)

"Hashing is the process of mapping data values to fixed-size hash values (hashes). Common hashing algorithms are Message Digest 5 (MD5) and Secure Hashing Algorithm (SHA). It’s impossible to turn a hash value back into the original data value." (Piethein Strengholt, "Data Management at Scale", 2020)

"A process used to convert data into a string of numbers and letters." (AICPA)

"A technique for arranging a set of items, in which a hash function is applied to the key of each item to determine its hash value. The hash value identifies each item's primary position in a hash table, and if this position is already occupied, the item is inserted either in an overflow table or in another available position in the table." (IEEE 610.5-1990)

01 May 2017

⛏️Data Management: Hash (Definitions)

"A number (often a 32-bit integer) that is derived from column values using a lossy compression algorithm. DBMSs occasionally use hashing to speed up access, but indexes are a more common mechanism." (Peter Gulutzan & Trudy Pelzer, "SQL Performance Tuning", 2002)

"A set of characters generated by running text data through certain algorithms. Often used to create digital signatures and compare changes in content." (Tom Petrocelli, "Data Protection and Information Lifecycle Management", 2005)

"Hash, a mathematical method for creating a numeric signature based on content; these days, often unique and based on public key encryption technology." (Bo Leuf, "The Semantic Web: Crafting infrastructure for agency", 2006)

[hash code:] "An integer calculated from an object. Identical objects have the same hash code. Generated by a hash method." (Michael Fitzgerald, "Learning Ruby", 2007)

"An unordered collection of data where keys and values are mapped. Compare with array." (Michael Fitzgerald, "Learning Ruby", 2007)

"A cryptographic hash is a fixed-size bit string that is generated by applying a hash function to a block of data. Secure cryptographic hash functions are collision-free, meaning there is a very small possibility of generating the same hash for two different blocks of data. A secure cryptographic hash function should also be one-way, meaning it is infeasible to retrieve the original text from the hash." (Michael Coles & Rodney Landrum, "Expert SQL Server 2008 Encryption", 2008)

"A hash is the result of applying a mathematical function or transformation on data to generate a smaller 'fingerprint' of the data. Generally, the most useful hash functions are one-way collision-free hashes that guarantee a high level of uniqueness in their results." (Michael Coles, "Pro T-SQL 2008 Programmer's Guide", 2008)

"The output of a hash function." (Mark S Merkow & Lakshmikanth Raghavan, "Secure and Resilient Software Development", 2010)

"A number based on the hash value of a string." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"1.Data allocated in an algorithmically randomized fashion in an attempt to evenly distribute data and smooth access patterns. 2.Verb. To calculate a hash key for data." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"A hash is the result of applying a mathematical function or transformation on data to generate a smaller 'fingerprint' of the data. Generally, the most useful hash functions are one-way collision-free hashes that guarantee a high level of uniqueness in their results." (Jay Natarajan et al, "Pro T-SQL 2012 Programmer's Guide" 3rd Ed., 2012)

"An unordered association of key/value pairs, stored such that you can easily use a string key to look up its associated data value. This glossary is like a hash, where the word to be defined is the key and the definition is the value. A hash is also sometimes septisyllabically called an “associative array”, which is a pretty good reason for simply calling it a 'hash' instead." (Jon Orwant et al, "Programming Perl" 4th Ed., 2012)

"In a hash cluster, a unique numeric ID that identifies a bucket. Oracle Database uses a hash function that accepts an infinite number of hash key values as input and sorts them into a finite number of buckets. Each hash value maps to the database block address for the block that stores the rows corresponding to the hash key value (department 10, 20, 30, and so on)." (Oracle, "Database SQL Tuning Guide Glossary", 2013)

"The result of applying a mathematical function or transformation to data to generate a smaller 'fingerprint' of the data. Generally, the most useful hash functions are one-way, collision-free hashes that guarantee a high level of uniqueness in their results." (Miguel Cebollero et al, "Pro T-SQL Programmer’s Guide" 4th Ed., 2015)

[hash code:] "The output of the hash function that is associated with the input object" (Nell Dale et al, "Object-Oriented Data Structures Using Java" 4th Ed., 2016)

"A numerical value produced by a mathematical function, which generates a fixed-length value typically much smaller than the input to the function. The function is many to one, but generally, for all practical purposes, each file or other data block input to a hash function yields a unique hash value." (William Stallings, "Effective Cybersecurity: A Guide to Using Best Practices and Standards", 2018)

"The number generated by a hash function to indicate the position of a given item in a hash table." (IEEE 610.5-1990)

28 April 2017

⛏️Data Management: Completeness (Definitions)

"A characteristic of information quality that measures the degree to which there is a value in a field; synonymous with fill rate. Assessed in the data quality dimension of Data Integrity Fundamentals." (Danette McGilvray, "Executing Data Quality Projects", 2008)

"Containing by a composite data all components necessary to full description of the states of a considered object or process." (Juliusz L Kulikowski, "Data Quality Assessment", 2009)

"An inherent quality characteristic that is a measure of the extent to which an attribute has values for all instances of an entity class." (David C Hay, "Data Model Patterns: A Metadata Map", 2010)

"Completeness is a dimension of data quality. As used in the DQAF, completeness implies having all the necessary or appropriate parts; being entire, finished, total. A dataset is complete to the degree that it contains required attributes and a sufficient number of records, and to the degree that attributes are populated in accord with data consumer expectations. For data to be complete, at least three conditions must be met: the dataset must be defined so that it includes all the attributes desired (width); the dataset must contain the desired amount of data (depth); and the attributes must be populated to the extent desired (density). Each of these secondary dimensions of completeness can be measured differently." (Laura Sebastian-Coleman, "Measuring Data Quality for Ongoing Improvement ", 2012)

"Completeness is defined as a measure of the presence of core source data elements that, exclusive of derived fields, must be present in order to complete a given business process." (Rajesh Jugulum, "Competing with High Quality Data", 2014)

"Complete existence of all values or attributes of a record that are necessary." (Boris Otto & Hubert Österle, "Corporate Data Quality", 2015)

"The degree to which all data has been delivered or stored and no values are missing. Examples are empty or missing records." (Piethein Strengholt, "Data Management at Scale", 2020)

"The degree to which elements that should be contained in the model are indeed there." (Panos Alexopoulos, "Semantic Modeling for Data", 2020)

"Completeness refers to the extent to which required data attributes or records are present in a dataset." (William Smith, "Soda Core for Modern Data Quality and Observability: The Complete Guide for Developers and Engineers", 2025)

"The degree of data representing all properties and instances of the real-world context." (Zhamak Dehghani, "Data Mesh: Delivering Data-Driven Value at Scale", 2021)

"Data is considered 'complete' when it fulfills expectations of comprehensiveness." (Precisely) [source]

"The degree to which all required measures are known. Values may be designated as “missing” in order not to have empty cells, or missing values may be replaced with default or interpolated values. In the case of default or interpolated values, these must be flagged as such to distinguish them from actual measurements or observations. Missing, default, or interpolated values do not imply that the dataset has been made complete." (CODATA)

27 April 2017

⛏️Data Management: Availability (Definitions)

"Corresponds to the information that should be available when necessary and in the appropriate format." (José M Gaivéo, "Security of ICTs Supporting Healthcare Activities", 2013)

"A property by which the data is available all the time during the business hours. In cloud computing domain, the data availability by the cloud service provider holds a crucial importance." (Sumit Jaiswal et al, "Security Challenges in Cloud Computing", 2015)

"Availability: the ability of the data user to access the data at the desired point in time." (Boris Otto & Hubert Österle, "Corporate Data Quality", 2015)

"It is one of the main aspects of the information security. It means data should be available to its legitimate user all the time whenever it is requested by them. To guarantee availability data is replicated at various nodes in the network. Data must be reliably available." (Omkar Badve et al, "Reviewing the Security Features in Contemporary Security Policies and Models for Multiple Platforms", 2016)

"Timely, reliable access to data and information services for authorized users." (Maurice Dawson et al, "Battlefield Cyberspace: Exploitation of Hyperconnectivity and Internet of Things", 2017)

"A set of principles and metrics that assures the reliability and constant access to data for the authorized individuals or groups." (Gordana Gardašević et al, "Cybersecurity of Industrial Internet of Things", 2020)

"Ensuring the conditions necessary for easy retrieval and use of information and system resources, whenever necessary, with strict conditions of confidentiality and integrity." (Alina Stanciu et al, "Cyberaccounting for the Leaders of the Future", 2020)

"The state when data are in the place needed by the user, at the time the user needs them, and in the form needed by the user." (CODATA)

"The state that exists when data can be accessed or a requested service provided within an acceptable period of time." (NISTIR 4734)

"Timely, reliable access to information by authorized entities." (NIST SP 800-57 Part 1)