22 June 2018

Data Migrations (DM): Approaches to Planning a Data Migration for an ERP Implementation

Data Migration
Data Migrations Series

Introduction

ERP implementations are one of the most complex projects to plan as they often imply changes/transformations at different levels (e.g. strategic, processes, data, cultural, technological), span over one or more years, involve many resources that need to be efficiently managed, and often come with important costs for the organization.

One way of handling complexity is to ignore the nonessential in planning by focusing on the important activities/phases, following to go deeper as the project progresses. Another way to handle complexity is to split it at manageable parts – identifying and grouping together components. For example, Data Migration (DM) and Data Quality (DQ) are managed as subprojects, with their own planning. The two strategies can be combined to increase the effect.

Planning a DM cannot be done without looking at the timelines of the ERP implementation and considering the various interfaces to the DQ, however in this post I will focus only on the first two.

The Context

In the context of an ERP implementation there are three main approaches to the planning of a DM – pushing the activities toward the end of the implementation, pushing most activities toward the beginning of the implementation, or splitting the various activities over the whole timeline of the ERP implementation. Borrowing a term from statistics, we can talk about a left-skewed plan, a right-skewed plan, respectively a uniform-distributed plan.

For exemplification I will use a set of Lego pieces grouped together in 3 rows and representing the main phases of an ERP implementation, DM, respectively DQ:

clip_image002

The Lego pieces are a good tool for representing the phases, even they can be mischievous because there’s often no clear delimitation between certain phases as they overlap or repeat over several iterations, and bricks’ length doesn’t necessarily represent the actual duration of the phases. In addition, the phases are oversimplified in order not to clutter the diagrams. The detailed phases will be considered in further posts. The color changes gradually as the activities get closer toward the end.

Left-Skewed Planning

One way to plan a DM is backwards from the Go-Live, the DM activities flowing continuously backwards (the DM bricks are arranged from the end over the implementation bricks) and thus accumulating toward the end of the ERP implementation. This approach is natural considering that the requirements stabilize in the second half of the implementation, the code freeze occurring toward the end. Stabilizing means that the most important code changes were performed, and only minor changes need to be performed, typically bug fixing, refactoring or last-minute changes.

The DM starts with a set of requirements in what concerns the business processes, the data and configuration. Each change to these requirements equate with overwork that need to be performed. Typically, this happens only at entity level, however there can be changes that have impact for the whole or important parts of the migration. Additionally, after each set of changes another dry-run needs to be performed in order test the changes. Therefore, to minimize the volume of rework a DM needs a stable environment – in other words a stabile data model, configuration and requirements.

Thus, the conceptualizing of the migration including the prototyping can start in the first half of the implementation or, for long-running projects, even in the second half of the implementation. Performing the DM without interruptions assures an optimal use and planning of the resources – the resources are continuously working on the project, they are focused toward the end.

On the other side, the accumulation of the activities toward the end can easily lead to problems in what concerns resources’ availability. This type of approach needs a good planning, otherwise the project runs into the risk of having the Go-Live delayed until the stability of the DM is assured, or of going Live with data that don’t have the expected quality. These risks can be alleviated by adding a puffer to DM’s timeline, or by considering one of the other two approaches.

This approach minimizes the various types of waste associated with software projects, and thus the costs associated with waste.

clip_image004


Right-Skewed Planning

To release the pressure existing toward the end of the project, some of the DM activities can be performed toward the beginning of the project – the conceptualization and prototyping, as well the data mapping. One can in theory build also an important part of the DM for the standard functionality, following to address the changes in data model, processes and configuration in subsequent iterations. This could involve a higher volume of rework and more dry-runs, however this depends on the complexity and number of the changes. If a small number of customizations are expected, then this approach may be the best approach. Even in the case of many customization this approach might be something to consider, however the DM costs increase with the number of customization made, and in certain contexts the increase can be exponential.

This approach pushes some of the costs toward the beginning of the implementation, and this can have positive as well negative aspects. For example, it is well-known that ERP implementations involve cost overruns. With this approach the DM costs are assured toward the beginning and one can better get a hold of the budget, at least in theory. As negative aspect could be considered the cases in which an ERP implementation is stopped toward the middle of the project, the incurred costs being thus higher. In the end the main cost-driver are the volume of customizations.

Breaking a DM in two can have several other negative aspects. The data cleaning needs to be broken eventually as well, most probably in the second phase more data enrichment activities need to be considered.

The resources that worked on the first phase might not be available for the second phase. An adequate knowledge transfer might be hard to make, so the second team might need good documentation or time to understand the solution. This can lead to other type of behavior, e.g. rewriting unnecessarily the code, the push for a redesign, and so on.

As the environment stabilizes much later, there is the risk that an important part of the migration need to be reworked/redesigned. In extreme cases might be needed to start from zero. The chances for this to happen are small, though such a case can occur. Probably some of the code, transformation can be reused, though this depends from case to case. Without knowing implementation’s details it’s difficult to estimate the chances for something like this to happen. Sometimes is enough to invalidate a premise considered in design phase. Usually the interplay between several new requirements lead to redesign.

clip_image006


Uniform-distributed Plan

To alleviate the risks from the first two approaches, some of the activities could be uniformly distributed over the whole duration of the ERP implementation. This approach works well when same resources from the vendor side are involved in activities from all the three layers, the nature of the tasks allowing them to work continuously in the project. For example, the consultants working on the DM concept are helping on the mapping of the attributes, as well on data cleansing. When the work on multiple activities isn’t possible then the vendor(s) more likely will have problems in assigning resources to the project. Either the same resources will be assigned for big parts from the projects, incurring thus higher costs, or the resources will be replaced by others, additional learning being involved. In either case the costs are higher.

One of the main dangers of this approach is that certain activities will expand taking the time available, incurring thus higher costs. When the Implementation time is much higher than DM’s duration, the distance between DM’s phases can increase dramatically, being almost impossible to manage resources adequately. Keeping the metaphor of the Lego pieces, it will be thus also more difficult to build a structure on which an edifice can be built. With proper planning and adequate use of resources and knowledge the empty spaces can be incorporated in the structure for project’s advantage.

Even if this approach attempts to even the DM effort over the whole duration of the ERP implementation, performing the activities too early, before requirements stabilized can have an adverse effect.

clip_image008


Personal Approach

Looking back at the projects I worked on, I think I used a hybrid between the 3 approaches. The DM was planed backwards from the Go Live, however the first draft of the DM concept and prototyping was performed at the beginning of the implementation. This assured that the technical solution was working. Being involved in the creation of the data mappings as well in data cleansing, the jumps between activities allowed me to smoothly switch between the various activities, however toward the end of the project this became a bottleneck, the activities being harder to synchronize, and the volume of work could be addressed at that time only with overtime.

With a few exceptions I worked mainly alone on the technical activities, being responsible for the data mappings, design, prototyping, implementation, testing, protocolling, and execution of the DM. I think that more resources would have removed some of the burden but made the planning more complicated and the synchronization even harder. Probably a team of 2-3 people that could cover these activities would provide the optimum balance between costs, effort and quality.

Conclusion

I suppose there is no best solution that will work for all. The three approaches are more an attempt to highlight some of the extreme usages of planning. In an ERP implementation there are so many factors, so many chances for a decision to be an opportunity or a threat. My advice – ponder the various aspects/constraints, choose an approach, and adjust it as the project advances.

Previous Post <<||>> Next Post

20 June 2018

ERP Systems: Dynamics AX 2009 – Deleting Obsolete Companies

Introduction   

    During implementations, migrations and other projects are created in Dynamics AX temporary companies (aka legal entities, data areas) that aren’t needed anymore once they fulfilled their purpose. Excepting the fact that obsolete companies occupy space in the data center, under certain circumstances they can lead to performance problems. The logical thing to do would be to delete the obsolete companies as long there’s no further demand from the business.    

   In what follows we will look at several methods for deleting obsolete companies. The scripts were tested in Dynamics AX 2009, and more likely they’ll work in coming versions as long the data model behind was kept.

Warning:
    Please note that the scripts are provided “AS IS” only to exemplify a technique and they come without any warranty! Before attempting any of the methods described here, review the comments from “Further Considerations” section!


Method 1: Using DynamcsAX Built-In Functionality   

   Dynamics AX 2009 provides built-in functionality for deleting a company, however when the volume of data in the system goes above a certain limit the functionality starts to perform poorly, even when run directly on an AOS. (It is recommended to run long-running administration jobs directly on the AOS rather than clients.)    For example, it was attempted to use this method to delete several companies in Dynamics AX Test environment. By the first company the deletion job needed a few hours, while by the second company the job hasn’t finished after two days, being thus forced to stop it. After two further failed attempts it came the time to look for another solution.

Warning:
     It seems that this solution can lead to orphaned data (see [1]). So, even if you are using this method, you might need to consider one of the following methods as well.


Method 2: Using sp_MSforEachTable   

  In almost all tables in AX the company is stored in a DataAreaId attribute. Over this attribute the records belonging to a company are logically partitioned. This allows writing a script via the undocumented sp_MSforEachTable stored procedure:

--delete the data for one data area
sp_MSforEachTable @command1 = 'DELETE FROM ? WHERE DataAreaId = ''m01'''


An error with be thrown for the tables that don’t contain the DataAreaId attribute:
Msg 207, Level 16, State 1, Line 1


Invalid column name 'DataAreaId'.The script can be extended to delete in the same step two or more companies:

--delete the data for multiple data areas
 sp_MSforEachTable @command1 = 'DELETE FROM ? WHERE DataAreaId IN (''m01'', ''m02'')'


     During the first test the script needed half of hour to run, however a few tables  in which the company is stored in other attributes remained untouched. One can either search for such tables manually, via a script, or run the built-in AX functionality. We opted for running the built-in functionality, which managed to delete the remaining data relatively fast.

Warning:
Microsoft doesn’t support this method and can be used when the volume of obsolete data is relatively small!    What does it mean relatively small? The most important limitation of this method is the transaction log, considering that the deleted data are logged. One can either change log’s size to accommodate the volume of data to be deleted or run the deletion only for a subset of the tables. (Changing the recovery model to “simple” or “bulk-logged” won’t make a difference.)

   The second important limitation is the available memory, once the available memory is reached SQL Server having to paginate the data, fact that could lead to further disk space consumed.    Other limitations have more with the performance to do, e.g. each deletion is reflected also in the indexes. One might consider for example dropping the indexes before deletion and recreating them afterwards.


Method 3: Using a Cursor    

  Instead of using the undocumented sp_MSforEachTable stored procedure, the loop can be performed via a cursor (see [1]). This method is advantageous when the deletion needs to be performed only for a subset of tables one could use a cursor. The deletion can be grouped together with other activities and run together.


Method 4: Using „Shadow“ Tables    

   When the volume of data available is huge, and the volume of data that remain in the table is small compared with the overall data, it might be useful to consider using “shadow” tables. One can take advantage of the fact that a truncate command performs incomparable better than a delete command.  To use a truncate on a table, the records that need to be kept could be saved temporarily to a copy (aka “shadow”) of the table, the truncate then applied, and the copied records could be moved back. The following scripts exemplify the logic needed to delete the records from InventDim (inventory dimensions) table:

-- (optional) prove the number of records
SELECT count(*) 
FROM dbo.InventDim 
WHERE DataAreaId = 'm01'

-- create the “shadow” table
CREATE TABLE [dbo].[INVENTDIM_Dump](
[INVENTDIMID] [nvarchar](30) NOT NULL,
[INVENTBATCHID] [nvarchar](21) NOT NULL,
[WMSLOCATIONID] [nvarchar](12) NOT NULL,
[INVENTSERIALID] [nvarchar](21) NOT NULL,
[INVENTLOCATIONID] [nvarchar](10) NOT NULL,
[CONFIGID] [nvarchar](10) NOT NULL,
[INVENTSIZEID] [nvarchar](10) NOT NULL,
[INVENTCOLORID] [nvarchar](10) NOT NULL,
[INVENTSITEID] [nvarchar](10) NOT NULL,
[DATAAREAID] [nvarchar](4) NOT NULL,
[RECVERSION] [int] NOT NULL,
[RECID] [bigint] NOT NULL,
[WMSPALLETID] [nvarchar](18) NOT NULL,
[INVENTSTYLEID] [nvarchar](10) NOT NULL
) ON [PRIMARY]

-- copy the data into the “shadow” table
INSERT INTO [dbo].[InventDim_Dump] WITH (TABLOCK)
SELECT *
FROM [dbo].[InventDim] 
WHERE DataAreaId = 'm01'

-- truncate the data frome the main table 
--TRUNCATE TABLE [dbo].[InventDim]

-- copy the data back
INSERT INTO [dbo].[InventDim] WITH (TABLOCK)
SELECT *
FROM [dbo].[InventDim_Dump]

-- (optional) prove whether the IDs were correctly copied 
SELECT count(*)
FROM [dbo].[InventDim] A
JOIN [dbo].[InventDim_Dump] B
ON A.recid = B.RECID 
AND A. DATAAREAID = B.DATAAREAID 
WHERE A.DataAreaId = 'm01'

-- drop the „shadow“ table 
--DROP TABLE[dbo].[InventDim_Dump]

  

   As can be seen the “shadow” tables are simplified versions of the original tables, without constraints or indexes. They can be eventually created in another schema or even other database.   

   Except the script for table’s creation in the other scripts table’s name can be easily replaced in the editor via the search and replace functionality, trick that reduces considerably the time needed for development. I needed on average 5 minutes for each table, plus 3-4 hours for further tests.    

   The optional steps are more for exemplification and can be eventually removed.  

   The Tablock hint used in inserts provides better performance and minimizes the volume of data logged.    

   I used this method only for the tables having more than 3 million records, around 50 tables in total. Between them there were a few tables having 20-200 GB worth of records. I started with these big tables and figured out that also smaller tables could benefit from this method. A few minutes gained for each small table resulted in the end in a gain of a couple of hours.

   The remained records were 0-25% of the initial tables.   

   In theory, these steps could be performed within a cursor in which the creation of the “shadow” tables could be automated via table metadata as well. This approach will pay-off especially when the schema is not fixed, or the procedure needs to be repeated on different schemas.


Method 5: Delete Records in Batches    

   There will be a point beyond which the performance provided by the fourth method will deprecate considerably. This point is based on the volume of records available in the table, and the records needed to be inserted back and forth. Without further tests, I suppose that this point lies in the 50-75% interval. Beyond this point for big tables in range of 10x or 100x GB it might be useful to delete the data in batches. A push in this direction might be constrained by the need to shrink the transaction log in between the deletes. The query could be written as follow:

-- deleting top x records 
DELETE top 10000
FROM dbo.InventDim WITH (TABLOCK)
WHERE DataAreaId = m01

   The query can be included in a loop or run manually until no records are returned. It can be tested with different batch sizes to determine the best solution. In between is recommended to check also the growth of the log file and truncate it accordingly when needed.


Method 6: Using X++ Code  

    For those having some basic knowledge of X++ and Dynamics AX classes, a solution based on deleting data via AX code could prove to be a better solution as standard functionality can be leveraged, functionality that eventually considers also the business logic implemented. The downside is the code that need to be written for this purpose, however there are already some examples available on the web (see [4]).


Hint:
In AX 2012 built-in support for batch deletes was added via the delete_from statement (see [3]).


Further Considerations    

   Before attempting a deletion, it might be useful to analyze how many records will be deleted from each table, and eventually devise different scenarios for specific table categories. To get the number of records one can use either the built-in functionality from AX or use the sp_MSforEachTable stored procedure and export the results to text, following to overwork the data further in Excel:

-- listing the number of records per company 
sp_MSforEachTable @command1 = 'SELECT dataareaid, ''?'' table_name, count(*) no_records FROM ? WHERE DataAreaId IN (''m01'', ''m02'') GROUP BY dataareaid'

The results can be used also to approximate the space occupied by the data.   

   Independently of the method used it is recommended to restrict users‘ access to the system and to deactivate the scheduled AX or SQL Server jobs. This will ensure that no blockings will occur in the system during the respective time.    

   As data are synchronized between the AOS’s and the database, it is recommended to shut down the not needed AOS services before the deletions are performed, and restart them once all activities were performed.   

   To minimize the risks associated with the loss of data it’s recommended to perform a backup of the database(s) before performing any changes.    

   By deleting the data directly on the database, the business logic from AX (including customizations) is skipped. In theory this can lead to logical inconsistencies, however considering that all the data for a company are deleted, the risks are very small, unless intercompanies are involved.   

   After the data are deleted it is recommended to recreate the indexes and update the statistics on the tables.  

   Check whether the transaction log can accommodate the volume of records to be deleted! In extreme cases your SQL Server might crash! From this consideration it might be advantageous to delete only a company at a time.    

   Based on the volume of data available in the transaction log it might be needed to truncate the log(s) between the steps, as well at the end.  

   After the principle “better safe than sorrow”, it might be a good idea to check the physical and logical consistency of the data before letting the users in.   

  To minimize the impact on the business, it is recommended to perform the deletion outside the working hours, otherwise the action can lead to blocking and even deadlocks in the system.     Always attempt to use standard functionality and resort to other methods only when there’s no way around it.

  It is recommended to always test the scripts thoroughly in the test environment before attempting their productive usage!

References:
[1] Microsoft Dynamics AX Technical Support Blog (2010) How to delete orphaned data remained from deleted company?, by Martin Falta [Online] Available from: https://blogs.msdn.microsoft.com/emeadaxsupport/2010/12/09/how-to-delete-orphaned-data-remained-from-deleted-company/
[2] Art of Creation (2010) Delete an AX company on SQL [Online] Available from: http://www.artofcreation.be/2010/02/03/delete-an-ax-company-on-sql/
[3] MSDN (2012) delete_from Statement [Online] Available from: https://msdn.microsoft.com/en-us/library/aa624886.aspx[
4] Kevin’s blog (2017) Dynamics Ax 2012 History cleanup, by Kevin Roos [Online] Available from: https://www.kevinroos.be/2017/07/dynamics-ax-2012-history-cleanup/

09 June 2018

Data Migrations (DM): Guiding Principles

Data Migration
Data Migrations Series

Introduction

“An army of principles can penetrate where an army of soldiers cannot."
Thomas Paine

In life as well in IT principles serve as patterns of advice in form of general or fundamental ideas, truths or values stated in a context-independent manner. They can be used as guidelines in understanding and modeling the reality, the world we live in. With the invasion of technologies in our lives principles serve as a solid ground on which we can build castles – solutions for our problems. Each technology comes with its own set of principles that defines in general terms its usage. That's why most of the IT books attempt to catch these sets of principles. Unfortunately, few of the technical writers manage to define some meaningful principles and showcase their usages.

Many of the ideas considered as principles in papers on Data Migration (DM) are at best just practices, and some can be considered as best/good practices. Just because something worked good in a previous migration doesn’t mean automatically that the idea behind the respective decision turns automatically in a principle. Some of the advices advanced are just lessons learned in disguise. Principles through their generality apply to a broad range of cases, while practices are more activity specific.

A DM through its nature finds its characteristics at the intersection of several area - database-based architecture design, ETL workflows, data management, project management (PM) and services. From these areas one can pull a set of principles that can be used in building DM architectures.

Architecture Principles

“Architecture starts when you carefully put two bricks together.”
Ludwig Mies van der Rohe

There are several general principles that apply to the architecture of applications, independently of the technologies used or the industry, e.g. research first, keep it simple/small, start with the end in mind, model first, design to handle failure, secure by design (aka safety first), prototype, progress iteratively, focus on value, reuse (aka don't reinvent the wheel), test early, early feedback, refactor, govern, validate, document, right tool – right people, make it to last, make it sustainable, partition around limits, scale out, defensive coding, minimal intervention, use common sense, process orientation, follow the data, abstract, anticipate obsolescence, benchmark, single-responsibility, single dispatch, separation of concerns, right perspective.

To them add a range of application design characteristics that can be considered as principles as well: extensibility, modularity, adaptability, reusability, repeatability, modularity, performance, revocability, auditability, subject-orientation, traceability, robustness, locality, heterogeneity, consistency, atomicity, increased cohesion, reduced coupling, monitoring, usability, etc. There are several principles that can be transported from problem solving into design - divide and conquer, prioritize, system’s approach, take inventory, and so on.

A DM’s architecture has more to do with a data warehouse as it relies heavily on ETL tasks and data need to be stored for various purposes. Besides the principles of good database design, a few other principles apply: model (the domain) first, denormalize, design for performance, maintainability and security, validate continuously. From ETL area following principles can be considered: single point of processing, each step must have a purpose, minimize touch points, rest data for checkpoints, leverage existing knowledge, automate the steps, batch processing.

 In addition, considering their data-specific character, a DM can be regarded as one or several data products, though in contrast with typical data products DM have typically a limited purpose. From this area following principles could be considered: build trust with transparency, blend in, visualize the complex.

Data Management Principles

Considering that a DM’s focus is an organization's data, some principles need to focus on the management and governance of Data. Data Governance together with Data Quality, Data Architecture, Metadata Management, Master Data Management are functions of Data Management. The focus is on data, metadata and their lifecycle, on processes, ownership and roles and their responsibilities. With this in mind there can be defined several principles supposed to facilitate the functions of Data Management: manage data as asset, manage data lifecycle, the business owns the data, integration across the organization, make data/metadata accessible, transparent and auditable processes, one source of truth.

As part of DM there are customer, employee and vendor information which fall under the General Data Protection Regulation (GDPR) EU 2016/679 regulation which defines the legal framework for data protection and privacy for all individuals within the European Union (EU) and the European Economic Area (EEA) as well the export of personal data outside the EU and EEA. The regulation defines a set of principles that make its backbone: fairness, lawfulness and transparency, purpose limitation, data minimization, accuracy, storage limitation, integrity and confidentiality, accountability [6].

Overseas, the US Federal Trade Commission (FTC) issued in 2012, a report recommending organizations design and implement their own privacy programs based on a set of best practices. The report reaffirms the FTC’s focus on Fair Information Processing Principles, which include notice/awareness, choice/consent, access/participation, integrity/security, enforcement/redress [6].


Project Management (PM) Principles

"Management is doing things right […]"
Peter Drucker

A DM though its characteristics is a project and, to increase the chances of success, it needs to be managed as a project. Managing DM as a project is one of the most important principles to consider. The usage of a PM framework will further increase the chances of success, as long the framework is adequate for the purpose and the organization team is able to use the framework. PMI, Prince2, Agile/Scrum/Kanban are probably the most used PM methodologies and they come with their own sets of principles.

In general, all or some of the PM principles apply independently on whether is used alone or in combination with other PM methodologies: a single project manager, an informed and supportive management, a dedicated team of qualified people to do the work of the project, clearly defined goals addressing stakeholders’ priorities, an integrated plan and schedule, as well a budget of costs and/or resources required [1].

On the other side, an agile approach could prove to be a better match for a DM given that requirements change a lot, frequent and continuous deliveries are needed, collaboration is necessary, agile processes as well self-organizing teams can facilitate the migration. These are just a few of the catchwords that make the backbone of the Agile Manifesto (see [3]).

An agile form of Prince2 could be something to consider as well, especially when Prince2 is used as methodology for other projects. For Prince2 are the following principles to consider: continued business justification, learn from experience, defined roles and responsibilities, manage by stages, management by exception, focus on products, tailor to suit the project environment [2].

All these PM principles reveal important aspects to ponder upon, and maybe with a few exceptions, all can be incorporated in the way the DM project is managed.


Service Principles

Considering the dependencies existing between the DM and Data Quality as well to the broader project, a DM can have the characteristics of a service. It’s not an IT Service per se, as IT only supports technically and eventually from a PM perspective the project. Even if a DM is not a ITSM service, some of the ITIL principles can still apply: focus on value, design for experience, start where you are, work holistically, progress iteratively, observe directly, be transparent, collaborate and keep it simple [4].


Conclusion

“Obey the principles without being bound by them.”
Bruce Lee

Within a DM all the above principles can be considered, though the network of implication they create can easily shift the focus from the solution to the philosophical aspects, and that’s a marshy road to follow. Even if all principles are noble, not all can be considered. It would be utopic to consider each possible principle. The trick is to identify the most “important” principles (principles that make sense) and prioritize them according to existing requirements. In theory, this is a one-time process that involves establishing a “framework” of best/good practices for the DM, in next migrations needing only to consider the new facts and aspects.

Previous Post <<||>> Next Post

References:
[1] “Principles of project management”, by J. A. Bing, PM Network, 1994 (link)
[2] Axelos (2018) What is PRINCE2? (link)
[3] Agile Manifesto (2001) Principles behind the Agile Manifesto (link)
[4] Axelos (2018) ITIL® Practitioner 9 Guiding Principles (link)
[5] The Data Governance Institute (2018) Goals and Principles for Data Governance (link) 
[6] Navigating the Labyrinth: An Executive Guide to Data Management, by Laura Sebastian-Coleman for DAMA International, Technics Publications, 2018 (link)  

01 June 2018

Data Science: Data Model (Definitions)

"simplified and approximative description of a system or process, based on a finite set of essential variables and their analytically definable behavior." (Teuvo Kohonen, "Self-Organizing Maps" 3rd Ed., 2001)

"(i) An abstract, self-contained logical definition of the data structures and associated operators that make up the abstract machine with which users interact (such as the relational model of data). (ii) A model of the persistent data of some enterprise" (Keith Gordon, "Principles of Data Management", 2007)

"A means of encapsulating the data elements that were decided on by the business experts, in conjunction with the data stewards and IT professionals. The data models reflect an organization’s business as represented in the data." (Tony Fisher, "The Data Asset", 2009)

"An abstraction of how individual data elements relate to each other. It visually depicts how the data is to be organized and stored in a database. A data model provides the mechanism to document and understand how data is organized." (Laura Reeves, "A Manager's Guide to Data Warehousing", 2009)

"An organization of data that describes the relationships among primitive and composite data elements." (Toby J. Teorey, "Database Modeling and Design" 4th Ed., 2010)

"A model of the structure (and to some extent the content) of a database and at least some of the rules governing the data therein." (Graham Witt, "Writing Effective Business Rules", 2012)

"A data model is a visual representation of data content and the relationships, created for purposes of understanding how data is or might be organized, and for ensuring the comprehensibility and usability of that way of organizing data." (Laura Sebastian-Coleman, "Measuring Data Quality for Ongoing Improvement", 2013)

"Represents data objects and their relationships with each other. Data models form the basis for data integration at the conceptual level as well as the improvement of data quality, such as with regard to the reduction of data redundancy. Data models are one component of the data architecture." (Boris Otto & Hubert Österle, "Corporate Data Quality", 2015)

"A template formalizing the relationship between an input and an output. Its structure is fixed but it also has parameters that are modifiable; the parameters are adjusted so that the same model with different parameters can be trained on different data to implement different relationships in different tasks." (Ethem Alpaydın, "Machine learning : the new AI", 2016)

"A visual means of depicting data and its relationship to other data." (Gregory Lampshire et al, "The Data and Analytics Playbook", 2016)

"An abstract representation of a subject that looks and/or behaves like all or part of the original." (George Tillmann, "Usage-Driven Database Design: From Logical Data Modeling through Physical Schmea Definition", 2017)

"In the context of machine learning, a model is a representation of a pattern extracted using machine learning from a data set. Consequently, models are trained, fitted to a data set, or created by running a machine learning algorithm on a data set. Popular model representations include decision trees and neural networks. A prediction model defines a mapping (or function) from a set of input attributes to a value for a target attribute. Once a model has been created, it can then be applied to new instances from the domain. For example, in order to train a spam filter model, we would apply a machine learning algorithm to a data set of historic emails that have been labeled as spam or not spam. Once the model has been trained it can be used to label (or filter) new emails that were not in the original data set." (John D Kelleher & Brendan Tierney, "Data science", 2018)

"An abstract model that describes how data is presented and used." (Piethein Strengholt, "Data Management at Scale", 2020)

"A logical map that represents the inherent properties of the data independent of software, hardware or machine performance considerations. The model shows data elements grouped into records, as well as the association around those records." (Information Management)

"Defines how data is structured, related, and standardized for the purpose of extracting meaningful insight." (Insight Software)

"A data model is an abstract model that organizes elements of data and standardizes how they relate to one another. Their properties are generally governed by properties of the real world entities." (kloudless)
Related Posts Plugin for WordPress, Blogger...

About Me

My photo
IT Professional with more than 24 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.