20 July 2025

🗃️Data Management: Versioning (Just the Quotes)

"There are two different methods to detect and collect changes: data versioning, which evaluates columns that identify rows that have changed (e.g., last-update-timestamp columns, version-number columns, status-indicator columns), or by reading logs that document the changes and enable them to be replicated in secondary systems."  (DAMA International, "DAMA-DMBOK: Data Management Body of Knowledge" 2nd Ed., 2017)

"Moving your code to modules, checking it into version control, and versioning your data will help to create reproducible models. If you are building an ML model for an enterprise, or you are building a model for your start-up, knowing which model and which version is deployed and used in your service is essential. This is relevant for auditing, debugging, or resolving customer inquiries regarding service predictions." (Christoph Körner and Kaijisse Waaijer, "Mastering Azure Machine Learning". 2020)

"Versioning is a critical feature, because understanding the history of a master data record is vital to maintaining its quality and accuracy over time." (Cédrine MADERA, "Master Data and Reference Data in Data Lake Ecosystems" [in "Data Lake" ed. by Anne Laurent et al, 2020])

"Versioning of data is essential for ML systems as it helps us to keep track of which data was used for a particular version of code to generate a model. Versioning data can enable reproducing models and compliance with business needs and law. We can always backtrack and see the reason for certain actions taken by the ML system. Similarly, versioning of models (artifacts) is important for tracking which version of a model has generated certain results or actions for the ML system. We can also track or log parameters used for training a certain version of the model. This way, we can enable end-to-end traceability for model artifacts, data, and code. Version control for code, data, and models can enhance an ML system with great transparency and efficiency for the people developing and maintaining it." (Emmanuel Raj, "Engineering MLOps Rapidly build, test, and manage production-ready machine learning life cycles at scale", 2021)

"DevOps and Continuous Integration/Continuous Deployment (CI/CD) are vital to any software project that is developed by more than one developer and needs to uphold quality standards. A central code repository that offers versioning, branching, and merging for collaborative development and approval workflows and documentation features is the minimum requirement here." (Patrik Borosch, "Cloud Scale Analytics with Azure Data Services: Build modern data warehouses on Microsoft Azure", 2021)

"Automated data orchestration is a key DataOps principle. An example of orchestration can take ETL jobs and a Python script to ingest and transform data based on a specific sequence from different source systems. It can handle the versioning of data to avoid breaking existing data consumption pipelines already in place." (Sonia Mezzetta, "Principles of Data Fabric: Become a data-driven organization by implementing Data Fabric solutions efficiently", 2023)

"Data products should remain stable and be decoupled from the operational/transactional applications. This requires a mechanism for detecting schema drift, and avoiding disruptive changes. It also requires versioning and, in some cases, independent pipelines to run in parallel, giving your data consumers time to migrate from one version to another." (Piethein Strengholt, "Data Management at Scale: Modern Data Architecture with Data Mesh and Data Fabric" 2nd Ed., 2023)

"When performing experiments, the first step is to determine what compute infrastructure and environment you need.16 A general best practice is to start fresh, using a clean development environment. Keep track of everything you do in each experiment, versioning and capturing all your inputs and outputs to ensure reproducibility. Pay close attention to all data engineering activities. Some of these may be generic steps and will also apply for other use cases. Finally, you’ll need to determine the implementation integration pattern to use for your project in the production environment." (Piethein Strengholt, "Data Management at Scale: Modern Data Architecture with Data Mesh and Data Fabric" 2nd Ed., 2023)

No comments:

Related Posts Plugin for WordPress, Blogger...

About Me

My photo
Koeln, NRW, Germany
IT Professional with more than 25 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.