- Team Data Science Process (TDSP)
- an agile, iterative data science methodology to deliver predictive analytics solutions and intelligent applications efficiently [1]
- {goal} help customers fully realize the benefits of their analytics program [1]
- {component} data science lifecycle definition
- {description} a framework to structure the development of data science projects [1]
- {goal} designed for data science projects that ship as part of intelligent applications that deploy ML & AI models for predictive analytics [1]
- {benefit} can be used in the context of other DM methodologies as they have common ground [1]
- e.g. CRISP-DM, KDD
- {benefit} exploratory data science projects or improvised analytics projects can also benefit from using this process [1]
- {component} standardized project structure
- {description} a directory structure that includes templates for project documents
- ⇒makes it easy for team members to find information [1]
- ⇐templates for the folder structure and required documents are provided in standard locations [1]
- all code and documents are stored in an agile VCS tracking repository [1]
- {recommendation} create a separate repository for each project on the VCS for versioning, information security, and collaboration [1]
- {benefit} organizes the code for the various activities [1]
- {benefit} allows tracking the progress [1]
- {benefit} provides checklist with key questions for each project to guarantee process and deliverables’ quality [1]
- {benefit} enables team collaboration [1]
- {benefit} allows closer tracking of the code for individual features [1]
- {benefit} enables teams to obtain better cost estimates [1]
- {benefit} helps build institutional knowledge across the organization [1]
- {description} a directory structure that includes templates for project documents
- {component} recommended infrastructure
- {description} a set of recommendations for the infrastructure and resources needed for analytics and storage [1]
- {benefit} addresses cloud and/or on-premises requirements [1]
- {benefit} enables reproducible analysis [1]
- {benefit} avoids infrastructure duplication [1]
- ⇒minimizes inconsistencies and unnecessary infrastructure costs [1]
- {tools} tools are provided to provision the shared resources, track them, and allow each team member to connect to those resources securely [1]
- {good practice} create a consistent compute environment [1]
- ⇐allows team members replicate and validate experiments [1]
- {component} recommended tools and utilities
- {description} a set of recommendations for the tools and utilities needed for project’s execution [1]
- {benefit} help lower the barriers and increase the consistency of their adoption [1]
- {benefit} provides an initial set of tools and scripts to jump-start methodology’s adoption [1]
- {benefit} helps automate some of the common tasks in the data science lifecycle [1]
- e.g. data exploration and baseline modeling [1]
- {benefit} well-defined structure provided for individuals to contribute shared tools and utilities into their team's shared code repository [1]
- ⇐ resources can then be leveraged by other projects [1]
- {phase} 1: business understanding
- {goal} define and document the business problem, its objectives, the needed attributes, and the metric(s) used to determine project’s success
- {goal} identify and document the relevant data sources
- {step} 1.1: define project’s objectives
- elicit together with the stakeholders the requirements, define and document the problem and its objectives, respectively the metric(s) used to determine project’s success
- requires a good understanding of the business processes, data and further characteristics
- elicit together with the stakeholders the requirements, define and document the problem and its objectives, respectively the metric(s) used to determine project’s success
- {step} 1.2: identify data sources
- identify the attributes and the data sources relevant to the problem under study
- {step} 1.3: define project plan and team*
- develop a high-level milestone plan and identify the resources needed for executing it
- {tool} project charter
- standard template that documents the business problem, the scope of the project, the business objectives and metric(s) used to determine project’s success
- {phase} 2: data acquisition & understanding
- {goal} prepare the base dataset(s) as needed by the modeling phase into the target repository
- {goal} build the data ETL/ELT architecture and processes needed for provisioning the basis data
- {step} 2.1: ingest data
- make the required data available for the team in the repository where the analytics operations take place
- {step} 2.2: explore data
- understand data’s characteristics by leveraging specific tools (visualization, analysis)
- prepare the data as needed for further processing
- {step} 2.3: set up pipelines
- build the pipelines needed for data actualization and qualitative assessment [3]
- set up a process to score new data or refresh the data regularly [3]
- {step} 2.4: feasibility analysis*
- reevaluate the project to determine whether the value expected is sufficient to continue pursuing it
- {tool} data quality report
- report that includes data summaries, data mappings, variable ranking, data qualitative assessment(s) and further information [3]
- {tool} solution architecture
- diagram and/or textual-based description of the data pipeline(s), technical assumptions and further aspects
- {tool} data reports
- document the structure and statistics of the raw data
- {tool} checkpoint decision
- decision template document that
- summarizes the findings of the feasibility analysis step
- includes a set of choices and recommendations for the next steps
- serves as basis for the decision on whether to continue or not the project, respectively what the next steps are
- decision template document that
- {phase} 3: modeling
- {goal} create a machine-learning model that addresses the prediction requirements and that's suitable for production
- {step} 3.1: feature engineering
- the inclusion, aggregation, and transformation of raw variables to create the features used in the analysis [4]
- ⇐requires a good understanding of how the features relate to each other and how the ML algorithms use those features [4]
- the inclusion, aggregation, and transformation of raw variables to create the features used in the analysis [4]
- {step} 3.2: model selection*
- choose one or more modeling algorithms that address problem’s characteristics the best
- {step} 3.3: model training
- involves the following steps:
- split the input data into training and test datasets
- build the models by using the training dataset
- evaluate the training and the test data set
- determine the optimal setup and methods
- involves the following steps:
- {step} 3.4: model evaluation
- evaluate the performance of the model(s)
- {step} 3.5: feasibility analysis*
- evaluate the readiness of the models for use into production, respectively on whether they fulfill project’s objectives
- {tool} feature sets
- describe the features developed for the modeling and how they were generated
- contains pointers to the code used to generate the features
- {tool} model report
- a standard, template-based report that provides details on each experiment’s outcomes
- created for each model tried
- {tool} checkpoint decision
- {tool} model performance metrics
- e.g. ROC curves or MSE
- {phase} 4: deployment
- {goal} deploy the models and the data pipelines to the environment used for final user acceptance
- {step} 4.1: operationalize architecture
- prepare the models and data pipelines for use into production
- {best practice} expose the models over an open API interface
- enables models’ consumption from various applications
- {best practice} build telemetry and monitoring into the models and the data pipelines [5]
- helps in monitoring and troubleshooting [5]
- {step} 4.2: deploy solution*
- deploy the architecture into production
- {tool} status dashboard
- displays data on system’s health and key metrics
- {tool} model report
- the report in its final form with deployment information
- {tool} solution architecture
- the document in its final form
- {phase} 5: customer acceptance
- {goal} confirm that project’s objectives were fulfilled and get customer’s acceptance
- {step} 5.1: system validation
- validate system’s performance and outcomes and confirm that it fulfills customer’s needs
- {step} 5.2: project signoff*
- finalize and review documentation
- handover the solution and afferent documentation to customer
- evaluate the project against the defined objectives and get customer’ signoff
- {tool} exit report
- {tool} technical report
- contains all the details of the project that are useful for learning about how to operate the system [6]
Acronyms:
Artificial Intelligence (AI)
Cross-Industry Standard Process for Data Mining (CRISP-DM)
Data Mining (DM)
Knowledge Discovery in Databases (KDD)
Team Data Science Process (TDSP)
Version Control System (VCS)
Visual Studio Team Services (VSTS)
Resources:
[1] Microsoft Azure (2020) What is the Team Data Science Process? [source]
[2] Microsoft Azure (2020) The business understanding stage of the Team Data Science Process lifecycle [source]
[3] Microsoft Azure (2020) Data acquisition and understanding stage of the Team Data Science Process [source]
[4] Microsoft Azure (2020) Modeling stage of the Team Data Science Process lifecycle [source]
[5] Microsoft Azure (2020) Deployment stage of the Team Data Science Process lifecycle [source]
[6] Microsoft Azure (2020) Customer acceptance stage of the Team Data Science Process lifecycle [source]