blog, Coaching, data, data science, Project Management, python, stakeholder management

Applying CRISP-DM to Data Science and a Re-Usable Template in Jupyter

What is CRISP-DM?

CRISP-DM is a process methodology that provides a certain amount of structure for data-mining and analysis projects. It stands for cross-industry process for data mining. According to polls popular Data  Science website KD Nuggets, it is the most widely used process for data-mining.

 

 

The process revolves are six major steps:

1.       Business Understanding

Start by focusing on understanding the objectives and requirements from a business perspective, and then using this knowledge to define the data problem and  project plan.

2.       Data Understanding

As with every data project, there is an initial hurdle to collect data and to familiarise yourself with it, identify data quality issues, discover initial insights, or to detect interesting nuggets of information that might for a hypothesis for analysis.

3.       Data Preparation

The nitty-gritty dirty work of preparing the data by cleaning it, merging, moulding it etc to form a final dataset that can be used in modeling.

4.       Modeling

At this point we decide which model techniques to actually use and build them

5.       Evaluation

Once we appear to have good enough quality model results, these need to be tested to ensure they test well against unseen data and that all key business issues have been answered.

6.       Deployment

At this stage we are ready to deploy our code representation of the model into an production environment and solve our original business problem.

Why Use It ?

Puts the Business Problem First

One of the greatest advantages is that it puts business understanding at the centre of the project. This means we are concentrating on solving the business’s needs first and foremost and trying to deliver value to our stakeholders.

Commonality of Structured Solutions

As a manager of Data Scientists and Analysts it also ensures that we stick to a common methodology to maintain optimal results that the team can follow, ensuring we have a followed best practice or tackled common issues.

Flexible

It’s also not a rigid structure – It’s malleable and steps are repeatable, and often you will naturally go back through the steps to optimise your final data set for modeling.

It does not necessarily need to be mining or model related. Since so many of the business problems today require extensive data analysis and preperation, the methodology can flex to suit other categories of solutions like recommender systems, Sentiment analysis, NLP amongst other

A Template in Jupyter

Since we have a structure process to follow, it is likely we have re-usable steps and components. Ideal for re-usable code. What’s more, a Jupyter notebook can contain the documentation necessary for our business understanding and description of each step in the process.

To that end, I have a re-usable notebook on my DataSci_Resources GitHub Repository here 

Resources

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s