What is CRISP-DM?
CRISP-DM is a process methodology that provides a certain amount of structure for data-mining and analysis projects. It stands for cross-industry process for data mining. According to polls popular Data Science website KD Nuggets, it is the most widely used process for data-mining.
The process revolves are six major steps:
1. Business Understanding
Start by focusing on understanding the objectives and requirements from a business perspective, and then using this knowledge to define the data problem and project plan.
2. Data Understanding
As with every data project, there is an initial hurdle to collect data and to familiarise yourself with it, identify data quality issues, discover initial insights, or to detect interesting nuggets of information that might for a hypothesis for analysis.
3. Data Preparation
The nitty-gritty dirty work of preparing the data by cleaning it, merging, moulding it etc to form a final dataset that can be used in modeling.
At this point we decide which model techniques to actually use and build them
Once we appear to have good enough quality model results, these need to be tested to ensure they test well against unseen data and that all key business issues have been answered.
At this stage we are ready to deploy our code representation of the model into an production environment and solve our original business problem.
Why Use It ?
Puts the Business Problem First
One of the greatest advantages is that it puts business understanding at the centre of the project. This means we are concentrating on solving the business’s needs first and foremost and trying to deliver value to our stakeholders.
Commonality of Structured Solutions
As a manager of Data Scientists and Analysts it also ensures that we stick to a common methodology to maintain optimal results that the team can follow, ensuring we have a followed best practice or tackled common issues.
It’s also not a rigid structure – It’s malleable and steps are repeatable, and often you will naturally go back through the steps to optimise your final data set for modeling.
It does not necessarily need to be mining or model related. Since so many of the business problems today require extensive data analysis and preperation, the methodology can flex to suit other categories of solutions like recommender systems, Sentiment analysis, NLP amongst other
A Template in Jupyter
Since we have a structure process to follow, it is likely we have re-usable steps and components. Ideal for re-usable code. What’s more, a Jupyter notebook can contain the documentation necessary for our business understanding and description of each step in the process.
To that end, I have a re-usable notebook on my DataSci_Resources GitHub Repository here