Data mining is a now a mature science. After a whole lot of trial and error, there is now a general agreement on how a data mining project normally progresses. These are stages of a successful data mining project. These can be put together into a methodology.
A generally agreed upon methodology for data science projects is the Cross Industry Standard Process for Data Mining (CRISP-DM). There is a Wikipedia article about it.
CRISP-DM gives you the tools for what to expect during certain stages of the project. Some stages are prescriptive and require the application of Information Technology where as others require creativity and business understanding. It gives you a systematic structure, broken down into tasks. Using this gives you a neat way of organising your IT or business requirements.
The stages a cyclical but the methodology makes allowances to go back to previous stages and refine your understanding. It is normal in data analysis projects which follow an iterative approach. You always find out new things about your datasets.
The solution being deployed could spawn off new data mining projects. A better understanding of the problem could mean a revision of the output of a prior stage.
The success of a project depends on the understanding of the problem. This may seem obvious but business projects seldom come packages with clear unambiguous data mining problems.
A definition of the problem to be solved and how it will help the business is the goal of this phase. If you are unable to describe this, then anything following on from this point will have no direction.
The more knowledge about the problem to be solved and the environment within which it exists will help the analyst. A good understanding means he can be creative in formulating the solution.
Data is the raw material. The data will have to be changed in a way that it can be formulated to solve the question it’s supposed to solve. It will often be the shadow of the data set it is to become.
This is the stage where the strengths and limitations of the data are documented. The compromises based on what data was available would also be documented here.
Depending on project scope, there could be a requirement for multiple data sets. At this stage, what is possible to obtain and how it will affect the overall goal of the project can also be considered.
The output of this stage is a well documented understanding of the data available and how it supports the business understanding of the project’s overall goals.
Turn raw, unprocessed and unstructured data into a format that is usable.
Data preparation is an important technical part of the process. Any investment made on improving the quality of the data are sure to pay out dividends later.
Proper documentation at this stage is crucial. As the project becomes more complex, it becomes harder to keep track on why certain transformations happened. Any further documentation on this will certainly save time in the future.
There are rules for a well structured data set.
The prepared data set and the data mining algorithms make up the data mining model.
At this stage, multiple data mining algorithms are run on the data. The best algorithms are chosen based on how they solve the original requirement.
Assess the results of the model to gain confidence in how well it performs with the data.
There are several methods on how to gauge the success and reliability of a model which is determined by the problem domain.
Success criteria is subjective depending on the problem that is being solved.
Integrating the findings of the model into the business. This means implementing processes within the enterprise to support the model. This is the most high stakes part of the cycle because until this time everything was theoretical.
The deployment phase could then lead to another cycle of a project or with additional business understanding refinement of the process to get a better outcome. Large data mining projects could mean specific teams to deal with different stages in process. An appreciation of how a stage fits into the larger picture will lead to a more relevant output.
- The Wikipedia article on the methodology.