Getting data ready for analysis us usually the the longest and most involved part of a data analysis project. Starting from scratch you can expect this part of the project to take up to 80% of the total time.
Because of this, any time spent on speeding up the process can have a significant impact impact on the overall completion time of the project. It will also allow you to spend more time on the actual analysis of the data.
Although there is a sequence of steps that are normally followed when carrying out this task, it is normal to revisit previous steps. Sometimes more insight is gained from the data that can be used further up on the analysis chain. This means the project is more iterative than sequential.
While building a data analysis project, it is normal to expect an iterative approach like this. The project should be organised in a structure that allows for going back and modifying previous steps.
Quantitative and Qualitative Data
In a forecasting project it is important to use all the information you have at hand when building the forecast. This means that the data side of things is only part of the overall information available. Business and expert knowledge needs to be included in the forecast.
The final forecast needs to clearly document why the forecast is and document all the information that went into building it.
Getting as much information as possible from your data is time well spent. Even a small new insight can yield significant opportunities to the business.
Getting at the Data
Start off by documenting the major data sources and what you know about the contents of each data source. This could include the opinions of subject matter experts in the business. It could also include information about future events that will affect the the forecast. This could include things like new product launches or marketing initiatives.
Data sources are normally spread sheets or database applications but nowadays there are multiple other sources that can be used to get additional insight into the business.
Files and documents can be an interesting challenge. The data you are looking for could exist on multiple files scattered all over the organisations. The data could also be within a file in a specific location. Extracting data from files could require a lot of custom building of extraction routines.
SQL is a universal query language that can be used to get data sets out of databases. All relational data bases will have this option to extract out data.
Web scraping is becoming an increasingly popular way of getting data. It involves writing a script to read a web page and extract out its contents.
Application Programming Interface (API)
This is any type of query you can use to extract data out of another application.
Publicly available data
E.g. weather data to check if that has an impact on your sales.
Getting the data could be a one off process or a regular one. Depending on how often and how complicated extraction is, automated data extraction can be build in order to streamline this task.
At this point documentation of the data sources as well their data will be at hand. There could be further work to be done in improving the quality of the data for analysis. These are decisions to be made based on the type of data and what to do to get the most accurate representation of reality.
These are holes in your data and there are a few reasons why these tend to arise.
- Errors in data collection.
- Expected missing values.
- Unexpected missing values.
Either way, any statistical software has methods to deal with missing values.
- Delete records with missing values. If the data is evenly distributed and deleting the records wont have much of an impact on the analysis.
- Zero them out, if the rest of the properties of the record can benefit the analysis.
- Use averages of the records with similar properties. This will not have an impact on summary values and still let you use the other properties in the records for analysis.
- Forward fill or back fill. Useful for time series data. Take the previous values or a rolling average of past values.
Outliers and what to do with them
Real world data is often messy and doesn’t behave the way you like it to. This sometimes reveals itself as outliers. Or values that are significantly different from what is expected.
Outliers can have a impact on statistical analysis as it changes the summaries of data. If outliers are part of the normal behavior of the data, then they should be included in the analysis.
Outliers like missing values need to be considered in a case by case basis and treated accordingly.
Once the data is in a form that aids analysis, it is good opportunity to get more insight about the data and even confirm the organisations understanding of its data.
Categorical data is normally non numeric data that can be divided into a fixed number of categories. It can give you insight into your population and demographics.
Statistical analysis can be run on numeric data to look for summary values and outliers.
Combination of the two
Categorical and summary data on their own are useful but combined are even more useful.
Data wrangling is transforming data into a structure that is more useful for analysis.
Most data, resides in spreadsheets and relational databases and is represented as records and properties in a two dimensional sheet. Having one row for records and one column for properties and values. Data wrangling is changing the structure the data is in to get higher level or summary information from the data set.
As an example, we can look at sales data and break it down into categories.
And further smooth the data to reveal trends hiding behind sales.
Building Forecasting Models
At this point you are in a strong position to build forecasting models to your data.
Forecasting models could be deterministic or stochastic. Each method having its own advantages over the other.
These are covered in other blog posts.