
Introduction
As a data consumer, the focus is always on searching for right information to form a true analysis from the information gathered over time. Data Exploration is a term used for finding insightful information from data. To find insights from data various steps such as data munging, data analysis, data modeling, and model evaluation are taken.
In any real data exploration project, commonly six steps are involved in the exploration process. They are as follows:
- Asking the right questions: Asking the right questions will help in understanding the objective and target information sought from the data. Questions can be asked such as What are my expected findings after the exploration is finished?, or What kind of information can I extract through the exploration?.
- Data collection: Once the right questions have been asked the target of exploration is cleared. Data collected from various sources is in unorganized and diverse format. Data may come from various sources such as files, databases, internet, and so on. Data collected in this way is raw data and needs to be processed to extract meaningful information. Most of the analysis and visualizing tools or applications expect data to be in a certain format to generate results and hence the raw data is of no use for them.
- Data munging: Raw data collected needs to be converted into the desired format of the tools to be used. In this phase, raw data is passed through various processes such as parsing the data, sorting, merging, filtering, dealing with missing values, and so on. The main aim is to transform raw data in the format that the analyzing and visualizing tools understand. Once the data is compatible with the tools, analysis and visualizing tools are used to generate the different results.
- Basic exploratory data analysis: Once the data munging is done and data is formating for the tools, it can be used to perform data exploration and analysis. Tools provide various methods and techniques to do the same. Most analyzing tools allow statistical functions to be performed on the data. Visualizing tools help in visualizing the data in different ways. Using basic statistical operations and visualizing the same data can be understood in better way.
- Advanced exploratory data analysis: Once the basic analysis is done it's time to look at an advanced stage of analysis. In this stage, various prediction models are formed on basis of requirement. Machine learning algorithms are utilized to train the model and generate the inferences. Various tuning on the model is also done to ensure correctness and effectiveness of the model.
- Model assessment: When the models are mare, they are evaluated to find the best model from the given different models. The major factor to decide the best model is to see how perfect or closely it can predict the values. Models are tuned here also for increasing the accuracy and effectiveness. Various plots and graphs are used to see the models prediction.
In this chapter, we will be performing a sample data exploration based on the dataset of air quality for different months. We will study the data, clean data through data munging; perform basic exploratory data analysis to discover important attributes that might give a prediction of the temperature, and generate a prediction model.