Python:Advanced Predictive Analytics
上QQ阅读APP看书,第一时间看更新

Use cases of the read_csv method

The read_csv method can be put to a variety of uses. Let us look at some such use cases.

Passing the directory address and filename as variables

Sometimes it is easier and viable to pass the directory address and filename as variables to avoid hard-coding. More importantly so, when one doesn't want to hardcode the full address of the file and intend to use this full address many times. Let us see how we can do so while importing a dataset.

import pandas as pd
path = 'E:/Personal/Learning/Datasets/Book'
filename = 'titanic3.csv'
fullpath = path+'/'+filename
data = pd.read_csv(fullpath)

For such cases, alternatively, one can use the following snippet that uses the path.join method in an os package:

import pandas as pd
import os
path = 'E:/Personal/Learning/Datasets/Book'
filename = 'titanic3.csv'
fullpath = os.path.join(path,filename)
data = pd.read_csv(fullpath)

One advantage of using the latter method is that it trims the lagging or leading white spaces, if any, and gives the correct filename.

Reading a .txt dataset with a comma delimiter

Download the Customer Churn Model.txt dataset from the Google Drive folder and save it on your local drive. To read this dataset, the following code snippet will do:

import pandas as pd
data = read_csv('E:/Personal/Learning/Datasets/Book/Customer Churn Model.txt')

As you can see, although it's a text file, it can be read easily using the read_csv method without even specifying any other argument of the method.

Specifying the column names of a dataset from a list

We just read the Customer Churn Model.txt file in the last segment with the default column names. But, what if we want to rename some or all of the column names? Or, what if the column names are not there already and we want to assign names to columns from a list (let's say, available in a CSV file).

Look for a CSV file called Customer Churn Columns.csv in the Google Drive and download it. I have put English alphabets as placeholders for the column names in this file. We shall use this file to create a list of column names to be passed on to the dataset. You can change the names in the CSV files, if you like, and see how they are incorporated as column names.

The following code snippet will give the name of the column names of the dataset we just read:

import pandas as pd
data = pd.read_csv('E:/Personal/Learning/Datasets/Book/Customer Churn Model.txt')
data.columns.values

If you run it on one of the IDEs, you should get the following screenshot as the output:

Fig. 2.2: The column names in the Customer Churn Model.txt dataset

This basically lists all the column names of the dataset. Let us now go ahead and change the column names to the names we have in the Customer Churn Columns.csv file.

data_columns = pd.read_csv('E:/Personal/Learning/Predictive Modeling Book/Book Datasets/Customer Churn Columns.csv')
data_column_list = data_columns['Column_Names'].tolist()
data=pd.read_csv('E:/Personal/Learning/Predictive Modeling Book/Book Datasets/Customer Churn Model.txt',header=None,names=data_column_list)
data.columns.values

The output after running this snippet should look like the following screenshot (if you haven't made any changes to the values in the Customer Churn Columns.csv file):

Fig. 2.3: The column names in the Customer Churn Columnsl.txt dataset which have been passed to the data frame data

The key steps in this process are:

  • Sub-setting the particular column (containing the column names) and converting it to a list—done in the second line
  • Passing the header=None and names=name of the list containing the column names(data_column_list in this case) in the read_csv method

If some of the terms, such as sub-setting don't make sense now, just remember that it is an act of selecting a combination of particular rows or columns of the dataset. We will discuss this in detail in the next chapter.