Facets Overview
Facets Overview provides a wide range of statistics for each feature of a dataset. Facets Overview will help you detect missing data, zero values, non-uniformity in data distributions, and more, as we will see in this section.
We will begin by creating feature statistics for the training and testing datasets.
Creating feature statistics for the datasets
Without Facets Overview or a similar tool, the only way to obtain statistics would be to write our programs or use spreadsheets. Writing our own functions can be time-consuming and costly. This is where Facets provides statistics with a few lines of code that we will use now.
Implementing the feature statistics code
In this section, we will encode the data, stringify it, and build the statistics generator. When using JSON, we first stringify information to transfer data into strings before sending it to JavaScript functions.
First, we will import base64:
import base64
base64 will encode a string using a Base64 alphabet. A Base64 alphabet uses 64 ASCII characters to encode data.
We now import Facets' statistics generator and retrieve the data from the train and test DataFrames:
from facets_overview.generic_feature_statistics_generator import GenericFeatureStatisticsGenerator
gfsg = GenericFeatureStatisticsGenerator()
proto = gfsg.ProtoFromDataFrames([{'name': 'train',
'table': train_data},
{'name': 'test',
'table': test_data}])
The program creates a UTF-8 encoder/decoder string that will be plugged into the HTML interface in the next section:
protostr = base64.b64encode(proto.SerializeToString()).decode(
"utf-8")
You can see that the output is an encoded string:
CqQ0CgV0cmFpbhC4ARqiBwoOY29sb3JlZF9zcHV0dW0QARqNBwqzAgi4ARgB...
We will now plug the protostr in an HTML template.
Implementing the HTML code to display feature statistics
The program first imports the display and HTML modules:
# Display the Facets Overview visualization for this data
from IPython.core.display import display, HTML
Then the HTML template is defined:
HTML_TEMPLATE = """
<script src="https://cdnjs.cloudflare.com/ajax/libs/webcomponentsjs/1.3.3/webcomponents-lite.js"></script>
<link rel="import" href="https://raw.githubusercontent.com/PAIR-code/facets/1.0.0/facets-dist/facets-jupyter.html" >
<facets-overview id="elem"></facets-overview>
<script>
document.querySelector("#elem").protoInput = "{protostr}";
</script>"""
html = HTML_TEMPLATE.format(protostr=protostr)
The protostr variable containing our stringified encoded data is now plugged into the template.
Then, the HTML template named html is sent to IPython's display function:
display(HTML(html))
We can now visualize and explore the data:
Figure 3.1: Tabular visualization of the numeric features
Once we obtain the output, we can analyze the features of the datasets from various perspectives.