上QQ阅读APP看书,第一时间看更新
Data review
When you have successfully loaded your data into Watson Analytics, you should review it and assess its quality.
The IBM Watson Analytics documentation describes data quality as:
Data quality assesses the degree to which a data set is suitable for analysis. A shorthand representation of this assessment is the data quality score. The score is measured on a scale of 0-100, with 100 representing the highest possible data quality.
Further:
The data quality score for a data set is computed by averaging the data quality score for every column in the data set. Several factors affect the data quality score for an individual field or column.
The factors that can affect the data quality score include:
- Missing values: Records for which no data are entered.
- Constant values: Some fields have the same value recorded for every field.
- Imbalance: Occurs in a categorical field when records are not equally distributed across categories.
- Influential categories: Those categories that are significantly different from other categories.
- Outliers: Extreme values.
- Skewness: Skewness measures how symmetrical a continuous field is distributed. Skewed fields have lower data quality scores.