Apache Spark 2:Data Processing and Real-Time Analytics
上QQ阅读APP看书,第一时间看更新

Theory on Classification

In order to use the Naive Bayes algorithm to classify a dataset, the data must be linearly divisible; that is, the classes within the data must be linearly divisible by class boundaries. The following figure visually explains this with three datasets and two class boundaries shown via the dotted lines:

Naive Bayes assumes that the features (or dimensions) within a dataset are independent of one another; that is, they have no effect on each other. The following example considers the classification of e-mails as spam. If you have 100 e-mails, then perform the following:

60% of emails are spam
80% of spam emails contain the word buy
20% of spam emails don't contain the word buy
40% of emails are not spam
10% of non spam emails contain the word buy
90% of non spam emails don't contain the word buy

Let's convert this example into conditional probabilities so that a Naive Bayes classifier can pick it up:

P(Spam) = the probability that an email is spam = 0.6
P(Not Spam) = the probability that an email is not spam = 0.4
P(Buy|Spam) = the probability that an email that is spam has the word buy = 0.8
P(Buy|Not Spam) = the probability that an email that is not spam has
the word buy = 0.1

What is the probability that an e-mail that contains the word buy is spam? Well, this would be written as P (Spam|Buy). Naive Bayes says that it is described by the equation in the following figure:

So, using the previous percentage figures, we get the following:

P(Spam|Buy) = ( 0.8 * 0.6 ) / (( 0.8 * 0.6 ) + ( 0.1 * 0.4 ) ) = ( .48 ) / ( .48 + .04 )

= .48 / .52 = .923

This means that it is 92 percent more likely that an e-mail that contains the word buy is spam. That was a look at the theory; now it's time to try a real-world example using the Apache Spark MLlib Naive Bayes algorithm.