What Makes Naive Bayes Classification So Naive? | How Does Naive Bayes Classifier Work


If you logically segment "Naive Bayes Classification" into words then you will end up with two terms that makes more sense independently:
- Naive
- Bayes Theorem

So, if we go through these two terms separately and then combine the concepts into one, we can understand Naive Bayes Classification so easily, isn't it?

"Naive"?

- Inexperienced
- Just like a naive (inexperienced) little child who makes some assumptions that are not completely true.

So, what is this assumption all about?
- "Independent assumption of predictor"
- i.e. presence of a particular feature in a class is unrelated to the presence of any other features.

How does Naive Bayes' Classifier differ from other classifiers?
- A big difference!
- Unlike many other classifiers which assume or find for some correlation among features, Naive Bayes classifier completely abandon the concept of correlation.
- Seems illogical, doesn't it? - The reason we call it "Naive"


We are in a probabilistic world!


Let's give this "independent assumption" a mathematical shape: If two events A and B are independent then P(A AND B) = P(A) x P(B)
e.g.

Assumption: If the guy name is Vito Corleone AND if he is from Sicily then he is The Godfather.

Independent assumption: P(The guy is The Godfather) = P(The guy name is Vito Corleone) x P(The guy is from Sicily)

Let's say, you see a guy and you are 80% sure that his name is Vito Corleone AND 90% sure that he is from Sicily. So, with our "independent assumption", now calculate what are the chances that he is The Godfather?

P(The guy name is Vito Corleone)  = 0.8,     P(The guy is from Sicily) = 0.9
or
P(The guy is The Godfather) = P(The guy name is Vito Corleone) x P(The guy is from Sicily)

P(The guy is The Godfather)  = 0.8 x 0.9 = 0.72

There are 72% chances that he is The Godfather! Respect!


Bayes Theorem?

- Probability of an event given that another even has already been occurred.
- Example: Probability of occurrence of Event-2 given that another event Event-1 has already been occurred or in mathematical language find P(Event-2 | Event-1).

Already Given Information 
- P(Event-1)  = 30%,     P(Event-2) = 20%
- P(Event-1 | Event-2) is probability of Event-1, given that Event-2 already occurred  = 45%
so
P(Event-2 | Event-1) = P(Event-1 | Event-2) x P(Event-2) / P(Event-1)
P(Event-2 | Event-1) = 0.45 x 0.2 / 0.3  = .30


Time to combine the concepts of Naive Independent assumption and Bayes Theorem into one

- Lets forecast rain for tomorrow
- Predictors / Independent variables: Humidity (H), Atmospheric Pressure (AP)
- Probability that rain will happen ( Rain = Yes ) if
      - Humidity (H) = Yes
      - Atmospheric Pressure (AP) = Low
-

According to Bayes' Theorem - 

P(Rain = Yes | (H = High, AP = Low)) =

P((H = High, AP = Low) |  Rain = Yes) x P (Rain = Yes)  /  P(H = High AND AP = Low)

According to independent assumption of predictors - 

P(H = High AND AP = Low) can also be written as P (H = High ) x P(AP = Low)
and 
P((H = High, AP = Low) |  Rain = Yes)  can also be written as P(H = High | Rain = Yes)  x  P(AP = Low | Rain = Yes)

After combination of both the concept -

P(Rain = Yes | H = High, AP = Low) =

P(H = High | Rain = Yes)  x  P(AP = Low | Rain = Yesx  P (Rain = Yes)) /  P (H = High ) x P(AP = Low)

Similarly, we can calculate for different options e.g.  P(Rain = No | H = High, AP = Low) or P(Rain = Yes | H = low, AP = Low) or P( Rain = Yes | H = low, AP = High) and so on.

- The option with the maximum probability would be a right classifier.



How does the Naive Bayes Classifier perform so well with a wrong assumption?

- The independent assumption of predictor limits it significantly, and as a result this makes it very less prone to get stuck in Local Minima.

- Since predictors are independent to each other, the interactions among them are not modeled, so relatively it needs less training data. As a result, it performs well even with small data-sets as well as with missing data too.

- Because it employs a very simple hypothesis function, it exhibits a very high bias but relatively low variance, that prevents it from over-fitting to its training data.

- It’s not sensitive to irrelevant features.


When to use Naive Bayes Classifier for classification?

- It performs magnificently in multi-class prediction.

- It also works so well in text classification.

- Spam Filtering

- Sentiment Analysis

- Recommendation System - Naive Bayes Classifier along with Collaborative Filtering make a great Recommendation System.

- You may prefer it when you require less model training time. It's fast. So fast.

- Unlike Neural Network and other classifiers, it is not black box. It is so easy to understand and interpret.


Since Naive Bayes Classification is based on probabilistic approach, the concept and its application is never gonna fade away.

Don't you agree with me? Do you? Much appreciated if let me know your thoughts in the comments below. Please do share the post with your friends as well!



Comments

  1. can we use naive bayes for regression?

    ReplyDelete
    Replies
    1. In regression models, I believe you can just use the bayesian evaluation as an addition feature.

      Delete
    2. should i need to make any additional assumption of distribution such as normality?

      Delete
  2. excellent explanation! for naive bayes, first time ever i see my logic working in a sequence way.

    ReplyDelete
  3. Explanation using Vito Corleone?


    Respect!

    ReplyDelete

Post a Comment

Popular Posts