Text Classification using Naive Bayes Classifier

The main objective of this assignment is classification of a review into fresh or rotten by applying naive bayes classifier on the given dataset. Now, we will discuss about Naive Bayes Classifier

Naive Bayes Classifier: It is based on Bayes theorem which is as follows.

Text, whiteboard

Description automatically generated

In the above formula P(A|B) is conditional probability. We will discuss more about how to use classifier by using an example from the data set.

Data Set: We have the rotten tomato review data set. There are 480000 rows and 2 columns. The columns are "Freshness" and "Review". In the "Freshness" column there are reviews and, in the column "Review" there are only two values fresh and rotten depending on the review. The data set is as follows.

Diagram

Description automatically generated with medium confidence

Now let us see how to use Naive Bayes Classifier on a review. Let us consider there is a review "Long, unfocused film". Our task is to classify this as rotten or fresh. Let us use Bayes theorem formula.

We have to find P (fresh/ "Long, unfocused film") and P(rotten/ "Long, unfocused film"). Whichever probability is higher that will be the class of the sentence.

We will see mathematical approach.

Text, letter

Description automatically generated

CODE:

1. The first step is to divide the data set into train, test, and development. I decided to split into 60% train, 20% test and 20% development.

from sklearn.model_selection import train_test_split

train_size=0.6

train,rem = train_test_split(df, train_size=0.6)

test_size = 0.5

dev,test = train_test_split(rem, test_size=0.5)

2. Next, I stored all the reviews of the train data in a list

col_list = train['Review'].tolist()

3. Now, I store all the words of the above list and their corresponding frequency (number of documents containing the word) in a dictionary.

import re

word_count = {}

for row in col_list:

words = list(set(re.split("[!. _,@?:-= ]", row.lower())))

words = [i for i in words if i !='']

for word in words:

if word in word_count:

word_count[word] += 1

else:

word_count[word] = 1

4. Now, I stored all reviews in the train data which are fresh in one list and the reviews which are rotten in another list.

freshData = []

rottenData = []

for i, row in train.iterrows():

if row['Freshness'] == "fresh":

freshData.append(row['Review'])

if row['Freshness'] == "rotten":

rottenData.append(row['Review'])

freshData = [x.lower() for x in freshData]

rottenData = [x.lower() for x in rottenData]

5. In the next step I stored all the words in fresh list and their frequency in a dictionary and all the words in rotten list and their frequency in another dictionary

import re

fresh_count = {}

for row in freshData:

words = list(set(re.split("[!. _,@?:-=]", row.lower())))

words = [i for i in words if i !='']

for word in words:

if word in fresh_count:

fresh_count[word] += 1

else:

fresh_count[word] = 1

rotten_count = {}

for row in rottenData:

words = list(set(re.split("[!. _,@?:-=]", row.lower())))

words = [i for i in words if i !='']

for word in words:

if word in rotten_count:

rotten_count[word] += 1

else:

rotten_count[word] = 1

6. Now, I counted total rows in train data, number of rows which are fresh and number of rows which are rotten.

total_count=train['Review'].count()

count_rotten=len(rottenData)

count_fresh=len(freshData)

7. Now, I found the probability of fresh and probability of rotten.

prob_rotten=count_rotten/total_count

prob_fresh=count_fresh/total_count

print(prob_rotten,prob_fresh)

8. Now, I defined a function to predict the class of a review passed as a parameter using the Bayes theorem formula which I mentioned above.

def predict_class(review):

a=re.split("[!. _,@?:-= ]",review.lower())

a = [i for i in a if i !='']

ans_rotten=1

ans_fresh=1

for ele in a:

ans_rotten=ans_rotten*rotten_count.get(ele, 0)/count_rotten

ans_fresh=ans_fresh*fresh_count.get(ele, 0)/count_fresh

fresh_prob=(ans_fresh*prob_fresh)/(ans_fresh*prob_fresh+ ans_rotten*prob_rotten)

rotten_prob=(ans_rotten*prob_rotten)/(ans_fresh*prob_fresh+ ans_rotten*prob_rotten)

if(fresh_prob>rotten_prob):

return "fresh"

else:

return "rotten"

9. Now I iterated through the entire development data and run the above the function for each review and stored the result in a new list

pred=[]

for i,row in dev.iterrows():

pred.append(predict_class(row['Review']))

10. Now, I calculated the accuracy.

total=len(pred)

count=0

for i in range(total):

if pred[i]==d[i]:

count=count+1

accuracy = count/total

print(f"Accuracy: {accuracy:.2f}")

11. I got the accuracy as 0.76. Now I applied the concept of smoothing. The formula is.

A picture containing text

Description automatically generated

12. I tried using various alpha values like 1, 100,0.01, and 0.1 and got the accuracies 0.81, 0.77, 0.80 and 0.81. I got the best accuracy for alpha values 1 and 0.1. The update code after smoothing is

def predict_class_smooth_newer(review):

a=re.split("[!. _,@?:-= ]",review.lower())

a = [i for i in a if i !='']

ans_rotten=1

ans_fresh=1

for ele in a:

if(ele in rotten_count):

ans_rotten*=(rotten_count[ele]+0.1)/(count_rotten+0.1*2)

if(ele not in rotten_count):

ans_rotten*=(0.1)/(count_rotten+0.1*2)

if(ele in fresh_count):

ans_fresh*=(fresh_count[ele]+0.1)/(count_fresh+0.1*2)

if(ele not in fresh_count):

ans_fresh*=(0.1)/(count_fresh+0.1*2)

fresh_prob=(ans_fresh*prob_fresh)/(ans_fresh*prob_fresh+ ans_rotten*prob_rotten)

rotten_prob=(ans_rotten*prob_rotten)/(ans_fresh*prob_fresh+ ans_rotten*prob_rotten)

if(fresh_prob>rotten_prob):

return "fresh"

else:

return "rotten"

13. Next, I found top ten words in each class fresh and rotten.

import operator

sorted_d = dict( sorted(fresh_count.items(), key=operator.itemgetter(1),reverse=True))

print('Dictionary in descending order by value : ',sorted_d)

first_ten = list(sorted_d.items())[:10]

print(first_ten)

import operator

sorted_r = dict( sorted(rotten_count.items(), key=operator.itemgetter(1),reverse=True))

print('Dictionary in descending order by value : ',sorted_r)

first_ten = list(sorted_r.items())[:10]

print(first_ten)

14. The top ten words in fresh class are:

[('the', 91524), ('a', 81869), ('and', 78512), ('of', 70965), ('is', 50864), ('to', 49819), ('in', 40605), ('that', 33157), ('it', 30177), ('with', 26787)]

15. The top ten words in rotten class are:

[('the', 91578), ('a', 74986), ('of', 64886), ('and', 64094), ('to', 55938), ('is', 45947), ('in', 37449), ('that', 33169), ('it', 31956), ('but', 28438)]

16.  The top ten words in each class are stop words. Now we try removing them.

stop_words=['ourselves',"it's",..]

for key in stop_words:

 del sorted_d[key]

first_ten = list(sorted_d.items())[:10]

print(first_ten)

for key in stop_words:

 del sorted_r[key]

first_ten = list(sorted_r.items())[:10]

print(first_ten)

17.  Now the top ten words in fresh class are:

[('film', 22423), ('movie', 14005), ('one', 11568), ('like', 8305), ('story', 8014), ('best', 5630), ('even', 5602), ('good', 5476), ('much', 4869), ('time', 4665)]

18.  Now the top ten words in rotten class are:

 [('film', 18477), ('movie', 18246), ('like', 11857), ('one', 9650), ('much', 7923), ('story', 7102), ('even', 6455), ("doesn't", 5308), ('good', 5063), ('make', 4978)]

19.  Now I applied the hyperparameters of alpha 1 and 0.1 on test data.

  When alpha is 1 the accuracy is 0.81

 When alpha is 0.1 the accuracy is 0.81

Contribution:

1.     I segregated all the reviews based on freshness, calculated frequency of each word and stored in two separate dictionaries.

2.     I have written a function which classifies the given review by applying Bayes theorem formula.

3.     I calculated the accuracy of development data set and test dataset.

4.     I applied the concept of smoothing by taking various alpha values.

5.     I removed stop words from the dictionaries to calculate top ten words in each class.

 Technical Challenges and solution:

1.     Initially it became very difficult for me to divide a sentence into words using multiple delimiters. Then I used regular expressions which made my job easy.

2.     When I first run the function to classify reviews, I got an error which could not handle divide by zero. Despite of getting error, the execution was successful, and I was able to calculate accuracy.

Then I used concept of smoothing to remove zero probabilities.

3.     When I tried to calculate top ten words in each class, the result was all common words. Then I removed all those words from the dictionary and again calculated top 10 words.

References:

1] How to split data into three sets (train, validation, and test) And why? | by Samarth Agrawal | Towards Data Science. In step 1, I used the code from this link to divide the dataset into three parts.

2] Python Regular Expression Tutorial with RE Library Examples | DataCamp. In step 3, I used the code from this link to write regular expression for splitting a string.

3] Laplace smoothing in Naive Bayes algorithm | by Vaibhav Jayaswal | Towards Data Science. In step 11, I used the concept of Laplace smoothing from this link.

4] Python: Sort (ascending and descending) a dictionary by value - w3resource. In step 13, I used the code from this link to sort the dictionary.

5] Removing stop words with NLTK in Python - GeeksforGeeks. In step 16, I used the stop words from this link.