Humor Detection in Sentences using Binary Classification

Introduction

In this blog post, we will explore the implementation of a binary classification model for humor detection using a neural network. In today's world, where humor plays a significant role in entertainment, communication, and even marketing, the ability to detect and understand humor in text has become increasingly important. And humor detection in sentences is an interesting problem that can be solved using machine learning techniques. For this reason, we choose this topic. By accurately detecting humor in sentences, we can automate processes that require humor recognition, such as content moderation, sentiment analysis, and personalization. This automation can save time and resources while providing more engaging and entertaining experiences for users. There were we used 200000 samples in the dataset obtained from Kaggle. The model will be trained on a dataset of sentences labeled with humor tags, and its performance will be evaluated on a test set.

The objective of this project is to develop a binary classification model that can effectively detect humor in sentences. By training this model on a large dataset of labeled humorous and non-humorous sentences, we aim to create a robust and reliable model for identifying humor in text. The resulting model can be integrated into various applications to enhance their capabilities and provide users with a more enjoyable and engaging experience. Continuously we will discuss the problem to be addressed and solution, data preparation, data pre-processing, model design, hyper parameter selection, training process, optimization of hyper parameters, final optimized model, test results, and provide a discussion on the findings.

Problem to be addressed and Solution

The problem we aim to address is the lack of an automated and accurate method for detecting humor in sentences. While humans can effortlessly recognize humor through contextual cues, wordplay, and linguistic patterns, developing a computational model that can mimic this ability poses several challenges. Existing approaches often fall short in accurately identifying humor, leading to inaccurate classifications and missed opportunities for utilizing humor in various applications. Our goal is to overcome these limitations and build a robust classification model that can reliably detect humor in a wide range of sentence structures and contexts.

To address the challenge of detecting humor in sentences, we propose developing a binary classification model. By training the model on a diverse dataset of labeled humorous and non-humorous sentences, we can enable it to learn the patterns and features indicative of humor. Leveraging techniques such as feature extraction, text preprocessing, and model optimization, we will fine-tune the model's ability to accurately classify sentences into humorous or non-humorous categories. The resulting model can be utilized in various applications, including social media content moderation, chatbots, sentiment analysis, and recommendation systems, to enhance user experiences and engagement.

Data Preparation

The first step in our implementation is to load the dataset. We got dataset from Kaggle. We use the pandas library to read the data from a CSV file. The dataset contains two columns: 'text' which represents the sentences, and 'humor' which represents the humor tags. We split the dataset into sentences and tags and store them in separate variables. We stored our dataset file in our drive which as ‘Humor1’.

Here is the code we used for importing libraries and load data;

import pandas as pd
import numpy as np
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Dense, GlobalMaxPooling1D

from google.colab import drive
drive.mount('/gdrive')

df = pd.read_csv('/gdrive/My Drive/Colab Notebooks/Humor1.csv')

Here are that necessary libraries we used:

Libraries	Used for
‘Pandas’	Data manipulation and analysis
‘Numpy’	Numerical computations
‘Tokenizer’ and ‘pad_sequences’ from ‘tensorflow.keras.preprocessing.text’	Tokenizing and padding the input text data
‘train_test_split’ from ‘sklearn.model_selection’	Splitting the dataset into training and testing sets
‘Sequential’, ‘Embedding’, ‘Dense’, and ‘GlobalMaxPooling1D’ from ‘tensorflow.keras.models’ and ‘tensorflow.keras.layers’	Constructing the neural network model
‘drive’ and ‘drive.mount’ from ‘google.colab’	Mounting the Google Drive

The “drive.mount(‘/gdrive’)” command is used to mount the Google Drive in order to access the dataset stored in the Drive. After When Loading the dataset the line “df = pd.read_csv(‘/gdrive/My Drive/Colab Notebooks/Humor1.csv’)” reads the CSV file containing the dataset into a Pandas DataFrame named “df”. The path to the file is specified as “/gdrive/My Drive/Colab Notebooks/Humor1.csv”, assuming that the dataset is stored in the specified location on the mounted Google Drive.

Data Pre-processing

To prepare the text data for training, we perform text cleaning and tokenization. We use the Tokenizer class from the Keras library to convert the sentences into sequences of integers. The tokenizer is fitted on the sentences, and then the sentences are converted into sequences using the tokenizer. We also apply padding to make all sequences of equal length, using the pad_sequences function from Keras. This ensures that all input sequences have the same shape, which is required by the neural network model.

We used following codes for data pre-processing^[7];

# Preprocess the data
from sklearn.preprocessing import LabelEncoder

Above code begins by importing the necessary libraries. In this case, it imports the “LabelEncoder” class from the “sklearn.preprocessing” module. This used to encode the target variable into numerical values.

#create instance of label encoder
lab = LabelEncoder()

This code creates an instance of the “LabelEncoder” called “lab”. The “LabelEncoder” is used to convert categorical labels into numerical values.

#perform label encoding on 'team' column
df['humor'] = lab.fit_transform(df['humor'])

sentences = df['text'].values
tags = df['humor'].values

Here, the “fit_transform()” method is applied to the “humor” column in the DataFrame “df”. It encodes the labels in the “humor” column into numerical values and assigns them back to the “humor” column in “df”. The code extracts the sentences from the “text” column of the DataFrame “df” and assigns them to the variable “sentences”. It also extracts the encoded “humor” labels from the DataFrame “df” and assigns them to the variable “tags”.

# Text cleaning and tokenization
tokenizer = Tokenizer()
tokenizer.fit_on_texts(sentences)
tokenized_sentences = tokenizer.texts_to_sequences(sentences)

This code initializes a “Tokenizer” object called “tokenizer”. The “Tokenizer” class is used to tokenize text data, converting sentences into sequences of integers. The “fit_on_texts()” method is applied to the “sentences” variable, which fits the tokenizer on the text data. The “texts_to_sequences()” method is then used to convert the tokenized sentences into sequences of integers. These tokenized sequences are assigned to the variable “tokenized_sentences”.

# Padding
max_length = 100  # Adjust this value based on your data
padded_sentences = pad_sequences(tokenized_sentences, maxlen=max_length)

Here, the code specifies the maximum length of the padded sequences to be 100, which can be adjusted based on the specific data. The “pad_sequences()” function is applied to the “tokenized_sentences” variable, padding or truncating the sequences to the specified maximum length. The padded sequences are stored in the variable “padded_sentences”.

# Prepare the labels
labels = np.array(tags)

This code converts the “tags” array into a NumPy array using “np.array()”, which prepares the target labels for the classification model and assigns them to the variable “labels”.

Model Design

The neural network architecture is designed to understand the patterns and connections between input features and the target variable to effectively predict humor in sentences. The selection of the number of layers, neurons, and activation functions depends on the problem's complexity and the dataset size.

Here is the code we used for build the model;

# Build the model
vocab_size = len(tokenizer.word_index) + 1
embedding_dim = 100

model = Sequential()
model.add(Embedding(input_dim=vocab_size, output_dim=embedding_dim, input_length=max_length))
model.add(GlobalMaxPooling1D())
model.add(Dense(32, activation='relu'))
model.add(Dense(1, activation='sigmoid'))

# Compile the model
model.compile(loss='binary_crossentropy', optimizer='SGD', metrics=['accuracy'])

# Train the model
history = model.fit(partial_x_train, partial_y_train, batch_size=1500, epochs=10, validation_data=(x_val, y_val))

For our humor detection model, we use a sequential model from the Keras library. The model consists of an embedding layer, a global max pooling layer, input layer, two dense layers, which are hidden layer with ReLU activation function and an output layer with a sigmoid activation function. It has 32 and 1 nodes respectively. The embedding layer learns the dense vector representations of words in the input sequences, and the global max pooling layer extracts the most important features from the embedding’s. Dense layers mean its fully connected that performs non-linear transformations, and the final dense layer outputs the probability of the input sentence. We use ReLu activation function to overcome the gradient varnishing problem and in last layer we use sigmoid activation function for it is useful for estimating probabilities, as in binary classification. Sigmoid function squashes the output between 0 and 1.

The code compiles the model with binary cross-entropy loss and the SGD optimizer. The reason we used binary_crossentropy loss function is that specially designed for binary classification task, where the goal to predict one of two classes (for e.g. in our model humor or non-humor). We used 200000 data sets for training and testing, for this large-scale dataset SGD performs well and computationally efficient. The model is trained on the training data for 10 epochs indicating the number of times the entire dataset will be passed through the network and with a batch size of 1500 which determines the number of samples used in each training iteration means that the network will be trained on 100 samples at a time before updating the weights. The validation data is used to evaluate the model's performance during training.

Following is the figure of structure overview for the model we build;

Hyperparameter Selection

Here are the hyperparameters used in our code;

max_length: The maximum length of the padded sentences. It is used for padding and truncating sentences to a fixed length. we set 100 as max lengths.
embedding_dim: The dimensionality of the word embedding vectors.
batch_size: The number of samples per gradient update during training. Larger batch sizes can lead to faster training. We took 1500 batch sizes.
epochs: The number of times the model will iterate over the entire training dataset. One epoch is a complete pass through the entire dataset. Here we used 10 epochs.
optimizer: It determines the optimization algorithm used to update the model's weights during training. We used the SGD (Stochastic Gradient Descent) optimizer.
loss: We used binary_crossentropy loss function to measure the difference between the predicted and actual labels.

These hyperparameters modified based on our problem and dataset to potentially improve our model's performance.

Overview of the Implementation Platform

The implementation is done using Python programming language. We use various libraries such as pandas, NumPy, TensorFlow, and scikit-learn for data handling, model creation, and evaluation. The code is executed in an environment that supports these libraries, such as Jupyter Notebook or Google Colab. Overall, implementation platform leverages the power of tensorflow.keras and other libraries to preprocess the data, build a binary classification model for humor detection, train the model, and evaluate its performance.

Training

To train the model, we split the pre-processed data into training and testing sets using the train_test_split function from scikit-learn. We set aside 25% of the data for testing. Further we split the training set into training and validation sets to monitor the model's performance during training. The model is compiled with the binary cross-entropy loss function and the SGD optimizer. Then we fit the model to the training data, specifying the batch size, number of epochs, and validation data. The training process updates the model's parameters iteratively to minimize the loss function and improve accuracy.

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(padded_sentences, labels, test_size=0.25, random_state=42)

x_val = X_train[0:25000]
partial_x_train = X_train[25000:]
y_val = y_train[0:25000]
partial_y_train = y_train[25000:]

The code splits the data into training and testing sets using train_test_split from scikit-learn. 75% of the data is used for training (X_train, y_train) and 25% for testing (X_test, y_test). Additionally, a validation set (x_val, y_val) is created by further splitting the training set.

Optimizing the Hyperparameters

After training the initial model, we optimized the hyperparameters to improve its performance. It’s done through a process called hyperparameter tuning. Hyperparameters such as the learning rate, batch size, and number of epochs adjusted to find the optimal values that maximize the model's accuracy. The following tables shows how we optimized the hyper parameter by changing that;

Final Optimized Model

Hyper parameter optimization has been done in several ways. In the beginning the model was over fit and the graphs were not arranged properly. After changing the hyper parameters, most graph became regular and the model over fit. A try was correct when changing the hyper parameters to avoid constant over fit. But the accuracy was low. So we did not choose it as optimized model. Based on the evaluation of all the attempts we have made, we have selected the improved model which is better than the other attempts, the accuracy is fit at low percentage and the loss graph is continuously decreasing and the accuracy graph is continuously increasing than the other attempts. This figure shows our final optimized model;

Test Results

After training the final model, we evaluated its performance on the test set. Then we calculated the loss and accuracy of the model on the test data using the evaluate function.

Here is the output of train and loss accuracy;

Train loss: 0.6783716082572937 Train accuracy: 0.7980800271034241 ******************** 1563/1563 [==============================] - 3s 2ms/step - loss: 0.6786 - accuracy: 0.7935 Test score: 0.6785876750946045 Test accuracy: 0.7934600114822388

The model was trained for 1563 steps: The statement "1563/1563" means that the training process consisted of 1563 steps or iterations. Each step involves processing a batch of training data, making predictions, calculating the loss, and updating the model's weights to minimize the loss. These steps collectively make up an epoch, which represents a complete pass through the entire training datasetOur model achieved a test accuracy almost 0.79. This accuracy is the proportion of correctly predicted labels out of all the samples in the test set. A value of 0.793 means that our model correctly classified approximately 79.3% of the test samples. This accuracy value is slightly lower than the train accuracy, so our model indicates a small amount of overfitting.

Discussion

In this blog post, we have implemented a binary classification model for humor detection in sentences using a neural network. We discussed the problem to be addressed and solution, data preparation, data pre-processing, model design, hyperparameter selection, training process, optimization of hyperparameters, final optimized model, and test results. We selected the final optimized model based on the evaluation of different attempts. Overall, the output indicates that our developed model has achieved reasonably good performance on both the training and test data. However, there is still some room for improvement, as the loss values lower, and the test accuracy closer to the train accuracy. It could be worth experimenting with different model architectures, hyper parameters, or regularization techniques to further enhance the model's performance.

References

Pandas Documentation https://pandas.pydata.org/docs/
NumPy Documentation https://numpy.org/doc/
TensorFlow Documentation https://www.tensorflow.org/api_docs
scikit-learn Documentation https://scikit-learn.org/stable/documentation.html
Brownlee, J. (2019). Deep Learning for Natural Language Processing. Machine Learning Mastery.
Chollet, F. (2017). Deep Learning with Python. Manning Publications.
Label Encoding in Python - GeeksforGeeks

Search This Blog

From_my_pen