#BlackLivesMatter: Fine Grained Emotion Dataset
Started my master project back in Early 2020 and decided to do something about Social Movement. At that time, the overall direction, research objectives wasn’t clear yet, but one thing for sure is I want to do something that relate to Social Science. After few discussions with my lecturers Dr Sabrina Tiun and Prof Nazlia Omar and few weeks on preparation, I decided to do my research on the Black Lives Matter in The United States of America. Moreover, I am so lucky that Dr Ridwzan took me in and become my supervisor in UKM FTSM Sentiment Lab.
I spent almost 2 months just to collect the data to be used for my research. It is because I am doing the fine grained emotion by following Robert Plutchik Wheel of Emotions (Fear, Anger, Sadness, Joy, Disgust, Trust, Surprise, Anticipation, ) and not just only the sentiment polarity. Hence it become one of the bigger challenge for me to obtain the dataset can be used. After few months of tried and error I decided to build my of datasets.
Prelabeled datasets are the dataset already labelled or annotated, some of it are the research data from other researchers. and some of them are from Semantic Evaluation Conference 2018 and 2019
a) Sentiment Analysis in Text is the dataset published by Crowd Flower (or Appen) in Nov 2016 on data world. It has total 13 emotions. link
b) Emotion Stimulus Data (also known as Emotion Cause), this dataset is published by Diman Ghazi (Ghazi et al. 2015) for 16th International Conference on Intelligent Text Processing and Computational Linguistics (CICLing 2015) in Cairo, Egypt. It has 7 emotions. link
c) International Survey on Emotion Antecedents and Reasons (ISEAR) has total 7,666 instances with 7 emotions. This dataset was created by using questionnaires about the experiences and reaction for those 7 emotions, total 1,096 participants from different culture background. This dataset is generated by R. Scherer (Scherer & Wallbott 1994). Saif M. Mohammad from National Research Council Canada (Mohammad et al. 2018) generated this dataset from English, Arabic and Spanish tweets. link
d) SemEval 2018 Task 1 - Affect in Tweets. This dataset is provided by Saif M. Mohammad for the Sematic Evaluation Conference 2018 task 1 hosted in New Orleans, LA. It has 4 emotions only with total 8,640 instances. link
e) SemEval 2019 Task 3 - Contextual Emotion Detection in text is the dataset for a shared task for International Workshop on Semantic Evaluation 2019 (SemEval-2019). The topic of the task is regarding Contextual Emotion Detection in Text. It has total 15,000 instances with 3 emotions. link
f) SMILE Twitter dataset contains 3,085 tweets and 5 motions label. This dataset was published in April 2016 and it was collected between May 2013 to June 2015. The tweets collected are about the emotions that expressed on twitter towards art and cultural experiences in British Museums. This project is funded by AHRC (Arts and Humanities Research Council) UK. link
All 6 set of datasets are then cleaned by text pre-processing with 6 sub tasks to remove the noise etc. Due to the original label or emotion for each prelabeled dataset are not exactly align with the Robert Plutchik’s. Hence the emotion label are remap again by using Python Pandas as shown in Figure 2 below. For example remap the tweet for “Worry” to “Fear” or “Happy” to “Joy”
Simple assessments were done on the dataset by simple ML (Naïve Bayes, SVN, Linear) and word embedding (COW, TF-IDF) tested on each dataset and the performance are compared. Figure 3 below show a simple and quick test on the each dataset.
From the result, the dataset (a) and (f) are not good compare to other datasets like (b, c, d, and f). Hence, only b, c, d, and e dataset are kept and the detail of each dataset show in the Figure 4 below.
At this stage, the based training data is build. However it cannot be stopped here because it is still not sufficient. It is because this dataset class imbalance where only 2 emotions (Anger and Sadness) are around 8,000. Another reason is this dataset is not exactly related to Black Lives Matter or Social Movement. Hence the additional data related to the research topic need to be added to the based dataset to enrich it.
Raw Data or EmoLex Labeled Dataset
4 raw or unlabeled dataset are downloaded from various sources and also by GetOldTweet Twitter API with keyword #BackLivesMatter. those 4 dataset gone through the same process (data cleaning) then EmoLex Lexicon is used to label the data. EmoLex is Corpus based lexicon with a list of keyword. I wrote python code to match the word with the tweets and count the number of occurrence and assign the emotion intensity. and Figure 5 is the dataset information.
Create Final Training Dataset
Now both type of datasets are ready to be combine as one final training dataset. First the Prelabeled dataset is used entirely as based, and the Emolex labeled data is used as supplement to fill in the blank on prelabeled dataset. Figure 6 below show the combination of both datasets and group by emotions. In order to make sure the dataset is balance, only approximately 8,000 instances for each emotion will be kept. The EmoLex data will be sorted by intensity and copy over to the prelabeled dataset from high intensity to low intensity, as long as the emotion reached approximately 8,000 then it will stop copying.
Figure 7 above is the detail of the final training dataset with Robert Plutchik Wheel of Emotions.
Final Training Dataset Detail
Figure 8 show the word count distribution for the each emotion in the training dataset. This figure show how people tweet their content according to the emotion. One of the example is people tend to tweet short when they are Fear; and longer when they want to express Trust.
Following 3 chart in figure 9 shows the wordcloud and word frequency of this dataset by Noun, Adjective, and Verb.
The training dataset in this article is used in my master project to do emotion detection by BERT model and the accuracy achieved 0.9 and above. However, few actions were taken to balance the dataset, but there are 3 emotions are still not balance.
The dataset can be downloaded from https://www.kaggle.com/carlsonhoo/baselinedataset