Sentiment analysis is one of the most common tasks performed by machine learning enthusiasts to understand the tone, opinions, and other sentiments. Over the years, sentimental analysis datasets were mostly created by extracting information from social media platforms. But due to an increase in unstructured data within organizations, companies have been actively leveraging natural language processing techniques to gain unique insights to make quick decisions. Today, with sentiment analysis, organizations are able to monitor brand and product sentiments among their customers. Consequently, working with datasets for sentiment analysis allows job seekers to gain expertise in handling unstructured data and help companies make effective decisions.
Sentiment analysis datasets are not limited to organizations; researchers have used rule-based models, automated models, and a combination of both to gauge out the sentiments behind texts for advancing the techniques in artificial intelligence. Neural network models are prevalent in the field for their sheer performance. But all these models need data to be trained, especially clean and well-annotated data. This is where benchmarks — sentiment analysis datasets — come in.
Amongst all the available datasets for sentimental analysis, here are some of the highest cited datasets:
1. General Language Understanding Evaluation (GLUE) Benchmark
Based on a paper on Multi-Task benchmarking and analysis for Natural Language Understanding (NLU), the GLUE sentiment analysis dataset offers a binary classification of sentiments — SST-2 along with eight other tasks for an NLU model. Current state-of-the-art models are trained and tested on it because of the variety of divergent tasks. Besides, a wide range of models can be evaluated for linguistic phenomena found in natural language.
python download_glue_data.py --data_dir glue_data --tasks all
Source: Wang et al. GLUE
2. IMDb Movie Reviews
Hosted by Stanford, this beginner-friendly dataset is a binary sentiment analysis dataset consisting of 50,000 reviews from the Internet Movie Database (IMDb). A score above seven is labeled as positive, and a score below 4 is negative. The dataset for sentiment analysis contains the same number of positive and negative reviews with only 30 reviews per movie. Only highly polarizing reviews are considered.
Source: Andrew L. Maas et al.
DynaSent is an English-language-based positive, negative, and neutral dataset for sentiment analysis. It combines naturally occurring sentences with sentences created using the open-source Dynabench Platform, which facilitates human-and-model-in-the-loop dataset creation. DynaSent has a total of 121,634 sentences, each validated by five crowd workers. The dataset also contains the Stanford Sentiment Treebank dev set with labels.
Source: Potts et al.
4. MPQA Opinion Corpus (Multi-Perspective Question Answering)
The MPQA Opinion Corpus contains 535 news articles from a wide variety of news sources manually annotated for opinions, beliefs, emotions, sentiments, speculations, and more. The data should be strictly used for research and academic purpose only.
Source: Janyce Wiebe et al.
ReDial (Recommendation Dialogues) is an annotated dataset of dialogues for sentiment analysis, where users recommend movies to each other. The dataset consists of over 10,000 conversations centered around the theme of providing movie recommendations. There are several examples from the conversation for validation sets that can be useful before getting started.
Source: Li et al.
6. AG’s Corpus
Antonio Gulli’s corpus of news articles is a collection of more than 1 million news articles. The articles are curated from more than 2000 news sources by ComeToMyHead in more than one year. This data set can be used for non-commercial activities only. Also, you cannot re-distribute the datasets with a different name.
Source: Gulli in AG’s corpus of news articles
7. Amazon Fine Foods
The paper From Amateurs to Connoisseurs: Modeling the Evolution of User Expertise through Online Reviews using Amazon Fine Foods is cited over 400 times. The Amazon Fine Foods dataset consists of ~5000,000 reviews up to October 2021 by 256,059 users. A total of 74,258 products have been reviewed, with a median number of words per review of 56.
Source: McAuley et al.
8. SPOT (Sentiment Polarity Annotations Dataset)
Collected from Yelp’13 and IMDB, the SPOT sentiment analysis dataset contains 197 reviews that are annotated with segment-level polarity labels (positive/neutral/negative). Annotations have been gathered on two levels of granularity: Sentences and Elementary Discourse Units (EDUs). The dataset is ideal for evaluating methods that are focused on predicting sentiment on a fine-graned and segment-level basis.
Source: Angelidis et al.
Youtbean is a dataset created from closed captions of YouTube product review videos. It can be used for a wide range of sentimental analysis tasks like aspect extraction and sentiment classification. The data set was used for the paper ‘Mining fine-grained opinions on closed captions of YouTube videos with an attention-RNN.’
Source: Marrese-Taylor et al.
ReviewQA is a question-answering dataset proposed for sentiment analysis tasks based on hotel reviews. The dataset consists of questions that are linked to a set of relational understanding competencies of models. Each question comes with an associated type that characterizes the required competency.
Source: Grail et al.
Twitter datasets for sentimental analysis are one of the go-to data for sentiment analysis. iSarcasm is a dataset of tweets that are intended sarcasm for sentimental analysis. The data is labeled as either sarcastic or non_sarcastic. Sarcastic tweets are further labeled with the types of ironic speech — sarcasm, irony, satire, understatement, overstatement, and rhetorical questions.
Source: Oprea et al.
PHINC is a parallel corpus of the 13,738 code-mixed English-Hindi sentences and their translation in English. According to researchers, the translations of sentences are manually annotated. This is one of the best datasets for sentimental analysis with a mixture of languages that is highly common in India.
Source: Srivastava et al.
XED is a multilingual fine-grained emotion dataset for sentimental analysis. The dataset consists of human-annotated Finnish (25k) and English sentences (30k), as well as projected annotations for 30 additional languages, providing new resources for many low-resource languages.
Source: Öhman et al.
MultiSenti offers code-switched informal short text, for which a deep learning-based model can be trained for sentiment classification. The developers have provided a pre-trained model using word2vec, which can be accessed here.
Source: Shakeel et al.
PerSenT dataset contains crowd-sourced annotations of the sentiment expressed by the authors towards the main entities in news articles. The dataset also includes paragraph-level sentiment annotations to provide more fine-grained supervision for the task.
Source: Bastan et al.