Monday, June 17, 2024
HomeData ScienceTop 15 Datasets For Sentiment Analysis With Significant Citations

Top 15 Datasets For Sentiment Analysis With Significant Citations

Sentiment analysis is one of the most common tasks performed by machine learning enthusiasts to understand the tone, opinions, and other sentiments. Over the years, sentimental analysis datasets were mostly created by extracting information from social media platforms. But due to an increase in unstructured data within organizations, companies have been actively leveraging natural language processing techniques to gain unique insights to make quick decisions. Today, with sentiment analysis, organizations are able to monitor brand and product sentiments among their customers. Consequently, working with datasets for sentiment analysis allows job seekers to gain expertise in handling unstructured data and help companies make effective decisions.

Sentiment analysis datasets are not limited to organizations; researchers have used rule-based models, automated models, and a combination of both to gauge out the sentiments behind texts for advancing the techniques in artificial intelligence. Neural network models are prevalent in the field for their sheer performance. But all these models need data to be trained, especially clean and well-annotated data. This is where benchmarks — sentiment analysis datasets — come in.

Amongst all the available datasets for sentimental analysis, here are some of the highest cited datasets:

- Advertisement -

1. General Language Understanding Evaluation (GLUE) Benchmark

Based on a paper on Multi-Task benchmarking and analysis for Natural Language Understanding (NLU), the GLUE  sentiment analysis dataset offers a binary classification of sentiments — SST-2 along with eight other tasks for an NLU model. Current state-of-the-art models are trained and tested on it because of the variety of divergent tasks. Besides, a wide range of models can be evaluated for linguistic phenomena found in natural language.

Download: python --data_dir glue_data --tasks all

Source: Wang et al. GLUE 

2. IMDb Movie Reviews

Hosted by Stanford, this beginner-friendly dataset is a binary sentiment analysis dataset consisting of 50,000 reviews from the Internet Movie Database (IMDb). A score above seven is labeled as positive, and a score below 4 is negative. The dataset for sentiment analysis contains the same number of positive and negative reviews with only 30 reviews per movie. Only highly polarizing reviews are considered.

- Advertisement -

Download: Link

Source: Andrew L. Maas et al.

3. DynaSent

DynaSent is an English-language-based positive, negative, and neutral dataset for sentiment analysis. It combines naturally occurring sentences with sentences created using the open-source Dynabench Platform, which facilitates human-and-model-in-the-loop dataset creation. DynaSent has a total of 121,634 sentences, each validated by five crowd workers. The dataset also contains the Stanford Sentiment Treebank dev set with labels.

Download: Link

Source: Potts et al.

Also Read: Microsoft Announces The Support Of Hindi For Sentiment Analysis

4. MPQA Opinion Corpus (Multi-Perspective Question Answering)

The MPQA Opinion Corpus contains 535 news articles from a wide variety of news sources manually annotated for opinions, beliefs, emotions, sentiments, speculations, and more. The data should be strictly used for research and academic purpose only. 

Download: Link

Source: Janyce Wiebe et al. 

5. ReDial

ReDial (Recommendation Dialogues) is an annotated dataset of dialogues for sentiment analysis, where users recommend movies to each other. The dataset consists of over 10,000 conversations centered around the theme of providing movie recommendations. There are several examples from the conversation for validation sets that can be useful before getting started.

Download: Link

Source: Li et al.

6. AG’s Corpus 

Antonio Gulli’s corpus of news articles is a collection of more than 1 million news articles. The articles are curated from more than 2000 news sources by ComeToMyHead in more than one year. This data set can be used for non-commercial activities only. Also, you cannot re-distribute the datasets with a different name.

Download: Link

Source: Gulli in AG’s corpus of news articles

7. Amazon Fine Foods

The paper From Amateurs to Connoisseurs: Modeling the Evolution of User Expertise through Online Reviews using Amazon Fine Foods is cited over 400 times. The Amazon Fine Foods dataset consists of ~5000,000 reviews up to October 2021 by 256,059 users. A total of 74,258 products have been reviewed, with a median number of words per review of 56.

Download: Link  

Source: McAuley et al. 

8. SPOT (Sentiment Polarity Annotations Dataset)

Collected from Yelp’13 and IMDB, the SPOT sentiment analysis dataset contains 197 reviews that are annotated with segment-level polarity labels (positive/neutral/negative). Annotations have been gathered on two levels of granularity: Sentences and Elementary Discourse Units (EDUs). The dataset is ideal for evaluating methods that are focused on predicting sentiment on a fine-graned and segment-level basis. 

Download: Link

Source: Angelidis et al.

9. Youtubean

Youtbean is a dataset created from closed captions of YouTube product review videos. It can be used for a wide range of sentimental analysis tasks like aspect extraction and sentiment classification. The data set was used for the paper ‘Mining fine-grained opinions on closed captions of YouTube videos with an attention-RNN.’

Download: GitHub

Source: Marrese-Taylor et al.

10. ReviewQA

ReviewQA is a question-answering dataset proposed for sentiment analysis tasks based on hotel reviews. The dataset consists of questions that are linked to a set of relational understanding competencies of models. Each question comes with an associated type that characterizes the required competency.

Download: GitHub

Source: Grail et al. 

11. iSarcasm

Twitter datasets for sentimental analysis are one of the go-to data for sentiment analysis. iSarcasm is a dataset of tweets that are intended sarcasm for sentimental analysis. The data is labeled as either sarcastic or non_sarcastic. Sarcastic tweets are further labeled with the types of ironic speech — sarcasm, irony, satire, understatement, overstatement, and rhetorical questions.

Download: GitHub

Source: Oprea et al.


PHINC is a parallel corpus of the 13,738 code-mixed English-Hindi sentences and their translation in English. According to researchers, the translations of sentences are manually annotated. This is one of the best datasets for sentimental analysis with a mixture of languages that is highly common in India.

Download: Link

Source: Srivastava et al.

13. XED

XED is a multilingual fine-grained emotion dataset for sentimental analysis. The dataset consists of human-annotated Finnish (25k) and English sentences (30k), as well as projected annotations for 30 additional languages, providing new resources for many low-resource languages.

Download: GitHub

Source: Öhman et al.

14. MultiSenti

MultiSenti offers code-switched informal short text, for which a deep learning-based model can be trained for sentiment classification. The developers have provided a pre-trained model using word2vec, which can be accessed here.

Download: Link

Source: Shakeel et al.

15. PerSenT

PerSenT dataset contains crowd-sourced annotations of the sentiment expressed by the authors towards the main entities in news articles. The dataset also includes paragraph-level sentiment annotations to provide more fine-grained supervision for the task.

Download: Link

Source: Bastan et al.

Subscribe to our newsletter

Subscribe and never miss out on such trending AI-related articles.

We will never sell your data

Join our Telegram and WhatsApp group to be a part of an engaging community.

Ratan Kumar
Ratan Kumar
Ratan is a tech content writer who amasses inspiration from science fiction, cartoons, and psychology. Apart from writing, you can find him playing mobile games and depicting humans.


Please enter your comment!
Please enter your name here

- Advertisment -
- Advertisment -

Most Popular