Special Topics in Natural Language Processing (CS698O) : Winter 2020
Natural language (NL) refers to the language spoken/written by humans. NL is the primary mode of communication for humans. With the growth of the world wide web, data in the form of textual natural language has grown exponentially. It calls for the development of algorithms and techniques for processing natural language for automation and the development of intelligent machines. This course will primarily focus on understanding and developing techniques/learning algorithms/models for processing text. We will have a statistical approach to Natural Language Processing (NLP), wherein we will learn how one could develop natural language understanding models from regularities in large corpora of natural language texts.
Pre-requisites:
Must: Introduction to Machine Learning (CS771) or equivalent course, Proficiency in Linear Algebra, Probability and Statistics, Proficiency in Python Programming Desirable: Probabilistic Machine Learning (CS772), Topics in Probabilistic Modeling and Inference (CS775), Deep Learning for Computer Vision (CS776)
In case you want to communicate with the instructor, please do not send any direct emails to the instructor (these will most likely end in spam), use this course email for the communication: nlp.course.iitk@gmail.com
The course will be managed via Piazza. Please sign-up on Piazza for course annoucements, study material, and resources. For the access code, please contact the instructor or one of the TAs.
Grading:
This is a research project oriented course and the project carries the maximum weightage. The tentative weightage for different components are as follows.
Quizzes: 40%
Project: 60%
Mid-Sem Exam: Project paper and presentation
End-Sem Exam: Project paper and presentation
There are no specific references, this course gleans information from a variety of sources likebooks, research papers, other courses, etc. Relevant references would be suggested in the lectures. Some of the frequent references are as follows:
Speech and Language Processing, Daniel Jurafsky, James H.Martin
Foundations of Statistical Natural Language Processing, CH Manning, H Schtze
Introduction to Natural Language Processing, Jacob Eisenstein
This is a project oriented course. Participants worked on different NLP research projects. Following is the list of final projects done by the participants. Research done in some of these projects was published in workshops/conferences.
Commonsense Validation and Explanation
Sandeep Routray, Soumya Ranjan Dash, Prateek Varshney PAPER
In this project, we develop a system for addressing the research problem posed in Task 4 of SemEval 2020, which involves differentiating between natural language statements that confirm to common sense and those that do not. The organizers propose three subtasks - first, selecting between two sentences, the one which is against common sense. Second, identifying the most crucial reason why a statement does not make sense. Third, generating novel reasons for explaining the against common sense statement. Out of the three subtasks, this paper reports the system description of subtask A and subtask B. This paper proposes a model based on transformer neural network architecture for addressing the subtasks. The novelty in work lies in the architecture design, which handles the logical implication of contradicting statements and simultaneous information extraction from both sentences. We use a parallel instance of transformers, which is responsible for a boost in the performance. We achieved an accuracy of 94.8% in subtask A and 89% in subtask B on the test set.
Sentiment Analysis of Code Mixed Text
Ayush Kumar, Harsh Agarwal, Keshav Bansal PAPER
Sentiment Analysis of code-mixed text has diversified applications in opinion mining ranging from tagging user reviews to identifying social or political sentiments of a sub-population. In this paper, we present an ensemble architecture of convolutional neural net (CNN) and self-attention based LSTM for sentiment analysis of code-mixed tweets. While the CNN component helps in the classification of positive and negative tweets, the self-attention based LSTM, helps in the classification of neutral tweets, because of its ability to identify correct sentiment among multiple sentiment bearing units. We achieved F1 scores of 0.707 (ranked 5th) and 0.725 (ranked 13th) on Hindi-English (Hinglish) and Spanish-English (Spanglish) datasets, respectively. The submissions for Hinglish and Spanglish tasks were made under the usernames ayushk and harsh_6 respectively.
Modelling Causal Reasoning in Language, Detecting Counterfactuals Application: Financial Document Causality Detection
Rohin Garg, Shashank Gupta, Anirudh Anil Ojha PAPER
This project aims at developing computational models for detecting a class of textual expressions known as counterfactuals and separating them into their constituent elements. Counterfactual statements describe events that have not or could not have occurred and the possible implications of such events. While counterfactual reasoning is natural for humans, understanding these expressions is difficult for artificial agents due to a variety of linguistic subtleties. For this project, we participated in Task 5 of SemEval-2020. Our final submitted approaches were an ensemble of various fine-tuned transformer-based and CNN-based models for the first subtask and a transformer model with dependency tree information for the second subtask. We ranked 4-th and 9-th in the overall leaderboard. We also explored various other approaches that involved the use of classical methods, other neural architectures and the incorporation of different linguistic features.
Detection of Propaganda Techniques in News Articles
Paramansh Singh, Siraj Singh Sandhu, Subham Kumar PAPER
In this project, we develop a system for addressing the research problem posed in SemEval 2020 Task 11: Detection of Propaganda Techniques in News Articles for each of the two subtasks of Span Identification and Technique Classification. We make use of pre-trained BERT language model enhanced with tagging techniques developed for the task of Named Entity Recognition (NER), to develop a system for identifying propaganda spans in the text. For the second subtask, we incorporate contextual features in a pre-trained RoBERTa model for the classification of propaganda techniques. We were ranked 5th in the propaganda technique classification subtask.
Emphasis Selection for written text in visual media
Rishabh Agarwal, Vipul Singhal, Sahil Dhull PAPER
In this project, we develop a system for addressing the research problem posed in Task 10 of SemEval-2020: Emphasis Selection For Written Text in Visual Media. We propose an end-to-end model that takes as input the text and corresponding to each word gives the probability of the word to be emphasized. Our results show that transformer-based models are particularly effective in this task. We achieved the best Matchm score (described in section 2.2) of 0.810 and were ranked third on the leaderboard.
Memotion Analysis
Vishal Keswani, Sakshi Singh, Suryansh Agarwal PAPER
Social media is abundant in visual and textual information presented together or in isolation. Memes are the most popular form, belonging to the former class. In this project, we develop approaches for the Memotion Analysis problem as posed in SemEval-2020 Task 8. The goal of this task is to classify memes based on their emotional content and sentiment. We leverage techniques from Natural Language Processing (NLP) and Computer Vision (CV) towards the sentiment classification of internet memes (Subtask A). We consider Bimodal (text and image) as well as Unimodal (text-only) techniques in our study ranging from the Naïve Bayes classifier to Transformer-based approaches. Our results show that a text-only approach, a simple Feed Forward Neural Network (FFNN) with Word2vec embeddings as input, performs superior to all the others. We stand first in the Sentiment analysis task with a relative improvement of 63% over the baseline macro-F1 score. Our work is relevant to any task concerned with the combination of different modalities.
Multilingual Offensive Language Identification in Social Media
Karishma Laud, Jagriti Singh, Randeep Kumar Sahu PAPER
In this project, we develop for addressing the research problem posed in SemEval-2020 Shared Task 12 Multilingual Offensive Language Identification in Social Media. We participated in all the three sub-tasks of OffensEval-2020, and our final submissions during the evaluation phase included transformer-based approaches and a soft label-based approach. BERT based fine-tuned models were submitted for each language of sub-task A (offensive tweet identification). RoBERTa based fine-tuned model for sub-task B (automatic categorization of offense types) was submitted. We submitted two models for sub-task C (offense target identification), one using soft labels and the other using BERT based fine-tuned model. Our ranks for sub-task A were Greek-19 out of 37, Turkish-22 out of 46, Danish-26 out of 39, Arabic-39 out of 53, and English-20 out of 85. We achieved a rank of 28 out of 43 for sub-task B. Our best rank for sub-task C was 20 out of 39 using BERT based fine-tuned model.
The task of emotion cause extraction (ECE) is aimed at inferring the cause of an emotion expressed through a piece of text. The task assumes that the emotion associated with the text is provided to us. Such an emotion→cause extraction pipeline disregards the inherent dependence between emotions and causes while also limiting the applicability of the model. Recent work in emotion-cause pair extraction (ECPE) (Xia and Ding, 2019) has tried to improve upon this by extracting emotion-cause clause pairs from the document in a 2-step approach—by first extracting emotion and cause clauses and then conducting emotion-cause pairing. However, this overlooks the effect that a cause clause has on the perception of the emotion since clause extraction happens in isolation from the pairing task. Further, it runs the risk of failing to extract potential emotion clauses in the first step of the pipeline—certain clauses do not appear to convey an emotion when seen in isolation from the cause clause. To overcome these drawbacks, we propose an end-to-end emotion-cause pair extraction architecture that infers emotion-cause pairs from documents and takes into account the effect of cause clause on the perceived emotion of the emotion clause. We evaluate our approach on the benchmark emotion cause dataset introduced in (Gui et al., 2016) and show significant performance improvements in the emotion-cause pairing task.
Affective Language Modelling and Text Generation
Ahsan Barkati, Ishika Singh, Tushar Goswamy
Messages for human conversation are best conveyed by flavouring the sentences with emotionally coloured words. In this project, we aim to integrate the affective sentence generation methodology to the state-of-the-art language generation models. It is intended to develop a model capable of generating affect-driven sentences without losing the grammatical correctness. We propose to incorporate emotion as prior for the probabilistic state-of-the-art sentence generation models such as GPT-2 and BERT. The model will give user the flexibility to control the category and intensity of emotion as well as the subject of the generated text. This is followed by demonstrating an application of the language model for story generation, advertisements and conversational agents for therapy chatbots.
In this project, we generate sarcastic remarks on a topic given by the user as input. Sarcasm is a relatively new innovation in NLP and topic-based sarcasm generation is still an unexplored field. We propose a novel four-step based sarcasm generation approach. Given an input, in the first step, we find the general consensus about the topic by retrieving relevant tweets and reddit articles. In the second step, we extract the key adjectives from the retrieved corpus using adjective clustering. In the third step, we generate a simple sentence containing the topic and the key adjective using a predefined template. Finally, in the fourth step, we use the simple sentence as an input to a customized plug and play model to generate a sarcastic comment.
Machine Comprehension Using Commonsense and Script Knowledge
Apoorva Singh, Gargi Singh, Yogesh Kumar
The commonsense knowledge problem has been researched on for years but SOTA still lags far behind the human-level performance. This project will address a subdomain of this problem from the sphere of machine comprehension. We intend to experiment with datasets for requisite and compatible data based on commonsense question-answering(QA) and sequential knowledge of events. We will use the same to better performance of ALBERT, XLNet, and RoBERTa on MCScript2.0. Additionally, we will explore the application of commonsense MRC in the domain of cross-lingual QA and story cloze test.
Defense against Adversarial Attacks in Text
Hunar Preet, Smarth Gupta, Suryateja BV
Textual adversarial attacks is a recent field in NLP which tests the robustness of models. State-of-the-art NLP models fail at simple additions and deletions of characters in the input sentences, calling for a need to defend against such attacks. We generate realistic character level attacks and find ways to overcome them. We designed a suite of experiments to analyze the robustness of BERT-based models and present an analysis of where the models fail. Finally, we use these foundations to focus on a better adversarial attack with higher lexical overlap but with subtle changes in meaning.
Information Retrieval and Sentence Extraction on Mental Health using Research Domain Criteria
Ankit Kumar Singh, Aditya Jain, Nikunj Jha
Research Domain Criteria (RDoC) is a framework that integrates multi-dimensional information for a better understanding of mental disorders. The absence of biomedical literature with annotated RDoC categories (called ”constructs”) limits the full potential of RDoC. It is infeasible to manually analyze every biomedical article for critical insights, thereby explaining the importance of annotating biomedical literature using RDoC constructs. We aim at exploring different classical ml based and neural network based approaches to rank all abstracts and to extract the most relevant sentence for a given construct.
Learning from Descriptions: An Approach for Zero-Shot Text Classification
Karthikeyan
In this work, we propose a novel learning technique called Learning from Description (LDES) and analyze our approach for the case of zero-shot text classification (ZS-TC). It is worth noting that our approach is a step closer to how humans typically learn – using descriptions in natural languages. We convert text classification (TC) problem into Textual Entailment (TE), Question-Answering (QA), and Masked word prediction (similar to Masked language model (Devlin et al., 2018)) problem and then use the available TE and QA datasets. Our approach can be easily extended to zero-shot image classification using Visual Entailment (VE) (Xie et al., 2018). Further, our approach is orthogonal to existing meta-learning (Vilalta and Drissi, 2002) based techniques – therefore, one can use our method in conjunction with meta-learning based techniques.
Construction of Knowledge Graph for IIT Kanpur website
Vishal Singh, Yash Kumar, Nitesh Trivedi
An unstructured text contains valuable information but retrieving elements of interest from the unstructured text requires crucial NLP techniques to process unstructured text. One such important element is the Knowledge Graph(KG). Most of the modern applications prepare knowledge base using KG and derive hidden insights. We aim to develop KG for the IIT Kanpur website using classical NLP techniques.
Abstractive Text Summarization
Ayush Nagal, Prakhyat Shankes, Navanya Sharma
Text summarization aims at compressing long documents into a shorter form that conveys the essential parts of the original document. In this work, we apply state of the art abstractive news summarization techniques on Indian news datasets. First, we use a hybrid pointer-generator network. Second, we apply a transformer-based model like BERTSUMABS on the Inshorts news articles.
Generalized Adversarial Attacks on NLP Models
Rahul B S, Manish Kumar Bera, Bhavy Khatri
We consider the problem of devising an adversarial attack scheme that can be applied to any general NLP model. Since the techniques used in vision are not transferable easily into NLP, because of discrete nature, and syntactic/semantic restrictions. In our work, we will generate an aversarial example by doing perturbation in the latent space(of input) and then map perturbed latent representation back into the input space. In solving the problem we develop a decoder for the latent representation, and a perturbation model.
Hindi Dependency Parser
Abhishek Jaiswal, Tushar Shandhilya, A.V.D.S.Mahesh
We have built a dependency parser for Hindi with a web-interface and, further demonstrated its cross-lingual usage by applying on Marathi, a language that is low-resourced yet not very different from the former. For our parser, we have applied and tested various techniques ranging from transition-based parser that use SVM as a classifier to deep neural network based parsers which incorporate recent context-based word embeddings like BERT and FastText. We have demonstrated the immediate zero-shot application of such a parser on Marathi and also applied an existing word-embedding alignment method called MUSE to improve the cross-lingual application performance. In the future, we aim to apply these techniques to Bhojpuri, Telugu and Tamil.
Understanding and predicting humor is a semantically challenging task. In a quest to make AI agents more human-like, it becomes essential that they understand the complex social and psychological trait of humor coming very naturally to us. Although some work has been done on proposing methods and datasets for the task, very little work has been done on understanding what makes something funny. A recent work aimed at same, has recently proposed the Humicroedit (Hossain et al., 2019) dataset, which contains edited news headlines graded for funniness, as a step to identify causes of humor. In our work, we solve for the task of regressing funniness and predicting the funnier edited headline by leveraging the recently proposed powerful LM’s and humor heuristics-based features.
Multi-modal approach in Natural Language Processing is gaining popularity these days. After the release of CMU-MOSEI dataset, a lot of work has been done in Multi-modal emotion and sentiment recognition. Recently, MELD dataset was released which accelerated the research in conversational systems involving emotion recognition. Datasets involving multiparty conversation involves very challenging problems which includes context and speaker-state modeling, emotion shift, and speaker recognition in case of multiple people in a frame. Significant work has been done to solve these problem and models like ConGCN and dRNN models context and speaker state nicely. But no work has been done to include visual modality along with context and speaker state modeling. In this paper, we are trying to solve some of these problems by introducing visual modality and speaker identification in a frame.
In this project, we develop a system for addressing the research problem posed in SemEval-2018 Task 8: Semantic Extraction from Cybersecurity Reports using NLP. The goal is to exploit the power of Natural Language Processing to provide critical and relevant information about malware behind various cyber attacks. For Subtask 1, our method consists of learning embeddings using Bert and then using a Binary classifier. For Subtask 2, we used BertForTokenClassification on top, with the embeddings from Bert. For Subtask 3, word embeddings from Bert were passed onto the classifier alongside the already existing relations. For Subtask 4, separate Bert embeddings were learnt for each Attribute class, which were then passed into a sigmoid classifier to learn the multi-labels. Our technique achieved an F1 score of 85.43 and 35.2 for Subtask 1 and Subtask 2, respectively.