Ashutosh Modi

Statistical Natural Language Processing (CS779) : Spring 2021

Natural language (NL) refers to the language spoken/written by humans. NL is the primary mode of communication for humans. With the growth of the world wide web, data in the form of text has grown exponentially. It calls for the development of algorithms and techniques for processing natural language for the automation and development of intelligent machines. This course will primarily focus on understanding and developing linguistic techniques, statistical learning algorithms, and models for processing language. We will have a statistical approach towards natural language processing, wherein we will learn how one could develop natural language understanding models from statistical regularities in large corpora of natural language texts while leveraging linguistics theories.

Pre-requisites:

Must: Introduction to Machine Learning (CS771) or equivalent course, Proficiency in Linear Algebra, Probability and Statistics, Proficiency in Python Programming
Desirable: Probabilistic Machine Learning (CS772), Topics in Probabilistic Modeling and Inference (CS775), Deep Learning for Computer Vision (CS776)

Course Instructor:

Dr. Ashutosh Modi

Course TAs:

Samik Some (Email: samik@cse.iitk.ac.in )
Shubham Kumar Nigam (Email: sknigam@cse.iitk.ac.in )
Tushar Shandhilya (Email: stushar@cse.iitk.ac.in )
Karishma Satchidanand Laud (Email: kslaud@iitk.ac.in )
Gargi Singh (Email: sgargi@cse.iitk.ac.in )
Chayan Dhaddha (Email: cdhaddha@cse.iitk.ac.in )
Ashwani Bhat (Email: bashwani@cse.iitk.ac.in )
Aishwarya Gupta (Email: aishwaryag20@iitk.ac.in )
Ajitha Shree (Email: ajitashree20@iitk.ac.in )
Prakhyat Shankesi (Email: prakhyat@iitk.ac.in )

Course Email:

In case you want to communicate with the instructor, please do not send any direct emails to the instructor (these will most likely end in spam), use this course email for the communication: nlp.course.iitk@gmail.com

Weekly Meeting Session:

Tuesday, Friday 2PM to 3:15 PM

Virtual Classroom:

Lectures, assignments and quizzes will be uploaded/conducted on HelloIITK.

Virtual classes will be held on MS Teams. A separate team/channel (CS779-Spring2021: Statistical Natural Language Processing) has been set up for the course. All course announcements will be made on the Teams channel, HelloIITK and Telegram.

In case you are not there on MS Teams please create an account via IITK subscription. To get IITK sub-scription please fill this form: https://web.iitk.ac.in/ccnew/Office365/Office_365_subscription_at_IITK.htm. If you are outside IITK, you would need to log into IITK network via VPN to fill the form. Once you have the account, please contact TAs to add you to the Teams channel.

Resources:

Please check the Resources tab.

Grading:

Course will have quizzes, assignments, lecture summaries, research project, open competition (NLP Challenge), and exams. Given that course is going to be online, all exams will be conducted online. The tentative weightage for different components are as follows. Please note that this is tentative (due to COVID uncertainties and factors beyond Instructor's control) and weightage might change.

Quizzes: 20%
Lecture summaries: 10%
Research Project: 20%
NLP Challenge: 10%
Mid-Sem Exam: 20%
End-Sem Exam: 20%

Lectures

Date	Topic	References
19/01/2021	Introduction and Logistics	-

Course Contents:

Tentative list of topics we will be covering in this course:

Introduction to Natural Language (NL): why is it hard to process NL, linguistics fundamentals, etc.
Language Models: n-grams, smoothing, class-based, brown clustering
Sequence Labeling: HMM, MaxEnt, CRFs, related applications of these models e.g. Part of Speech tagging, etc.
Parsing: CFG, Lexicalized CFG, PCFGs, Dependency parsing
Applications: Named Entity Recognition, Coreference Resolution, text classification, toolkits e.g., Spacy, etc.
Distributional Semantics: distributional hypothesis, vector space models, etc.
Distributed Representations: Neural Networks (NN), Backpropogation, Softmax, Hierarchical Softmax
Word Vectors: Feedforward NN, Word2Vec, GloVE, Contextualization (ELMo etc.), Subword information (FastText, etc.)
Deep Models: RNNs, LSTMs, Attention, CNNs, applications in language, etc.
Sequence to Sequence models: machine translation and other applications
Transformers: BERT, transfer learning and applications

References:

There are no specific references, this course gleans information from a variety of sources likebooks, research papers, other courses, etc. Relevant references would be suggested in the lectures. Some of the frequent references are as follows:

Speech and Language Processing, Daniel Jurafsky, James H.Martin
Foundations of Statistical Natural Language Processing, CH Manning, H Schtze
Introduction to Natural Language Processing, Jacob Eisenstein
Neural Network Methods for NLP, Yoav Goldberg, Morgan Claypool (If you are in IITK network you can download at: Morgan Claypool Subscription for IITK )
Linguistic Fundamentals for Natural Language Processing, Emily Bender, Morgan Claypool (If you are in IITK network you can download at: Morgan Claypool Subscription for IITK )
Natural Language Understanding, James Allen

This is a project oriented course. Participants will be working on different NLP research projects. Once the projects have been finalized by the participants, this page will be populated with the list of projects.

Course Logistics Related Resources:

Course lectures, assignments and quizzes will be on HelloIITK. Log in using your IITK username and password.
Course MS Teams Channel: CS779-Spring2021: Statistical Natural Language Processing
Telegram Channel: CS779_Spring_2021

Study Resources:

Linear Algebra Refresher
Probability Refresher
PyTorch Tutorials
Deep Learning with PyTorch Book
Spacy ToolKit
Writing Code for NLP Research
Repository of NLP research papers: ACL Anthology
Human Language Technology Series by Morgan Claypool. This can be accessed only from the IITK network.
Guide to ML Research
Using Google CoLab for research
Using Kaggle Notebooks for research