Ashutosh Modi

Statistical Natural Language Processing (CS779) : Spring 2023

Natural language (NL) refers to the language spoken/written by humans. NL is the primary mode of communication for humans. With the growth of the world wide web, data in the form of text has grown exponentially. It calls for the development of algorithms and techniques for processing natural language for the automation and development of intelligent machines. This course will primarily focus on understanding and developing linguistic techniques, statistical learning algorithms and models for processing language. We will have a statistical approach towards natural language processing, wherein we will learn how one could develop natural language understanding models from statistical regularities in large corpora of natural language texts while leveraging linguistics theories.

Pre-requisites:

Must: Proficiency in Linear Algebra, Probability and Statistics, Proficiency in Python Programming
Desirable: Introduction to Machine Learning (CS771) or Probabilistic Machine Learning (CS772) or Topics in Probabilistic Modeling and Inference (CS775) or equivalent course.

Course Instructor:

Dr. Ashutosh Modi

Course TAs:

Alok Kumar Trivedi (Email: alokt@cse.iitk.ac.in )
Amar Raja Dibbu (Email: amard@cse.iitk.ac.in )
Chabil Kansal (Email: chabilk@cse.iitk.ac.in )
Rahul Kumar (Email: rahulkumar@cse.iitk.ac.in )
Tanikella Sai Kiran (Email: tskiran@cse.iitk.ac.in )

Course Email:

In case you want to communicate with the instructor, please do not send any direct emails to the instructor (these will most likely end in spam), use this course email for the communication: nlp.course.iitk@gmail.com

Weekly Sessions:

Tuesday 1200 -1315 Hrs
Wednesday 1200 -1315 Hrs

Lecture Venue:

CSE Dept., KD101

Course Annoucements:

The course will be managed via Slack. Please sign-up on Slack for course annoucements, study material, and resources. For joining the workspace, please contact the instructor or one of the TAs.

Tentative Grading:

This is a research project oriented course and the project carries the maximum weightage. The tentative weightage for different components are as follows.

Class Participation: 3%
Quizzes/Exams: 7%
NLP Challenge: 20%
Project: 70%

Lectures

Date	Topic
10/01/2023	Introduction
	Logistics
11/01/2023	Why NLP is Hard?
17/01/2023	Levels in Language Processing
	NLP Pipeline
18/01/2023	Sub-Tokenization
24/01/2023	Text Prediction: Introduction
	Prediction Framework
	Feature Function
25/01/2023	Prediction Model
	Loss Function-1
	Loss Function-2
	Loss Function-3
31/01/2023	No Class
01/02/2023	No Class
07/02/2023	Data Log Likelihood
	Softmax Function
08/02/2023	Optimization
	Naive Bayes
11/02/2023	EM Algorithm
	Introduction to Neural Networks
	Computational Graphs
14/02/2023	NN Practicalities
	CNN
15/02/2023	Neural Sequence Models
20/02/2023 - 20/02/2023	Mid-Semester Exam
26/02/2023 - 27/02/2023	Project Presentations
04/03/2023 - 12/03/2023	Spring Break
14/03/2023	Transformers
15/03/2023	Transformers
21/03/2023 - 22/03/2023	Project Discussions
28/03/2023 - 29/03/2023	Competition and Project Discussions

References:

There are no specific references, this course gleans information from a variety of sources likebooks, research papers, other courses, etc. Relevant references would be suggested in the lectures. Some of the frequent references are as follows:

Speech and Language Processing, Daniel Jurafsky, James H.Martin
Foundations of Statistical Natural Language Processing, CH Manning, H Schtze
Introduction to Natural Language Processing, Jacob Eisenstein
Natural Language Understanding, James Allen
Deep Dive into Deep Learning, Aston Zhang, Zack C. Lipton, Mu Li, Alexander J. Smola
Neural Network Methods for Natural Language Processing, Yaov Goldeberg

Useful Resources:

List of projects coming soon