Natural Language Processing Course

Natural Language Processing

Ben-Gurion University of the Negev

These tasks created as part of the Natural Language Processing (NLP) course at the Ben-Gurion University of the Negev.

Natural language processing is the research field in which we develop, test and analyze machine learning algorithms that are used in order to automatically process large amounts of text in order to understand given texts and generate new texts. The course makes heavy use of machine learning but introduces concepts from linguistics and cognitive psychology. Typical examples for active research topics and applications are spam detection, error correction, machine translation, topic modeling, document classification and demographic attribution.

Assignments

Assignment 1

Text Preprocessing, Language Modeling and Generation
Implement a Markovian language model and a language generator. We use noisy channel algorithm for spell checking. Combining the noisy channel with a language model is a simple, though powerful, algorithm that demonstrates some key elements in language processing and the way statistical machine learning implicitly accounts for cognitive and technological biases.

Assignment 2

Contextual Spell Checking
Distributional semantics and Text Classification. In this assignment we built a spell checker that handles both non-word and real-word errors given in a sentential context. In order to do that we learn a language model as well as reconstruct the error distribution tables (according to error type) from lists of common errors. Finally, we combine it all to a context sensitive noisy channel model.

Assignment 4|Notebook

Part of Speech Tagging
Implement a Hidden Markov Model and a BiLSTM model for Part of Speech tagging. Using discriminative models for POS tagging (MEMM and bi-LSTM).

Assignment 3 - Authorship Attribution
Code | Notebook | Report

LSTM networks - Using various algorithms for text classification, performing an authorship attribution task on Donald Trump’s tweets. A comprehensive report, the accompanying code and classification output obtained on a test set is included in the repository.

We tested the performance of "clasic" machine learning models using SKLearn and Neural Networks using PyTorch.To facilitate the use of models from SKLearn, a base class is implemented which acts as a uniform wrapper for them. Grid Search and 5-fold Cross Validation is used to search over the hyper parameters range and select the best performing model.

Natural Language Processing

Assignments

Assignment 3 - Authorship Attribution Code | Notebook | Report

Assignment 3 - Authorship Attribution
Code | Notebook | Report