Computer Science Question

Important note: Please work strictly based on the instructions. I do not want any plagiarized work. Follow all the instructions carefully. Although it states to use Jupyter Notebook, please use Google Colab instead. I want your best work with detailed answers. PLEASE WORK IN GOOGLE COLAB. READ THE INSTRUCTIONS CAREFULLY.

Assignment due Friday, March 6, 2026 by 11:59pm

Question 1: Text Preprocessing

Objective: Perform text preprocessing using regular expressions and text normalization techniques.

Create a Jupyter Notebook that contains code, results, interpretation of the results, and answers to questions for the tasks defined below.

Tasks:

  1. Use the nltk.corpus.gutenberg to download the “Alice in Wonderland” text (‘carroll-alice.txt’). Import the necessary NLTK libraries and download the Gutenberg corpus if not already available.
  2. Preprocess text: Remove all non-alphanumeric characters (e.g., punctuation, special characters) using regular expressions. Convert all characters to lowercase. Split the cleaned text into individual words (tokens). Remove common English stopwords (e.g., “the,” “and,” “is”) using NLTK’s stopwords corpus.
  3. Define Type-Token Ratio (TTR) and its use. Compute and print Type-Token Ratio (TTR) for this text. Interpret the TTR value you have computed.Describe the factors affecting TTR.
  4. Find two other books from Gutenberg project that have lower TTR and two having higher TTR without computing TTR for all books (i.e. brute force search). Compute and print TTR for these four books. Explain your strategy to find a book with higher TTR.
  5. Write code to find and print longest words for each of these five books.
  6. Write code to find 10 examples, for each book, where the stemmed word and the lemmatized word have the highest distance (e.g., “better” becomes “good” with lemmatization but “better” with stemming) using NLTK’s PorterStemmer, NLTK’s WordNetLemmatizer and minimum edit distance. Display these differences and explain why they occur (discuss the differences in the output of stemming and lemmatization on the examples).
  7. Calculate the frequency of each lemmatized word for each book. Visualize the top 10 most frequent lemmatized words in each book using bar charts.
  8. Which method (stemming or lemmatization) do you think is more appropriate for tasks like text classification, and why?

Question 2: Applying N-gram Models

Objective: Implementing bigram and trigram models, calculating perplexity, and analyzing how N-gram models handle real-world text data.

Question 2.1:

Explain the key differences between bigram and trigram models in terms of:

  • Contextual Understanding: How does each model capture context, and what are the trade-offs between the two?
  • Data Sparsity: Discuss why higher-order N-grams might suffer from data sparsity and how this affects model performance.
  • Perplexity: Define perplexity and explain how it evaluates language models. [Optional] Derive the formula showing steps; hand-written equations are welcome.

Question 2.2:

Download the Amazon Fine Food Reviews Dataset from Kaggle. You will use a sample of the reviews to build bigram and trigram models. Briefly explain the preprocessing steps youve applied and why each step is essential for building N-gram models.

Data Preprocessing: Load and preprocess the dataset by:

  • Tokenize each review
  • Lowercase all words
  • Removing punctuation and stopwords

Train a Bigram and Trigram Model:

  • Using the processed dataset, train both a bigram and trigram language model
  • Use Laplace smoothing to handle unseen N-grams

Calculate Perplexity:

  • Given the following test sentence: I enjoyed the meal, but the service was slow, calculate the perplexity of both models.
  • Compare the perplexity scores of the bigram and trigram models. What does the comparison reveal about the ability of each model to handle unseen data?

Question 2.3:

Create a word cloud visualization of the most common bigrams and trigrams from the training dataset.

[Optional] Add an insight you gained from the classroom discussion, e.g., explain the relevance of Laplace smoothing with a vocabulary-size parameter with an example sentence.

Submission Guidelines:

  • Submit a self-contained PDF report including all theoretical answers, code snippets, output visualizations, and reasoning. This should be the primary resource and will be graded.
  • Submit your Python code as a Jupyter Notebook (.ipynb) as the secondary resource as a part of a zip file. Include other resources, such as screenshots or hand-written notes, as well.
  • Include detailed explanations of your code and the rationale for your choices.
  • The assignment should be submitted on Forum by March 6, 2026, at 11 pm GST.

Assignment Information

Weight:

15%

Learning Outcomes Added

  • : Demonstrate knowledge of the fundamental principles of natural language processing.
  • : Explain the methods and algorithms used to process different types of textual data as well as the challenges involved.

Requirements: enough

WRITE MY PAPER