I MAY ASK FOR CHANGES BASED ON WHAT I SEE FIT
Summary: In this assignment the students will implement a machine learning experiment from scratch starting from a problem statement and a dataset. This assignment is an individual one and each student will be given a different problem statement and dataset. You will choose your dataset out of your area of interest from UAE official platform that host thousands of datasets across many domains (education, economy, health, environment, and more). The dataset should be in CSV format, contain clearly defined lables, include at least five features, and have one target variable. You can discuss and confirm the dataset with your instructor.
The student will submit:
- **Primary Source: **A PDF report should be submitted containing non-technical details and discussion of the project. The report must follow the section structure provided below and exclude technical code, which should be included in the accompanying Jupyter Notebook (.ipynb) file. The report should present the analysis and interpretation of results, together with visualizations of the best-performing models predictions and associated errors.
- **Secondary Source: **A single jupyter notebook (along with dataset CSV) that includes all codes and their rationale, experiments outputs, and content described below. Scikit-learn library tools and the classifiers considered in the course will be used. You can add these into a zip-file and upload as secondary source. The work should be reproducible, i.e. one should be able to reproduce all the results via running the notebook.
Your PDF and Jupyter notebook should contain the following sections. The PDF will only contain descriptive part, while the notebook will include all codes and experimentations with comments.
Section 1: Introduction and Data Exploration
Objective: Provide a short overview of the project and perform initial data exploration.
Questions:
- What is the significance of predicting the target variable in the context of the dataset?
- Univariate analysis – Bivariate analysis – Use appropriate visualizations to identify the patterns and insights you gain from exploring the chosen dataset?
- Discuss your findings and relate it to the concepts we covered in the course LPs in the form of a table. Clearly mention the LP and from which we angle we covered this concept.
Suggestion: For this part of the assignment, first review a few Exploratory Data Analysis reviews such as: , , . You do not need to consider all steps provided in these reviews, just use some of the ideas that make sense for your project and data.
Section 2: Data Cleaning, Pre-processing, and Feature Engineering
Objective: Address any data issues, perform necessary pre-processing, and engineer features.
Questions:
- Discuss any missing values or outliers in the dataset and your approach to handling them. Show the missing values and outliers (if any) via graphs.
- Provide a code block demonstrating the cleaning, pre-processing, and feature engineering steps.
- How did you decide which features to include or engineer for predicting the target variable?
Section 3: Data Modelling and Data Splitting
Objective: Prepare data for modeling and split the data into training and test sets.
Questions:
- Explain the task of predicting the target variable as a supervised automatic classification problem.
- How did you split the data? Why is your method of splitting the data the correct method?
Section 4: Model Selection
Objective: Discuss the selection of classification models
Questions:
- Why did you choose specific classification models for predicting the target variable?
- Provide a very short description about each model (the description should be about 1 paragraph long and should be along the lines discussed in the LPs/sessions) where you compare the selected models and discuss their strengths and weaknesses.
Section 5: Model Training, Hyperparameter Tuning and Model Building
Objective: Train the selected models, perform cross-validation, and fine-tune hyperparameters.
Questions:
- Explain the process of training the classification models, the loss function, including any cross-validation techniques used.
- How did you approach hyperparameter tuning, and what impact did it have on model performance? Show impact with evidence.
Section 6: Model Performance Metrics
Objective: Model Performance evaluation and Improvement.
Questions:
- Which performance metrics did you use for model performance, and why are they appropriate for predicting the target variable?
- Can model performance be improved? If yes, then do it using appropriate techniques for each ML algorithm and comment on model performance after improvement. Show comparison of the performance before and after the improvement both in terms of accuracy and training and testing time. Show this comparison via graphs or tables.
Section 7: Results Visualization and Discussion
Objective: Visualize model results and provide insightful discussions.
Questions:
- Include code for visualizing the results, such as confusion matrices or ROC curves.
- What insights can be drawn from the visualizations, and how do they contribute to the understanding of model performance? Show all kind of cumulative visualizations in this part for holistic analysis of your results.
Section 8: Summary
Objective: Summarize key steps and discuss insights or shortcomings.
Questions:
- What are the 3 key things you learned from this assignment.
- Draw a complete ML or data pipeline diagram that shows the detailed steps you followed for this ML problem
- What are the 2 strengths and 2 weaknesses of the entire ML approach you followed for this assignment
Additional Guideline for this Assignment
- Use as much visualizations (e.g., graphs, charts, and diagrams) as much possible so that you have evidence for the various decisions made.
- Reflect on the visualizations (e.g., graphs) i.e., what are the key learnings from those graphs. Include these in your assignment as bullet points.
- Include your rational/motivation with evidence for various decisions such as selecting a particular machine learning algorithm or a feature selection algorithm.
- Generally, your reflections should demonstrate your understanding of the tasks given in the assignment.
- Use cross-validation and discuss briefly why it is important to use it.
- Use at least 4 ML models, briefly mention their strengths and weaknesses. Also, mention why you selected these 4 algorithms.
- Make an insightful comparison among the results for the 4 ML algorithms used.
- Ensure that the PDF report and final notebook are well organized into sections as mentioned in the assignment description.
- Make sure that the code is clean, commented, and well-documented.
- Adhere to the maximum word limit and page limit mentioned in the assignment.
Your notebook will also be graded in the following dimensions:
- Structure and flow
- Readability/accessibility of the code (use of comments and meaningful variable names)
Assignment Information
2000
Learning Outcomes Added
- : Apply a range of common model performance metrics (e.g. classification accuracy, recall, precision).
- : Implement maximum likelihood methods and the Expectation Maximization algorithm
- : Select appropriate classification methods in both supervised and unsupervised tasks.
Requirements: 2000

Leave a Reply
You must be logged in to post a comment.