Dataset
Use a public dataset from the approved list (Chicago Crime dataset selected)
Assignment Requirements
Q1: Problem Identification & Data Collection
In the notebook:
- Clearly define a real-world problem suitable for EDA and ML
- Describe the dataset and source (link included)
- Identify:
- Target variable
- Feature variables
- Show dataset shape and preview
Q2: Exploratory Data Analysis (Manual EDA)
Using the SAME dataset:
- Comment on data quality (missing values, duplicates, data types)
- Descriptive statistics and interpretation
- Identify and remove outliers
- Check feature distributions
- Correlation analysis
- Clearly list:
- Dependent variable
- Independent variables
- Drop unnecessary independent features
- Check skewness using p-value
- Apply:
- Standardization
- Normalization
- Save:
- Cleaned dataset
- Standardized dataset
- Normalized dataset
Q3: Automated EDA (Sweetviz)
Using the SAME dataset in the SAME notebook:
- Install and use Sweetviz (must work in Google Colab)
- Generate:
analyze()report for raw datasetanalyze()report for cleaned datasetcompare()report (raw vs cleaned)compare_intra()report (e.g., class-based comparison)
- Display and save Sweetviz HTML reports
- Provide written explanations comparing:
- Raw vs cleaned dataset
- Insights from analyze, compare, compare_intra
- Discuss how dataset quality affects Linear Regression performance (conceptual explanation)
Submission Expectations
- ONE clean, well-structured .ipynb notebook
- Clear markdown explanations (student level)
- Code must run successfully in Google Colab
- Proper handling of Sweetviz + NumPy compatibility
- No plagiarism
What I Expect From You
Complete end-to-end solution
Same dataset across Q1, Q2, Q3
Ready to submit with no errors
Requirements: as long | Python

Leave a Reply
You must be logged in to post a comment.