Dataset

Use a public dataset from the approved list (Chicago Crime dataset selected)


Assignment Requirements

Q1: Problem Identification & Data Collection

In the notebook:

  1. Clearly define a real-world problem suitable for EDA and ML
  2. Describe the dataset and source (link included)
  3. Identify:
    • Target variable
    • Feature variables
  4. Show dataset shape and preview

Q2: Exploratory Data Analysis (Manual EDA)

Using the SAME dataset:

  1. Comment on data quality (missing values, duplicates, data types)
  2. Descriptive statistics and interpretation
  3. Identify and remove outliers
  4. Check feature distributions
  5. Correlation analysis
  6. Clearly list:
    • Dependent variable
    • Independent variables
  7. Drop unnecessary independent features
  8. Check skewness using p-value
  9. Apply:
    • Standardization
    • Normalization
  10. Save:
  • Cleaned dataset
  • Standardized dataset
  • Normalized dataset

Q3: Automated EDA (Sweetviz)

Using the SAME dataset in the SAME notebook:

  1. Install and use Sweetviz (must work in Google Colab)
  2. Generate:
    • analyze() report for raw dataset
    • analyze() report for cleaned dataset
    • compare() report (raw vs cleaned)
    • compare_intra() report (e.g., class-based comparison)
  3. Display and save Sweetviz HTML reports
  4. Provide written explanations comparing:
    • Raw vs cleaned dataset
    • Insights from analyze, compare, compare_intra
  5. Discuss how dataset quality affects Linear Regression performance (conceptual explanation)

Submission Expectations

  • ONE clean, well-structured .ipynb notebook
  • Clear markdown explanations (student level)
  • Code must run successfully in Google Colab
  • Proper handling of Sweetviz + NumPy compatibility
  • No plagiarism

What I Expect From You

Complete end-to-end solution
Same dataset across Q1, Q2, Q3
Ready to submit with no errors

Requirements: as long | Python

WRITE MY PAPER


Comments

Leave a Reply