Week 3 Assignment: Customer Data Pipeline - From Chaos to Clarity

Assignment Overview

Welcome to Week 3! This week, you’ll apply your data preparation skills to solve a real-world business problem. You’ll work with messy, incomplete customer data and transform it into a clean, analysis-ready dataset that can drive business decisions.

Learning Objectives

By completing this assignment, you will:

  1. Assess data quality and identify critical issues impacting analysis
  2. Handle missing data using appropriate strategies (drop vs. impute)
  3. Apply filtering techniques to remove outliers and focus on relevant data
  4. Use grouping operations to extract business insights
  5. Build a data preparation pipeline that can be reused
  6. Document data quality decisions and their business impact
  7. Create a data quality report for stakeholders

Business Context

You are a Data Analyst at CloudTech Solutions, a B2B SaaS company providing cloud infrastructure services. The company has been experiencing customer churn issues and wants to identify at-risk customers. However, the customer database has significant quality issues due to:

  • Multiple data entry systems
  • Manual data collection processes
  • System migrations over the years
  • Incomplete onboarding procedures

Your manager has asked you to: 1. Clean and prepare the customer data for analysis 2. Create customer segments based on usage patterns 3. Identify data quality issues that need systematic fixes 4. Prepare the data for a machine learning churn prediction model

Assignment Structure

The assignment is divided into 5 main parts:

  1. Data Quality Assessment (15 points)
  2. Missing Data Strategy (20 points)
  3. Data Filtering & Outlier Management (20 points)
  4. Grouping & Feature Engineering (20 points)
  5. Data Quality Report (15 points)

Total: 90 points (+ 10 bonus points available)

Estimated Time: 3-4 hours


Important Instructions

File Naming Convention

Your data file MUST be named using your student ID:

[YourStudentID]-week03-clean.csv

Example: If your student ID is U12345678, your file should be named:

U12345678-week03-clean.csv

Required Deliverables

You must submit THREE files to Canvas:

  1. Jupyter Notebook (.ipynb file) with all code and outputs
  2. HTML Export of your notebook (File → Download as → HTML)
  3. Cleaned CSV Data File with your student ID in the filename

Submission Checklist

Before submitting, ensure:


Getting Started

Step 1: Download the Starter Notebook

Download the starter notebook from Canvas or the course website: [week03_assignment_data_preparation.ipynb]

Step 2: Set Your Student ID

In the first code cell, replace 'UXXX' with your actual student ID:

STUDENT_ID = 'U12345678'  # Replace with your ID

Step 3: Work Through Each Section

Follow the instructions in each section. Look for TODO comments that indicate where you need to add code.

Step 4: Test Your Code

Ensure all cells run without errors by using: - Kernel → Restart & Run All

Step 5: Export and Submit

  1. Save your notebook with all outputs
  2. Export as HTML (File → Download as → HTML)
  3. Submit all three files to Canvas

Grading Rubric

Core Requirements (90 points)

Section Points Key Requirements
Data Quality Assessment 15 • Load and explore data
• Identify all quality issues
• Create missing value analysis
• Document data problems
Missing Data Strategy 20 • Analyze missing patterns
• Make drop vs impute decisions
• Implement appropriate imputation
• Validate results
Data Filtering 20 • Identify and handle outliers
• Apply business logic filters
• Remove invalid records
• Document filtering impact
Grouping & Features 20 • Create customer segments
• Calculate aggregate metrics
• Engineer useful features
• Answer business questions
Data Quality Report 15 • Summarize quality issues
• Document decisions made
• Provide recommendations
• Professional presentation

Bonus Opportunities (10 points)

Bonus Item Points Description
Advanced Imputation 3 Use KNN or iterative imputation methods
Visualization 3 Create data quality visualizations
Pipeline Function 4 Build reusable data cleaning pipeline

Grading Notes

Full Credit Guidelines: - Code runs without errors - Demonstrates understanding of data preparation concepts - Makes reasonable data quality decisions - Provides clear documentation and explanations - Shows effort and thoughtful analysis

Partial Credit: - Minor errors that don’t affect overall analysis - Missing some documentation - Incomplete but reasonable attempts - Some filtering or imputation decisions not optimal

Focus Areas: 1. Understanding over perfection - Credit for demonstrating concept knowledge 2. Decision making - Justifying drop vs impute choices 3. Business context - Connecting technical decisions to business impact 4. Documentation - Clear explanation of what and why


Data Description

Original Dataset Structure

The customer dataset contains the following fields:

Column Description Expected Type Issues to Watch
customer_id Unique customer identifier String Duplicates possible
company_name Company name String Inconsistent formats
industry Industry sector Category Missing, typos
employee_count Number of employees Numeric Outliers, negatives
annual_revenue Annual revenue (USD) Numeric Missing, outliers
signup_date Account creation date Date Format issues
last_login_date Most recent login Date Missing means inactive
monthly_spend Average monthly spend Numeric Negative values
support_tickets Number of support tickets Numeric Missing data
features_used Number of features used Numeric Out of range
satisfaction_score Customer satisfaction (1-10) Numeric Invalid scores
contract_type Type of contract Category Inconsistent values
payment_method Payment method Category Missing data

Data Quality Challenges

You will encounter various real-world data quality issues:

  1. Missing Values: Different patterns (MCAR, MAR, MNAR)
  2. Outliers: Both errors and legitimate extreme values
  3. Inconsistencies: Format variations, typos, duplicates
  4. Invalid Data: Impossible values, data entry errors
  5. Business Logic Violations: Conflicting information

Academic Integrity

This is an individual assignment. While you may discuss concepts with classmates, all code and written responses must be your own work.

Permitted: - Using course materials and notebooks - Consulting Python documentation - Asking clarifying questions on Canvas

Not Permitted: - Copying code from other students - Using AI to generate complete solutions - Submitting work from previous semesters

Violations will result in a zero for the assignment and may be reported to the Office of Student Conduct.


Tips for Success

Data Preparation Best Practices

  1. Explore First: Always understand your data before cleaning
  2. Document Everything: Record what you did and why
  3. Test Incrementally: Check results after each transformation
  4. Think Business: Consider business impact of data decisions
  5. Be Systematic: Follow a consistent cleaning workflow

Common Mistakes to Avoid

  • Dropping too much data without justification
  • Using mean imputation for skewed distributions
  • Not checking for data leakage after filtering
  • Forgetting to save the cleaned dataset
  • Over-engineering features without business context

Debugging Tips

If you encounter errors:

  1. Read the error message carefully
  2. Check data types with df.dtypes
  3. Verify column names with df.columns
  4. Look for NaN values with df.isnull().sum()
  5. Print intermediate results to debug

Getting Help

If you need assistance:

  1. Review Week 3 materials including slides and notebooks
  2. Check the discussion forum for similar questions
  3. Attend office hours for personalized help
  4. Post on Canvas with specific error messages

Due Date and Late Policy

Due Date: Check Canvas for specific date and time

Late Policy: - 10% deduction per day late - Maximum 3 days late accepted - After 3 days, assignment receives 0 points

Extensions: - Must be requested BEFORE the due date - Provide documentation for emergencies


Sample Output Examples

Expected Data Quality Summary

Data Quality Assessment:
- Total records: 5,000
- Complete records: 2,450 (49%)
- Records with missing values: 2,550 (51%)
- Duplicate customer IDs: 45
- Invalid values detected: 312

Expected Missing Value Analysis

Missing Values by Column:
satisfaction_score: 1,250 (25.0%)
annual_revenue: 875 (17.5%)
last_login_date: 623 (12.5%)
support_tickets: 498 (10.0%)

Expected Customer Segments

Customer Segments:
High-Value Active: 523 customers
High-Value At-Risk: 187 customers
Standard Active: 1,892 customers
Standard At-Risk: 743 customers
Low-Value: 655 customers

Questions?

If you have questions about this assignment:

  1. First, check this document and the starter notebook
  2. Review the Week 3 materials
  3. Post on Canvas Discussions
  4. Attend office hours

Good luck with your data preparation journey!


Last updated: August 11, 2025