Week 3 Assignment: Customer Data Pipeline - From Chaos to Clarity

Assignment Overview

Welcome to Week 3! This week, you’ll apply your data preparation skills to solve a real-world business problem. You’ll work with messy, incomplete customer data and transform it into a clean, analysis-ready dataset that can drive business decisions.

Learning Objectives

By completing this assignment, you will:

Assess data quality and identify critical issues impacting analysis
Handle missing data using appropriate strategies (drop vs. impute)
Apply filtering techniques to remove outliers and focus on relevant data
Use grouping operations to extract business insights
Build a data preparation pipeline that can be reused
Document data quality decisions and their business impact
Create a data quality report for stakeholders

Business Context

You are a Data Analyst at CloudTech Solutions, a B2B SaaS company providing cloud infrastructure services. The company has been experiencing customer churn issues and wants to identify at-risk customers. However, the customer database has significant quality issues due to:

Multiple data entry systems
Manual data collection processes
System migrations over the years
Incomplete onboarding procedures

Your manager has asked you to: 1. Clean and prepare the customer data for analysis 2. Create customer segments based on usage patterns 3. Identify data quality issues that need systematic fixes 4. Prepare the data for a machine learning churn prediction model

Assignment Structure

The assignment is divided into 5 main parts:

Data Quality Assessment (15 points)
Missing Data Strategy (20 points)
Data Filtering & Outlier Management (20 points)
Grouping & Feature Engineering (20 points)
Data Quality Report (15 points)

Total: 90 points (+ 10 bonus points available)

Estimated Time: 3-4 hours

Important Instructions

File Naming Convention

Your data file MUST be named using your student ID:

[YourStudentID]-week03-clean.csv

Example: If your student ID is U12345678, your file should be named:

U12345678-week03-clean.csv

Required Deliverables

You must submit THREE files to Canvas:

Jupyter Notebook (.ipynb file) with all code and outputs
HTML Export of your notebook (File → Download as → HTML)
Cleaned CSV Data File with your student ID in the filename

Getting Started

Step 1: Download the Starter Notebook

Download the starter notebook from Canvas or the course website: [week03_assignment_data_preparation.ipynb]

Step 2: Set Your Student ID

In the first code cell, replace 'UXXX' with your actual student ID:

STUDENT_ID = 'U12345678'  # Replace with your ID

Step 3: Work Through Each Section

Follow the instructions in each section. Look for TODO comments that indicate where you need to add code.

Step 4: Test Your Code

Ensure all cells run without errors by using: - Kernel → Restart & Run All

Step 5: Export and Submit

Save your notebook with all outputs
Export as HTML (File → Download as → HTML)
Submit all three files to Canvas

Grading Rubric

Core Requirements (90 points)

Section	Points	Key Requirements
Data Quality Assessment	15	• Load and explore data • Identify all quality issues • Create missing value analysis • Document data problems
Missing Data Strategy	20	• Analyze missing patterns • Make drop vs impute decisions • Implement appropriate imputation • Validate results
Data Filtering	20	• Identify and handle outliers • Apply business logic filters • Remove invalid records • Document filtering impact
Grouping & Features	20	• Create customer segments • Calculate aggregate metrics • Engineer useful features • Answer business questions
Data Quality Report	15	• Summarize quality issues • Document decisions made • Provide recommendations • Professional presentation

Bonus Opportunities (10 points)

Bonus Item	Points	Description
Advanced Imputation	3	Use KNN or iterative imputation methods
Visualization	3	Create data quality visualizations
Pipeline Function	4	Build reusable data cleaning pipeline

Grading Notes

Full Credit Guidelines: - Code runs without errors - Demonstrates understanding of data preparation concepts - Makes reasonable data quality decisions - Provides clear documentation and explanations - Shows effort and thoughtful analysis

Partial Credit: - Minor errors that don’t affect overall analysis - Missing some documentation - Incomplete but reasonable attempts - Some filtering or imputation decisions not optimal

Focus Areas: 1. Understanding over perfection - Credit for demonstrating concept knowledge 2. Decision making - Justifying drop vs impute choices 3. Business context - Connecting technical decisions to business impact 4. Documentation - Clear explanation of what and why

Data Description

Original Dataset Structure

The customer dataset contains the following fields:

Column	Description	Expected Type	Issues to Watch
customer_id	Unique customer identifier	String	Duplicates possible
company_name	Company name	String	Inconsistent formats
industry	Industry sector	Category	Missing, typos
employee_count	Number of employees	Numeric	Outliers, negatives
annual_revenue	Annual revenue (USD)	Numeric	Missing, outliers
signup_date	Account creation date	Date	Format issues
last_login_date	Most recent login	Date	Missing means inactive
monthly_spend	Average monthly spend	Numeric	Negative values
support_tickets	Number of support tickets	Numeric	Missing data
features_used	Number of features used	Numeric	Out of range
satisfaction_score	Customer satisfaction (1-10)	Numeric	Invalid scores
contract_type	Type of contract	Category	Inconsistent values
payment_method	Payment method	Category	Missing data

Data Quality Challenges

You will encounter various real-world data quality issues:

Missing Values: Different patterns (MCAR, MAR, MNAR)
Outliers: Both errors and legitimate extreme values
Inconsistencies: Format variations, typos, duplicates
Invalid Data: Impossible values, data entry errors
Business Logic Violations: Conflicting information

Academic Integrity

This is an individual assignment. While you may discuss concepts with classmates, all code and written responses must be your own work.

Permitted: - Using course materials and notebooks - Consulting Python documentation - Asking clarifying questions on Canvas

Not Permitted: - Copying code from other students - Using AI to generate complete solutions - Submitting work from previous semesters

Violations will result in a zero for the assignment and may be reported to the Office of Student Conduct.

Tips for Success

Data Preparation Best Practices

Explore First: Always understand your data before cleaning
Document Everything: Record what you did and why
Test Incrementally: Check results after each transformation
Think Business: Consider business impact of data decisions
Be Systematic: Follow a consistent cleaning workflow

Common Mistakes to Avoid

Dropping too much data without justification
Using mean imputation for skewed distributions
Not checking for data leakage after filtering
Forgetting to save the cleaned dataset
Over-engineering features without business context

Debugging Tips

If you encounter errors:

Read the error message carefully
Check data types with df.dtypes
Verify column names with df.columns
Look for NaN values with df.isnull().sum()
Print intermediate results to debug

Getting Help

If you need assistance:

Review Week 3 materials including slides and notebooks
Check the discussion forum for similar questions
Attend office hours for personalized help
Post on Canvas with specific error messages

Due Date and Late Policy

Due Date: Check Canvas for specific date and time

Late Policy: - 10% deduction per day late - Maximum 3 days late accepted - After 3 days, assignment receives 0 points

Extensions: - Must be requested BEFORE the due date - Provide documentation for emergencies

Sample Output Examples

Expected Data Quality Summary

Data Quality Assessment:
- Total records: 5,000
- Complete records: 2,450 (49%)
- Records with missing values: 2,550 (51%)
- Duplicate customer IDs: 45
- Invalid values detected: 312

Expected Missing Value Analysis

Missing Values by Column:
satisfaction_score: 1,250 (25.0%)
annual_revenue: 875 (17.5%)
last_login_date: 623 (12.5%)
support_tickets: 498 (10.0%)

Expected Customer Segments

Customer Segments:
High-Value Active: 523 customers
High-Value At-Risk: 187 customers
Standard Active: 1,892 customers
Standard At-Risk: 743 customers
Low-Value: 655 customers

Questions?

If you have questions about this assignment:

First, check this document and the starter notebook
Review the Week 3 materials
Post on Canvas Discussions
Attend office hours

Good luck with your data preparation journey!

Last updated: August 11, 2025