Week 3 Assignment: Customer Data Pipeline - From Chaos to Clarity
Assignment Overview
Welcome to Week 3! This week, you’ll apply your data preparation skills to solve a real-world business problem. You’ll work with messy, incomplete customer data and transform it into a clean, analysis-ready dataset that can drive business decisions.
Learning Objectives
By completing this assignment, you will:
- Assess data quality and identify critical issues impacting analysis
- Handle missing data using appropriate strategies (drop vs. impute)
- Apply filtering techniques to remove outliers and focus on relevant data
- Use grouping operations to extract business insights
- Build a data preparation pipeline that can be reused
- Document data quality decisions and their business impact
- Create a data quality report for stakeholders
Business Context
You are a Data Analyst at CloudTech Solutions, a B2B SaaS company providing cloud infrastructure services. The company has been experiencing customer churn issues and wants to identify at-risk customers. However, the customer database has significant quality issues due to:
- Multiple data entry systems
- Manual data collection processes
- System migrations over the years
- Incomplete onboarding procedures
Your manager has asked you to: 1. Clean and prepare the customer data for analysis 2. Create customer segments based on usage patterns 3. Identify data quality issues that need systematic fixes 4. Prepare the data for a machine learning churn prediction model
Assignment Structure
The assignment is divided into 5 main parts:
- Data Quality Assessment (15 points)
- Missing Data Strategy (20 points)
- Data Filtering & Outlier Management (20 points)
- Grouping & Feature Engineering (20 points)
- Data Quality Report (15 points)
Total: 90 points (+ 10 bonus points available)
Estimated Time: 3-4 hours
Important Instructions
File Naming Convention
Your data file MUST be named using your student ID:
[YourStudentID]-week03-clean.csv
Example: If your student ID is U12345678, your file should be named:
U12345678-week03-clean.csv
Required Deliverables
You must submit THREE files to Canvas:
- Jupyter Notebook (
.ipynb
file) with all code and outputs - HTML Export of your notebook (File → Download as → HTML)
- Cleaned CSV Data File with your student ID in the filename
Submission Checklist
Before submitting, ensure:
Getting Started
Step 1: Download the Starter Notebook
Download the starter notebook from Canvas or the course website: [week03_assignment_data_preparation.ipynb]
Step 2: Set Your Student ID
In the first code cell, replace 'UXXX'
with your actual student ID:
= 'U12345678' # Replace with your ID STUDENT_ID
Step 3: Work Through Each Section
Follow the instructions in each section. Look for TODO
comments that indicate where you need to add code.
Step 4: Test Your Code
Ensure all cells run without errors by using: - Kernel → Restart & Run All
Step 5: Export and Submit
- Save your notebook with all outputs
- Export as HTML (File → Download as → HTML)
- Submit all three files to Canvas
Grading Rubric
Core Requirements (90 points)
Section | Points | Key Requirements |
---|---|---|
Data Quality Assessment | 15 | • Load and explore data • Identify all quality issues • Create missing value analysis • Document data problems |
Missing Data Strategy | 20 | • Analyze missing patterns • Make drop vs impute decisions • Implement appropriate imputation • Validate results |
Data Filtering | 20 | • Identify and handle outliers • Apply business logic filters • Remove invalid records • Document filtering impact |
Grouping & Features | 20 | • Create customer segments • Calculate aggregate metrics • Engineer useful features • Answer business questions |
Data Quality Report | 15 | • Summarize quality issues • Document decisions made • Provide recommendations • Professional presentation |
Bonus Opportunities (10 points)
Bonus Item | Points | Description |
---|---|---|
Advanced Imputation | 3 | Use KNN or iterative imputation methods |
Visualization | 3 | Create data quality visualizations |
Pipeline Function | 4 | Build reusable data cleaning pipeline |
Grading Notes
Full Credit Guidelines: - Code runs without errors - Demonstrates understanding of data preparation concepts - Makes reasonable data quality decisions - Provides clear documentation and explanations - Shows effort and thoughtful analysis
Partial Credit: - Minor errors that don’t affect overall analysis - Missing some documentation - Incomplete but reasonable attempts - Some filtering or imputation decisions not optimal
Focus Areas: 1. Understanding over perfection - Credit for demonstrating concept knowledge 2. Decision making - Justifying drop vs impute choices 3. Business context - Connecting technical decisions to business impact 4. Documentation - Clear explanation of what and why
Data Description
Original Dataset Structure
The customer dataset contains the following fields:
Column | Description | Expected Type | Issues to Watch |
---|---|---|---|
customer_id | Unique customer identifier | String | Duplicates possible |
company_name | Company name | String | Inconsistent formats |
industry | Industry sector | Category | Missing, typos |
employee_count | Number of employees | Numeric | Outliers, negatives |
annual_revenue | Annual revenue (USD) | Numeric | Missing, outliers |
signup_date | Account creation date | Date | Format issues |
last_login_date | Most recent login | Date | Missing means inactive |
monthly_spend | Average monthly spend | Numeric | Negative values |
support_tickets | Number of support tickets | Numeric | Missing data |
features_used | Number of features used | Numeric | Out of range |
satisfaction_score | Customer satisfaction (1-10) | Numeric | Invalid scores |
contract_type | Type of contract | Category | Inconsistent values |
payment_method | Payment method | Category | Missing data |
Data Quality Challenges
You will encounter various real-world data quality issues:
- Missing Values: Different patterns (MCAR, MAR, MNAR)
- Outliers: Both errors and legitimate extreme values
- Inconsistencies: Format variations, typos, duplicates
- Invalid Data: Impossible values, data entry errors
- Business Logic Violations: Conflicting information
Academic Integrity
This is an individual assignment. While you may discuss concepts with classmates, all code and written responses must be your own work.
Permitted: - Using course materials and notebooks - Consulting Python documentation - Asking clarifying questions on Canvas
Not Permitted: - Copying code from other students - Using AI to generate complete solutions - Submitting work from previous semesters
Violations will result in a zero for the assignment and may be reported to the Office of Student Conduct.
Tips for Success
Data Preparation Best Practices
- Explore First: Always understand your data before cleaning
- Document Everything: Record what you did and why
- Test Incrementally: Check results after each transformation
- Think Business: Consider business impact of data decisions
- Be Systematic: Follow a consistent cleaning workflow
Common Mistakes to Avoid
- Dropping too much data without justification
- Using mean imputation for skewed distributions
- Not checking for data leakage after filtering
- Forgetting to save the cleaned dataset
- Over-engineering features without business context
Debugging Tips
If you encounter errors:
- Read the error message carefully
- Check data types with
df.dtypes
- Verify column names with
df.columns
- Look for NaN values with
df.isnull().sum()
- Print intermediate results to debug
Getting Help
If you need assistance:
- Review Week 3 materials including slides and notebooks
- Check the discussion forum for similar questions
- Attend office hours for personalized help
- Post on Canvas with specific error messages
Due Date and Late Policy
Due Date: Check Canvas for specific date and time
Late Policy: - 10% deduction per day late - Maximum 3 days late accepted - After 3 days, assignment receives 0 points
Extensions: - Must be requested BEFORE the due date - Provide documentation for emergencies
Sample Output Examples
Expected Data Quality Summary
Data Quality Assessment:
- Total records: 5,000
- Complete records: 2,450 (49%)
- Records with missing values: 2,550 (51%)
- Duplicate customer IDs: 45
- Invalid values detected: 312
Expected Missing Value Analysis
Missing Values by Column:
satisfaction_score: 1,250 (25.0%)
annual_revenue: 875 (17.5%)
last_login_date: 623 (12.5%)
support_tickets: 498 (10.0%)
Expected Customer Segments
Customer Segments:
High-Value Active: 523 customers
High-Value At-Risk: 187 customers
Standard Active: 1,892 customers
Standard At-Risk: 743 customers
Low-Value: 655 customers
Questions?
If you have questions about this assignment:
- First, check this document and the starter notebook
- Review the Week 3 materials
- Post on Canvas Discussions
- Attend office hours
Good luck with your data preparation journey!
Last updated: August 11, 2025