# Week 2C: Pandas Fundamentals - Data Manipulation and Analysis

## Welcome to Pandas!

Welcome to your journey into data manipulation with Pandas! Pandas is the cornerstone of data analysis in Python, providing powerful and flexible tools for working with structured data. Named after "Panel Data" from econometrics, Pandas has become the de facto standard for data manipulation in Python's data science ecosystem.

### Why Pandas?

Pandas bridges the gap between Python and production-ready data analysis. It provides:

- **Intuitive Data Structures**: DataFrames and Series that make data manipulation natural
- **Powerful Tools**: Built-in functions for filtering, grouping, and transforming data
- **Integration**: Seamless work with NumPy, Matplotlib, and machine learning libraries
- **Performance**: Optimized C implementations for speed
- **Flexibility**: Handle missing data, time series, and various file formats

### What We'll Cover

In this comprehensive notebook, we'll explore:

1. **DataFrames and Series** - The fundamental data structures
2. **Data Selection and Indexing** - Accessing your data efficiently
3. **Data Manipulation** - Transforming and modifying data
4. **Aggregation and Statistics** - Summarizing and analyzing data
5. **Handling Missing Data** - Dealing with real-world imperfect data
6. **Merging and Joining** - Combining data from multiple sources

Let's begin our exploration of Pandas!

## Part 1: Introduction to Pandas Data Structures

### Series: The Building Block

A Series is a one-dimensional labeled array capable of holding any data type. Think of it as a sophisticated version of a Python list or a single column in a spreadsheet.

In [None]:
import pandas as pd
import numpy as np

# Creating a Series from a list
temperatures = pd.Series([72, 75, 69, 80, 83, 79, 77])
print("Temperature Series:")
print(temperatures)
print(f"\nType: {type(temperatures)}")
print(f"Shape: {temperatures.shape}")

In [None]:
# Series with custom index
days = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
temp_series = pd.Series([72, 75, 69, 80, 83, 79, 77], index=days)

print("Temperature by Day:")
print(temp_series)
print(f"\nWednesday's temperature: {temp_series['Wednesday']}¬∞F")
print(f"Weekend average: {temp_series[['Saturday', 'Sunday']].mean():.1f}¬∞F")

In [None]:
# Series operations
print("Temperature Statistics:")
print(f"Mean: {temp_series.mean():.1f}¬∞F")
print(f"Max: {temp_series.max()}¬∞F on {temp_series.idxmax()}")
print(f"Min: {temp_series.min()}¬∞F on {temp_series.idxmin()}")
print(f"\nDays above 75¬∞F:")
print(temp_series[temp_series > 75])

### DataFrames: The Workhorse of Pandas

A DataFrame is a two-dimensional labeled data structure with columns of potentially different types. Think of it as a spreadsheet or SQL table.

In [None]:
# Creating a DataFrame from a dictionary
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Edward'],
    'Age': [25, 30, 35, 28, 32],
    'Department': ['Sales', 'IT', 'HR', 'Sales', 'IT'],
    'Salary': [50000, 75000, 60000, 55000, 80000],
    'Years_Experience': [2, 5, 8, 3, 7]
}

df = pd.DataFrame(data)
print("Employee DataFrame:")
print(df)
print(f"\nShape: {df.shape} (rows, columns)")
print(f"Columns: {df.columns.tolist()}")

In [None]:
# DataFrame information and statistics
print("DataFrame Info:")
df.info()

print("\nDataFrame Statistics:")
print(df.describe())

print("\nData Types:")
print(df.dtypes)

In [None]:
# Creating DataFrame from various sources
# From lists of lists
data_list = [
    ['Product A', 100, 25.50],
    ['Product B', 150, 30.00],
    ['Product C', 75, 22.75]
]
products_df = pd.DataFrame(data_list, columns=['Product', 'Quantity', 'Price'])
print("Products DataFrame:")
print(products_df)

# From NumPy array
np_data = np.random.randn(5, 3)
random_df = pd.DataFrame(np_data, columns=['A', 'B', 'C'])
print("\nRandom DataFrame:")
print(random_df.round(2))

### üéØ Practice Exercise 1: Creating DataFrames

Create a DataFrame for student grades:

In [None]:
# Exercise: Create a student grades DataFrame
# TODO: Create a DataFrame with the following columns:
# - Student names: John, Sarah, Mike, Emily, David
# - Math scores: 85, 92, 78, 95, 88
# - Science scores: 90, 88, 85, 92, 79
# - English scores: 78, 94, 82, 89, 91

# Your code here:


# Test your solution (uncomment when ready):
# print("Student Grades:")
# print(students_df)
# print(f"\nAverage Math Score: {students_df['Math'].mean():.1f}")

## Part 2: Data Selection and Indexing

### Selecting Columns

There are multiple ways to select data from a DataFrame. Let's explore the most common methods.

In [None]:
# Single column selection (returns Series)
print("Names (as Series):")
print(df['Name'])
print(f"Type: {type(df['Name'])}")

# Alternative syntax using dot notation (only for valid Python identifiers)
print("\nAges (using dot notation):")
print(df.Age)

# Multiple columns (returns DataFrame)
print("\nName and Salary:")
print(df[['Name', 'Salary']])
print(f"Type: {type(df[['Name', 'Salary']])}")

### Row Selection with iloc and loc

In [None]:
# iloc - Integer location based indexing
print("First row (iloc[0]):")
print(df.iloc[0])

print("\nFirst three rows (iloc[0:3]):")
print(df.iloc[0:3])

print("\nSpecific rows and columns (iloc[1:4, 1:3]):")
print(df.iloc[1:4, 1:3])  # Rows 1-3, columns 1-2

In [None]:
# Setting custom index for loc demonstration
df_indexed = df.set_index('Name')
print("DataFrame with Name as index:")
print(df_indexed)

# loc - Label based indexing
print("\nAlice's data (loc['Alice']):")
print(df_indexed.loc['Alice'])

print("\nMultiple employees:")
print(df_indexed.loc[['Bob', 'Diana']])

print("\nSpecific cells:")
print(f"Bob's Salary: ${df_indexed.loc['Bob', 'Salary']:,}")

### Boolean Indexing (Filtering)

In [None]:
# Simple filtering
print("Employees with salary > $60,000:")
high_earners = df[df['Salary'] > 60000]
print(high_earners)

# Multiple conditions
print("\nIT employees with 5+ years experience:")
experienced_it = df[(df['Department'] == 'IT') & (df['Years_Experience'] >= 5)]
print(experienced_it)

# Using isin() for multiple values
print("\nEmployees in Sales or HR:")
sales_hr = df[df['Department'].isin(['Sales', 'HR'])]
print(sales_hr)

In [None]:
# Query method - more readable for complex conditions
print("Using query method:")
result = df.query('Salary > 60000 and Years_Experience < 8')
print(result)

# Query with variables
min_salary = 55000
result2 = df.query('Salary >= @min_salary')
print(f"\nEmployees earning >= ${min_salary:,}:")
print(result2)

### üéØ Practice Exercise 2: Data Selection

Practice selecting and filtering data:

In [None]:
# Exercise: Data Selection and Filtering
# Given the employee DataFrame (df), find:
# 1. All employees aged 30 or older
# 2. Names and salaries of IT department employees
# 3. The employee with the highest salary
# 4. Average salary by department

# Your code here:
# TODO: Filter employees aged 30+
# TODO: Select IT employees' names and salaries
# TODO: Find highest paid employee
# TODO: Calculate average salary by department


# Test your solution (uncomment when ready):
# print("Employees aged 30+:")
# print(older_employees)
# print("\nIT Department (Name & Salary):")
# print(it_salaries)
# print(f"\nHighest paid: {highest_paid}")
# print("\nAverage salary by department:")
# print(dept_avg_salary)

## Part 3: Data Manipulation

### Adding and Modifying Columns

In [None]:
# Create a copy to work with
df_copy = df.copy()

# Adding a new column
df_copy['Bonus'] = df_copy['Salary'] * 0.1
print("DataFrame with Bonus column:")
print(df_copy)

# Conditional column creation
df_copy['Level'] = df_copy['Years_Experience'].apply(
    lambda x: 'Senior' if x >= 5 else 'Junior'
)
print("\nWith experience level:")
print(df_copy[['Name', 'Years_Experience', 'Level']])

In [None]:
# Using np.where for conditional values
df_copy['Performance'] = np.where(
    df_copy['Salary'] > 65000, 
    'Excellent', 
    np.where(df_copy['Salary'] > 55000, 'Good', 'Average')
)
print("Performance ratings:")
print(df_copy[['Name', 'Salary', 'Performance']])

# Using pd.cut for binning continuous values
df_copy['Age_Group'] = pd.cut(
    df_copy['Age'], 
    bins=[0, 30, 35, 100], 
    labels=['Young', 'Middle', 'Senior']
)
print("\nAge groups:")
print(df_copy[['Name', 'Age', 'Age_Group']])

### Sorting Data

In [None]:
# Sort by single column
sorted_by_salary = df.sort_values('Salary', ascending=False)
print("Sorted by Salary (highest first):")
print(sorted_by_salary)

# Sort by multiple columns
sorted_multi = df.sort_values(['Department', 'Salary'], ascending=[True, False])
print("\nSorted by Department (A-Z), then Salary (high to low):")
print(sorted_multi)

# Sort by index
df_copy.sort_index(inplace=True)
print("\nSorted by index:")
print(df_copy[['Name', 'Department']].head())

### Dropping and Renaming

In [None]:
# Dropping columns
df_reduced = df_copy.drop(columns=['Bonus', 'Performance'])
print("After dropping columns:")
print(df_reduced.columns.tolist())

# Dropping rows
df_filtered = df_copy.drop(index=[0, 2])  # Drop rows at index 0 and 2
print("\nAfter dropping rows 0 and 2:")
print(df_filtered[['Name', 'Department']])

# Renaming columns
df_renamed = df.rename(columns={
    'Years_Experience': 'Experience',
    'Department': 'Dept'
})
print("\nRenamed columns:")
print(df_renamed.columns.tolist())

### String Operations

In [None]:
# String methods on Series
df_str = df.copy()

# Convert to uppercase
df_str['Name_Upper'] = df_str['Name'].str.upper()
print("Uppercase names:")
print(df_str[['Name', 'Name_Upper']])

# Extract information
df_str['Name_Length'] = df_str['Name'].str.len()
df_str['First_Letter'] = df_str['Name'].str[0]
print("\nString analysis:")
print(df_str[['Name', 'Name_Length', 'First_Letter']])

# Contains method
print("\nNames containing 'a' (case insensitive):")
contains_a = df_str[df_str['Name'].str.contains('a', case=False)]
print(contains_a['Name'].tolist())

### üéØ Practice Exercise 3: Data Manipulation

Transform and enhance the employee dataset:

In [None]:
# Exercise: Data Manipulation
# Using the employee DataFrame:
# 1. Add a 'Total_Compensation' column (Salary + 15% bonus)
# 2. Create a 'Seniority' column: 'Entry' (<3 years), 'Mid' (3-6), 'Senior' (>6)
# 3. Add email addresses: first_name.lastname@company.com (lowercase)
# 4. Sort by Total_Compensation (highest first)

df_exercise = df.copy()

# Your code here:
# TODO: Add Total_Compensation
# TODO: Create Seniority levels
# TODO: Generate email addresses
# TODO: Sort by compensation


# Test your solution (uncomment when ready):
# print("Enhanced Employee Data:")
# print(df_exercise[['Name', 'Email', 'Seniority', 'Total_Compensation']])

## Part 4: Aggregation and Statistics

### GroupBy Operations

GroupBy is one of the most powerful features in Pandas, implementing the split-apply-combine strategy.

In [None]:
# Basic groupby
dept_groups = df.groupby('Department')

# Aggregate with single function
print("Average salary by department:")
print(dept_groups['Salary'].mean())

# Multiple aggregations
print("\nDepartment statistics:")
dept_stats = dept_groups['Salary'].agg(['mean', 'min', 'max', 'count'])
print(dept_stats)

# Different aggregations for different columns
print("\nComprehensive department summary:")
summary = dept_groups.agg({
    'Salary': ['mean', 'max'],
    'Age': 'mean',
    'Years_Experience': 'sum'
}).round(2)
print(summary)

In [None]:
# Custom aggregation functions
def salary_range(x):
    return x.max() - x.min()

print("Salary range by department:")
print(dept_groups['Salary'].agg(salary_range))

# Multiple groupby keys
df_copy = df.copy()
df_copy['Level'] = df_copy['Years_Experience'].apply(
    lambda x: 'Senior' if x >= 5 else 'Junior'
)

multi_group = df_copy.groupby(['Department', 'Level'])['Salary'].mean()
print("\nAverage salary by department and level:")
print(multi_group)

### Pivot Tables

Pivot tables provide a flexible way to create spreadsheet-style summary tables.

In [None]:
# Create sample sales data
sales_data = pd.DataFrame({
    'Date': pd.date_range('2024-01-01', periods=20),
    'Product': np.random.choice(['A', 'B', 'C'], 20),
    'Region': np.random.choice(['North', 'South', 'East', 'West'], 20),
    'Sales': np.random.randint(100, 1000, 20),
    'Units': np.random.randint(10, 50, 20)
})

print("Sales Data Sample:")
print(sales_data.head(10))

# Create pivot table
pivot = pd.pivot_table(
    sales_data,
    values='Sales',
    index='Product',
    columns='Region',
    aggfunc='sum',
    fill_value=0
)

print("\nPivot Table - Sales by Product and Region:")
print(pivot)

# Multiple aggregations in pivot table
pivot_multi = pd.pivot_table(
    sales_data,
    values=['Sales', 'Units'],
    index='Product',
    aggfunc={'Sales': 'sum', 'Units': 'mean'}
)

print("\nPivot with multiple metrics:")
print(pivot_multi.round(1))

### Statistical Analysis

In [None]:
# Descriptive statistics
print("Employee DataFrame Statistics:")
print(df.describe())

# Correlation analysis
print("\nCorrelation Matrix:")
numeric_cols = df.select_dtypes(include=[np.number])
correlation = numeric_cols.corr()
print(correlation.round(3))

# Specific statistics
print("\nDetailed Statistics:")
print(f"Salary variance: {df['Salary'].var():,.2f}")
print(f"Salary std dev: {df['Salary'].std():,.2f}")
print(f"Age median: {df['Age'].median()}")
print(f"25th percentile salary: ${df['Salary'].quantile(0.25):,.2f}")
print(f"75th percentile salary: ${df['Salary'].quantile(0.75):,.2f}")

### üéØ Practice Exercise 4: Aggregation

Perform complex aggregations on the data:

In [None]:
# Exercise: Advanced Aggregation
# Create a summary report that shows:
# 1. Number of employees per department
# 2. Average years of experience per department
# 3. Salary statistics (min, max, mean, median) per department
# 4. Which department has the highest average salary per year of experience

# Your code here:
# TODO: Count employees per department
# TODO: Average experience per department
# TODO: Salary statistics per department
# TODO: Calculate salary per year of experience


# Test your solution (uncomment when ready):
# print("Department Summary Report:")
# print(dept_summary)
# print(f"\nBest salary per experience: {best_dept}")

## Part 5: Handling Missing Data

### Detecting Missing Data

Real-world data often contains missing values. Pandas provides tools to handle them effectively.

In [None]:
# Create sample data with missing values
missing_data = pd.DataFrame({
    'A': [1, 2, np.nan, 4, 5],
    'B': [np.nan, 2, 3, np.nan, 5],
    'C': [1, 2, 3, 4, 5],
    'D': [np.nan, np.nan, np.nan, 4, 5]
})

print("Data with missing values:")
print(missing_data)

# Check for missing values
print("\nMissing values per column:")
print(missing_data.isnull().sum())

print("\nPercentage of missing values:")
print((missing_data.isnull().sum() / len(missing_data) * 100).round(1))

# Check which rows have any missing values
print("\nRows with any missing values:")
print(missing_data[missing_data.isnull().any(axis=1)])

### Handling Missing Data

In [None]:
# Drop missing values
print("Drop rows with any NaN:")
print(missing_data.dropna())

print("\nDrop columns with any NaN:")
print(missing_data.dropna(axis=1))

print("\nDrop rows where all values are NaN:")
print(missing_data.dropna(how='all'))

# Fill missing values
print("\nFill with constant value (0):")
print(missing_data.fillna(0))

print("\nFill with column mean:")
filled_mean = missing_data.fillna(missing_data.mean())
print(filled_mean.round(2))

print("\nForward fill (use previous value):")
print(missing_data.fillna(method='ffill'))

print("\nBackward fill (use next value):")
print(missing_data.fillna(method='bfill'))

In [None]:
# Interpolation for missing values
time_series = pd.DataFrame({
    'date': pd.date_range('2024-01-01', periods=10),
    'value': [1, np.nan, np.nan, 4, 5, np.nan, 7, 8, np.nan, 10]
})

print("Time series with gaps:")
print(time_series)

print("\nLinear interpolation:")
time_series['interpolated'] = time_series['value'].interpolate(method='linear')
print(time_series)

## Part 6: Merging and Joining DataFrames

### Concatenation

Combining DataFrames vertically or horizontally.

In [None]:
# Create sample DataFrames
df1 = pd.DataFrame({
    'A': [1, 2, 3],
    'B': [4, 5, 6]
})

df2 = pd.DataFrame({
    'A': [7, 8, 9],
    'B': [10, 11, 12]
})

df3 = pd.DataFrame({
    'C': [13, 14, 15],
    'D': [16, 17, 18]
})

# Vertical concatenation (stack rows)
vertical_concat = pd.concat([df1, df2], ignore_index=True)
print("Vertical concatenation:")
print(vertical_concat)

# Horizontal concatenation (add columns)
horizontal_concat = pd.concat([df1, df3], axis=1)
print("\nHorizontal concatenation:")
print(horizontal_concat)

### Merging DataFrames

Similar to SQL joins, merge combines DataFrames based on common columns or indices.

In [None]:
# Create related DataFrames
employees = pd.DataFrame({
    'emp_id': [1, 2, 3, 4, 5],
    'name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Edward'],
    'dept_id': [101, 102, 101, 103, 102]
})

departments = pd.DataFrame({
    'dept_id': [101, 102, 103, 104],
    'dept_name': ['Sales', 'IT', 'HR', 'Finance'],
    'location': ['NYC', 'SF', 'LA', 'Chicago']
})

salaries = pd.DataFrame({
    'emp_id': [1, 2, 3, 5, 6],  # Note: emp_id 4 missing, 6 is extra
    'salary': [50000, 75000, 60000, 80000, 45000]
})

print("Employees:")
print(employees)
print("\nDepartments:")
print(departments)
print("\nSalaries:")
print(salaries)

In [None]:
# Inner join (only matching records)
inner_merge = pd.merge(employees, departments, on='dept_id')
print("Inner join - Employees with Departments:")
print(inner_merge)

# Left join (all from left, matching from right)
left_merge = pd.merge(employees, salaries, on='emp_id', how='left')
print("\nLeft join - All employees with salaries (if available):")
print(left_merge)

# Outer join (all from both)
outer_merge = pd.merge(employees, salaries, on='emp_id', how='outer')
print("\nOuter join - All employees and all salaries:")
print(outer_merge)

# Multiple joins
full_data = pd.merge(employees, departments, on='dept_id')
full_data = pd.merge(full_data, salaries, on='emp_id', how='left')
print("\nComplete employee information:")
print(full_data)

### üéØ Practice Exercise 5: Merging Data

Combine multiple datasets to create a comprehensive view:

In [None]:
# Exercise: Data Integration
# Given three DataFrames:
orders = pd.DataFrame({
    'order_id': [1, 2, 3, 4, 5],
    'customer_id': [101, 102, 101, 103, 102],
    'product_id': [1, 2, 1, 3, 2],
    'quantity': [2, 1, 3, 1, 2]
})

customers = pd.DataFrame({
    'customer_id': [101, 102, 103],
    'customer_name': ['John', 'Sarah', 'Mike'],
    'city': ['NYC', 'LA', 'Chicago']
})

products = pd.DataFrame({
    'product_id': [1, 2, 3],
    'product_name': ['Laptop', 'Mouse', 'Keyboard'],
    'price': [1000, 50, 80]
})

# TODO: Merge all three DataFrames to create a complete order report
# TODO: Calculate total price for each order (quantity * price)
# TODO: Find total sales per customer
# TODO: Find most popular product

# Your code here:


# Test your solution (uncomment when ready):
# print("Complete Order Report:")
# print(complete_orders)
# print(f"\nTotal sales per customer:")
# print(customer_sales)
# print(f"\nMost popular product: {popular_product}")

## Part 7: Working with Dates and Times

### DateTime Operations

Pandas has extensive support for working with dates and times.

In [None]:
# Creating date ranges
dates = pd.date_range('2024-01-01', periods=10, freq='D')
print("Daily dates:")
print(dates)

# Different frequencies
business_days = pd.date_range('2024-01-01', periods=10, freq='B')  # Business days
print("\nBusiness days only:")
print(business_days)

# Create time series DataFrame
ts_df = pd.DataFrame({
    'date': dates,
    'value': np.random.randn(10).cumsum() + 100
})
ts_df.set_index('date', inplace=True)
print("\nTime series DataFrame:")
print(ts_df)

In [None]:
# Parsing dates from strings
date_strings = ['2024-01-15', '2024-02-20', '2024-03-10']
parsed_dates = pd.to_datetime(date_strings)
print("Parsed dates:")
print(parsed_dates)

# Extract date components
date_df = pd.DataFrame({'date': parsed_dates})
date_df['year'] = date_df['date'].dt.year
date_df['month'] = date_df['date'].dt.month
date_df['day'] = date_df['date'].dt.day
date_df['weekday'] = date_df['date'].dt.day_name()
print("\nDate components:")
print(date_df)

# Date arithmetic
date_df['plus_30_days'] = date_df['date'] + pd.Timedelta(days=30)
date_df['plus_1_month'] = date_df['date'] + pd.DateOffset(months=1)
print("\nDate arithmetic:")
print(date_df[['date', 'plus_30_days', 'plus_1_month']])

## Part 8: Reading and Writing Data

### File I/O Operations

Pandas can read and write data in various formats.

In [None]:
# Create sample data to save
sample_data = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'Score': [85.5, 92.3, 78.9]
})

# Save to CSV
sample_data.to_csv('sample.csv', index=False)
print("Data saved to CSV")

# Read from CSV
read_data = pd.read_csv('sample.csv')
print("\nData read from CSV:")
print(read_data)

# Save to Excel (requires openpyxl)
try:
    sample_data.to_excel('sample.xlsx', index=False, sheet_name='Data')
    print("\nData saved to Excel")
except ImportError:
    print("\nExcel writing requires 'openpyxl' package")

# Save to JSON
sample_data.to_json('sample.json', orient='records', indent=2)
print("\nData saved to JSON")

# Clean up files
import os
for file in ['sample.csv', 'sample.xlsx', 'sample.json']:
    if os.path.exists(file):
        os.remove(file)
print("\nTemporary files cleaned up")

## Summary and Next Steps

### What We've Learned

Congratulations! You've mastered the fundamentals of Pandas:

1. **Data Structures**: Working with Series and DataFrames
2. **Selection and Indexing**: Accessing data using loc, iloc, and boolean indexing
3. **Data Manipulation**: Adding columns, sorting, and transforming data
4. **Aggregation**: GroupBy operations and pivot tables
5. **Missing Data**: Detecting and handling NaN values
6. **Merging**: Combining multiple DataFrames
7. **DateTime**: Working with time series data
8. **File I/O**: Reading and writing various file formats

### Key Takeaways

- **DataFrames** are the primary data structure for tabular data
- **Method chaining** allows for elegant data transformations
- **GroupBy** operations enable powerful aggregations
- **Merging** DataFrames is similar to SQL joins
- **Missing data** handling is crucial for real-world datasets

### Practice Projects

To solidify your understanding, try these projects:

1. **Sales Analysis**: Load sales data and create a dashboard with top products, regional performance, and trends
2. **Student Performance**: Analyze student grades, identify patterns, and create summary reports
3. **Stock Market Data**: Download stock prices, calculate returns, and perform technical analysis
4. **Customer Segmentation**: Group customers based on purchasing behavior

### What's Next?

In the next notebook (Week 2D - Matplotlib), we'll explore:

- Creating various plot types
- Customizing visualizations
- Statistical plots
- Combining Pandas with Matplotlib for data visualization

### Final Challenge

Create a complete data analysis pipeline:
1. Load a dataset (create your own or use sample data)
2. Clean and prepare the data
3. Perform exploratory data analysis
4. Create summary statistics and visualizations
5. Export results to multiple formats

In [None]:
# Final Challenge: Complete Data Analysis Pipeline
# Your implementation here:

def analyze_dataset(data):
    """
    Perform comprehensive analysis on a dataset.
    
    Steps:
    1. Display basic information
    2. Handle missing values
    3. Generate summary statistics
    4. Perform groupby analysis
    5. Create derived features
    6. Export results
    """
    # TODO: Implement your analysis pipeline
    pass

# Create sample dataset and analyze
# Uncomment to test:
# sample_dataset = create_sample_data()
# analyze_dataset(sample_dataset)

---

## Resources for Further Learning

- **Official Pandas Documentation**: https://pandas.pydata.org/docs/
- **Pandas User Guide**: https://pandas.pydata.org/docs/user_guide/index.html
- **10 Minutes to Pandas**: https://pandas.pydata.org/docs/user_guide/10min.html
- **Pandas Cookbook**: https://pandas.pydata.org/docs/user_guide/cookbook.html
- **Real Python Pandas Tutorials**: https://realpython.com/learning-paths/pandas-data-science/

Remember: Pandas is best learned through practice. Work with real datasets, experiment with different methods, and don't be afraid to explore the documentation!

Happy Data Wrangling! üêºüìä