Python Fundamentals

Essential Programming Skills for Machine Learning

Building Your Data Science Foundation

ISM6251 | Week 2
Python • NumPy • Pandas • Matplotlib

Learning Objectives

By the end of this week, you will be able to:

  • Write Python code using variables, operators, and control flow
  • Create and use functions with various argument types
  • Work with Python data structures (lists, tuples, dictionaries)
  • Use list and dictionary comprehensions effectively
  • Create and manipulate NumPy arrays
  • Perform data analysis with Pandas DataFrames
  • Create visualizations with Matplotlib

Part 1: Python Basics

Variables, Data Types, Control Flow, and Functions

Foundation of Programming:
"Python's simplicity lets you become productive quickly, yet it scales to power some of the world's largest applications."

Core Concepts

  • Dynamic typing system
  • Variable assignment & naming
  • Basic data types
  • Type conversion

Control Structures

  • Conditional statements (if/elif/else)
  • Loop constructs (for/while)
  • Function definition & calls
  • Scope and parameters
Key Insight: Python's readable syntax and powerful features make it ideal for data science, allowing you to focus on solving problems rather than fighting with the language.

Variables and Data Types

Key Concepts

  • Dynamic Typing: Variables get their type from the assigned value
  • No Declaration: No need to declare variable types
  • Case Sensitive: age and Age are different variables
Common Types:
int - whole numbers
float - decimal numbers
str - text/strings
bool - True/False
# Python is dynamically typed
age = 25              # Integer
price = 19.99         # Float
name = "Alice"        # String
is_student = True     # Boolean

# Type checking
print(type(age))      
# Output: <class 'int'>

print(type(price))    
# Output: <class 'float'>

print(type(name))
# Output: <class 'str'>

# Type conversion
str_num = "42"
int_num = int(str_num)
float_num = float(str_num)

# Multiple assignment
x, y, z = 1, 2, 3
a = b = c = 0
 

Operators

Arithmetic

a = 10
b = 3
print(a + b)  # 13
print(a - b)  # 7
print(a * b)  # 30
print(a / b)  # 3.333...
print(a // b) # 3 (floor)
print(a % b)  # 1 (remainder)
print(a ** b) # 1000 (power)

Comparison

x = 5
y = 10
print(x < y)   # True
print(x <= y)  # True
print(x > y)   # False
print(x >= y)  # False
print(x == y)  # False
print(x != y)  # True

Strings

String Operations

  • Concatenation: + operator
  • Repetition: * operator
  • Indexing: str[0], str[-1]
  • Slicing: str[start:end:step]
Common Methods:
upper(), lower() - case conversion
strip(), split() - whitespace & splitting
replace(), find() - search & replace
format(), f-strings - formatting
# String creation
greeting = "Hello"
name = 'World'
message = greeting + " " + name
print(message)  # Hello World

# String formatting
age = 25
height = 5.9
info = f"{name} is {age} years old"
print(info)

# String methods
text = "  Python Programming  "
print(text.strip())      # Remove whitespace
print(text.upper())      # PYTHON PROGRAMMING
print(text.lower())      # python programming
print(text.replace("Python", "Java"))

# String slicing
word = "Python"
print(word[0])     # P
print(word[-1])    # n
print(word[0:3])   # Pyt
print(word[::2])   # Pto
print(word[::-1])  # nohtyP (reverse)

Control Flow

Conditional Logic

  • if: Basic condition
  • elif: Additional conditions
  • else: Default case
  • for: Iterate sequences
  • while: Condition-based loop
Remember:
• Indentation matters! (defines code blocks)
• Use 4 spaces (not tabs)
break exits the entire loop
continue skips to next iteration
# If-elif-else
score = 85
if score >= 90:
    grade = 'A'
elif score >= 80:
    grade = 'B'
elif score >= 70:
    grade = 'C'
else:
    grade = 'F'
print(f"Grade: {grade}")  # Grade: B

# For loop
fruits = ['apple', 'banana', 'orange']
for fruit in fruits:
    print(f"I like {fruit}")

# For with range
for i in range(5):
    print(i, end=' ')  # 0 1 2 3 4

# While loop
count = 0
while count < 3:
    print(f"Count: {count}")
    count += 1

# Break and continue
for i in range(10):
    if i == 7:
        break
    if i % 2 == 0:
        continue
    print(i, end=' ')
# Output: 1 3 5

Functions - Basics

Reusable Code Blocks

  • def: Define a function
  • return: Send value back
  • Docstrings: Document purpose
  • Scope: Local vs global
  • Lambda: Anonymous functions
Best Practices:
• Descriptive function names
• Single responsibility principle
• Document with docstrings
• Type hints (optional but helpful)
# Basic function
def greet(name):
    """Simple greeting function"""
    return f"Hello, {name}!"

print(greet("Alice"))  # Hello, Alice!

# Default parameters
def power(base, exponent=2):
    return base ** exponent

print(power(3))      # 9 (3^2)
print(power(3, 3))   # 27 (3^3)

# Multiple return values
def calculate_stats(numbers):
    return (
        min(numbers), 
        max(numbers), 
        sum(numbers)/len(numbers)
    )

min_val, max_val, avg = calculate_stats([1,2,3,4,5])
print(f"Min: {min_val}, Max: {max_val}, Avg: {avg}")

# Lambda functions
square = lambda x: x**2
numbers = [1, 2, 3, 4, 5]
squared = list(map(lambda x: x**2, numbers))
print(squared)  # [1, 4, 9, 16, 25]

# Filter with lambda
evens = list(filter(lambda x: x%2==0, numbers))
print(evens)  # [2, 4]

Functions - *args and **kwargs

Variable Arguments

  • *args: Variable positional args
  • **kwargs: Variable keyword args
  • Flexibility: Accept any number
Order Matters:
1. Regular positional args
2. *args (variable positional)
3. Keyword args
4. **kwargs (variable keyword)
# *args: Variable positional arguments
def print_scores(*args):
    """Accept any number of scores"""
    if not args:
        print("No scores provided")
        return
    
    for i, score in enumerate(args, 1):
        print(f"Score {i}: {score}")
    print(f"Average: {sum(args)/len(args):.1f}")

print_scores(85, 90, 78, 92, 88)
# Score 1: 85
# Score 2: 90...
# Average: 86.6

# **kwargs: Variable keyword arguments
def create_profile(**kwargs):
    """Create profile from keyword args"""
    profile = {}
    for key, value in kwargs.items():
        profile[key] = value
    return profile

user = create_profile(
    name="Alice", age=25, 
    city="NYC", role="Developer"
)

# Combining all types
def process(op, *values, **options):
    print(f"Operation: {op}")
    print(f"Values: {values}")
    print(f"Options: {options}")

process("sum", 1, 2, 3, verbose=True)

Creating and Importing Python Modules

File: ml_utils.py

# ml_utils.py
"""Utility functions for ML tasks"""

def normalize(data):
    """Min-max normalization"""
    min_val = min(data)
    max_val = max(data)
    return [(x - min_val) / (max_val - min_val) 
            for x in data]

def train_test_split(data, test_size=0.2):
    """Simple train-test split"""
    split_idx = int(len(data) * (1 - test_size))
    return data[:split_idx], data[split_idx:]

def accuracy(y_true, y_pred):
    """Calculate accuracy"""
    correct = sum(1 for t, p in zip(y_true, y_pred) 
                  if t == p)
    return correct / len(y_true)

Using in Notebook

# In your notebook or script
import ml_utils

# Use the functions
data = [10, 20, 30, 40, 50]
normalized = ml_utils.normalize(data)
print(normalized)

# Or import specific functions
from ml_utils import train_test_split

train, test = train_test_split(data)
print(f"Train: {train}")
print(f"Test: {test}")

# Import with alias
import ml_utils as utils

result = utils.accuracy([1,0,1,1], [1,0,0,1])
print(f"Accuracy: {result:.2%}")

Best Practice: Organize reusable functions in .py files for clean, maintainable code

Part 2: Data Structures

Lists, Tuples, Dictionaries, and Comprehensions

Organizing Your Data:
"The right data structure can be the difference between a solution that works and one that scales."

Sequential Structures

  • Lists: Mutable, ordered sequences
  • Tuples: Immutable, fixed collections
  • Strings: Immutable text sequences
  • Ranges: Efficient number sequences

Mapping & Advanced

  • Dictionaries: Key-value mappings
  • Sets: Unique element collections
  • Comprehensions: Concise creation syntax
  • Generators: Memory-efficient iteration
Key Insight: Mastering Python's built-in data structures enables efficient data manipulation, a crucial skill for machine learning preprocessing and feature engineering.

Lists

Mutable Sequences

  • Ordered: Items have index
  • Mutable: Can be modified
  • Mixed types: Any data type
  • Dynamic: Can grow/shrink
Common Methods:
append(), extend() - add items
insert(), remove() - modify at position
pop(), clear() - remove items
sort(), reverse() - reorder list
# Creating lists
numbers = [1, 2, 3, 4, 5]
mixed = [1, "hello", 3.14, True]
nested = [[1, 2], [3, 4], [5, 6]]

# List operations
fruits = ['apple', 'banana']
fruits.append('orange')      # Add to end
fruits.insert(1, 'grape')     # Insert at index
print(fruits)  
# ['apple', 'grape', 'banana', 'orange']

# Remove items
fruits.remove('grape')        # Remove by value
last = fruits.pop()           # Remove & return last
print(fruits)  # ['apple', 'banana']

# List slicing
nums = [0, 1, 2, 3, 4, 5]
print(nums[2:5])    # [2, 3, 4]
print(nums[::2])    # [0, 2, 4] (every 2nd)
print(nums[::-1])   # [5, 4, 3, 2, 1, 0]

# List methods
scores = [85, 92, 78, 95, 88]
scores.sort()
print(scores)  # [78, 85, 88, 92, 95]
print(f"Max: {max(scores)}, Min: {min(scores)}")
print(f"Sum: {sum(scores)}, Avg: {sum(scores)/len(scores)}")

Tuples

Immutable Sequences

  • Immutable: Cannot be changed
  • Ordered: Indexed like lists
  • Hashable: Can be dict keys
  • Memory efficient: Less overhead
Use Cases:
• Fixed collections (coordinates, RGB values)
• Function returns (multiple values)
• Dictionary keys (immutable required)
• Named tuples for structured records
# Creating tuples
point = (3, 4)
rgb = (255, 128, 0)
single = (42,)  # Note comma for single item

# Tuple unpacking
x, y = point
print(f"x: {x}, y: {y}")  # x: 3, y: 4

# Multiple assignment
def get_min_max(numbers):
    return min(numbers), max(numbers)

minimum, maximum = get_min_max([1, 2, 3, 4, 5])
print(f"Min: {minimum}, Max: {maximum}")

# Tuples are immutable
# point[0] = 5  # This would raise TypeError

# Named tuples (more readable)
from collections import namedtuple
Person = namedtuple('Person', ['name', 'age', 'city'])
alice = Person('Alice', 30, 'NYC')
print(alice.name)  # Alice
print(alice.age)   # 30

# Tuple as dictionary key (lists can't do this)
locations = {
    (40.7128, -74.0060): "New York",
    (51.5074, -0.1278): "London"
}

Dictionaries

Key-Value Pairs

  • Unordered: No index
  • Mutable: Can modify
  • Fast lookup: O(1) access
  • Unique keys: No duplicates
Common Methods:
keys(), values(), items() - access parts
get(), setdefault() - safe access
update(), pop() - modify dict
clear(), copy() - manage dict
# Creating dictionaries
person = {
    'name': 'Alice',
    'age': 30,
    'city': 'NYC'
}

# Alternative creation
scores = dict(math=90, english=85, science=92)

# Accessing values
print(person['name'])  # Alice
print(person.get('age'))  # 30
print(person.get('job', 'Unknown'))  # Unknown

# Adding/updating
person['job'] = 'Developer'
person['age'] = 31

# Dictionary methods
print(person.keys())    # dict_keys(['name', 'age', 'city', 'job'])
print(person.values())  # dict_values(['Alice', 31, 'NYC', 'Developer'])

# Iterating
for key, value in person.items():
    print(f"{key}: {value}")

# Nested dictionaries
data = {
    'user1': {'name': 'Alice', 'age': 30},
    'user2': {'name': 'Bob', 'age': 25}
}
print(data['user1']['name'])  # Alice

Mutable vs Immutable

Understanding Mutability

  • Immutable: int, float, str, tuple
  • Mutable: list, dict, set
  • Assignment: Creates new reference
  • Modification: Changes in place
Important Points:
• Immutable = safer for sharing data
• Mutable = memory efficient updates
• Watch for aliasing issues with references
• Use copy() when needed to avoid side effects
# Immutable example
x = 5
y = x
x = 10
print(f"x: {x}, y: {y}")  # x: 10, y: 5

# String (immutable)
s1 = "hello"
s2 = s1
s1 = s1.upper()
print(f"s1: {s1}, s2: {s2}")  
# s1: HELLO, s2: hello

# Mutable example - aliasing
list1 = [1, 2, 3]
list2 = list1  # Both point to same list
list1.append(4)
print(f"list1: {list1}")  # [1, 2, 3, 4]
print(f"list2: {list2}")  # [1, 2, 3, 4] - also changed!

# Avoid aliasing with copy
list3 = [1, 2, 3]
list4 = list3.copy()  # or list3[:]
list3.append(4)
print(f"list3: {list3}")  # [1, 2, 3, 4]
print(f"list4: {list4}")  # [1, 2, 3] - unchanged

# Function arguments
def modify_list(lst):
    lst.append(100)  # Modifies original!
    
my_list = [1, 2, 3]
modify_list(my_list)
print(my_list)  # [1, 2, 3, 100]

Nested Data Structures

Complex Structures

  • List of dicts: Records/rows
  • Dict of lists: Grouped data
  • Dict of dicts: Hierarchical
  • Mixed nesting: Any combination
Real-world Uses:
• JSON data representation
• Database records mapping
• Configuration file storage
• API response handling
# List of dictionaries (like database records)
students = [
    {'name': 'Alice', 'age': 20, 'grade': 85},
    {'name': 'Bob', 'age': 21, 'grade': 92},
    {'name': 'Charlie', 'age': 19, 'grade': 78}
]

# Access and modify
print(students[0]['name'])  # Alice
students[1]['grade'] = 95

# Add new student
students.append({'name': 'Diana', 'age': 22, 'grade': 88})

# Find students with grade > 80
high_performers = [s for s in students if s['grade'] > 80]
print(f"High performers: {len(high_performers)}")

# Average grade
avg_grade = sum(s['grade'] for s in students) / len(students)
print(f"Average grade: {avg_grade:.1f}")

More Nested Structures

Dictionary of Lists

# Dictionary of lists (grouped data)
courses = {
    'Math101': ['Alice', 'Bob', 'Charlie'],
    'CS201': ['Bob', 'Diana'],
    'Eng301': ['Alice', 'Charlie', 'Eve']
}

# Add student to course
courses['Math101'].append('Frank')

# Find all courses for a student
student = 'Alice'
alice_courses = [course for course, students 
                 in courses.items() 
                 if student in students]
print(f"{student}'s courses: {alice_courses}")
# Alice's courses: ['Math101', 'Eng301']

Dictionary of Lists of Dictionaries

# Complex nested structure
company = {
    'Engineering': [
        {'name': 'Alice', 'salary': 120000},
        {'name': 'Bob', 'salary': 105000}
    ],
    'Sales': [
        {'name': 'Charlie', 'salary': 85000},
        {'name': 'Diana', 'salary': 92000}
    ]
}

# Calculate department averages
for dept, employees in company.items():
    avg_salary = sum(e['salary'] for e in employees) / len(employees)
    print(f"{dept}: ${avg_salary:,.0f}")
    
# Engineering: $112,500
# Sales: $88,500

Working with Data Structures Using Functions

Functional Approach

  • Regular functions: Process complex data
  • Lambda functions: Quick transformations
  • Higher-order: Functions as arguments
Common Patterns:
map() - Transform elements
filter() - Select elements
reduce() - Aggregate data
sorted() - Custom sorting
# Regular function for complex processing
def process_student_data(students):
    """Calculate statistics for student records"""
    if not students:
        return {}
    
    grades = [s['grade'] for s in students]
    return {
        'count': len(students),
        'average': sum(grades) / len(grades),
        'highest': max(grades),
        'lowest': min(grades),
        'passing': len([g for g in grades if g >= 70])
    }

students = [
    {'name': 'Alice', 'grade': 85},
    {'name': 'Bob', 'grade': 92},
    {'name': 'Charlie', 'grade': 78},
    {'name': 'Diana', 'grade': 65}
]

stats = process_student_data(students)
print(f"Class average: {stats['average']:.1f}")
print(f"Passing students: {stats['passing']}/{stats['count']}")

# Lambda function for sorting complex structures
students_sorted = sorted(students, 
                        key=lambda s: s['grade'], 
                        reverse=True)
print(f"Top student: {students_sorted[0]['name']}")

# Using map with lambda on nested data
student_names = list(map(lambda s: s['name'].upper(), 
                         students))
print(student_names)  # ['ALICE', 'BOB', 'CHARLIE', 'DIANA']

List Comprehensions

Concise List Creation

  • Syntax: [expr for item in iterable]
  • Filter: Add if condition
  • Nested: Multiple for loops
  • Faster: Than equivalent loops
Benefits:
• More readable and concise
• More Pythonic style
• Better performance than loops
• Less code to maintain
# Basic list comprehension
squares = [x**2 for x in range(10)]
print(squares)  
# [0, 1, 4, 9, 16, 25, 36, 49, 64, 81]

# With condition
evens = [x for x in range(20) if x % 2 == 0]
print(evens)  
# [0, 2, 4, 6, 8, 10, 12, 14, 16, 18]

# Multiple conditions
filtered = [
    x for x in range(100) 
    if x % 2 == 0 and x % 3 == 0
]

# String operations
words = ['hello', 'world', 'python']
upper = [w.upper() for w in words]
# ['HELLO', 'WORLD', 'PYTHON']

# Nested comprehension
matrix = [
    [i+j for j in range(3)] 
    for i in range(3)
]
# [[0,1,2], [1,2,3], [2,3,4]]

# Flatten nested list
nested = [[1,2], [3,4], [5,6]]
flat = [x for sublist in nested for x in sublist]
# [1, 2, 3, 4, 5, 6]

List Comprehensions with Functions

Functions in Comprehensions

  • Regular functions: Named, reusable
  • Lambda functions: Inline, one-time
  • Built-in functions: map(), filter()
When to Use:
Regular: Complex logic
Lambda: Simple transforms
Comprehension: Clean, readable
# Regular function in comprehension
def is_valid_score(score):
    """Check if score is valid (0-100)"""
    return 0 <= score <= 100

scores = [85, 92, -5, 78, 105, 88, 95]
valid_scores = [s for s in scores if is_valid_score(s)]
print(valid_scores)  # [85, 92, 78, 88, 95]

# Lambda in comprehension
numbers = [1, 2, 3, 4, 5]
transformed = [(lambda x: x**2 + 2*x + 1)(n) for n in numbers]
print(transformed)  # [4, 9, 16, 25, 36]

# Equivalent using map with lambda
squared_plus = list(map(lambda x: x**2 + 2*x + 1, numbers))
print(squared_plus)  # [4, 9, 16, 25, 36]

# Complex example: process student records
students = [
    {'name': 'Alice', 'scores': [85, 90, 88]},
    {'name': 'Bob', 'scores': [78, 82, 85]},
    {'name': 'Charlie', 'scores': [92, 95, 90]}
]

# Calculate averages using regular function
def calculate_avg(scores):
    return sum(scores) / len(scores)

averages = [
    {'name': s['name'], 'avg': calculate_avg(s['scores'])}
    for s in students
]

# Same with lambda (less readable for complex logic)
averages_lambda = [
    {'name': s['name'], 
     'avg': (lambda scores: sum(scores)/len(scores))(s['scores'])}
    for s in students
]

Dictionary Comprehensions

Concise Dict Creation

  • Syntax: {k: v for item in iterable}
  • From Lists: Use zip()
  • Transform: Modify existing dicts
Common Uses:
• Invert dictionaries (swap keys/values)
• Filter by value conditions
• Transform values systematically
• Merge dictionaries efficiently
# Basic dictionary comprehension
squares_dict = {x: x**2 for x in range(5)}
print(squares_dict)  
# {0: 0, 1: 1, 2: 4, 3: 9, 4: 16}

# From two lists
names = ['Alice', 'Bob', 'Charlie']
ages = [25, 30, 35]
people = {
    name: age 
    for name, age in zip(names, ages)
}
# {'Alice': 25, 'Bob': 30, 'Charlie': 35}

# With condition
adults = {
    name: age 
    for name, age in people.items() 
    if age >= 30
}
# {'Bob': 30, 'Charlie': 35}

# Invert dictionary
inverted = {v: k for k, v in people.items()}
# {25: 'Alice', 30: 'Bob', 35: 'Charlie'}

# Transform values
ages_in_months = {
    name: age * 12 
    for name, age in people.items()
}

Dictionary Comprehensions with Functions

Functions in Dict Comprehensions

  • Process keys: Transform dictionary keys
  • Process values: Calculate new values
  • Filter entries: Conditional inclusion
Best Practices:
• Keep comprehensions readable
• Extract complex logic to functions
• Consider performance vs readability
# Regular function for value processing
def calculate_tax(salary):
    """Calculate tax based on salary"""
    if salary < 50000:
        return salary * 0.10
    elif salary < 100000:
        return salary * 0.20
    else:
        return salary * 0.30

employees = {
    'Alice': 45000,
    'Bob': 75000,
    'Charlie': 120000,
    'Diana': 55000
}

# Dictionary comprehension with regular function
taxes = {
    name: calculate_tax(salary)
    for name, salary in employees.items()
}
print(taxes)
# {'Alice': 4500.0, 'Bob': 15000.0, 
#  'Charlie': 36000.0, 'Diana': 11000.0}

# Lambda for simple transformations
# Convert to after-tax income
after_tax = {
    name: salary - (lambda s: s * 0.2)(salary)
    for name, salary in employees.items()
}

# Process with multiple functions
def get_level(salary):
    return 'Senior' if salary > 80000 else 'Junior'

employee_info = {
    name: {
        'salary': salary,
        'tax': calculate_tax(salary),
        'level': get_level(salary),
        'monthly': salary / 12
    }
    for name, salary in employees.items()
}

# Filter using lambda
high_earners = {
    name: salary
    for name, salary in employees.items()
    if (lambda s: s > 70000)(salary)
}

Part 3: NumPy Fundamentals

Numerical Computing with Python

The Foundation of Scientific Python:
"NumPy is the fundamental package for scientific computing in Python. It's the foundation on which nearly all higher-level tools are built."

Core Advantages

  • Performance: 10-100x faster than lists
  • Vectorization: No explicit loops needed
  • Broadcasting: Smart array operations
  • Foundation: Powers pandas, scikit-learn

Key Concepts

  • N-dimensional arrays (ndarray)
  • Element-wise operations
  • Array reshaping & indexing
  • Mathematical functions
Key Insight: NumPy's vectorized operations eliminate the need for explicit loops, making your code both faster and more readable - essential for machine learning computations.

Creating NumPy Arrays

Array Creation Methods

  • From lists: np.array()
  • Zeros/Ones: Initialize with values
  • Range: np.arange(), np.linspace()
  • Random: Various distributions
Array Properties:
shape - array dimensions
dtype - data type of elements
size - total number of elements
ndim - number of dimensions
import numpy as np

# From list
arr1 = np.array([1, 2, 3, 4, 5])
print(arr1)  # [1 2 3 4 5]

# 2D array (matrix)
matrix = np.array([[1, 2, 3],
                   [4, 5, 6],
                   [7, 8, 9]])
print(matrix.shape)  # (3, 3)

# Initialize arrays
zeros = np.zeros((3, 4))      # 3x4 matrix of zeros
ones = np.ones((2, 3))         # 2x3 matrix of ones
full = np.full((3, 3), 7)     # 3x3 filled with 7
identity = np.eye(4)           # 4x4 identity matrix

# Sequences
range_arr = np.arange(0, 10, 2)  # [0 2 4 6 8]
linear = np.linspace(0, 1, 5)    # [0. 0.25 0.5 0.75 1.]

# Random arrays
random_uniform = np.random.rand(3, 3)  # Uniform [0,1)
random_normal = np.random.randn(3, 3)  # Normal(0,1)
random_int = np.random.randint(0, 10, size=(3, 3))

# Array properties
print(f"Shape: {matrix.shape}")
print(f"Data type: {matrix.dtype}")
print(f"Size: {matrix.size}")
print(f"Dimensions: {matrix.ndim}")

Array Indexing and Slicing

Accessing Elements

  • 1D: Like Python lists
  • 2D: [row, column]
  • Slicing: Select subarrays
  • Boolean: Conditional selection
Important:
• Views vs copies (memory implications)
• Negative indexing (from end)
• Fancy indexing (array of indices)
• Broadcasting rules (shape matching)
# 1D array indexing
arr = np.array([10, 20, 30, 40, 50])
print(arr[0])     # 10
print(arr[-1])    # 50
print(arr[1:4])   # [20 30 40]

# 2D array indexing
matrix = np.array([[1, 2, 3],
                   [4, 5, 6],
                   [7, 8, 9]])

print(matrix[0, 0])    # 1 (first element)
print(matrix[1, :])    # [4 5 6] (second row)
print(matrix[:, 2])    # [3 6 9] (third column)
print(matrix[0:2, 1:3])  # [[2 3], [5 6]]

# Boolean indexing
arr = np.array([1, 2, 3, 4, 5])
mask = arr > 3
print(mask)  # [False False False True True]
print(arr[mask])  # [4 5]

# Direct boolean indexing
matrix = np.random.randint(0, 10, (4, 4))
print(matrix[matrix > 5])  # All elements > 5

# Fancy indexing
arr = np.array([10, 20, 30, 40, 50])
indices = [0, 2, 4]
print(arr[indices])  # [10 30 50]

# Modify using indexing
matrix[0, 0] = 100
matrix[matrix < 5] = 0  # Set all values < 5 to 0

Array Operations and Broadcasting

Vectorized Operations

  • Element-wise: +, -, *, /
  • Broadcasting: Different shapes
  • Aggregations: sum, mean, std
  • Matrix ops: dot product, transpose
Broadcasting Rules:
1. Compare shapes right to left
2. Dimensions are compatible if equal or 1
3. Arrays are broadcast to match shapes
# Element-wise operations
a = np.array([1, 2, 3, 4])
b = np.array([10, 20, 30, 40])

print(a + b)  # [11 22 33 44]
print(a * b)  # [10 40 90 160]
print(b / a)  # [10. 10. 10. 10.]
print(a ** 2)  # [1 4 9 16]

# Broadcasting with scalar
arr = np.array([[1, 2, 3],
                [4, 5, 6]])
print(arr * 2)  # All elements multiplied by 2

# Broadcasting with different shapes
row = np.array([1, 2, 3])     # Shape: (3,)
col = np.array([[10], [20]])  # Shape: (2, 1)
result = row + col  # Broadcasting!
# [[11 12 13],
#  [21 22 23]]

# Aggregations
matrix = np.array([[1, 2, 3],
                   [4, 5, 6],
                   [7, 8, 9]])

print(matrix.sum())       # 45 (all elements)
print(matrix.sum(axis=0))  # [12 15 18] (column sums)
print(matrix.sum(axis=1))  # [6 15 24] (row sums)
print(matrix.mean())      # 5.0
print(matrix.std())       # 2.58...

# Matrix operations
A = np.array([[1, 2], [3, 4]])
B = np.array([[5, 6], [7, 8]])
print(np.dot(A, B))  # Matrix multiplication
print(A.T)           # Transpose

Statistical Functions

NumPy Statistics

  • Basic: mean, median, std
  • Extremes: min, max, argmin, argmax
  • Percentiles: quantile, percentile
  • Correlation: corrcoef, cov
Axis Parameter:
None: Entire array (flattened)
0: Along columns (vertically)
1: Along rows (horizontally)
# Generate sample data
np.random.seed(42)
data = np.random.randn(1000) * 15 + 100

# Basic statistics
print(f"Mean: {data.mean():.2f}")
print(f"Median: {np.median(data):.2f}")
print(f"Std Dev: {data.std():.2f}")
print(f"Variance: {data.var():.2f}")

# Min/Max and their indices
print(f"Min: {data.min():.2f}")
print(f"Max: {data.max():.2f}")
print(f"Index of min: {data.argmin()}")
print(f"Index of max: {data.argmax()}")

# Percentiles
print(f"25th percentile: {np.percentile(data, 25):.2f}")
print(f"75th percentile: {np.percentile(data, 75):.2f}")

# 2D array statistics
matrix = np.random.randint(0, 100, (4, 5))
print(matrix)

# Statistics along axes
print(f"Column means: {matrix.mean(axis=0)}")
print(f"Row means: {matrix.mean(axis=1)}")
print(f"Column max: {matrix.max(axis=0)}")

# Correlation
x = np.random.randn(100)
y = 2 * x + np.random.randn(100) * 0.5
correlation = np.corrcoef(x, y)[0, 1]
print(f"Correlation: {correlation:.3f}")

Why NumPy? (Performance)

NumPy vs Pure Python

  • Vectorization: C-optimized loops
  • Memory: Contiguous storage
  • Type consistency: No type checking
  • Parallelization: BLAS/LAPACK
Performance Tips:
• Avoid Python loops - use vectorization
• Use vectorized operations whenever possible
• Preallocate arrays for known sizes
• Use appropriate dtypes (int32 vs int64)
import time

# Python list operation
def python_sum_squares(n):
    """Sum of squares using Python list"""
    numbers = list(range(n))
    result = []
    for num in numbers:
        result.append(num ** 2)
    return sum(result)

# NumPy operation
def numpy_sum_squares(n):
    """Sum of squares using NumPy"""
    numbers = np.arange(n)
    return (numbers ** 2).sum()

# Timing comparison
n = 1000000

start = time.time()
python_result = python_sum_squares(n)
python_time = time.time() - start

start = time.time()
numpy_result = numpy_sum_squares(n)
numpy_time = time.time() - start

print(f"Python time: {python_time:.4f} seconds")
print(f"NumPy time: {numpy_time:.4f} seconds")
print(f"NumPy is {python_time/numpy_time:.1f}x faster")

# Output (typical):
# Python time: 0.1234 seconds
# NumPy time: 0.0023 seconds
# NumPy is 53.7x faster

# Memory efficiency
import sys
py_list = list(range(1000))
np_array = np.arange(1000)
print(f"Python list size: {sys.getsizeof(py_list)} bytes")
print(f"NumPy array size: {np_array.nbytes} bytes")

Part 4: Pandas Fundamentals

Powerful Data Structures for Analysis

The Swiss Army Knife of Data Analysis:
"Pandas provides high-performance, easy-to-use data structures and data analysis tools that make working with structured data fast and intuitive."

Core Capabilities

  • DataFrames: Tabular data structure
  • Missing data: Built-in handling
  • I/O tools: Read/write various formats
  • Time series: Date/time functionality

Key Features

  • Label-based indexing
  • Group by operations
  • Merge, join, and concatenate
  • Pivot tables and reshaping
Key Insight: Pandas bridges the gap between Python and data analysis, providing Excel-like functionality with the power of programming - essential for data preprocessing in ML.

Creating DataFrames

DataFrame Creation

  • From dict: Column-oriented
  • From list: Row-oriented
  • From CSV: pd.read_csv()
  • From Excel: pd.read_excel()
DataFrame Structure:
Rows: Index (0, 1, 2...)
Columns: Named fields
Values: Any data type
import pandas as pd

# From dictionary
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'Diana'],
    'Age': [25, 30, 35, 28],
    'City': ['NYC', 'Paris', 'London', 'Tokyo'],
    'Salary': [70000, 80000, 75000, 90000]
}
df = pd.DataFrame(data)

print(df)
# Output:
#       Name  Age    City  Salary
# 0    Alice   25     NYC   70000
# 1      Bob   30   Paris   80000
# 2  Charlie   35  London   75000
# 3    Diana   28   Tokyo   90000

# DataFrame info
print(df.shape)  
# Output: (4, 4)

print(df.columns.tolist())
# Output: ['Name', 'Age', 'City', 'Salary']

print(df.dtypes)
# Output:
# Name      object
# Age        int64
# City      object
# Salary     int64

Data Selection

Accessing DataFrame Data

  • Columns: df['col'] or df[['col1', 'col2']]
  • Rows by position: df.iloc[index]
  • Rows by label: df.loc[label]
  • Boolean indexing: df[condition]
Selection Methods:
[ ] - Column or boolean indexing
iloc[ ] - Integer position based
loc[ ] - Label-based access
query() - String expression filtering
# Column selection
print(df['Name'])        
# Output: Series with names
# 0      Alice
# 1        Bob
# 2    Charlie
# 3      Diana

print(df[['Name', 'Salary']])  
# Output: DataFrame with 2 columns

# Row selection
print(df.iloc[0])        
# Output: First row as Series
# Name      Alice
# Age          25
# City        NYC
# Salary    70000

print(df.loc[0:2])       
# Output: DataFrame with rows 0-2

# Conditional selection
print(df[df['Age'] > 30])
# Output: Rows where Age > 30

print(df[(df['Age'] > 25) & 
         (df['Salary'] > 75000)])
# Output: Multiple conditions

# Using query
print(df.query('Age > 30 and Salary < 80000'))
# Output: String-based filtering

Data Manipulation

Modifying DataFrames

  • Add columns: Direct assignment
  • Modify: Update in place
  • Sort: By values or index
  • Group: Split-apply-combine
Missing Data:
dropna() - Remove NaN values
fillna() - Replace NaN with value
interpolate() - Fill gaps smartly
# Add new column
df['Bonus'] = df['Salary'] * 0.1
print(df['Bonus'].head(2))
# Output: 
# 0    7000.0
# 1    8000.0

# Modify existing column
df['Salary'] = df['Salary'] * 1.05

# Sort data
df_sorted = df.sort_values('Salary', 
                           ascending=False)
print(df_sorted[['Name', 'Salary']].head(2))
# Output:
#     Name    Salary
# 3  Diana  94500.00
# 1    Bob  84000.00

# Group by operations
df_grouped = df.groupby('City')['Salary'].mean()
print(df_grouped)
# Output:
# City
# London    78750.0
# NYC       73500.0
# Paris     84000.0
# Tokyo     94500.0

# Handling missing data
df_clean = df.dropna()        # Drop NaN rows
df_filled = df.fillna(0)      # Fill with 0
df.fillna(method='ffill')     # Forward fill

Aggregation and Statistics

Statistical Analysis

  • describe(): Summary statistics
  • Aggregations: mean, sum, count
  • Group by: Split-apply-combine
  • Pivot tables: Reshape data
Aggregate Functions:
sum, mean, median - Basic stats
min, max, std - Range & spread
count, nunique - Counting
• Custom functions with apply()
# Basic statistics
print(df.describe())
# Output:
#             Age        Salary
# count   4.0000       4.000000
# mean   29.5000   78750.000000
# std     4.1231   8539.125638
# min    25.0000   70000.000000
# 25%    27.2500   73750.000000
# 50%    29.0000   77500.000000
# 75%    31.2500   82500.000000
# max    35.0000   90000.000000

# Single column stats
print(f"Mean: {df['Salary'].mean():.2f}")
# Output: Mean: 78750.00

# Group by aggregation
grouped = df.groupby('City').agg({
    'Salary': ['mean', 'max', 'min'],
    'Age': 'mean'
})
print(grouped)
# Output: Multi-level column DataFrame

# Value counts
print(df['City'].value_counts())
# Output:
# NYC      1
# Paris    1
# London   1
# Tokyo    1

Part 5: Matplotlib Basics

Creating Meaningful Visual Insights

A Picture is Worth a Thousand Data Points:
"Visualization gives you answers to questions you didn't know you had. It's not just about making pretty pictures, but about understanding."

Why Visualization?

  • Explore: Understand your data
  • Communicate: Share insights
  • Validate: Check assumptions
  • Present: Publication-quality figures

Essential Plot Types

  • Scatter: Relationships & correlations
  • Histogram: Data distributions
  • Bar/Column: Category comparisons
  • Line: Trends over time
Key Insight: Effective visualization is crucial for exploratory data analysis and communicating machine learning results - it helps you spot patterns, outliers, and validate model assumptions.

Scatter Plots

Visualizing Relationships

  • Purpose: Show correlation between variables
  • X-axis: Independent variable
  • Y-axis: Dependent variable
  • Patterns: Linear, non-linear, clusters
Customization Options:
alpha: Transparency (0-1)
c: Color (can be array)
s: Size (can be array)
marker: Point style (o, ^, s, etc.)
import matplotlib.pyplot as plt
import numpy as np

# Generate data
np.random.seed(42)
x = np.random.randn(100)
y = 2 * x + np.random.randn(100) * 0.5

# Create scatter plot
plt.figure(figsize=(10, 6))
plt.scatter(x, y, alpha=0.6, c=x, 
            cmap='viridis', s=50)

# Labels and title
plt.xlabel('X Variable')
plt.ylabel('Y Variable')
plt.title('Scatter Plot Example')

# Add grid
plt.grid(True, alpha=0.3)

# Add colorbar
plt.colorbar(label='X Value')

# Show plot
plt.show()

# Correlation coefficient
corr = np.corrcoef(x, y)[0, 1]
print(f"Correlation: {corr:.3f}")
# Output: Correlation: 0.964

Histograms

Distribution Analysis

  • Purpose: Show data distribution
  • Bins: Group continuous data
  • Height: Frequency or count
  • Shape: Normal, skewed, bimodal
Key Statistics:
Mean: Central tendency
Median: Middle value
Std Dev: Data spread
Skewness: Distribution asymmetry
# Generate data
data = np.random.normal(100, 15, 1000)

# Create histogram
plt.figure(figsize=(10, 6))
counts, bins, patches = plt.hist(
    data, bins=30, 
    edgecolor='black', 
    alpha=0.7,
    color='steelblue'
)

# Add vertical line at mean
mean_val = data.mean()
plt.axvline(mean_val, color='red', 
            linestyle='--', linewidth=2,
            label=f'Mean: {mean_val:.1f}')

# Labels and title
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.title('Distribution of Values')

# Add legend and grid
plt.legend()
plt.grid(True, alpha=0.3, axis='y')

plt.show()

# Print statistics
print(f"Mean: {data.mean():.2f}")
print(f"Std Dev: {data.std():.2f}")
print(f"Min: {data.min():.2f}")
print(f"Max: {data.max():.2f}")
# Output:
# Mean: 100.12
# Std Dev: 14.87
# Min: 52.84
# Max: 143.56

Bar Charts

Categorical Comparisons

  • Purpose: Compare discrete categories
  • Vertical: Standard comparison
  • Horizontal: Long category names
  • Grouped: Multiple series
Best Practices:
• Sort by value for clarity
• Add value labels on bars
• Use consistent colors
• Start y-axis at zero
# Create data
categories = ['Product A', 'Product B', 
              'Product C', 'Product D']
values = [23, 45, 56, 78]

# Vertical bar chart
plt.figure(figsize=(10, 6))
bars = plt.bar(categories, values, 
               color='coral',
               edgecolor='black',
               alpha=0.7)

# Add value labels on bars
for bar in bars:
    height = bar.get_height()
    plt.text(bar.get_x() + bar.get_width()/2.,
             height, f'{height}',
             ha='center', va='bottom')

# Labels and title
plt.xlabel('Products')
plt.ylabel('Sales (in thousands)')
plt.title('Bar Chart Example')

# Add grid
plt.grid(True, alpha=0.3, axis='y')

plt.show()

# Calculate total and average
total = sum(values)
avg = total / len(values)
print(f"Total Sales: {total}k")
print(f"Average: {avg:.1f}k")
# Output:
# Total Sales: 202k
# Average: 50.5k

Multiple Subplots

Subplot Layouts

  • subplots(): Create grid of axes
  • Indexing: axes[row, col]
  • tight_layout(): Auto-spacing
  • suptitle(): Overall title
Common Patterns:
2x2: Four related plots
1x3: Side-by-side comparison
3x1: Vertical stack
Custom: GridSpec for complex layouts
fig, axes = plt.subplots(2, 2, 
                        figsize=(12, 10))

# Scatter plot (top-left)
axes[0, 0].scatter(x, y, alpha=0.6)
axes[0, 0].set_title('Scatter Plot')
axes[0, 0].set_xlabel('X')
axes[0, 0].set_ylabel('Y')

# Histogram (top-right)
axes[0, 1].hist(data, bins=20, 
                color='green', alpha=0.7)
axes[0, 1].set_title('Histogram')
axes[0, 1].set_xlabel('Value')
axes[0, 1].set_ylabel('Frequency')

# Bar chart (bottom-left)
axes[1, 0].bar(categories, values, 
               color='orange')
axes[1, 0].set_title('Bar Chart')
axes[1, 0].set_xlabel('Category')

# Line plot (bottom-right)
x_line = np.linspace(0, 10, 100)
y_line = np.sin(x_line)
axes[1, 1].plot(x_line, y_line, 'r-')
axes[1, 1].set_title('Line Plot')

# Overall title
plt.suptitle('Dashboard', fontsize=14)
plt.tight_layout()
plt.show()

Practice Exercises

Complete the Week 2 Jupyter Notebooks:

  1. week02a-python-fundamentals.ipynb
    • Variables, operators, control flow
    • Functions and comprehensions
  2. week02b-numpy-fundamentals.ipynb
    • Array creation and manipulation
    • Broadcasting and vectorization
  3. week02c-pandas-fundamentals.ipynb
    • DataFrames and Series
    • Data selection and aggregation
  4. week02d-matplotlib-fundamentals.ipynb
    • Creating various plot types
    • Customization and styling

Key Takeaways

Essential Skills for Machine Learning:

  • Python Basics: Foundation for all programming tasks
  • Functions: Reusable code blocks and modular programming
  • Data Structures: Lists, tuples, dictionaries, and comprehensions
  • NumPy: Efficient numerical computations
  • Pandas: Data manipulation and analysis
  • Matplotlib: Data visualization

Remember: These tools form the foundation for all machine learning work in Python!

Next Week: Data Preparation

Week 3 Preview:

  • Data cleaning techniques
  • Handling missing values
  • Feature scaling and normalization
  • Encoding categorical variables
  • Feature engineering basics

Keep practicing with the notebooks!