Data Manipulation

Learning Objectives

By the end of this lesson, you will be able to:
- Filter DataFrames using various conditions
- Select data using different methods
- Group data by columns
- Perform aggregations on grouped data
- Merge and join DataFrames
- Understand different join types
- Combine data from multiple sources
- Apply advanced filtering techniques
- Use pivot tables
- Transform data effectively
- Handle complex data manipulation tasks
- Debug data manipulation issues

Lesson 22.2: Data Manipulation

Learning Objectives

By the end of this lesson, you will be able to:

Filter DataFrames using various conditions
Select data using different methods
Group data by columns
Perform aggregations on grouped data
Merge and join DataFrames
Understand different join types
Combine data from multiple sources
Apply advanced filtering techniques
Use pivot tables
Transform data effectively
Handle complex data manipulation tasks
Debug data manipulation issues

Filtering and Selection

Boolean Indexing

import pandas as pd

df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
    'age': [25, 30, 35, 40, 45],
    'salary': [50000, 60000, 70000, 80000, 90000],
    'department': ['Sales', 'IT', 'Sales', 'IT', 'HR']
})

# Single condition
filtered = df[df['age'] > 30]
print(filtered)

# Multiple conditions (AND)
filtered = df[(df['age'] > 30) & (df['department'] == 'IT')]
print(filtered)

# Multiple conditions (OR)
filtered = df[(df['age'] > 35) | (df['department'] == 'Sales')]
print(filtered)

# Using isin()
filtered = df[df['department'].isin(['IT', 'HR'])]
print(filtered)

# Using not
filtered = df[~df['department'].isin(['IT'])]
print(filtered)

Using query() Method

import pandas as pd

df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie'],
    'age': [25, 30, 35],
    'salary': [50000, 60000, 70000]
})

# Query with single condition
filtered = df.query('age > 30')
print(filtered)

# Query with multiple conditions
filtered = df.query('age > 25 and salary > 55000')
print(filtered)

# Query with variables
min_age = 30
filtered = df.query('age > @min_age')
print(filtered)

Using loc and iloc

import pandas as pd

df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie', 'David'],
    'age': [25, 30, 35, 40],
    'salary': [50000, 60000, 70000, 80000]
})

# Using iloc (integer position)
print(df.iloc[0])              # First row
print(df.iloc[0:2])            # First 2 rows
print(df.iloc[0:2, 0:2])       # First 2 rows, first 2 columns
print(df.iloc[[0, 2]])         # Rows 0 and 2

# Using loc (label-based)
df_indexed = df.set_index('name')
print(df_indexed.loc['Alice'])  # Row with label 'Alice'
print(df_indexed.loc['Alice':'Charlie'])  # Rows from Alice to Charlie
print(df_indexed.loc[['Alice', 'Charlie']])  # Specific rows

# Using loc with conditions
filtered = df.loc[df['age'] > 30, ['name', 'salary']]
print(filtered)

String Methods

import pandas as pd

df = pd.DataFrame({
    'name': ['Alice Smith', 'Bob Johnson', 'Charlie Brown', 'David Lee'],
    'email': ['alice@example.com', 'bob@test.com', 'charlie@example.com', 'david@test.com']
})

# Contains
filtered = df[df['name'].str.contains('Smith')]
print(filtered)

# Starts with
filtered = df[df['name'].str.startswith('A')]
print(filtered)

# Ends with
filtered = df[df['email'].str.endswith('example.com')]
print(filtered)

# Case insensitive
filtered = df[df['name'].str.contains('smith', case=False)]
print(filtered)

Grouping and Aggregation

Basic Grouping

import pandas as pd

df = pd.DataFrame({
    'department': ['Sales', 'IT', 'Sales', 'IT', 'HR', 'Sales'],
    'employee': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank'],
    'salary': [50000, 60000, 55000, 70000, 65000, 60000]
})

# Group by single column
grouped = df.groupby('department')
print(grouped.groups)  # Shows groups

# Iterate over groups
for name, group in grouped:
    print(f"\n{name}:")
    print(group)

Aggregation Functions

import pandas as pd

df = pd.DataFrame({
    'department': ['Sales', 'IT', 'Sales', 'IT', 'HR', 'Sales'],
    'salary': [50000, 60000, 55000, 70000, 65000, 60000],
    'age': [25, 30, 35, 40, 45, 28]
})

# Single aggregation
result = df.groupby('department')['salary'].mean()
print(result)

# Multiple aggregations
result = df.groupby('department')['salary'].agg(['mean', 'sum', 'count', 'min', 'max'])
print(result)

# Different aggregations for different columns
result = df.groupby('department').agg({
    'salary': ['mean', 'sum'],
    'age': 'mean'
})
print(result)

# Custom aggregation function
def salary_range(group):
    return group.max() - group.min()

result = df.groupby('department')['salary'].agg(salary_range)
print(result)

Common Aggregation Functions

import pandas as pd

df = pd.DataFrame({
    'department': ['Sales', 'IT', 'Sales', 'IT'],
    'salary': [50000, 60000, 55000, 70000]
})

grouped = df.groupby('department')['salary']

# Common aggregations
print(grouped.sum())      # Sum
print(grouped.mean())     # Mean
print(grouped.median())   # Median
print(grouped.std())      # Standard deviation
print(grouped.var())      # Variance
print(grouped.min())     # Minimum
print(grouped.max())      # Maximum
print(grouped.count())   # Count
print(grouped.size())     # Size (includes NaN)
print(grouped.first())    # First value
print(grouped.last())     # Last value

Grouping by Multiple Columns

import pandas as pd

df = pd.DataFrame({
    'department': ['Sales', 'IT', 'Sales', 'IT', 'Sales', 'IT'],
    'location': ['NY', 'NY', 'LA', 'LA', 'NY', 'LA'],
    'salary': [50000, 60000, 55000, 70000, 60000, 75000]
})

# Group by multiple columns
result = df.groupby(['department', 'location'])['salary'].mean()
print(result)

# Unstack to create pivot-like structure
result = df.groupby(['department', 'location'])['salary'].mean().unstack()
print(result)

Transform and Apply

import pandas as pd

df = pd.DataFrame({
    'department': ['Sales', 'IT', 'Sales', 'IT'],
    'salary': [50000, 60000, 55000, 70000]
})

# Transform (returns same shape as original)
df['salary_normalized'] = df.groupby('department')['salary'].transform(
    lambda x: (x - x.mean()) / x.std()
)
print(df)

# Apply custom function
def custom_func(group):
    return group['salary'].sum() / len(group)

result = df.groupby('department').apply(custom_func)
print(result)

Merging and Joining

Merge Basics

import pandas as pd

# Create sample DataFrames
df1 = pd.DataFrame({
    'id': [1, 2, 3, 4],
    'name': ['Alice', 'Bob', 'Charlie', 'David'],
    'department_id': [1, 2, 1, 3]
})

df2 = pd.DataFrame({
    'id': [1, 2, 3],
    'department': ['Sales', 'IT', 'HR']
})

# Merge on common column
merged = pd.merge(df1, df2, left_on='department_id', right_on='id', suffixes=('_emp', '_dept'))
print(merged)

Join Types

import pandas as pd

left = pd.DataFrame({
    'key': ['A', 'B', 'C', 'D'],
    'value_left': [1, 2, 3, 4]
})

right = pd.DataFrame({
    'key': ['B', 'C', 'D', 'E'],
    'value_right': [5, 6, 7, 8]
})

# Inner join (default)
inner = pd.merge(left, right, on='key', how='inner')
print("Inner join:")
print(inner)

# Left join
left_join = pd.merge(left, right, on='key', how='left')
print("\nLeft join:")
print(left_join)

# Right join
right_join = pd.merge(left, right, on='key', how='right')
print("\nRight join:")
print(right_join)

# Outer join (full outer)
outer = pd.merge(left, right, on='key', how='outer')
print("\nOuter join:")
print(outer)

Merge on Multiple Columns

import pandas as pd

df1 = pd.DataFrame({
    'first_name': ['Alice', 'Bob', 'Charlie'],
    'last_name': ['Smith', 'Johnson', 'Brown'],
    'age': [25, 30, 35]
})

df2 = pd.DataFrame({
    'first_name': ['Alice', 'Bob', 'David'],
    'last_name': ['Smith', 'Johnson', 'Lee'],
    'salary': [50000, 60000, 70000]
})

# Merge on multiple columns
merged = pd.merge(df1, df2, on=['first_name', 'last_name'], how='inner')
print(merged)

Using join() Method

import pandas as pd

left = pd.DataFrame({
    'A': ['A0', 'A1', 'A2'],
    'B': ['B0', 'B1', 'B2']
}, index=['K0', 'K1', 'K2'])

right = pd.DataFrame({
    'C': ['C0', 'C1', 'C2'],
    'D': ['D0', 'D1', 'D2']
}, index=['K0', 'K1', 'K3'])

# Join on index
result = left.join(right)
print(result)

# Join with how parameter
result = left.join(right, how='outer')
print(result)

Concatenation

import pandas as pd

df1 = pd.DataFrame({
    'A': ['A0', 'A1'],
    'B': ['B0', 'B1']
})

df2 = pd.DataFrame({
    'A': ['A2', 'A3'],
    'B': ['B2', 'B3']
})

# Concatenate vertically (stack rows)
result = pd.concat([df1, df2], ignore_index=True)
print(result)

# Concatenate horizontally (stack columns)
df3 = pd.DataFrame({
    'C': ['C0', 'C1'],
    'D': ['D0', 'D1']
})

result = pd.concat([df1, df3], axis=1)
print(result)

# Concatenate with keys
result = pd.concat([df1, df2], keys=['first', 'second'])
print(result)

Pivot Tables

Basic Pivot

import pandas as pd

df = pd.DataFrame({
    'date': ['2023-01-01', '2023-01-01', '2023-01-02', '2023-01-02'],
    'product': ['A', 'B', 'A', 'B'],
    'sales': [100, 200, 150, 250]
})

# Create pivot table
pivot = df.pivot_table(values='sales', index='date', columns='product', aggfunc='sum')
print(pivot)

Pivot with Aggregation

import pandas as pd

df = pd.DataFrame({
    'date': ['2023-01-01', '2023-01-01', '2023-01-01', '2023-01-02', '2023-01-02'],
    'product': ['A', 'A', 'B', 'A', 'B'],
    'sales': [100, 150, 200, 120, 250],
    'region': ['North', 'South', 'North', 'South', 'North']
})

# Pivot with multiple aggregations
pivot = df.pivot_table(
    values='sales',
    index='date',
    columns='product',
    aggfunc=['sum', 'mean']
)
print(pivot)

# Pivot with multiple index columns
pivot = df.pivot_table(
    values='sales',
    index=['date', 'region'],
    columns='product',
    aggfunc='sum'
)
print(pivot)

Advanced Filtering

Using apply() for Complex Filtering

import pandas as pd

df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie'],
    'age': [25, 30, 35],
    'salary': [50000, 60000, 70000]
})

# Filter using apply
def high_earner(row):
    return row['salary'] > 55000 and row['age'] < 35

filtered = df[df.apply(high_earner, axis=1)]
print(filtered)

Using query() with Complex Conditions

import pandas as pd

df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie'],
    'age': [25, 30, 35],
    'salary': [50000, 60000, 70000],
    'department': ['Sales', 'IT', 'Sales']
})

# Complex query
filtered = df.query('(age > 25 and salary > 55000) or department == "Sales"')
print(filtered)

Practical Examples

Example 1: Sales Analysis

import pandas as pd

# Sample sales data
sales_data = pd.DataFrame({
    'date': pd.date_range('2023-01-01', periods=10, freq='D'),
    'product': ['A', 'B', 'A', 'C', 'B', 'A', 'C', 'B', 'A', 'C'],
    'sales': [100, 200, 150, 300, 250, 120, 350, 280, 110, 320],
    'region': ['North', 'South', 'North', 'East', 'South', 'North', 'East', 'South', 'North', 'East']
})

# Group by product
product_sales = sales_data.groupby('product')['sales'].agg(['sum', 'mean', 'count'])
print("Sales by Product:")
print(product_sales)

# Group by region
region_sales = sales_data.groupby('region')['sales'].sum()
print("\nSales by Region:")
print(region_sales)

# Pivot table
pivot = sales_data.pivot_table(
    values='sales',
    index='date',
    columns='product',
    aggfunc='sum',
    fill_value=0
)
print("\nPivot Table:")
print(pivot)

Example 2: Employee Analysis

import pandas as pd

# Employee data
employees = pd.DataFrame({
    'employee_id': [1, 2, 3, 4, 5],
    'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
    'department_id': [1, 2, 1, 3, 2],
    'salary': [50000, 60000, 55000, 70000, 65000]
})

# Department data
departments = pd.DataFrame({
    'department_id': [1, 2, 3],
    'name': ['Sales', 'IT', 'HR']
})

# Merge
merged = pd.merge(employees, departments, on='department_id', suffixes=('_emp', '_dept'))
print("Merged Data:")
print(merged)

# Analysis
dept_stats = merged.groupby('name_dept').agg({
    'salary': ['mean', 'sum', 'count'],
    'employee_id': 'count'
})
print("\nDepartment Statistics:")
print(dept_stats)

Common Mistakes and Pitfalls

1. SettingWithCopyWarning

# WRONG: May cause warning
df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})
subset = df[df['a'] > 1]
subset['c'] = 10  # Warning!

# CORRECT: Use copy()
subset = df[df['a'] > 1].copy()
subset['c'] = 10  # No warning

2. Forgetting to Reset Index

# After filtering, index may not be sequential
df_filtered = df[df['age'] > 30]
print(df_filtered.index)  # May have gaps

# Reset index if needed
df_filtered = df[df['age'] > 30].reset_index(drop=True)
print(df_filtered.index)  # Sequential

3. Merge Key Mismatch

# WRONG: Keys don't match
df1 = pd.DataFrame({'id': [1, 2, 3]})
df2 = pd.DataFrame({'ID': [1, 2, 3]})  # Different case
# merged = pd.merge(df1, df2, on='id')  # Error!

# CORRECT: Specify correct keys
merged = pd.merge(df1, df2, left_on='id', right_on='ID')

Best Practices

1. Use Vectorized Operations

# WRONG: Use apply for simple operations
df['new_col'] = df.apply(lambda row: row['a'] * 2, axis=1)

# CORRECT: Use vectorized operations
df['new_col'] = df['a'] * 2

2. Chain Operations Efficiently

# Chain operations for readability
result = (df[df['age'] > 30]
          .groupby('department')
          .agg({'salary': 'mean'})
          .sort_values('salary', ascending=False))

3. Use Appropriate Join Types

# Choose the right join type for your use case
# Inner: Only matching records
# Left: All left records + matching right
# Right: All right records + matching left
# Outer: All records from both

Practice Exercise

Exercise: Data Manipulation

Objective: Create a program that demonstrates data manipulation operations.

Instructions:

Create or load sample data
Perform filtering and selection operations
Group data and perform aggregations
Merge/join multiple DataFrames
Create pivot tables
Apply advanced data manipulation techniques

Example Solution:

"""
Data Manipulation Exercise
This program demonstrates various data manipulation operations.
"""

import pandas as pd
import numpy as np

print("=" * 50)
print("Data Manipulation Exercise")
print("=" * 50)

# 1. Create sample data
print("\n1. Creating Sample Data:")
print("-" * 50)

employees = pd.DataFrame({
    'employee_id': [1, 2, 3, 4, 5, 6, 7, 8],
    'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank', 'Grace', 'Henry'],
    'department_id': [1, 2, 1, 3, 2, 1, 3, 2],
    'salary': [50000, 60000, 55000, 70000, 65000, 58000, 72000, 62000],
    'age': [25, 30, 35, 40, 28, 32, 38, 29]
})

departments = pd.DataFrame({
    'department_id': [1, 2, 3],
    'name': ['Sales', 'IT', 'HR'],
    'location': ['NY', 'LA', 'NY']
})

print("Employees DataFrame:")
print(employees)
print("\nDepartments DataFrame:")
print(departments)

# 2. Filtering
print("\n2. Filtering Operations:")
print("-" * 50)

# Filter by age
young_employees = employees[employees['age'] < 30]
print("Employees under 30:")
print(young_employees[['name', 'age', 'salary']])

# Filter by multiple conditions
high_earners = employees[(employees['salary'] > 60000) & (employees['age'] < 35)]
print("\nHigh earners under 35:")
print(high_earners[['name', 'age', 'salary']])

# Filter using isin
sales_it = employees[employees['department_id'].isin([1, 2])]
print("\nSales and IT employees:")
print(sales_it[['name', 'department_id']])

# Filter using query
filtered = employees.query('salary > 60000 and age < 35')
print("\nUsing query method:")
print(filtered[['name', 'age', 'salary']])

# 3. Selection
print("\n3. Selection Operations:")
print("-" * 50)

# Select specific columns
print("Name and salary columns:")
print(employees[['name', 'salary']].head())

# Select using loc
print("\nFirst 3 rows, name and salary:")
print(employees.loc[0:2, ['name', 'salary']])

# Select using iloc
print("\nFirst 2 rows, first 3 columns:")
print(employees.iloc[0:2, 0:3])

# 4. Grouping and Aggregation
print("\n4. Grouping and Aggregation:")
print("-" * 50)

# Group by department
dept_stats = employees.groupby('department_id').agg({
    'salary': ['mean', 'sum', 'count', 'min', 'max'],
    'age': 'mean'
})
print("Department Statistics:")
print(dept_stats)

# Group by department with custom aggregation
dept_summary = employees.groupby('department_id')['salary'].agg([
    ('avg_salary', 'mean'),
    ('total_salary', 'sum'),
    ('employee_count', 'count')
])
print("\nDepartment Summary:")
print(dept_summary)

# Multiple column grouping
age_dept = employees.groupby(['department_id', 'age'])['salary'].mean()
print("\nAverage salary by department and age:")
print(age_dept.head(10))

# 5. Merging
print("\n5. Merging DataFrames:")
print("-" * 50)

# Merge employees with departments
merged = pd.merge(employees, departments, on='department_id', how='inner')
print("Merged DataFrame:")
print(merged[['name', 'salary', 'name_y', 'location']].head())

# Rename columns for clarity
merged = merged.rename(columns={'name_x': 'employee_name', 'name_y': 'department_name'})
print("\nRenamed columns:")
print(merged[['employee_name', 'salary', 'department_name', 'location']].head())

# 6. Analysis on Merged Data
print("\n6. Analysis on Merged Data:")
print("-" * 50)

# Group by department name
dept_analysis = merged.groupby('department_name').agg({
    'salary': ['mean', 'sum', 'count'],
    'age': 'mean'
})
print("Analysis by Department Name:")
print(dept_analysis)

# Group by location
location_analysis = merged.groupby('location').agg({
    'salary': 'mean',
    'employee_name': 'count'
})
print("\nAnalysis by Location:")
print(location_analysis)

# 7. Pivot Table
print("\n7. Pivot Tables:")
print("-" * 50)

# Create pivot table
pivot = merged.pivot_table(
    values='salary',
    index='department_name',
    columns='location',
    aggfunc='mean',
    fill_value=0
)
print("Pivot Table (Salary by Department and Location):")
print(pivot)

# 8. Advanced Operations
print("\n8. Advanced Operations:")
print("-" * 50)

# Transform: Normalize salary within department
merged['salary_normalized'] = merged.groupby('department_name')['salary'].transform(
    lambda x: (x - x.mean()) / x.std()
)
print("Salary normalized within department:")
print(merged[['employee_name', 'department_name', 'salary', 'salary_normalized']].head())

# Apply: Create salary category
def salary_category(row):
    if row['salary'] < 55000:
        return 'Low'
    elif row['salary'] < 65000:
        return 'Medium'
    else:
        return 'High'

merged['salary_category'] = merged.apply(salary_category, axis=1)
print("\nSalary categories:")
print(merged[['employee_name', 'salary', 'salary_category']].head())

# 9. Cross-tabulation
print("\n9. Cross-tabulation:")
print("-" * 50)

crosstab = pd.crosstab(merged['department_name'], merged['salary_category'])
print("Department vs Salary Category:")
print(crosstab)

# 10. Concatenation
print("\n10. Concatenation:")
print("-" * 50)

# Create additional data
new_employees = pd.DataFrame({
    'employee_id': [9, 10],
    'name': ['Iris', 'Jack'],
    'department_id': [1, 2],
    'salary': [52000, 63000],
    'age': [27, 31]
})

# Concatenate
all_employees = pd.concat([employees, new_employees], ignore_index=True)
print("All employees (after concatenation):")
print(all_employees)

print("\n" + "=" * 50)
print("Exercise completed!")
print("=" * 50)

Expected Output: Demonstrates various data manipulation operations including filtering, grouping, merging, and advanced operations.

Challenge (Optional):

Work with time series data
Perform more complex aggregations
Create multi-level pivot tables
Handle duplicate data
Implement data validation

Key Takeaways

Filtering - Boolean indexing, query(), loc, iloc
Selection - Select columns, rows, subsets
Grouping - groupby() for aggregations
Aggregation - sum, mean, count, min, max, custom functions
Merging - Inner, left, right, outer joins
Joining - join() method for index-based joins
Concatenation - Combine DataFrames vertically or horizontally
Pivot tables - Reshape data for analysis
Transform - Apply functions that return same shape
Apply - Apply custom functions to rows/columns
String methods - Filter using string operations
Multiple grouping - Group by multiple columns
Cross-tabulation - Create frequency tables
Best practices - Vectorized operations, appropriate joins
Common mistakes - SettingWithCopyWarning, index issues

Quiz: Data Manipulation

Test your understanding with these questions:

What filters rows where age > 30?
- A) df[age > 30]
- B) df[df['age'] > 30]
- C) df.filter(age > 30)
- D) df.where(age > 30)
What groups data by column?
- A) df.group()
- B) df.groupby()
- C) df.aggregate()
- D) df.collect()
What is the default join type in merge()?
- A) left
- B) right
- C) inner
- D) outer
What concatenates DataFrames vertically?
- A) pd.concat([df1, df2], axis=0)
- B) pd.concat([df1, df2], axis=1)
- C) pd.merge(df1, df2)
- D) df1.join(df2)
What creates a pivot table?
- A) df.pivot()
- B) df.pivot_table()
- C) df.reshape()
- D) df.transpose()
What aggregates multiple functions?
- A) df.agg()
- B) df.aggregate()
- C) df.groupby().agg()
- D) All of the above
What filters using string contains?
- A) df[df['col'].contains('text')]
- B) df[df['col'].str.contains('text')]
- C) df.filter('text')
- D) df[df['col'] == 'text']
What resets index after filtering?
- A) df.reset_index()
- B) df.reindex()
- C) df.set_index()
- D) df.index_reset()
What applies function to each row?
- A) df.apply(func, axis=0)
- B) df.apply(func, axis=1)
- C) df.map(func)
- D) df.transform(func)
What joins on index?
- A) pd.merge()
- B) df.join()
- C) pd.concat()
- D) Both A and B

Answers:

B) df[df['age'] > 30] (boolean indexing)
B) df.groupby() (grouping function)
C) inner (default join type)
A) pd.concat([df1, df2], axis=0) (vertical concatenation)
B) df.pivot_table() (pivot table function)
D) All of the above (aggregation methods)
B) df[df['col'].str.contains('text')] (string method)
A) df.reset_index() (reset index)
B) df.apply(func, axis=1) (apply to rows)
B) df.join() (index-based join)

Next Steps

Excellent work! You've mastered data manipulation. You now understand:

Filtering and selection
Grouping and aggregation
Merging and joining
Advanced data operations

What's Next?

Lesson 22.3: Data Cleaning
Learn handling missing data
Data transformation techniques
More data preparation

Additional Resources

Pandas Documentation: pandas.pydata.org/
Pandas User Guide: pandas.pydata.org/docs/user_guide/index.html
Groupby Guide: pandas.pydata.org/docs/user_guide/groupby.html
Merge Guide: pandas.pydata.org/docs/user_guide/merging.html

Lesson completed! You're ready to move on to the next lesson.

Previous: Data Manipulation with Pandas Next: Data Cleaning

Course Navigation

NumPy

Pandas