Data Analysis Project
Learning Objectives
- By completing this project, you will:
- - Load and explore datasets
- - Clean and preprocess data
- - Perform exploratory data analysis (EDA)
- - Create meaningful visualizations
- - Apply statistical analysis
- - Build predictive models
- - Generate actionable insights
- - Present findings effectively
- - Document analysis process
Capstone Project 2: Data Analysis Project
Project Overview
Build a comprehensive data analysis project demonstrating mastery of data manipulation, analysis, visualization, and insights generation using Python's data science ecosystem.
Project Duration: 2-3 weeks
Difficulty: Advanced
Technologies: Pandas, NumPy, Matplotlib, Seaborn, Jupyter Notebooks, Scikit-learn
Learning Objectives
By completing this project, you will:
- Load and explore datasets
- Clean and preprocess data
- Perform exploratory data analysis (EDA)
- Create meaningful visualizations
- Apply statistical analysis
- Build predictive models
- Generate actionable insights
- Present findings effectively
- Document analysis process
Project Requirements
Core Requirements
- Dataset: Real-world dataset (minimum 1000 rows, multiple features)
- Data Cleaning: Handle missing values, outliers, duplicates
- Exploratory Data Analysis: Comprehensive EDA with visualizations
- Statistical Analysis: Descriptive statistics, correlations, hypothesis testing
- Visualizations: Multiple chart types (at least 10 different visualizations)
- Machine Learning: At least one predictive model
- Documentation: Jupyter notebook with markdown explanations
- Insights: Clear conclusions and recommendations
Functional Requirements
-
Data Loading and Inspection
- Load data from CSV, JSON, or API
- Inspect data structure
- Check data types
- Identify missing values
-
Data Cleaning
- Handle missing values
- Remove duplicates
- Handle outliers
- Data type conversions
- Feature engineering
-
Exploratory Data Analysis
- Univariate analysis
- Bivariate analysis
- Multivariate analysis
- Distribution analysis
- Correlation analysis
-
Visualizations
- Histograms and distributions
- Scatter plots
- Box plots
- Heatmaps
- Time series plots (if applicable)
- Categorical visualizations
-
Statistical Analysis
- Descriptive statistics
- Inferential statistics
- Hypothesis testing
- Confidence intervals
-
Machine Learning
- Data preprocessing for ML
- Feature selection
- Model training
- Model evaluation
- Model interpretation
Technical Requirements
-
Code Quality
- Well-organized notebook
- Clear markdown explanations
- Reusable functions
- Comments and docstrings
- Error handling
-
Visualization Quality
- Professional-looking plots
- Clear labels and titles
- Appropriate color schemes
- Consistent styling
-
Documentation
- Executive summary
- Methodology explanation
- Key findings
- Recommendations
- Limitations
Suggested Project: E-commerce Sales Analysis
Dataset Options
-
E-commerce Sales Data
- Customer purchases
- Product information
- Sales transactions
- Customer demographics
-
Retail Sales Data
- Store sales
- Product categories
- Time series data
- Regional sales
-
Customer Behavior Data
- Website analytics
- User interactions
- Conversion data
- Customer segments
Analysis Goals
-
Sales Performance
- Total sales trends
- Product performance
- Category analysis
- Seasonal patterns
-
Customer Analysis
- Customer segmentation
- Purchase behavior
- Customer lifetime value
- Retention analysis
-
Predictive Modeling
- Sales forecasting
- Customer churn prediction
- Product recommendation
- Price optimization
Technology Stack
Core Libraries
- Pandas: Data manipulation and analysis
- NumPy: Numerical computing
- Matplotlib: Basic plotting
- Seaborn: Statistical visualizations
- Plotly: Interactive visualizations (optional)
Machine Learning
- Scikit-learn: Machine learning algorithms
- XGBoost: Gradient boosting (optional)
- Statsmodels: Statistical modeling
Data Sources
- Kaggle: Datasets
- UCI ML Repository: Machine learning datasets
- Government Data: Open data portals
- APIs: Real-time data
Tools
- Jupyter Notebook: Interactive analysis
- VS Code / PyCharm: Code editing
- Git: Version control
Project Structure
data_analysis_project/
├── README.md
├── requirements.txt
├── .gitignore
├── data/
│ ├── raw/
│ │ └── original_dataset.csv
│ ├── processed/
│ │ └── cleaned_data.csv
│ └── external/
│ └── reference_data.csv
├── notebooks/
│ ├── 01_data_loading.ipynb
│ ├── 02_data_cleaning.ipynb
│ ├── 03_exploratory_analysis.ipynb
│ ├── 04_statistical_analysis.ipynb
│ ├── 05_visualization.ipynb
│ ├── 06_machine_learning.ipynb
│ └── 07_final_report.ipynb
├── src/
│ ├── __init__.py
│ ├── data_loader.py
│ ├── data_cleaner.py
│ ├── visualizer.py
│ ├── analyzer.py
│ └── models.py
├── reports/
│ ├── figures/
│ │ └── (saved visualizations)
│ └── final_report.pdf
└── tests/
├── __init__.py
└── test_data_processing.py
Implementation Guide
Phase 1: Project Setup (Day 1)
Step 1: Environment Setup
# Create project directory
mkdir data_analysis_project
cd data_analysis_project
# Create virtual environment
python -m venv venv
source venv/bin/activate # Linux/macOS
# venv\Scripts\activate # Windows
# Install dependencies
pip install pandas numpy matplotlib seaborn plotly
pip install scikit-learn xgboost statsmodels
pip install jupyter notebook ipykernel
pip install openpyxl # For Excel files
# Create requirements.txt
pip freeze > requirements.txt
Step 2: Project Structure
# Create directories
mkdir -p data/{raw,processed,external}
mkdir -p notebooks src reports/figures tests
touch src/__init__.py tests/__init__.py
Step 3: Dataset Selection
Recommended Datasets:
- Kaggle: E-commerce Sales, Retail Sales, Customer Data
- UCI ML: Online Retail, Sales Transactions
- Government: Economic data, Census data
Dataset Requirements:
- Minimum 1000 rows
- Multiple features (10+ columns)
- Mix of numerical and categorical
- Real-world relevance
Phase 2: Data Loading and Inspection (Days 2-3)
Data Loading
# notebooks/01_data_loading.ipynb
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
# Set style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)
# Load data
df = pd.read_csv('data/raw/sales_data.csv')
# Basic inspection
print("Dataset Shape:", df.shape)
print("\nColumn Names:")
print(df.columns.tolist())
print("\nData Types:")
print(df.dtypes)
print("\nFirst Few Rows:")
df.head()
Data Inspection
# Basic information
df.info()
# Statistical summary
df.describe()
# Check for missing values
missing_values = df.isnull().sum()
missing_percent = (missing_values / len(df)) * 100
missing_df = pd.DataFrame({
'Missing Count': missing_values,
'Percentage': missing_percent
})
missing_df[missing_df['Missing Count'] > 0].sort_values('Missing Count', ascending=False)
# Check for duplicates
duplicate_count = df.duplicated().sum()
print(f"Duplicate rows: {duplicate_count}")
# Unique values in categorical columns
categorical_cols = df.select_dtypes(include=['object']).columns
for col in categorical_cols:
print(f"\n{col}: {df[col].nunique()} unique values")
print(df[col].value_counts().head())
Data Structure Analysis
# Memory usage
print("Memory Usage:")
print(df.memory_usage(deep=True).sum() / 1024**2, "MB")
# Date columns (if any)
date_columns = df.select_dtypes(include=['datetime64']).columns
print(f"\nDate columns: {list(date_columns)}")
# Numerical columns
numerical_cols = df.select_dtypes(include=[np.number]).columns
print(f"\nNumerical columns: {list(numerical_cols)}")
# Categorical columns
categorical_cols = df.select_dtypes(include=['object']).columns
print(f"\nCategorical columns: {list(categorical_cols)}")
Phase 3: Data Cleaning (Days 4-5)
Handle Missing Values
# notebooks/02_data_cleaning.ipynb
import pandas as pd
import numpy as np
# Load data
df = pd.read_csv('data/raw/sales_data.csv')
# Strategy 1: Drop columns with >50% missing values
threshold = 0.5
cols_to_drop = df.columns[df.isnull().mean() > threshold]
df_cleaned = df.drop(columns=cols_to_drop)
print(f"Dropped columns: {list(cols_to_drop)}")
# Strategy 2: Fill numerical columns with median
numerical_cols = df_cleaned.select_dtypes(include=[np.number]).columns
for col in numerical_cols:
if df_cleaned[col].isnull().sum() > 0:
df_cleaned[col].fillna(df_cleaned[col].median(), inplace=True)
# Strategy 3: Fill categorical columns with mode
categorical_cols = df_cleaned.select_dtypes(include=['object']).columns
for col in categorical_cols:
if df_cleaned[col].isnull().sum() > 0:
df_cleaned[col].fillna(df_cleaned[col].mode()[0], inplace=True)
# Verify no missing values
print(f"Missing values remaining: {df_cleaned.isnull().sum().sum()}")
Handle Duplicates
# Remove duplicates
initial_count = len(df_cleaned)
df_cleaned = df_cleaned.drop_duplicates()
final_count = len(df_cleaned)
print(f"Removed {initial_count - final_count} duplicate rows")
Handle Outliers
# Detect outliers using IQR method
def detect_outliers_iqr(df, column):
"""Detect outliers using IQR method."""
Q1 = df[column].quantile(0.25)
Q3 = df[column].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
outliers = df[(df[column] < lower_bound) | (df[column] > upper_bound)]
return outliers, lower_bound, upper_bound
# Handle outliers for numerical columns
numerical_cols = df_cleaned.select_dtypes(include=[np.number]).columns
for col in numerical_cols:
outliers, lower, upper = detect_outliers_iqr(df_cleaned, col)
if len(outliers) > 0:
print(f"\n{col}: {len(outliers)} outliers detected")
# Option 1: Cap outliers
df_cleaned[col] = df_cleaned[col].clip(lower=lower, upper=upper)
# Option 2: Remove outliers (if appropriate)
# df_cleaned = df_cleaned[(df_cleaned[col] >= lower) & (df_cleaned[col] <= upper)]
Data Type Conversions
# Convert date columns
date_columns = ['order_date', 'ship_date'] # Adjust based on your data
for col in date_columns:
if col in df_cleaned.columns:
df_cleaned[col] = pd.to_datetime(df_cleaned[col], errors='coerce')
# Convert categorical to category type (saves memory)
categorical_cols = df_cleaned.select_dtypes(include=['object']).columns
for col in categorical_cols:
if df_cleaned[col].nunique() < 50: # Only if reasonable number of categories
df_cleaned[col] = df_cleaned[col].astype('category')
# Save cleaned data
df_cleaned.to_csv('data/processed/cleaned_data.csv', index=False)
print("Cleaned data saved!")
Phase 4: Exploratory Data Analysis (Days 6-9)
Univariate Analysis
# notebooks/03_exploratory_analysis.ipynb
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Load cleaned data
df = pd.read_csv('data/processed/cleaned_data.csv')
# Numerical variables - Distribution
numerical_cols = df.select_dtypes(include=[np.number]).columns
fig, axes = plt.subplots(len(numerical_cols), 2, figsize=(15, 5*len(numerical_cols)))
fig.suptitle('Univariate Analysis - Numerical Variables', fontsize=16)
for idx, col in enumerate(numerical_cols):
# Histogram
axes[idx, 0].hist(df[col].dropna(), bins=30, edgecolor='black')
axes[idx, 0].set_title(f'Distribution of {col}')
axes[idx, 0].set_xlabel(col)
axes[idx, 0].set_ylabel('Frequency')
# Box plot
axes[idx, 1].boxplot(df[col].dropna())
axes[idx, 1].set_title(f'Box Plot of {col}')
axes[idx, 1].set_ylabel(col)
plt.tight_layout()
plt.savefig('reports/figures/univariate_numerical.png', dpi=300, bbox_inches='tight')
plt.show()
# Categorical variables
categorical_cols = df.select_dtypes(include=['object', 'category']).columns
for col in categorical_cols:
plt.figure(figsize=(12, 6))
value_counts = df[col].value_counts()
plt.bar(range(len(value_counts)), value_counts.values)
plt.xticks(range(len(value_counts)), value_counts.index, rotation=45)
plt.title(f'Distribution of {col}')
plt.xlabel(col)
plt.ylabel('Count')
plt.tight_layout()
plt.savefig(f'reports/figures/categorical_{col}.png', dpi=300, bbox_inches='tight')
plt.show()
Bivariate Analysis
# Correlation matrix
numerical_cols = df.select_dtypes(include=[np.number]).columns
correlation_matrix = df[numerical_cols].corr()
plt.figure(figsize=(12, 10))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0,
square=True, linewidths=1, cbar_kws={"shrink": 0.8})
plt.title('Correlation Matrix', fontsize=16)
plt.tight_layout()
plt.savefig('reports/figures/correlation_matrix.png', dpi=300, bbox_inches='tight')
plt.show()
# Scatter plots for key relationships
key_pairs = [('sales', 'quantity'), ('price', 'sales'), ('discount', 'sales')]
for pair in key_pairs:
if all(col in df.columns for col in pair):
plt.figure(figsize=(10, 6))
plt.scatter(df[pair[0]], df[pair[1]], alpha=0.5)
plt.xlabel(pair[0])
plt.ylabel(pair[1])
plt.title(f'{pair[0]} vs {pair[1]}')
plt.tight_layout()
plt.savefig(f'reports/figures/scatter_{pair[0]}_{pair[1]}.png', dpi=300, bbox_inches='tight')
plt.show()
Multivariate Analysis
# Pair plot for key numerical variables
key_numerical = ['sales', 'quantity', 'price', 'profit'] # Adjust based on your data
key_numerical = [col for col in key_numerical if col in df.columns]
if len(key_numerical) >= 2:
sns.pairplot(df[key_numerical], diag_kind='kde')
plt.suptitle('Pair Plot of Key Numerical Variables', y=1.02)
plt.savefig('reports/figures/pairplot.png', dpi=300, bbox_inches='tight')
plt.show()
# Grouped analysis
if 'category' in df.columns and 'sales' in df.columns:
category_sales = df.groupby('category')['sales'].agg(['sum', 'mean', 'count']).sort_values('sum', ascending=False)
fig, axes = plt.subplots(1, 2, figsize=(15, 6))
# Total sales by category
axes[0].barh(range(len(category_sales)), category_sales['sum'])
axes[0].set_yticks(range(len(category_sales)))
axes[0].set_yticklabels(category_sales.index)
axes[0].set_xlabel('Total Sales')
axes[0].set_title('Total Sales by Category')
# Average sales by category
axes[1].barh(range(len(category_sales)), category_sales['mean'])
axes[1].set_yticks(range(len(category_sales)))
axes[1].set_yticklabels(category_sales.index)
axes[1].set_xlabel('Average Sales')
axes[1].set_title('Average Sales by Category')
plt.tight_layout()
plt.savefig('reports/figures/category_analysis.png', dpi=300, bbox_inches='tight')
plt.show()
Phase 5: Statistical Analysis (Days 10-11)
Descriptive Statistics
# notebooks/04_statistical_analysis.ipynb
import pandas as pd
import numpy as np
from scipy import stats
# Load cleaned data
df = pd.read_csv('data/processed/cleaned_data.csv')
# Comprehensive descriptive statistics
numerical_cols = df.select_dtypes(include=[np.number]).columns
descriptive_stats = df[numerical_cols].describe()
# Add additional statistics
additional_stats = pd.DataFrame({
'skewness': df[numerical_cols].skew(),
'kurtosis': df[numerical_cols].kurtosis(),
'variance': df[numerical_cols].var(),
'coefficient_of_variation': df[numerical_cols].std() / df[numerical_cols].mean()
})
# Combine statistics
full_stats = pd.concat([descriptive_stats.T, additional_stats], axis=1)
print("Comprehensive Descriptive Statistics:")
print(full_stats)
Hypothesis Testing
# Example: Test if sales differ significantly between two categories
if 'category' in df.columns and 'sales' in df.columns:
categories = df['category'].unique()[:2] # Compare first two categories
cat1_sales = df[df['category'] == categories[0]]['sales']
cat2_sales = df[df['category'] == categories[1]]['sales']
# T-test
t_stat, p_value = stats.ttest_ind(cat1_sales, cat2_sales)
print(f"T-test between {categories[0]} and {categories[1]}:")
print(f"T-statistic: {t_stat:.4f}")
print(f"P-value: {p_value:.4f}")
print(f"Significant difference: {'Yes' if p_value < 0.05 else 'No'}")
# Mann-Whitney U test (non-parametric alternative)
u_stat, u_p_value = stats.mannwhitneyu(cat1_sales, cat2_sales, alternative='two-sided')
print(f"\nMann-Whitney U test:")
print(f"U-statistic: {u_stat:.4f}")
print(f"P-value: {u_p_value:.4f}")
Correlation Analysis
# Pearson correlation
numerical_cols = df.select_dtypes(include=[np.number]).columns
correlation_matrix = df[numerical_cols].corr(method='pearson')
# Find strong correlations
strong_correlations = []
for i in range(len(correlation_matrix.columns)):
for j in range(i+1, len(correlation_matrix.columns)):
corr_value = correlation_matrix.iloc[i, j]
if abs(corr_value) > 0.7: # Strong correlation threshold
strong_correlations.append({
'Variable 1': correlation_matrix.columns[i],
'Variable 2': correlation_matrix.columns[j],
'Correlation': corr_value
})
if strong_correlations:
print("Strong Correlations (|r| > 0.7):")
for corr in strong_correlations:
print(f"{corr['Variable 1']} - {corr['Variable 2']}: {corr['Correlation']:.4f}")
Phase 6: Advanced Visualizations (Days 12-13)
Time Series Analysis
# notebooks/05_visualization.ipynb
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Load cleaned data
df = pd.read_csv('data/processed/cleaned_data.csv')
# Time series analysis (if date column exists)
date_columns = df.select_dtypes(include=['datetime64']).columns
if len(date_columns) > 0 and 'sales' in df.columns:
date_col = date_columns[0]
df[date_col] = pd.to_datetime(df[date_col])
df_time = df.set_index(date_col)
# Daily sales trend
daily_sales = df_time['sales'].resample('D').sum()
fig, axes = plt.subplots(2, 1, figsize=(15, 10))
# Line plot
axes[0].plot(daily_sales.index, daily_sales.values, linewidth=2)
axes[0].set_title('Daily Sales Trend', fontsize=14)
axes[0].set_xlabel('Date')
axes[0].set_ylabel('Sales')
axes[0].grid(True, alpha=0.3)
# Monthly aggregation
monthly_sales = df_time['sales'].resample('M').sum()
axes[1].bar(range(len(monthly_sales)), monthly_sales.values)
axes[1].set_xticks(range(len(monthly_sales)))
axes[1].set_xticklabels([d.strftime('%Y-%m') for d in monthly_sales.index], rotation=45)
axes[1].set_title('Monthly Sales', fontsize=14)
axes[1].set_xlabel('Month')
axes[1].set_ylabel('Sales')
axes[1].grid(True, alpha=0.3, axis='y')
plt.tight_layout()
plt.savefig('reports/figures/time_series_sales.png', dpi=300, bbox_inches='tight')
plt.show()
Advanced Visualizations
# Violin plots for distribution comparison
if 'category' in df.columns and 'sales' in df.columns:
plt.figure(figsize=(12, 6))
sns.violinplot(data=df, x='category', y='sales')
plt.title('Sales Distribution by Category', fontsize=14)
plt.xticks(rotation=45)
plt.tight_layout()
plt.savefig('reports/figures/violin_category_sales.png', dpi=300, bbox_inches='tight')
plt.show()
# Heatmap with annotations
if 'category' in df.columns and 'region' in df.columns and 'sales' in df.columns:
pivot_table = df.pivot_table(values='sales', index='category', columns='region', aggfunc='sum')
plt.figure(figsize=(12, 8))
sns.heatmap(pivot_table, annot=True, fmt='.0f', cmap='YlOrRd', linewidths=0.5)
plt.title('Sales Heatmap: Category vs Region', fontsize=14)
plt.tight_layout()
plt.savefig('reports/figures/heatmap_category_region.png', dpi=300, bbox_inches='tight')
plt.show()
# Stacked bar chart
if 'category' in df.columns and 'status' in df.columns:
category_status = pd.crosstab(df['category'], df['status'])
category_status.plot(kind='bar', stacked=True, figsize=(12, 6))
plt.title('Status Distribution by Category', fontsize=14)
plt.xlabel('Category')
plt.ylabel('Count')
plt.legend(title='Status')
plt.xticks(rotation=45)
plt.tight_layout()
plt.savefig('reports/figures/stacked_bar_category_status.png', dpi=300, bbox_inches='tight')
plt.show()
Phase 7: Machine Learning (Days 14-16)
Data Preprocessing for ML
# notebooks/06_machine_learning.ipynb
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.metrics import mean_squared_error, r2_score, accuracy_score, classification_report
import joblib
# Load cleaned data
df = pd.read_csv('data/processed/cleaned_data.csv')
# Example: Predict sales based on other features
# Select features
feature_columns = ['quantity', 'price', 'discount'] # Adjust based on your data
target_column = 'sales'
# Prepare data
X = df[feature_columns].copy()
y = df[target_column].copy()
# Handle missing values
X = X.fillna(X.mean())
y = y.fillna(y.mean())
# Encode categorical variables if any
le = LabelEncoder()
for col in X.select_dtypes(include=['object', 'category']).columns:
X[col] = le.fit_transform(X[col])
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
Model Training
# Train Random Forest model
model = RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1)
model.fit(X_train_scaled, y_train)
# Predictions
y_train_pred = model.predict(X_train_scaled)
y_test_pred = model.predict(X_test_scaled)
# Evaluation
train_rmse = np.sqrt(mean_squared_error(y_train, y_train_pred))
test_rmse = np.sqrt(mean_squared_error(y_test, y_test_pred))
train_r2 = r2_score(y_train, y_train_pred)
test_r2 = r2_score(y_test, y_test_pred)
print("Model Performance:")
print(f"Train RMSE: {train_rmse:.4f}")
print(f"Test RMSE: {test_rmse:.4f}")
print(f"Train R²: {train_r2:.4f}")
print(f"Test R²: {test_r2:.4f}")
# Feature importance
feature_importance = pd.DataFrame({
'feature': feature_columns,
'importance': model.feature_importances_
}).sort_values('importance', ascending=False)
print("\nFeature Importance:")
print(feature_importance)
# Visualize feature importance
plt.figure(figsize=(10, 6))
plt.barh(feature_importance['feature'], feature_importance['importance'])
plt.xlabel('Importance')
plt.title('Feature Importance')
plt.tight_layout()
plt.savefig('reports/figures/feature_importance.png', dpi=300, bbox_inches='tight')
plt.show()
# Save model
joblib.dump(model, 'models/sales_predictor.pkl')
joblib.dump(scaler, 'models/scaler.pkl')
Model Evaluation
# Prediction vs Actual scatter plot
fig, axes = plt.subplots(1, 2, figsize=(15, 6))
# Training set
axes[0].scatter(y_train, y_train_pred, alpha=0.5)
axes[0].plot([y_train.min(), y_train.max()], [y_train.min(), y_train.max()], 'r--', lw=2)
axes[0].set_xlabel('Actual')
axes[0].set_ylabel('Predicted')
axes[0].set_title(f'Training Set (R² = {train_r2:.4f})')
axes[0].grid(True, alpha=0.3)
# Test set
axes[1].scatter(y_test, y_test_pred, alpha=0.5)
axes[1].plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=2)
axes[1].set_xlabel('Actual')
axes[1].set_ylabel('Predicted')
axes[1].set_title(f'Test Set (R² = {test_r2:.4f})')
axes[1].grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('reports/figures/prediction_accuracy.png', dpi=300, bbox_inches='tight')
plt.show()
Phase 8: Final Report and Presentation (Days 17-18)
Executive Summary
# notebooks/07_final_report.ipynb
import pandas as pd
import matplotlib.pyplot as plt
# Load cleaned data
df = pd.read_csv('data/processed/cleaned_data.csv')
# Key Metrics Summary
summary = {
'Total Records': len(df),
'Total Sales': df['sales'].sum() if 'sales' in df.columns else 'N/A',
'Average Sales': df['sales'].mean() if 'sales' in df.columns else 'N/A',
'Unique Customers': df['customer_id'].nunique() if 'customer_id' in df.columns else 'N/A',
'Unique Products': df['product_id'].nunique() if 'product_id' in df.columns else 'N/A',
'Date Range': f"{df['date'].min()} to {df['date'].max()}" if 'date' in df.columns else 'N/A'
}
print("=" * 50)
print("EXECUTIVE SUMMARY")
print("=" * 50)
for key, value in summary.items():
print(f"{key}: {value}")
print("=" * 50)
Key Findings
# Top insights
print("\nKEY FINDINGS:")
print("1. Sales Performance:")
if 'sales' in df.columns:
top_category = df.groupby('category')['sales'].sum().idxmax()
print(f" - Top performing category: {top_category}")
print(f" - Total sales: ${df['sales'].sum():,.2f}")
print("\n2. Customer Insights:")
if 'customer_id' in df.columns:
avg_order_value = df.groupby('customer_id')['sales'].sum().mean()
print(f" - Average customer lifetime value: ${avg_order_value:,.2f}")
print("\n3. Product Insights:")
if 'product_id' in df.columns and 'sales' in df.columns:
top_product = df.groupby('product_id')['sales'].sum().idxmax()
print(f" - Best selling product ID: {top_product}")
Recommendations
print("\nRECOMMENDATIONS:")
print("1. Focus marketing efforts on top-performing categories")
print("2. Implement customer retention programs for high-value customers")
print("3. Optimize inventory for best-selling products")
print("4. Consider promotional campaigns for underperforming categories")
print("5. Analyze seasonal trends for better inventory planning")
Evaluation Criteria
Data Analysis (30%)
- Comprehensive EDA
- Appropriate visualizations
- Statistical analysis
- Data quality assessment
Code Quality (20%)
- Well-organized code
- Reusable functions
- Clear comments
- Proper error handling
Machine Learning (20%)
- Appropriate model selection
- Proper evaluation metrics
- Feature engineering
- Model interpretation
Documentation (15%)
- Clear markdown explanations
- Methodology documented
- Findings summarized
- Recommendations provided
Visualization Quality (10%)
- Professional appearance
- Clear labels
- Appropriate chart types
- Consistent styling
Insights and Recommendations (5%)
- Actionable insights
- Business recommendations
- Clear conclusions
Submission Requirements
-
Jupyter Notebooks
- Complete analysis notebooks
- All code executable
- Clear markdown explanations
-
Data Files
- Original dataset
- Cleaned dataset
- Processed datasets
-
Visualizations
- All figures saved
- High resolution (300 DPI)
- Properly labeled
-
Documentation
- README.md with project overview
- Final report (PDF or HTML)
- Methodology explanation
-
Code
- Reusable functions in src/
- Requirements.txt
- .gitignore
Additional Resources
- Kaggle: kaggle.com/datasets
- UCI ML Repository: archive.ics.uci.edu/ml
- Pandas Documentation: pandas.pydata.org
- Matplotlib Gallery: matplotlib.org/gallery
- Seaborn Examples: seaborn.pydata.org/examples
Next Steps
After completing this project:
- Explore more advanced ML techniques
- Build interactive dashboards (Plotly Dash, Streamlit)
- Deploy models as APIs
- Create automated reporting pipelines
- Explore deep learning for time series
Good luck with your data analysis project!
Course Navigation
- Data Analysis Project
- Data Analysis Project