Predictive Customer Churn Analysis for a Telecom Company This project focuses on analyzing a telecom company’s customer data to predict churn. By building and evaluating several machine learning models, the analysis identifies the key factors that contribute to customers leaving the service. The ultimate goal is to provide actionable business recommendations to improve customer retention.
📊 Project Overview Customer churn is a critical metric for subscription-based businesses. This analysis walks through the entire data science workflow, from data cleaning and exploratory data analysis (EDA) to model training and feature importance evaluation. Three different classification models are trained to predict whether a customer will churn or not, with the best-performing model used to derive key insights.
The project demonstrates proficiency in:
Data cleaning and preprocessing.
Exploratory data analysis and visualization.
Building robust machine learning pipelines with scikit-learn.
Training and comparing multiple classification models (Logistic Regression, Random Forest, XGBoost).
Evaluating model performance using metrics like Accuracy, AUC-ROC, Precision, and Recall.
Deriving actionable business insights from model results.
📈 Project Workflow The analysis is structured into the following key steps:
Data Loading and Inspection: The Telco Customer Churn dataset is loaded from a public repository and its structure is inspected.
Data Cleaning and Preprocessing:
Handled missing values in the TotalCharges column by filling them with the median.
Converted the target variable Churn from categorical (‘Yes’/’No’) to binary (1/0).
Dropped the non-predictive customerID column.
Exploratory Data Analysis (EDA): Visualized the data to uncover initial patterns and relationships related to customer churn.
Feature Engineering and Pipeline Creation:
Identified numerical and categorical features.
Created a preprocessing pipeline using scikit-learn’s ColumnTransformer to scale numerical features (StandardScaler) and one-hot encode categorical features (OneHotEncoder).
Model Training:
Split the data into training (80%) and testing (20%) sets, stratifying by the target variable to maintain churn distribution.
Trained three classification models: Logistic Regression, Random Forest, and XGBoost.
Model Evaluation:
Evaluated each model on the test set using Accuracy, AUC-ROC score, a detailed classification report, and a confusion matrix.
Feature Importance Analysis:
Used the best model (XGBoost) to identify and visualize the top features that most influence churn predictions.
Conclusion and Recommendations: Summarized the findings and provided actionable business recommendations based on the analysis.
💡 Key Findings & Visualizations The EDA revealed several important trends:
Contract Type: Customers with Month-to-month contracts have a significantly higher churn rate compared to those on one or two-year contracts.
Internet Service: Customers with Fiber optic internet service are more likely to churn.
Tenure and Charges: Customers who churn tend to have lower tenure (are newer customers) and higher monthly charges.
⚙️ Model Performance The performance of the three models on the test set was compared, with XGBoost showing the best overall predictive capability, especially highlighted by its AUC-ROC score.
Model Accuracy AUC-ROC Score Logistic Regression 0.8055 0.8419 Random Forest 0.7779 0.8164 XGBoost 0.7850 0.8254
Export to Sheets While Logistic Regression had slightly higher accuracy, the XGBoost model was chosen for its strong performance and robustness, particularly its ability to provide clear feature importances.
🎯 Top Predictors of Churn Based on the XGBoost model’s feature importance analysis, the top 5 drivers of churn are:
Internet Service (Fiber optic): Having fiber optic is the most significant predictor.
Contract (Month-to-month): This is the second most powerful indicator of churn.
Streaming Movies (Yes)
Tech Support (No)
Online Security (No)
🚀 Conclusion & Business Recommendations Model Performance Summary The XGBoost model performed the best overall, achieving an AUC-ROC score of approximately 0.83, indicating a strong ability to distinguish between churning and non-churning customers.
Key Drivers of Churn Contract Type: Month-to-month contracts are by far the strongest predictor of churn.
Internet Service: Customers with Fiber optic service show a higher churn rate. This could be due to price sensitivity or service reliability issues that need investigation.
Tenure: Low tenure (newer customers) is a high-risk factor.
Actionable Business Recommendations Proactive Retention Campaigns: Use the model to score customers based on their churn probability. Target the highest-risk customers (e.g., top 10%) with proactive offers, such as a discount for switching from a month-to-month to an annual contract.
Improve Onboarding Experience: Since new customers are at high risk, develop a robust 90-day onboarding plan to demonstrate the service’s value and build loyalty early on.
Conduct Service Review: Investigate why Fiber optic customers churn more frequently. This could involve analyzing pricing compared to competitors, reviewing service reliability tickets, or conducting customer surveys.