This project involved a complete pipeline for pre-processing customer data and generating insightful visualizations. The tasks included cleaning and transforming a flat CSV dataset into nested JSON structures, computing new metrics, and visualizing key trends using Python. This project not only delivered the processed data files but also provided detailed charts that revealed patterns in customer attributes like age, salary, and commute distance.
Project Overview
The goal of this project was to transform raw customer data into structured formats suitable for business analysis. A set of tasks was performed, ranging from data manipulation to generating visual insights, using standard Python libraries for pre-processing and Pandas and Seaborn for data visualization. The final deliverables included multiple JSON files and various charts that provide actionable insights into the dataset.
Data Pre-processing Tasks
Data Loading and Structuring:
CSV to Nested JSON Conversion: The flat data structure from the CSV was parsed and converted into more meaningful nested structures for fields like
Vehicle
,Credit Card
, andAddress
.Data Cleaning: Issues such as missing values in key fields like dependents were handled effectively, ensuring accurate representation of the data.
Segmentation of Data: Separate JSON files were created for specific groups, such as
retired.json
andemployed.json
, based on customer occupation.
Credit Card Anomaly Detection:
A function was created to flag customers with credit card expiry dates spanning over 10 years, exporting them to
remove_ccard.json
for manual follow-up by the client.
New Metrics Calculation:
A new metric, "Salary-Commute", was calculated to measure the customer's earnings per mile commuted. The data was sorted and output to
commute.json
.
Data Visualization
Using Pandas and Seaborn, several visualizations were generated to provide the client with deeper insights into the customer data. The charts revealed trends in age, salary, and commute distances, highlighting key patterns that could guide business decisions.
Key Visualizations:
Age Distribution: A histogram showing the distribution of customer ages with a bin width of 5, allowing the client to see the age spread of their customer base.
Dependents Data: A visualization highlighting errors and anomalies in the dependents data, aiding in the understanding of potential data quality issues.
Age vs. Marital Status: A plot showing the relationship between age and marital status, segmented into different bins, offering insight into customer demographics.
Scatter Plot of Commute Distance vs. Salary: This chart visualized how yearly salary relates to the commute distance, helping the client identify possible correlations between salary and geographical spread.
Scatter Plot of Age vs. Salary: A plot showing the relationship between customer age and their salary, revealing key trends across different age groups.
Scatter Plot of Age vs. Salary by Dependents: This plot further refined the previous analysis by adding dependents as a condition, showing how family responsibilities might affect income.
Challenges and Solutions
Complex Data Transformation: The raw data included various flat structures that needed to be nested into meaningful categories. This challenge was tackled using custom Python functions to parse and transform the data into hierarchical JSON formats.
Data Quality: Missing or inconsistent entries, particularly in the dependents column, were managed by applying sensible default values and error handling to maintain the dataset’s integrity.
Final Deliverables
Processed JSON Files:
processed.json
: Contains the fully transformed and cleaned customer data.retired.json
andemployed.json
: Segmented datasets based on occupation.remove_ccard.json
: Customers flagged for potential credit card issues.commute.json
: Sorted records based on the newly calculated "Salary-Commute" metric.
Data Visualizations:
A set of insightful charts saved as images, capturing trends in customer demographics, salary distribution, and commute patterns.
Conclusion
This project successfully transformed raw customer data into a structured, analysis-ready format while generating visual insights that could inform business decisions. The blend of data cleaning, transformation, and visualization provided a comprehensive view of the company's customer base, enabling strategic planning and operational improvements.
Code and Analysis File: [Customer_data.ipynb]
Visualization Charts: [all_chart.pdf]
Detailed Report: [ProgDSAI_ACW.pdf]