In this data analytics project, I perform a comprehensive exploratory data analysis (EDA) using Python programming.
I use an insurance charges dataset found on Kaggle.
Basically, it includes data loading, initial inspection, statistical summaries, visualization, and correlation analysis.
Here is the step by step EDA process of the insurance data:
- Initial Data Inspection
- Univariate Analysis
- Bivariate Analysis
- Correlation Analysis
- Outlier Detection
- Advanced Analysis
- Statistical Tests
- Key Insights from the EDA
You can also find this work on my Kaggle account notebook here.
Let’s start…
Import Library and Load Dataset¶
First, let’s import pyforest library and load the dataset. Pyforest contains several data analyses libraries like pandas, seaborn, matplotlib. So, no need to import them individually.
import pyforest
# Set style for plots
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (10, 6)
# Load the dataset
df=pd.read_csv('/Users/iksolomon/Downloads/insurance.csv')
1. Data Inspection¶
df.head()
age | sex | bmi | children | smoker | region | charges | |
---|---|---|---|---|---|---|---|
0 | 19 | female | 27.900 | 0 | yes | southwest | 16884.92400 |
1 | 18 | male | 33.770 | 1 | no | southeast | 1725.55230 |
2 | 28 | male | 33.000 | 3 | no | southeast | 4449.46200 |
3 | 33 | male | 22.705 | 0 | no | northwest | 21984.47061 |
4 | 32 | male | 28.880 | 0 | no | northwest | 3866.85520 |
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1338 entries, 0 to 1337 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 age 1338 non-null int64 1 sex 1338 non-null object 2 bmi 1338 non-null float64 3 children 1338 non-null int64 4 smoker 1338 non-null object 5 region 1338 non-null object 6 charges 1338 non-null float64 dtypes: float64(2), int64(2), object(3) memory usage: 73.3+ KB
Here, df.info() shows that there is no null data. Overall, the data is clean
df.describe()
age | bmi | children | charges | |
---|---|---|---|---|
count | 1338.000000 | 1338.000000 | 1338.000000 | 1338.000000 |
mean | 39.207025 | 30.663397 | 1.094918 | 13270.422265 |
std | 14.049960 | 6.098187 | 1.205493 | 12110.011237 |
min | 18.000000 | 15.960000 | 0.000000 | 1121.873900 |
25% | 27.000000 | 26.296250 | 0.000000 | 4740.287150 |
50% | 39.000000 | 30.400000 | 1.000000 | 9382.033000 |
75% | 51.000000 | 34.693750 | 2.000000 | 16639.912515 |
max | 64.000000 | 53.130000 | 5.000000 | 63770.428010 |
2. Univariate Analysis¶
# Numerical variables
num_cols = ['age', 'bmi', 'children', 'charges']
print("\n=== Numerical Variables Analysis ===")
print(df[num_cols].describe())
=== Numerical Variables Analysis === age bmi children charges count 1338.000000 1338.000000 1338.000000 1338.000000 mean 39.207025 30.663397 1.094918 13270.422265 std 14.049960 6.098187 1.205493 12110.011237 min 18.000000 15.960000 0.000000 1121.873900 25% 27.000000 26.296250 0.000000 4740.287150 50% 39.000000 30.400000 1.000000 9382.033000 75% 51.000000 34.693750 2.000000 16639.912515 max 64.000000 53.130000 5.000000 63770.428010
# Plot distributions
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
for i, col in enumerate(num_cols):
sns.histplot(df[col], kde=True, ax=axes[i//2, i%2])
axes[i//2, i%2].set_title(f'Distribution of {col}')
plt.tight_layout()
plt.show()
# Categorical variables
cat_cols = ['sex', 'smoker', 'region']
print("\n=== Categorical Variables Analysis ===")
for col in cat_cols:
print(f"\n{col} distribution:")
print(df[col].value_counts(normalize=True) * 100)
plt.figure(figsize=(8, 4))
sns.countplot(data=df, x=col)
plt.title(f'Distribution of {col}')
plt.show()
=== Categorical Variables Analysis === sex distribution: sex male 50.523169 female 49.476831 Name: proportion, dtype: float64
smoker distribution: smoker no 79.521674 yes 20.478326 Name: proportion, dtype: float64
region distribution: region southeast 27.204783 southwest 24.289985 northwest 24.289985 northeast 24.215247 Name: proportion, dtype: float64
3. Bivariate Analysis¶
# Relationship between categorical variables and charges
print("\n=== Charges by Categorical Variables ===")
for col in cat_cols:
plt.figure(figsize=(8, 5))
sns.boxplot(data=df, x=col, y='charges')
plt.title(f'Charges by {col}')
plt.show()
=== Charges by Categorical Variables ===
# Relationship between numerical variables and charges
print("\n=== Charges by Numerical Variables ===")
for col in ['age', 'bmi', 'children']:
plt.figure(figsize=(8, 5))
sns.scatterplot(data=df, x=col, y='charges', hue='smoker', alpha=0.7)
plt.title(f'Charges vs {col}')
plt.show()
=== Charges by Numerical Variables ===
4. Correlation Analysis¶
# Compute correlation matrix
corr = df.corr(numeric_only=True)
plt.figure(figsize=(8, 6))
sns.heatmap(corr, annot=True, cmap='coolwarm', center=0)
plt.title('Correlation Matrix')
plt.show()
Now, let’s expand the correlation matrix to include categorical variables. With this, we discover if any variable has a strong (or weak) relationship with insurance charges
df_copy= df.copy()
df_copy['sex']=df_copy['sex'].map({'male':1, 'female':0})
df_copy['smoker']=df_copy['smoker'].map({'yes':1, 'no':0})
df_copy =pd.get_dummies(df_copy, columns=['region'], drop_first=True)
plt.figure(figsize=(12,8))
sns.heatmap(df_copy.corr(numeric_only=True), annot=True, cmap= 'coolwarm')
plt.show()
# Pairplot to visualize relationships
sns.pairplot(df, hue='smoker', diag_kind='kde')
plt.suptitle('Pairplot of Numerical Variables by Smoking Status', y=1.02)
plt.show()
5. Outlier Detection¶
print("\n=== Outlier Detection ===")
plt.figure(figsize=(12, 6))
sns.boxplot(data=df[num_cols], orient='h')
plt.title('Boxplot of Numerical Variables')
plt.show()
=== Outlier Detection ===
6. Advanced Analysis¶
# Age groups analysis
df['age_group'] = pd.cut(df['age'], bins=[17, 30, 40, 50, 60, 65],
labels=['18-30', '31-40', '41-50', '51-60', '61-65'])
plt.figure(figsize=(10, 6))
sns.boxplot(data=df, x='age_group', y='charges', hue='smoker')
plt.title('Charges by Age Group and Smoking Status')
plt.show()
# BMI categories
df['bmi_category'] = pd.cut(df['bmi'], bins=[0, 18.5, 25, 30, 100],
labels=['Underweight', 'Normal', 'Overweight', 'Obese'])
plt.figure(figsize=(10, 6))
sns.boxplot(data=df, x='bmi_category', y='charges', hue='smoker')
plt.title('Charges by BMI Category and Smoking Status')
plt.show()
7. Statistical Tests¶
print("\n=== Statistical Tests ===")
# T-test for charges between smokers and non-smokers
smoker_charges = df[df['smoker'] == 'yes']['charges']
non_smoker_charges = df[df['smoker'] == 'no']['charges']
t_stat, p_val = stats.ttest_ind(smoker_charges, non_smoker_charges, equal_var=False)
print(f"T-test for charges (smokers vs non-smokers): t-stat={t_stat:.2f}, p-value={p_val:.2e}")
=== Statistical Tests ===
T-test for charges (smokers vs non-smokers): t-stat=32.75, p-value=5.89e-103
# ANOVA for charges across regions
f_stat, p_val = stats.f_oneway(
df[df['region'] == 'southwest']['charges'],
df[df['region'] == 'southeast']['charges'],
df[df['region'] == 'northwest']['charges'],
df[df['region'] == 'northeast']['charges']
)
print(f"ANOVA for charges across regions: F-stat={f_stat:.2f}, p-value={p_val:.2f}")
ANOVA for charges across regions: F-stat=2.97, p-value=0.03
Key Insights from the EDA:¶
- Dataset Overview:
- 1338 records, 7 features
- No missing values
- Mixed data types (numerical and categorical)
- Target Variable (charges):
- Highly right-skewed distribution
- Mean: $13,270
- Range: 1,122 to 63,770
- Clear bimodal distribution likely due to smoker/non-smoker difference
- Demographics:
- Age range: 18-64 years
- Balanced gender distribution (50.5% male, 49.5% female)
- BMI ranges from 15.96 to 53.13 (mean 30.66)
- 20.5% smokers, 79.5% non-smokers
- Region distribution fairly even (southeast slightly more represented)
- Key Relationships:
- Smokers pay significantly higher charges (confirmed by t-test)
- Positive correlation between age and charges
- BMI shows a positive relationship with charges, especially for smokers
- Number of children has a modest impact on charges
- The southeast region has higher average charges
- Statistical Findings:
- Smokers pay on average 3-4 times more than non-smokers (p-value < 0.001)
- Significant regional differences in charges (ANOVA p-value < 0.05)
- Visual Findings:
- Clear separation in charges between smokers and non-smokers across all age groups
- BMI categories show that obese smokers have the highest charges
- The relationship between age and charges is more pronounced for smokers
Here is the end of the Exploratory Data Analysis project.
Feel free to comment.
Leave a Reply