In this data analytics project, I perform a comprehensive exploratory data analysis (EDA) using Python programming.

I use an insurance charges dataset found on Kaggle.

Basically, it includes data loading, initial inspection, statistical summaries, visualization, and correlation analysis.

Here is the step by step EDA process of the insurance data:

Initial Data Inspection
Univariate Analysis
Bivariate Analysis
Correlation Analysis
Outlier Detection
Advanced Analysis
Statistical Tests
Key Insights from the EDA

You can also find this work on my Kaggle account notebook here.

Let’s start…

Import Library and Load Dataset¶

First, let’s import pyforest library and load the dataset. Pyforest contains several data analyses libraries like pandas, seaborn, matplotlib. So, no need to import them individually.

In [1]:

import pyforest

# Set style for plots
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (10, 6)

# Load the dataset
df=pd.read_csv('/Users/iksolomon/Downloads/insurance.csv')

1. Data Inspection¶

In [2]:

df.head()

Out[2]:

	age	sex	bmi	children	smoker	region	charges
0	19	female	27.900	0	yes	southwest	16884.92400
1	18	male	33.770	1	no	southeast	1725.55230
2	28	male	33.000	3	no	southeast	4449.46200
3	33	male	22.705	0	no	northwest	21984.47061
4	32	male	28.880	0	no	northwest	3866.85520

In [3]:

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1338 non-null   int64  
 1   sex       1338 non-null   object 
 2   bmi       1338 non-null   float64
 3   children  1338 non-null   int64  
 4   smoker    1338 non-null   object 
 5   region    1338 non-null   object 
 6   charges   1338 non-null   float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.3+ KB

Here, df.info() shows that there is no null data. Overall, the data is clean

In [4]:

df.describe()

Out[4]:

	age	bmi	children	charges
count	1338.000000	1338.000000	1338.000000	1338.000000
mean	39.207025	30.663397	1.094918	13270.422265
std	14.049960	6.098187	1.205493	12110.011237
min	18.000000	15.960000	0.000000	1121.873900
25%	27.000000	26.296250	0.000000	4740.287150
50%	39.000000	30.400000	1.000000	9382.033000
75%	51.000000	34.693750	2.000000	16639.912515
max	64.000000	53.130000	5.000000	63770.428010

2. Univariate Analysis¶

In [5]:

# Numerical variables
num_cols = ['age', 'bmi', 'children', 'charges']
print("\n=== Numerical Variables Analysis ===")
print(df[num_cols].describe())

=== Numerical Variables Analysis ===
               age          bmi     children       charges
count  1338.000000  1338.000000  1338.000000   1338.000000
mean     39.207025    30.663397     1.094918  13270.422265
std      14.049960     6.098187     1.205493  12110.011237
min      18.000000    15.960000     0.000000   1121.873900
25%      27.000000    26.296250     0.000000   4740.287150
50%      39.000000    30.400000     1.000000   9382.033000
75%      51.000000    34.693750     2.000000  16639.912515
max      64.000000    53.130000     5.000000  63770.428010

In [6]:

# Plot distributions
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
for i, col in enumerate(num_cols):
    sns.histplot(df[col], kde=True, ax=axes[i//2, i%2])
    axes[i//2, i%2].set_title(f'Distribution of {col}')
plt.tight_layout()
plt.show()

No description has been provided for this image

In [7]:

# Categorical variables
cat_cols = ['sex', 'smoker', 'region']
print("\n=== Categorical Variables Analysis ===")
for col in cat_cols:
    print(f"\n{col} distribution:")
    print(df[col].value_counts(normalize=True) * 100)
    
    plt.figure(figsize=(8, 4))
    sns.countplot(data=df, x=col)
    plt.title(f'Distribution of {col}')
    plt.show()

=== Categorical Variables Analysis ===

sex distribution:
sex
male      50.523169
female    49.476831
Name: proportion, dtype: float64

smoker distribution:
smoker
no     79.521674
yes    20.478326
Name: proportion, dtype: float64

region distribution:
region
southeast    27.204783
southwest    24.289985
northwest    24.289985
northeast    24.215247
Name: proportion, dtype: float64

3. Bivariate Analysis¶

In [8]:

# Relationship between categorical variables and charges
print("\n=== Charges by Categorical Variables ===")
for col in cat_cols:
    plt.figure(figsize=(8, 5))
    sns.boxplot(data=df, x=col, y='charges')
    plt.title(f'Charges by {col}')
    plt.show()

=== Charges by Categorical Variables ===

In [9]:

# Relationship between numerical variables and charges
print("\n=== Charges by Numerical Variables ===")
for col in ['age', 'bmi', 'children']:
    plt.figure(figsize=(8, 5))
    sns.scatterplot(data=df, x=col, y='charges', hue='smoker', alpha=0.7)
    plt.title(f'Charges vs {col}')
    plt.show()

=== Charges by Numerical Variables ===

4. Correlation Analysis¶

In [10]:

# Compute correlation matrix
corr = df.corr(numeric_only=True)
plt.figure(figsize=(8, 6))
sns.heatmap(corr, annot=True, cmap='coolwarm', center=0)
plt.title('Correlation Matrix')
plt.show()

Now, let’s expand the correlation matrix to include categorical variables. With this, we discover if any variable has a strong (or weak) relationship with insurance charges

In [11]:

df_copy= df.copy()
df_copy['sex']=df_copy['sex'].map({'male':1, 'female':0})
df_copy['smoker']=df_copy['smoker'].map({'yes':1, 'no':0})
df_copy =pd.get_dummies(df_copy, columns=['region'], drop_first=True)

In [12]:

plt.figure(figsize=(12,8))
sns.heatmap(df_copy.corr(numeric_only=True), annot=True, cmap= 'coolwarm')
plt.show()

In [13]:

# Pairplot to visualize relationships
sns.pairplot(df, hue='smoker', diag_kind='kde')
plt.suptitle('Pairplot of Numerical Variables by Smoking Status', y=1.02)
plt.show()

5. Outlier Detection¶

In [14]:

print("\n=== Outlier Detection ===")
plt.figure(figsize=(12, 6))
sns.boxplot(data=df[num_cols], orient='h')
plt.title('Boxplot of Numerical Variables')
plt.show()

=== Outlier Detection ===

6. Advanced Analysis¶

In [15]:

# Age groups analysis
df['age_group'] = pd.cut(df['age'], bins=[17, 30, 40, 50, 60, 65], 
                        labels=['18-30', '31-40', '41-50', '51-60', '61-65'])
plt.figure(figsize=(10, 6))
sns.boxplot(data=df, x='age_group', y='charges', hue='smoker')
plt.title('Charges by Age Group and Smoking Status')
plt.show()

In [16]:

# BMI categories
df['bmi_category'] = pd.cut(df['bmi'], bins=[0, 18.5, 25, 30, 100], 
                           labels=['Underweight', 'Normal', 'Overweight', 'Obese'])
plt.figure(figsize=(10, 6))
sns.boxplot(data=df, x='bmi_category', y='charges', hue='smoker')
plt.title('Charges by BMI Category and Smoking Status')
plt.show()

7. Statistical Tests¶

In [17]:

print("\n=== Statistical Tests ===")
# T-test for charges between smokers and non-smokers
smoker_charges = df[df['smoker'] == 'yes']['charges']
non_smoker_charges = df[df['smoker'] == 'no']['charges']
t_stat, p_val = stats.ttest_ind(smoker_charges, non_smoker_charges, equal_var=False)
print(f"T-test for charges (smokers vs non-smokers): t-stat={t_stat:.2f}, p-value={p_val:.2e}")

=== Statistical Tests ===

T-test for charges (smokers vs non-smokers): t-stat=32.75, p-value=5.89e-103

In [18]:

# ANOVA for charges across regions
f_stat, p_val = stats.f_oneway(
    df[df['region'] == 'southwest']['charges'],
    df[df['region'] == 'southeast']['charges'],
    df[df['region'] == 'northwest']['charges'],
    df[df['region'] == 'northeast']['charges']
)
print(f"ANOVA for charges across regions: F-stat={f_stat:.2f}, p-value={p_val:.2f}")

ANOVA for charges across regions: F-stat=2.97, p-value=0.03

Key Insights from the EDA:¶

Dataset Overview:

1338 records, 7 features
No missing values
Mixed data types (numerical and categorical)

Target Variable (charges):

Highly right-skewed distribution
Mean: $13,270
Range: 1,122 to 63,770
Clear bimodal distribution likely due to smoker/non-smoker difference

Demographics:

Age range: 18-64 years
Balanced gender distribution (50.5% male, 49.5% female)
BMI ranges from 15.96 to 53.13 (mean 30.66)
20.5% smokers, 79.5% non-smokers
Region distribution fairly even (southeast slightly more represented)

Key Relationships:

Smokers pay significantly higher charges (confirmed by t-test)
Positive correlation between age and charges
BMI shows a positive relationship with charges, especially for smokers
Number of children has a modest impact on charges
The southeast region has higher average charges

Statistical Findings:

Smokers pay on average 3-4 times more than non-smokers (p-value < 0.001)
Significant regional differences in charges (ANOVA p-value < 0.05)

Visual Findings:

Clear separation in charges between smokers and non-smokers across all age groups
BMI categories show that obese smokers have the highest charges
The relationship between age and charges is more pronounced for smokers

Here is the end of the Exploratory Data Analysis project.

Feel free to comment.

Iksolomon

Exploratory Data Analysis of Insurance Dataset Using Python