Academic Performance Analysis¶

In [85]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

Loading Data¶

In [86]:
data = pd.read_csv("./dataset/en_lpor_explorer.csv")

print(data.shape)
data.head()
(649, 31)
Out[86]:
School Gender Age Housing_Type Family_Size Parental_Status Mother_Education Father_Education Mother_Work Father_Work ... Is_Dating Good_Family_Relationship Free_Time_After_School Time_with_Friends Alcohol_Weekdays Alcohol_Weekends Health_Status School_Absence Grade_1st_Semester Grade_2nd_Semester
0 Gabriel Pereira Female 18 Urban Above 3 Separated Higher Education Higher Education Homemaker Teacher ... No Good Moderate High Very Low Very Low Fair 4 0 11
1 Gabriel Pereira Female 17 Urban Above 3 Living Together Primary School Primary School Homemaker other ... No Excellent Moderate Moderate Very Low Very Low Fair 2 9 11
2 Gabriel Pereira Female 15 Urban Up to 3 Living Together Primary School Primary School Homemaker other ... No Good Moderate Low Low Moderate Fair 6 12 13
3 Gabriel Pereira Female 15 Urban Above 3 Living Together Higher Education Lower Secondary School Health Services ... Yes Fair Low Low Very Low Very Low Very Good 0 14 14
4 Gabriel Pereira Female 16 Urban Above 3 Living Together High School High School other other ... No Good Moderate Low Very Low Low Very Good 0 11 13

5 rows × 31 columns

Exploratory Data Analysis¶

In [87]:
data.columns
Out[87]:
Index(['School', 'Gender', 'Age', 'Housing_Type', 'Family_Size',
       'Parental_Status', 'Mother_Education', 'Father_Education',
       'Mother_Work', 'Father_Work', 'Reason_School_Choice',
       'Legal_Responsibility', 'Commute_Time', 'Weekly_Study_Time',
       'Extra_Educational_Support', 'Parental_Educational_Support',
       'Private_Tutoring', 'Extracurricular_Activities', 'Attended_Daycare',
       'Desire_Graduate_Education', 'Has_Internet', 'Is_Dating',
       'Good_Family_Relationship', 'Free_Time_After_School',
       'Time_with_Friends', 'Alcohol_Weekdays', 'Alcohol_Weekends',
       'Health_Status', 'School_Absence', 'Grade_1st_Semester',
       'Grade_2nd_Semester'],
      dtype='object')
In [88]:
data = data.drop(columns=[
            'Housing_Type',
            'Family_Size',
            'Father_Education',
            'Mother_Education',
            'Father_Work',
            'Mother_Work',
            'Reason_School_Choice',
            'Commute_Time',
            'Extracurricular_Activities',
            'Attended_Daycare',
            'Desire_Graduate_Education',
            'Free_Time_After_School',
            'Time_with_Friends'
                          ])
In [89]:
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 649 entries, 0 to 648
Data columns (total 18 columns):
 #   Column                        Non-Null Count  Dtype 
---  ------                        --------------  ----- 
 0   School                        649 non-null    object
 1   Gender                        649 non-null    object
 2   Age                           649 non-null    int64 
 3   Parental_Status               649 non-null    object
 4   Legal_Responsibility          649 non-null    object
 5   Weekly_Study_Time             649 non-null    object
 6   Extra_Educational_Support     649 non-null    object
 7   Parental_Educational_Support  649 non-null    object
 8   Private_Tutoring              649 non-null    object
 9   Has_Internet                  649 non-null    object
 10  Is_Dating                     649 non-null    object
 11  Good_Family_Relationship      649 non-null    object
 12  Alcohol_Weekdays              649 non-null    object
 13  Alcohol_Weekends              649 non-null    object
 14  Health_Status                 649 non-null    object
 15  School_Absence                649 non-null    int64 
 16  Grade_1st_Semester            649 non-null    int64 
 17  Grade_2nd_Semester            649 non-null    int64 
dtypes: int64(4), object(14)
memory usage: 91.4+ KB
In [90]:
data.isna().sum()
Out[90]:
School                          0
Gender                          0
Age                             0
Parental_Status                 0
Legal_Responsibility            0
Weekly_Study_Time               0
Extra_Educational_Support       0
Parental_Educational_Support    0
Private_Tutoring                0
Has_Internet                    0
Is_Dating                       0
Good_Family_Relationship        0
Alcohol_Weekdays                0
Alcohol_Weekends                0
Health_Status                   0
School_Absence                  0
Grade_1st_Semester              0
Grade_2nd_Semester              0
dtype: int64
In [91]:
data.describe(include = "all")
Out[91]:
School Gender Age Parental_Status Legal_Responsibility Weekly_Study_Time Extra_Educational_Support Parental_Educational_Support Private_Tutoring Has_Internet Is_Dating Good_Family_Relationship Alcohol_Weekdays Alcohol_Weekends Health_Status School_Absence Grade_1st_Semester Grade_2nd_Semester
count 649 649 649.000000 649 649 649 649 649 649 649 649 649 649 649 649 649.000000 649.000000 649.000000
unique 2 2 NaN 2 3 4 2 2 2 2 2 5 5 5 5 NaN NaN NaN
top Gabriel Pereira Female NaN Living Together Mother 2 to 5h No Yes No Yes No Good Very Low Very Low Very Good NaN NaN NaN
freq 423 383 NaN 569 455 305 581 398 610 498 410 317 451 247 249 NaN NaN NaN
mean NaN NaN 16.744222 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 3.659476 11.399076 11.570108
std NaN NaN 1.218138 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 4.640759 2.745265 2.913639
min NaN NaN 15.000000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.000000 0.000000 0.000000
25% NaN NaN 16.000000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.000000 10.000000 10.000000
50% NaN NaN 17.000000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2.000000 11.000000 11.000000
75% NaN NaN 18.000000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 6.000000 13.000000 13.000000
max NaN NaN 22.000000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 32.000000 19.000000 19.000000
In [92]:
data ["Average_Grade"] = data[["Grade_1st_Semester", "Grade_2nd_Semester"]].mean(axis = 1)
In [93]:
cat_cols = [
  'School', 'Gender', 'Age',
  'Parental_Status', 'Legal_Responsibility', "Weekly_Study_Time",
  "Alcohol_Weekdays", 'Alcohol_Weekends', 'Health_Status',
  'Extra_Educational_Support', 'Parental_Educational_Support', 'Private_Tutoring',
  'Is_Dating', 'Has_Internet', 'Good_Family_Relationship'
  ]

Visualization of Categorical Data¶

In [94]:
fig, axes = plt.subplots(5,3,
figsize = (18, 16))
fig.suptitle("Distribution of Categorical Variables", fontsize = 16)

for ax, col in zip(axes.flat, cat_cols):
  sns.countplot(
    ax = ax,
    x = col,
    data = data,
    order = data[col].value_counts().index
  )

  for container in ax.containers:
    ax.bar_label(container, fmt= '%d', label_type = 'edge', color = 'black')
  
  #ax.set_title(col)

plt.tight_layout()
plt.show()
No description has been provided for this image

The count plots illustrate the frequency distribution of students across different categories. This allows for a better understanding of the demographic and environmental characteristics present in the dataset

Conclusion:

The categorical distributions highlight differences in student backgrounds and support system which may influence academic outcomes

Calculating Passing Grade¶

Passing Grade is >=50% (any score >= 10 out of 20)

In [95]:
conditions = [
    (data["Average_Grade"] < 10),
    (data["Average_Grade"] >= 10)
]

values = ["Failed", "Passed"]
data["Pass"] = np.select(conditions, values, default="Unknown")

data.head()
Out[95]:
School Gender Age Parental_Status Legal_Responsibility Weekly_Study_Time Extra_Educational_Support Parental_Educational_Support Private_Tutoring Has_Internet Is_Dating Good_Family_Relationship Alcohol_Weekdays Alcohol_Weekends Health_Status School_Absence Grade_1st_Semester Grade_2nd_Semester Average_Grade Pass
0 Gabriel Pereira Female 18 Separated Mother 2 to 5h Yes No No No No Good Very Low Very Low Fair 4 0 11 5.5 Failed
1 Gabriel Pereira Female 17 Living Together Father 2 to 5h No Yes No Yes No Excellent Very Low Very Low Fair 2 9 11 10.0 Passed
2 Gabriel Pereira Female 15 Living Together Mother 2 to 5h Yes No No Yes No Good Low Moderate Fair 6 12 13 12.5 Passed
3 Gabriel Pereira Female 15 Living Together Mother 5 to 10h No Yes No Yes Yes Fair Very Low Very Low Very Good 0 14 14 14.0 Passed
4 Gabriel Pereira Female 16 Living Together Father 2 to 5h No Yes No No No Good Very Low Low Very Good 0 11 13 12.0 Passed
In [96]:
data["Pass"].value_counts()
Out[96]:
Pass
Passed    478
Failed    171
Name: count, dtype: int64
In [97]:
sns.countplot(x = "Pass", data=data) 
Out[97]:
<Axes: xlabel='Pass', ylabel='count'>
No description has been provided for this image

Conclusion:

The count plot of pass and fail outcomes shows the proportion of students who successsfully met the minimum academic requirement compared to those who did not.

Pass Rate by Categorical Values¶

In [98]:
fig, axes = plt.subplots(5,3,
figsize = (15, 20))

for ax, col in zip(axes.flat, cat_cols):
  plot = pd.crosstab(
    data[col],
    data['Pass'],
    normalize="index"
  ).plot(kind="bar", stacked=True, ax=ax)
  

  for container in plot.containers:
    plot.bar_label(container, fmt= '%.2f', label_type = 'center', color = 'black')
  
  ax.set_title(f'Pass Rate by {col}')
  ax.set_xlabel(col)
  ax.set_ylabel("Proportion")
  ax.tick_params(axis='x', labelrotation=0)

plt.tight_layout()
plt.show()

plt.show()
No description has been provided for this image

The chart compares the proportion of students who passed or failed in each school. Differences between schools may reflect variations in educational quality, resources, or teaching methods

Student reporting better health status tend to demonstrate higher pass rates compared to those with poor health conditions. Good physical health may enhance concentration, attendance and learning ability.

Students who reported stronger family relationships were more likely to pass compared to those with weaker family support systems. A supportive family environment can provide emotional encouragement and motivation for academic success

Multivariate Plots¶

Alcohol Consumption × Study Time → Grade¶

In [99]:
if all(c in data.columns for c in ['Alcohol_Weekdays', 'Weekly_Study_Time', 'Average_Grade']):
    fig, ax = plt.subplots(figsize=(8, 5))
    scatter = ax.scatter(
        data['Alcohol_Weekdays'], data['Weekly_Study_Time'],
        c=data['Average_Grade'], cmap='RdYlGn',
        alpha=0.7, edgecolors='white', linewidths=0.3, s=60
    )
    plt.colorbar(scatter, ax=ax, label='Average_Grade')
    ax.set_xlabel('Alcohol Consumption (Weekdays)')
    ax.set_ylabel('Weekly Study Time')
    ax.set_title('Alcohol × Study Time → Academic Grade')
    plt.tight_layout()
    plt.show()
No description has been provided for this image

Conclusion:

Higher alcohol consumption is mgenerally associated with lower academic performance, while increased study time tends to correlate with better grades.

Average Grade — Internet Access x Private Tutoring¶

In [100]:
if all(c in data.columns for c in ['Has_Internet', 'Private_Tutoring', 'Average_Grade']):
    pivot = data.pivot_table(
        values='Average_Grade',
        index='Has_Internet',
        columns='Private_Tutoring',
        aggfunc='mean'
    )
    fig, ax = plt.subplots(figsize=(8, 5))
    sns.heatmap(pivot, annot=True, fmt='.2f', cmap='YlGnBu', linewidths=0.5, ax=ax)
    ax.set_title('Average Grade\nInternet Access × Private Tutoring')
    plt.tight_layout()
    plt.show()
No description has been provided for this image

Conclusion:

Students with both access to internet and private tutoring tend to achieve higher grades compared to those without such academic support system

Family Relationship × Parental Support → Grade Trajectory¶

In [101]:
if all(c in data.columns for c in ['Good_Family_Relationship', 'Parental_Educational_Support', 'Average_Grade']):
    pivot = data.pivot_table(
        values='Average_Grade',
        index='Good_Family_Relationship',
        columns= 'Parental_Educational_Support',
        aggfunc='mean'
    )
    fig, ax = plt.subplots(figsize=(8, 5))
    sns.heatmap(pivot, annot=True, fmt='.2f', cmap='YlGnBu', linewidths=0.5, ax=ax)
    ax.set_title('Grade Trajectory\nFamily Relationship × Parental Support → Grade Trajectory')
    plt.tight_layout()
    plt.show()
No description has been provided for this image

Conclusion:

Students with strong family relationships and active parental educational support generally achieve higher average grades. Parental involvement may encourage better study habits and academic motivation.

Note:¶

The EDA reveals that academic performance is influenced by a combination of educational, behavioural and social factors. Variables such as study habits, health status, family relationships, access to educational resources and lifestyle choices all show noticeable relationship with student outcomes. Strengthening family support, improving access to educational resources and promoting healthy study habits may significantly enhance student academic performance.