I conducted a comprehensive descriptive statistics analysis, specifically focusing on key metrics such as the Mean, Median, and Skewness, using the Python programming language. During the process, I encountered a significant challenge from non-numeric values and numerous occurrences of “NaN” (Not-a-Number) values within the dataset. To ensure the accuracy of my statistical computations, I implemented data preprocessing steps to eliminate these non-numeric and “NaN” entries.
Subsequently, I have presented my observations and results below, along with the corresponding code utilized in this analysis.
My next step is constructing visual representations such as histograms and box plots, which are instrumental in visually capturing data distributions. Furthermore, I aim to segment the data based on geographical attributes, potentially grouping it by county or state. This analysis will enable a more granular examination of each subgroup’s characteristics.
Furthermore, I intend to conduct hypothesis testing to find significant differences in obesity, inactivity, and diabetes rates among counties within different states. To achieve this, I would like to do statistical tests like the t-test or analysis of variance (ANOVA).
Code :
!pip install pandas
import pandas as pd
file_path = ‘D:\\Juice Wrld\\University\\Subjects\\Fall 2023\\MTH 522 – Mathematical Statistics\\a1.xlsx’
df = pd.read_excel(file_path, sheet_name=’Diabetes’)
df1 = pd.read_excel(file_path, sheet_name=’Obesity’)
df2= pd.read_excel(file_path, sheet_name=’Inactivity’)
import numpy as np
selected_column_diabetes = df[‘% DIABETIC’]
selected_column_diabetes = pd.to_numeric(selected_column_diabetes, errors=’coerce’)
selected_column_obesity = df1[‘% OBESE’]
selected_column_obesity = pd.to_numeric(selected_column_obesity, errors=’coerce’)
selected_column_inactivity = df2[‘% INACTIVE’]
selected_column_inactivity = pd.to_numeric(selected_column_inactivity, errors=’coerce’)
from scipy import stats
median_diabetes = np.nanmedian(selected_column_diabetes)
median_obesity = np.nanmedian(selected_column_obesity)
median_inactivity = np.nanmedian(selected_column_inactivity)
skewness_diabetes = stats.skew(selected_column_diabetes, nan_policy=’omit’)
skewness_obesity = stats.skew(selected_column_obesity, nan_policy=’omit’)
skewness_inactivity = stats.skew(selected_column_inactivity, nan_policy=’omit’)
# Print the results
print(“Mean of % Diabetes:”, mean_diabetes)
print(“Median of % Diabetes:”, median_diabetes)
print(“Skewness of % Diabetes:”, skewness_diabetes)
print(“Mean of % Obesity:”, mean_obesity)
print(“Median of % Obesity:”, median_obesity)
print(“Skewness of % Obesity:”, skewness_obesity)
print(“Mean of % Inactivity:”, mean_inactivity)
print(“Median of % Inactivity:”, median_inactivity)
print(“Skewness of % Inactivity:”, skewness_inactivity)
OUTPUT:-
Mean of % Diabetes: 8.714891599294615 Median of % Diabetes: 8.4 Skewness of % Diabetes: 0.919133846559114 Mean of % Obesity: 18.1678463101346 Median of % Obesity: 18.3 Skewness of % Obesity: -7.104751234514205 Mean of % Inactivity: 16.52061469099719 Median of % Inactivity: 16.7 Skewness of % Inactivity: -0.9638674028443353
Leave a comment