Monday – September 11,2023.

I conducted a comprehensive descriptive statistics analysis, specifically focusing on key metrics such as the Mean, Median, and Skewness, using the Python programming language. During the process, I encountered a significant challenge from non-numeric values and numerous occurrences of “NaN” (Not-a-Number) values within the dataset. To ensure the accuracy of my statistical computations, I implemented data preprocessing steps to eliminate these non-numeric and “NaN” entries.

Subsequently, I have presented my observations and results below, along with the corresponding code utilized in this analysis.

My next step is constructing visual representations such as histograms and box plots, which are instrumental in visually capturing data distributions. Furthermore, I aim to segment the data based on geographical attributes, potentially grouping it by county or state. This analysis will enable a more granular examination of each subgroup’s characteristics.

Furthermore, I intend to conduct hypothesis testing to find significant differences in obesity, inactivity, and diabetes rates among counties within different states. To achieve this, I would like to do statistical tests like the t-test or analysis of variance (ANOVA).

Code :

!pip install pandas

import pandas as pd

file_path = ‘D:\\Juice Wrld\\University\\Subjects\\Fall 2023\\MTH 522 – Mathematical Statistics\\a1.xlsx’

df = pd.read_excel(file_path, sheet_name=’Diabetes’)
df1 = pd.read_excel(file_path, sheet_name=’Obesity’)
df2= pd.read_excel(file_path, sheet_name=’Inactivity’)

import numpy as np

selected_column_diabetes = df[‘% DIABETIC’]

selected_column_diabetes = pd.to_numeric(selected_column_diabetes, errors=’coerce’)

selected_column_obesity = df1[‘% OBESE’]

selected_column_obesity = pd.to_numeric(selected_column_obesity, errors=’coerce’)

selected_column_inactivity = df2[‘% INACTIVE’]

selected_column_inactivity = pd.to_numeric(selected_column_inactivity, errors=’coerce’)

from scipy import stats

median_diabetes = np.nanmedian(selected_column_diabetes)
median_obesity = np.nanmedian(selected_column_obesity)
median_inactivity = np.nanmedian(selected_column_inactivity)

skewness_diabetes = stats.skew(selected_column_diabetes, nan_policy=’omit’)
skewness_obesity = stats.skew(selected_column_obesity, nan_policy=’omit’)
skewness_inactivity = stats.skew(selected_column_inactivity, nan_policy=’omit’)

# Print the results
print(“Mean of % Diabetes:”, mean_diabetes)
print(“Median of % Diabetes:”, median_diabetes)
print(“Skewness of % Diabetes:”, skewness_diabetes)

print(“Mean of % Obesity:”, mean_obesity)
print(“Median of % Obesity:”, median_obesity)
print(“Skewness of % Obesity:”, skewness_obesity)

print(“Mean of % Inactivity:”, mean_inactivity)
print(“Median of % Inactivity:”, median_inactivity)
print(“Skewness of % Inactivity:”, skewness_inactivity)

OUTPUT:-

Mean of % Diabetes: 8.714891599294615
Median of % Diabetes: 8.4
Skewness of % Diabetes: 0.919133846559114
Mean of % Obesity: 18.1678463101346
Median of % Obesity: 18.3
Skewness of % Obesity: -7.104751234514205
Mean of % Inactivity: 16.52061469099719
Median of % Inactivity: 16.7
Skewness of % Inactivity: -0.9638674028443353

 

 

 

 

 

Leave a comment

Leave a Reply

Your email address will not be published. Required fields are marked *