Friday – September 29, 2023.

Based on my preliminary analysis, I have concluded that it is not possible to perform time series analysis modeling on a dataset with only one year of data. This is because time series models require a sufficient amount of historical data to learn the underlying trends and patterns in the data. Otherwise, the model will be unable to generate accurate predictions.

I also attempted to perform geospatial analysis on the dataset, as it contains county and state information. However, my code failed to execute because the dataset does not include a geometry column. This column is required for geospatial analysis, as it specifies the spatial location of each data point.

Finally, I tried to use ensemble methods, such as random forests, to gain insights into feature importance and relationships between predictor variables and the outcome. However, ensemble methods are not suitable for small datasets, as they are prone to overfitting.

Overall, I have made significant progress in exploring different modeling techniques for the dataset. However, I need to address the following challenges before I can finalize the modeling techniques and start writing the first report draft.

Project 1 - Progress report - Jupyter Notebook

Wednesday – September 27, 2023.

Today, I met with all my teammates for our project, and I’m happy to say we’re almost done with it. I have completed with linear regression model analysis and checked how well it works using a wide array of technical parameters like p-values, R-squared, confidence interval, cross-validation and collinearity.

Now, I’m figuring out how to use time series analysis to predict what might happen in the future, but I’m still working on getting the code right for that part. Meanwhile, my team members are working on their own models. Eventually, we’ll put everything together and start writing our first draft on Friday.

Project 1 - Progress report - Jupyter Notebook

Monday – September 25, 2023.

In today’s analysis, I have refined my code and effectively addressed all previously encountered codes issues. In the past, I had previously developed a linear regression model. However, in the present analysis, I did a comprehensive evaluation of this model across three distinct scenarios involving the computation of key statistical metrics:

  1. Calculation of p-values to assess the significance of individual predictors.
  2. Estimation of confidence intervals to gauge the precision of regression coefficient estimates.
  3. Computation of the coefficient of determination (R-squared) to measure the explained variance in the dependent variable.
  4. Execution of cross-validation procedures to assess model performance and generalizability.
  5. Investigation of collinearity among predictor variables to identify potential multicollinearity issues.

# Final Analysis of Linear Regression Model Let’s compare the three models (A, B, and C) based on various statistics and provide a detailed analysis:

Model A:

VIF values:
const: 325.88
% Diabetes: 1.18
% Obesity: 1.18
Mean R-squared: 0.125
Intercept: -0.158
Coefficients for % Diabetes and % Obesity: 0.957 and 0.445, respectively
Confidence Intervals for coefficients (95%):
% Diabetes: [0.769, 1.145]
% Obesity: [0.312, 0.578]
F-statistic: 115.2
Prob (F-statistic): 3.51e-39
Model B:

VIF values:
const: 318.05
% Inactivity: 1.29
% Obesity: 1.29
Mean R-squared: 0.155
Intercept: 1.654
Coefficients for % Inactivity and % Obesity: 0.232 and 0.111, respectively
Confidence Intervals for coefficients (95%):
% Inactivity: [0.187, 0.278]
% Obesity: [0.043, 0.180]
F-statistic: 90.71
Prob (F-statistic): 1.76e-32
Model C:

VIF values:
const: 120.67
% Inactivity: 1.47
% Diabetes: 1.47
Mean R-squared: 0.093
Intercept: 12.794
Coefficients for % Inactivity and % Diabetes: 0.247 and 0.254, respectively
Confidence Intervals for coefficients (95%):
% Inactivity: [0.173, 0.321]
% Diabetes: [0.097, 0.410]
F-statistic: 57.04
Prob (F-statistic): 3.54e-22
Analysis and Comparison:

VIF Values:

Model A has a very high VIF value for the constant (const), indicating potential multicollinearity with other variables in the model.
Model B and Model C also have high VIF values for the constant but lower than in Model A. These models include different sets of independent variables.
R-squared:

Model B has the highest mean R-squared (0.155), indicating that it explains the most variation in the dependent variable (% Inactivity).
Model A has the lowest mean R-squared (0.125).
Model C falls in between with a mean R-squared of 0.093.
Intercept and Coefficients:

The intercept values differ significantly between models. For Model A, it’s close to zero, while for Models B and C, it’s considerably higher.
The coefficients also vary between models, and their interpretations depend on the specific variables used in each model.
Confidence Intervals:

Confidence intervals for coefficients indicate whether they are statistically significant. In all models, some coefficients have confidence intervals that exclude zero, making them statistically significant predictors.
F-statistic:

Model A has the highest F-statistic (115.2), indicating strong overall model significance.
Model B has a lower F-statistic (90.71), but it is still highly significant.
Model C has the lowest F-statistic (57.04), which is also statistically significant but relatively lower than the other models.
Multicollinearity:

All three models exhibit multicollinearity to some extent, with high VIF values for the constant term in each case.
Model B and Model C include % Inactivity as an independent variable, which may contribute to multicollinearity in these models.
Relationships:

Model B appears to perform the best in terms of R-squared and overall model significance.
Model A has a particularly high VIF value for the constant, which indicates a potential issue with multicollinearity.
Model C has a moderate R-squared and F-statistic but also includes % Inactivity as an independent variableProject 1 - Progress report - Jupyter Notebook

Friday – September 22,2023.

In today’s work, I successfully updated my code and constructed linear regression models for all three categories:

  1. Inactivity vs. Obesity predicting Diabetes.
  2. Inactivity vs. Diabetes predicting Obesity.
  3. Obesity vs. Diabetes predicting Inactivity.

However, I am currently facing an issue with calculating confidence intervals & p values for these linear regression models. I’ve been troubleshooting this problem but have not yet found a solution. My goal is to refine my code and proceed with the analysis.

For the analysis of my linear regression models, I plan to follow these steps:

  1. Calculating the p-values: Resolve the issue with calculating p-values to determine the significance of each coefficient in the models.
  2. Calculating confidence intervals: Once the p-values are successfully calculated, estimate confidence intervals for the coefficients to understand the range of potential values.
  3. Using metrics like R-squared: Evaluate the goodness-of-fit of the models using metrics like R-squared to measure how well the models explain the variation in the dependent variable.
  4. Performing cross-validation: Implement cross-validation techniques to assess the models’ generalization performance and identify potential overfitting.
  5. Finding collinearity: Detect and handle multicollinearity among independent variables to ensure the models’ stability and interpretability.

I’m actively working on resolving the issue with p-values , confidence intervals and progressing with the analysis of these linear regression models.Project 1 - Progress report - Jupyter Notebook

 

 

Wednesday – September 20, 2023

In today’s analysis, we concentrated on building a linear regression model to examine the ‘% Obesity’ data by utilizing ‘% Diabetes’ and ‘% Inactivity’ data. Furthermore, we also built a linear regression model to examine the ‘% Diabetes’ data, considering the influence of ‘% Obesity’ and ‘% Inactivity’ data.

Linear regression is a statistical method for modeling the relationship between a dependent variable and one or more independent variables. Linear regression aims to find the best-fitting line through the data points. This line can then be used to make predictions about the dependent variable given the values of the independent variables.

The equation for a linear regression model is as follows:

y = mx + b

Where:

  • y is the dependent variable
  • x is the independent variable
  • m is the slope of the line
  • b is the y-intercept of the line

The slope of the line tells us how much the dependent variable changes for every one-unit change in the independent variable. The y-intercept of the line tells us the value of the dependent variable when the independent variable is equal to zero.

Imports the necessary libraries. This includes the following:

    • pandas: A library for data manipulation and analysis.
    • numpy: A library for scientific computing.
    • sklearn: A library for machine learning.
    • matplotlib: A library for data visualization.
  1. Loads the data from an Excel file. The file path is specified by the variable file_path.
  2. Removes rows with missing values in the dependent variable. The dependent variable is the variable that we want to predict. In this case, it is the percentage of obesity and diabetes.
  3. Defines the independent variables and the dependent variable. The independent variables are the variables that we use to predict the dependent variable. In this case, they are the percentage of obesity and inactivity; in the second case, they are the % Obesity’ and ‘% Inactivity data.
  4. Creates a linear regression model. This is done using the LinearRegression() class from the sklearn library.
  5. Fits the model to the data. This is done using the fit() method of the LinearRegression class.
  6. Prints the intercept and coefficients. The intercept is the value of the predicted dependent variable when all independent variables equal zero. The coefficients are the values that multiply the independent variables in the linear regression equation.
  7. Makes predictions using the model. This is done using the predict() method of the LinearRegression class.
  8. Plots the actual vs. predicted values. This is done using the matplotlib library.
  9. Calculates the regression line. This is done using the predict() method of the LinearRegression class.
  10. Adds the regression line to the plot. This is done using the plot() method of the matplotlib library.
  11. Displays the legend. This is done using the legend() method of the matplotlib library.
  12. Shows the plot. This is done using the show() method of the matplotlib library.
Project 1 - Progress report - Jupyter Notebook

Monday – September 18, 2023.

Import Necessary Libraries: The code imports essential libraries for data handling, analysis, and plotting, including pandas, numpy, scikit-learn, and matplotlib.

Load Data: It retrieves data from an Excel file located at a specified file path from my laptop.

Data Cleaning: The code ensures data cleanliness by removing rows with missing values (NaN) in the “Inactivity” column.

Data Setup: After cleaning, the data is split into two parts:

Independent variables (X): These are features that might affect “Inactivity,” like “% Diabetes” and “% Obesity.”

Dependent variable (y): This is the variable we want to predict, which is “Inactivity.”

Linear Regression Model: The code constructs a linear regression model, which is a mathematical formula that finds a link between independent variables (diabetes and obesity percentages) and the dependent variable (inactivity percentage).

Model Training: The model is trained on the data to learn how changes in independent variables influence the dependent variable. It identifies the best-fit line that minimizes the difference between predicted and actual “Inactivity” percentages.

Print Results: The code displays the outcomes of the linear regression analysis, including the intercept (where the line crosses the Y-axis) and coefficients (slopes for each independent variable). These values help interpret the relationship between the variables.

Make Predictions: Using the trained model, the code predicts “Inactivity” percentages based on new values of independent variables (diabetes and obesity percentages).

Plot Results: To visualize the model’s performance, a scatter plot is created. It compares actual “Inactivity” percentages (X-axis) with predicted percentages (Y-axis). A well-fitted model will have points closely aligned with a diagonal line.

In summary, this code loads, cleans, and prepares data, trains a linear regression model to understand relationships, and visualizes the model’s predictions, all aimed at explaining “Inactivity” percentages based on diabetes and obesity percentages.

Project 1 - Progress report - Jupyter Notebook

 

Friday – 15 September, 2023.

I performed a correlation analysis on three datasets: diabetes, obesity, and inactivity. The study revealed a strong correlation between all three datasets, with the FIPS code being the common factor. Recognizing the need for a more comprehensive analysis, I merged these three datasets into a single Excel spreadsheet for a more holistic examination.

I wrote a code to combine the three datasets and found 356 data points in common. I then cleaned the Excel sheet which involved addressing the issue of redundant columns containing information on county, state, and year. To enhance data clarity, I removed these columns. Additionally, I improved the dataset’s readability by renaming specific columns and adjusting column widths to facilitate data visualization.

Next, I focused on a geographical analysis, explicitly counting the number of counties within each state. I found that Texas has 138 counties, while several states in the dataset have only one county entry, making them statistically less reliable for meaningful analysis. This makes the data unreliable for research, as it is heavily skewed towards certain states. For example, Wyoming only has one county, so analysis of Wyoming would be incorrect as it will skewed towards once particular county only and we would not get the general view of the entire state which is our objective.

docs

Wednesday – 13 September , 2023.

During my work today, I conducted an in-depth analysis of the 2018 health data from the CDC. Leveraging the Python programming language, I executed calculations about statistical measures, specifically the standard deviation and kurtosis, for the variables representing the percentages of diabetic cases, obesity rates, and inactivity levels. Throughout this process, it became apparent that I encountered a notable challenge in the form of missing or “NaN” values within the dataset. As a result, I dedicated a significant portion of my efforts to data cleaning and preparation to ensure the accuracy of subsequent statistical analyses.

Regrettably, I observed a disparity between the kurtosis values I obtained and those outlined in the reference material provided by our professor. I intend to bring it to the attention of our instructor and teaching assistants during our upcoming class session.

I have attached a PDF of the code I wrote today for you to look over.

Looking forward, my immediate objective involves the calculation of p-values. I plan to implement a t-test, a statistical method that will facilitate hypothesis testing and aid in making informed conclusions about the dataset.Project 1 - Progress report - Jupyter Notebook

 

Monday – September 11,2023.

I conducted a comprehensive descriptive statistics analysis, specifically focusing on key metrics such as the Mean, Median, and Skewness, using the Python programming language. During the process, I encountered a significant challenge from non-numeric values and numerous occurrences of “NaN” (Not-a-Number) values within the dataset. To ensure the accuracy of my statistical computations, I implemented data preprocessing steps to eliminate these non-numeric and “NaN” entries.

Subsequently, I have presented my observations and results below, along with the corresponding code utilized in this analysis.

My next step is constructing visual representations such as histograms and box plots, which are instrumental in visually capturing data distributions. Furthermore, I aim to segment the data based on geographical attributes, potentially grouping it by county or state. This analysis will enable a more granular examination of each subgroup’s characteristics.

Furthermore, I intend to conduct hypothesis testing to find significant differences in obesity, inactivity, and diabetes rates among counties within different states. To achieve this, I would like to do statistical tests like the t-test or analysis of variance (ANOVA).

Code :

!pip install pandas

import pandas as pd

file_path = ‘D:\\Juice Wrld\\University\\Subjects\\Fall 2023\\MTH 522 – Mathematical Statistics\\a1.xlsx’

df = pd.read_excel(file_path, sheet_name=’Diabetes’)
df1 = pd.read_excel(file_path, sheet_name=’Obesity’)
df2= pd.read_excel(file_path, sheet_name=’Inactivity’)

import numpy as np

selected_column_diabetes = df[‘% DIABETIC’]

selected_column_diabetes = pd.to_numeric(selected_column_diabetes, errors=’coerce’)

selected_column_obesity = df1[‘% OBESE’]

selected_column_obesity = pd.to_numeric(selected_column_obesity, errors=’coerce’)

selected_column_inactivity = df2[‘% INACTIVE’]

selected_column_inactivity = pd.to_numeric(selected_column_inactivity, errors=’coerce’)

from scipy import stats

median_diabetes = np.nanmedian(selected_column_diabetes)
median_obesity = np.nanmedian(selected_column_obesity)
median_inactivity = np.nanmedian(selected_column_inactivity)

skewness_diabetes = stats.skew(selected_column_diabetes, nan_policy=’omit’)
skewness_obesity = stats.skew(selected_column_obesity, nan_policy=’omit’)
skewness_inactivity = stats.skew(selected_column_inactivity, nan_policy=’omit’)

# Print the results
print(“Mean of % Diabetes:”, mean_diabetes)
print(“Median of % Diabetes:”, median_diabetes)
print(“Skewness of % Diabetes:”, skewness_diabetes)

print(“Mean of % Obesity:”, mean_obesity)
print(“Median of % Obesity:”, median_obesity)
print(“Skewness of % Obesity:”, skewness_obesity)

print(“Mean of % Inactivity:”, mean_inactivity)
print(“Median of % Inactivity:”, median_inactivity)
print(“Skewness of % Inactivity:”, skewness_inactivity)

OUTPUT:-

Mean of % Diabetes: 8.714891599294615
Median of % Diabetes: 8.4
Skewness of % Diabetes: 0.919133846559114
Mean of % Obesity: 18.1678463101346
Median of % Obesity: 18.3
Skewness of % Obesity: -7.104751234514205
Mean of % Inactivity: 16.52061469099719
Median of % Inactivity: 16.7
Skewness of % Inactivity: -0.9638674028443353