Sunday – December 10,2023.

This is Project 3 for MTH 522 at the University of Massachusetts Dartmouth.

Project Title:

Analysis of Boston Crime Incident Data: Exploring Crime Patterns and Trends 

The provided dataset has been thoroughly examined and comprehensively reported in the project document.

The contribution report has been added to the final page of the report.

Report-PRJ3

Friday – December 8,2023.

In my analysis of a crime dataset, I initially identified the top three streets with the highest number of shootings, including “WASHINGTON ST,” “BOYLSTON ST,” and “BLUE HILL AVE,” along with the most prevalent offenses in these areas. I then determined the most common time for shootings, finding that incidents were most frequent in June, on Saturdays, and at midnight. Further investigation into UCR categories revealed that “Part Three” crimes were predominant, with variations in top offenses across UCR parts. Examining streets associated with UCR parts, “WASHINGTON ST” consistently appeared prominently. Additionally, I explored district-level data, highlighting the districts with the highest occurrences for different UCR parts. Finally, I identified the top five streets with the most diverse range of crimes, such as “CENTRE ST” and “WASHINGTON ST,” and visualized the findings through insightful bar graphs. Overall, the analysis provided a comprehensive understanding of the dataset’s crime patterns, street occurrences, and UCR categories.

Wednesday – December 6,2023.

In this Python code, I use the pandas library to analyze an Excel dataset containing information about offenses. I read the data into a DataFrame and then clean it by excluding rows where either the ‘OFFENSE_CODE_GROUP’ or ‘STREET’ columns contain integers, as well as dropping any missing values in these columns. Next, I group the cleaned data by street, counting the unique types of crimes for each location and sorting the results in descending order. I print the output, which displays street names and the corresponding counts of unique offenses, from the highest to the lowest offenses. Additionally, I identify and print the top 5 offense categories based on their frequency in the dataset.

Code:

import pandas as pd

# Read the data from an Excel file
df = pd.read_excel(r’D:\General\UMass Dartmouth\Subjects\Fall 2023 – MTH 522 – Mathematical Statistics\Project 3\customdataset.xlsx’)

# Remove rows where either ‘OFFENSE_CODE_GROUP’ or ‘STREET’ contains integers
# Also, drop rows with missing values in ‘OFFENSE_CODE_GROUP’ or ‘STREET’ columns
df_cleaned = df[
~df.applymap(lambda x: isinstance(x, (int, float)))[‘OFFENSE_CODE_GROUP’] &
~df.applymap(lambda x: isinstance(x, (int, float)))[‘STREET’]
].dropna(subset=[‘OFFENSE_CODE_GROUP’, ‘STREET’])

# Group by street and count unique types of crimes
result = df_cleaned.groupby(‘STREET’)[‘OFFENSE_CODE_GROUP’].nunique().sort_values(ascending=False)

# Optionally, reset the index if desired
# result = result.reset_index()

# Print the result, including the highest to the lowest offenses
print(result.to_frame().reset_index().to_string(index=False))

# Get the top 5 offense categories
top5_offenses = df_cleaned[‘OFFENSE_CODE_GROUP’].value_counts().nlargest(5)

# Print the top 5 offense categories
print(“\nTop 5 Offense Categories:”)
print(top5_offenses)

Monday – December 4, 2023.

So, for the final project, I have decided to work on this dataset: https://data.boston.gov/dataset/crime-incident-reports-august-2015-to-date-source-new-system

These are the steps for analysis, I will be following for our analysis:

  1. Variety of Crimes in Different Areas:
    1. Group the data by street and analyze the count of unique types of crimes on each street.
    2. Visualize the results using bar charts or other appropriate plots.
  2. Most Common Crime Types, Time, and Day on Specific Streets:
    1. Filter the data for each street and analyze the most common crime types, days, and hours.
    2. Use bar charts, pie charts, or heatmaps for visualization.
  3. Rise in Certain Crimes in Specific Areas:
    1. Perform a temporal analysis to identify trends in specific types of crimes over time.
    2. Use line charts or other time series visualizations.
  4. Common Crimes Rising Over Time:
    1. Analyze the overall trend of common crimes over the entire dataset.
    2. Consider creating a time series plot to visualize the changes.
  5. Common Neighborhoods with Crime:
    1. Group the data by neighborhood to identify areas with higher crime rates.
    2. Visualize the results using maps or bar charts.
  6. Time Analysis:
    1. Analyze the data based on time factors such as month, day of the week, and hour.
    2. Identify patterns and trends over time using appropriate visualizations.
  7. Map Chart Visualization:
    1. Utilize the latitude and longitude information to create a map chart.
    2. Color-code or size-code data points based on the frequency of crimes in each location.
  8. Correlation Analysis:
    1. Use statistical methods to identify correlations between different variables (e.g., time, day, month) and types of crimes.
    2. Visualize correlations using correlation matrices or scatter plots.
  9. Shooting Data Analysis:
    1. Analyze shooting data separately, identifying patterns, and correlations with other variables.
    2. Visualize shooting incidents on a map and explore temporal patterns.
  10. Predictive Models:
    1. Depending on the nature of your dataset, you can build predictive models to forecast future crime incidents or classify incidents into different categories.
    2. Common algorithms include decision trees, random forests, or neural networks.

 

 

Friday – December 1, 2023.

Geospatial Analysis of Violations

A geospatial analysis of the dataset can offer valuable insights into the distribution of health violations across different locations. By leveraging the latitude and longitude information provided for each establishment, a map can be created to visualize the concentration of violations in specific geographical areas. This analysis could help identify clusters of non-compliant establishments or areas with consistently high or low compliance rates. Furthermore, overlaying demographic or economic data onto the map may reveal correlations between the socio-economic context of an area and the adherence to health and safety standards by food establishments. Geospatial tools and visualizations, such as heatmaps or choropleth maps, can be employed for a comprehensive representation of the spatial distribution of violations.

Wednesday – November 29, 2023.

My second approach to analysis of that data is:

Temporal Analysis of Violations

Another insightful approach to analyzing the dataset is to conduct a temporal analysis of the recorded violations. This involves exploring how the frequency and nature of violations change over time. By grouping the data based on inspection dates, trends in compliance and non-compliance can be identified. For example, one could investigate whether there are specific months or seasons when certain types of violations are more prevalent. Additionally, examining the time lapse between consecutive inspections for each establishment can provide insights into the effectiveness of corrective actions taken by businesses. Utilizing line charts or heatmaps can be effective in visualizing temporal patterns in violation occurrences.

Monday – November 27, 2023.

This week I am looking to do analysis on this dataset:

https://data.boston.gov/dataset/active-food-establishment-licenses

Data Analysis Approach 1: Overview of Inspection Results

In the provided dataset containing information about various food establishments, particularly focusing on restaurants, a comprehensive analysis can be conducted to gain insights into their compliance with health and safety standards. The dataset includes details such as business name, license information, inspection outcomes, and specific violations noted during inspections. One approach to analyzing this data is to generate an overall overview of the inspection results for each establishment. This could involve calculating the percentage of inspections that resulted in a pass, fail, or other status. Additionally, identifying patterns in the types of violations recorded and their frequency across different establishments can provide valuable information. Visualizations such as pie charts or bar graphs can be employed to effectively communicate the distribution of inspection outcomes and the most common violations.

 

FRIDAY – NOVEMBER 24,2023

My final analysis for this data is :

Business Growth and Collaboration Analysis

To support business growth, understanding key factors such as business size, service offerings, and collaborative opportunities is crucial. Analyzing businesses like “IMMAD, LLC” in Forensic Science or “Sparkle Clean Boston LLC” in Clean-tech/Green-tech reveals specific niches that may have growth potential. Implementing targeted marketing and innovation in these niches can be strategic for expansion.

Moreover, identifying businesses open to collaboration can foster a mutually beneficial environment. For instance, “Boston Property Buyers” and “Presidential Properties” both operate in Real Estate. Recognizing such connections can lead to collaborative ventures, shared resources, and a stronger market presence.

Finally, businesses with no digital presence or incomplete information, like “Not yet” and “N/A,” present opportunities for improvement. Implementing digital strategies, such as creating a website or optimizing contact information, can enhance visibility and accessibility, contributing to overall business success.

WEDNESDAY – NOVEMBER 22,2023

In the same data, I continued my analysis.

Digital Presence and Communication Analysis

The dataset includes businesses’ online presence through websites, email addresses, and phone numbers. Analyzing the online landscape is crucial for understanding the modern business environment. For instance, businesses like “Boston Chinatown Tours” and “Interactive Construction Inc.” have websites, providing opportunities for digital marketing, customer engagement, and e-commerce. Evaluating the effectiveness of these online platforms and optimizing them for user experience can enhance business visibility and customer interaction.

Furthermore, analyzing contact information such as email addresses and phone numbers is vital for communication strategies. “Eye Adore Threading” and “Alexis Frobin Acupuncture” have multiple contact points, ensuring accessibility for potential clients. Utilizing data-driven communication strategies, such as email marketing or SMS campaigns, can enhance customer engagement and retention.

The “Other Information” field, specifying if a business is “Minority-owned” or “Immigrant-owned,” can influence marketing narratives. Highlighting these aspects in digital communication can resonate positively with diverse audiences, fostering a sense of community and inclusivity.

 

 

 

Monday – November 20,2023

Today, I started looking at a new set of data, the data can be found here: https://data.boston.gov/dataset/women-owned-businesses

Business Type and Location Analysis

In this dataset, businesses’ key attributes include Business Name, Business Type, Physical Location/Address, Business Zipcode, Business Website, Business Phone Number, Business Email, and Other Information. The initial step in data analysis involves categorizing businesses based on their types. This classification facilitates a comprehensive understanding of the diverse industries present. For instance, businesses like “Advocacy for Special Kids, LLC” and “HAI Analytics” fall under the Education category, while “Alexis Frobin Acupuncture” and “Eye Adore Threading” belong to the Healthcare sector. “CravenRaven Boutique” and “All Fit Alteration” represent the Retail industry, showcasing a variety of business types.

Next, examining the geographical distribution of businesses is essential. The physical locations and zip codes reveal clusters of businesses within specific regions, offering insights into the economic landscape of different areas. Businesses such as “Boston Sports Leagues” and “All Things Visual” in the 2116 zip code highlight concentrations of services in that region. Understanding the spatial distribution enables targeted marketing and resource allocation for business growth.

Additionally, analyzing the “Other Information” field, which includes details like “Minority-owned” and “Immigrant-owned,” provides valuable socio-economic insights. This information aids in identifying businesses contributing to diversity and inclusivity within the entrepreneurial landscape. Focusing on supporting minority and immigrant-owned businesses could be a strategic approach for community development and economic empowerment.

Friday – November 17,2023

Today I looked at data of “Hyde Park” . In order to analyze the provided data for Hyde Park across different decades, several data analysis techniques can be employed. Firstly, a temporal trend analysis can be conducted to observe population changes over time, identifying peaks and troughs in each demographic category. Age distribution patterns can be explored through bar charts, highlighting shifts in the population structure. Additionally, educational attainment trends can be visualized using pie charts or bar graphs to understand changes in the level of education within the community. The nativity and race/ethnicity data can be further examined using percentage distribution analysis to track variations in the composition of the population. Labor force participation rates, divided by gender, can be visualized to discern patterns in workforce dynamics. Housing tenure analysis, using pie charts or bar graphs, can reveal shifts in the proportion of owner-occupied and renter-occupied units, providing insights into housing trends. Overall, a combination of graphical representation and statistical measures would facilitate a comprehensive understanding of the demographic, educational, labor, and housing dynamics in Hyde Park over the specified decades.

Wednesday – November 15,2023.

Today I looked at the second sheet “Back Bay” of the Excel sheet https://data.boston.gov/dataset/neighborhood-demographics

The dataset on Back Bay offers insights into the neighborhood’s evolution across different decades, allowing for a comprehensive analysis of various demographic aspects. Notable patterns include population fluctuations, with a decline until 1990 followed by relative stability. Age distribution highlights shifts in the percentage of residents across different age groups, particularly a substantial increase in the 20-34 age bracket from 32% in 1950 to 54% in 1980. Educational attainment displays changing proportions of individuals with varying levels of education, notably showcasing a significant rise in those with a Bachelor’s Degree or Higher from 20% in 1950 to 81% in 2010. Nativity data reveals fluctuations in the percentage of foreign-born residents, while the race/ethnicity distribution indicates a decrease in the white population and a rise in the Asian/PI category. Labor force participation demonstrates gender-based variations, and housing tenure data underscores changes in the ratio of owner-occupied to renter-occupied units. Collectively, this dataset provides a nuanced understanding of the socio-demographic landscape in Back Bay over the decades.

Monday – November 13, 2023

I am currently examining the dataset on Analyze Boston, specifically focusing on the “Allston” sheet within the “neighborhoodsummaryclean_1950-2010” Excel file, which is available at https://data.boston.gov/dataset/neighborhood-demographics. The dataset provides a comprehensive overview of demographic and socioeconomic trends in Allston spanning several decades. Notably, there is evident population growth from 1950 to 2010. The age distribution data reveals intriguing patterns, including shifts in the percentage of residents across various age groups over the years. Educational attainment data reflects changes in the population’s education levels, notably showcasing a significant increase in the percentage of individuals holding a Bachelor’s degree or higher. The nativity data sheds light on the proportion of foreign-born residents, indicating shifts in immigration patterns. Changes in the racial and ethnic composition are apparent, with a declining percentage of White residents and an increase in Asian/PI residents. The labor force participation data by gender is noteworthy, illustrating fluctuations in male and female employment rates. Housing tenure data suggests a rise in the number of renter-occupied units over the years. Potential data analysis avenues may involve exploring correlations between demographic shifts, educational attainment, and housing tenure to gain deeper insights into the socio-economic dynamics of Allston.

Sunday – November 12,2023

This is Project 2 for MTH 522 at the University of Massachusetts Dartmouth.

Project Title:

Analysis of Fatal Police Shootings in the United States Using Washington Post Data 

The provided dataset has been thoroughly examined and comprehensively reported in the project document.

The contribution report has been added to the final page of the report.

Project 2

 

 

Friday – November 10,2023.

In today’s analysis, I loaded police shooting data from an Excel file into a Pandas DataFrame and aimed to investigate the distribution of justified and unjustified use of force by police across different racial groups, focusing on both male and female incidents. To achieve this, I defined a function to determine whether force was justified based on threat types and weapons involved. I then applied this function to the dataset, creating a new column indicating the justification of force. Subsequently, I filtered the data to include only incidents involving Black, White, Hispanic, and Asian individuals. After separating the data by gender, I calculated the occurrences and percentages of ‘False’ justified force cases for each race. Using Seaborn and Matplotlib, I created bar plots to visually represent these percentages for both male and female incidents. The analysis provides insights into potential disparities in the perceived justification of police force across different racial groups and genders, as visualized in the generated bar plots.

Wednesday – November 8,2023.

In todays analysis, I wrote the code to perform text analysis on specific columns of an Excel dataset to count the frequencies of words in those columns. Here’s a step-by-step explanation of the code:

  1. Import the necessary libraries:
    • import pandas as pd: Imports the Pandas library and assigns it the alias ‘pd’ for working with data.
    • from collections import Counter: Imports the Counter class from the collections module, which is used to count the frequency of words.
  2. Define the column names you want to analyze:
    • columns_to_analyze: A list containing the names of the columns you want to analyze for word frequencies. In this code, the columns specified are ‘threat_type’, ‘flee_status’, ‘armed_with’, and ‘body_camera.’
  3. Specify the file path to your Excel document:
    • directory_path: Specifies the file path to the Excel file you want to analyze. Make sure to update this path to your Excel file’s location.
  4. Load your data into a DataFrame:
    • df = pd.read_excel(directory_path): Reads the data from the Excel file specified by ‘directory_path’ into a Pandas DataFrame named ‘df.’
  5. Initialize a dictionary to store word counts for each column:
    • word_counts = {}: Creates an empty dictionary named ‘word_counts’ to store the word counts for each specified column.
  6. Iterate through the specified columns:
    • The code uses a for loop to go through each column specified in the columns_to_analyze list.
  7. Retrieve and preprocess the data from the column:
    • column_data = df[column_name].astype(str): Retrieves the data from the current column, converts it to strings to ensure consistent data type, and stores it in the ‘column_data’ variable.
  8. Tokenize the text and count the frequency of each word:
    • The code tokenizes the text within each column using the following steps:
      • words = ' '.join(column_data).split(): Joins all the text in the column into a single string, then splits it into individual words. This step prepares the data for word frequency counting.
      • word_counts[column_name] = Counter(words): Uses the Counter class to count the frequency of each word in the ‘words’ list and stores the results in the ‘word_counts’ dictionary under the column name as the key.
  9. Print the words and their frequencies for each column:
    • The code iterates through the ‘word_counts’ dictionary and prints the word frequencies for each column. It displays the column name, followed by the individual words and their counts for that column.

The code provides a word frequency analysis for the specified columns in your dataset, making it easier to understand the distribution of words in those columns. This can be useful for identifying common terms or patterns in the data.

Monday – November 6,2023

  1. Import the necessary libraries:
    • import pandas as pd: Imports the Pandas library and assigns it the alias ‘pd.’
    • import matplotlib.pyplot as plt: Imports the Matplotlib library and assigns it the alias ‘plt,’ which will be used to create plots and visualizations.
  2. Load the Excel file into a DataFrame:
    • directory_path: Specifies the file path to the Excel file you want to load. You should update this path to your Excel file’s location.
    • sheet_name: Specifies the name of the sheet within the Excel file from which data should be read.
    • df = pd.read_excel(directory_path, sheet_name=sheet_name): Reads the data from the Excel file into a Pandas DataFrame named ‘df.’
  3. Drop rows with missing ‘race,’ ‘age,’ or ‘gender’ values:
    • df = df.dropna(subset=['race', 'age', 'gender']): Removes rows from the DataFrame where any of these three columns (race, age, gender) have missing values.
  4. Create age groups:
    • age_bins: Defines the boundaries for age groups, similar to the previous code snippet.
    • age_labels: Provides labels for each age group, corresponding to ‘age_bins.’
  5. Cut the age data into age groups for each race category:
    • df['Age Group'] = pd.cut(df['age'], bins=age_bins, labels=age_labels): Creates a new column ‘Age Group’ in the DataFrame by categorizing individuals’ ages into the age groups defined in ‘age_bins’ and labeling them with ‘age_labels.’
  6. Count the number of individuals in each age group by race and gender:
    • age_group_counts_by_race_gender = df.groupby(['race', 'gender', 'Age Group'])['name'].count().unstack().fillna(0): Groups the data by race, gender, and age group, and then counts the number of individuals in each combination. The ‘unstack()’ function reshapes the data to make it more suitable for visualization, and ‘fillna(0)’ fills missing values with 0.
  7. Calculate the median age for each race and gender combination:
    • median_age_by_race_gender = df.groupby(['race', 'gender'])['age'].median(): Groups the data by race and gender and calculates the median age for each combination.
  8. Print the median age for each race and gender combination:
    • print("Median Age by Race and Gender:"): Prints a header.
    • print(median_age_by_race_gender): Prints the calculated median age for each race and gender combination.
  9. Create grouped bar charts for different genders:
    • The code iterates over unique gender values in the DataFrame and creates separate bar charts for each gender.
    • For each gender:
      • Subset the DataFrame to include only data for that gender.
      • Create a grouped bar chart, displaying the number of individuals in different age groups for each race-gender combination.
      • Set various plot properties such as the title, labels, legend, and rotation of x-axis labels.
      • Display the plot using plt.show().

This code generates grouped bar charts that visualize the distribution of individuals in different age groups for each race-gender combination, helping to analyze the age distribution within these subgroups.

The output is :

Median Age by Race and Gender:
race  gender
A     female    47.0
      male      34.0
B     female    31.0
      male      31.0
B;H   male      27.0
H     female    31.0
      male      33.0
N     female    32.0
      male      31.5
O     female    24.5
      male      36.0
W     female    39.0
      male      38.0
Name: age, dtype: float64

Friday – November 3,2023.

Today I worked on a Python script that uses the Pandas library to load data from an Excel file, perform some data analysis on the age distribution of individuals, and then create a bar graph to visualize the distribution of individuals in different age groups. Here’s a step-by-step explanation of the code:

  1. Import the necessary libraries:
    • import pandas as pd: Imports the Pandas library and assigns it the alias ‘pd.’
    • import matplotlib.pyplot as plt: Imports the Matplotlib library, specifically the ‘pyplot’ module, and assigns it the alias ‘plt.’ Matplotlib is used for creating plots and visualizations.
  2. Load the Excel file into a DataFrame:
    • directory_path: Specifies the file path to the Excel file you want to load. Make sure to update this path to the location of your Excel file.
    • sheet_name: Specifies the name of the sheet within the Excel file from which data should be read.
    • df = pd.read_excel(directory_path, sheet_name=sheet_name): Uses the pd.read_excel function to read the data from the Excel file into a Pandas DataFrame named ‘df.’
  3. Calculate the median age of all individuals:
    • median_age = df['age'].median(): Calculates the median age of all individuals in the ‘age’ column of the DataFrame and stores it in the ‘median_age’ variable.
    • print("Median Age of All Individuals:", median_age): Prints the calculated median age to the console.
  4. Create age groups:
    • age_bins: Defines the boundaries for age groups. In this case, individuals will be grouped into the specified age ranges.
    • age_labels: Provides labels for each age group, corresponding to the ‘age_bins.’
  5. Cut the age data into age groups:
    • df['Age Group'] = pd.cut(df['age'], bins=age_bins, labels=age_labels): Creates a new column ‘Age Group’ in the DataFrame by categorizing individuals’ ages into the age groups defined in ‘age_bins’ and labeling them with ‘age_labels.’
  6. Count the number of individuals in each age group:
    • age_group_counts = df['Age Group'].value_counts().sort_index(): Counts the number of individuals in each age group and sorts them by the age group labels. The result is stored in the ‘age_group_counts’ variable.
  7. Create a bar graph to analyze age groups:
    • plt.figure(figsize=(10, 6)): Sets the size of the figure for the upcoming plot.
    • age_group_counts.plot(kind='bar', color='skyblue'): Plots a bar graph using the ‘age_group_counts’ data, where each bar represents an age group. ‘skyblue’ is the color of the bars.
    • plt.title('Age Group Analysis'): Sets the title of the plot.
    • plt.xlabel('Age Group'): Sets the label for the x-axis.
    • plt.ylabel('Number of Individuals'): Sets the label for the y-axis.
    • plt.xticks(rotation=45): Rotates the x-axis labels by 45 degrees for better readability.
    • plt.show(): Displays the bar graph on the screen.

After running this code, you will get a bar graph showing the distribution of individuals in different age groups based on the data from the Excel file.

Wednesday – November 1,2023.

Today, I wrote a code in Python and used the pandas and collections libraries to analyze data from an Excel file. Here’s a simple explanation of what it does:

  1. It starts by importing two libraries: “pandas” (commonly used for data analysis) and “Counter” from “collections” (used for counting elements in a list).
  2. The code specifies the names of the columns you want to analyze from an Excel file. These columns include information like “threat_type,” “flee_status,” “armed_with,” and others.
  3. It sets the file path to the location of your Excel document. You need to replace this path with the actual path to your Excel file.
  4. The code uses “pd.read_excel” to load the data from the Excel file into a DataFrame (a table-like structure for data).
  5. It initializes a dictionary called “word_counts” to store word frequencies for each of the specified columns.
  6. The code then goes through each of the specified columns one by one. For each column:
    • It retrieves the data from that column and converts it to strings to ensure uniform data type.
    • It breaks the text into individual words (tokenizes it) and counts how many times each word appears in that column.
    • These word counts are stored in the “word_counts” dictionary under the column’s name.
    • Finally, the code prints the words and their frequencies for each of the specified columns. It goes through the “word_counts” dictionary and displays the words and how many times they appear in each column.

In summary, this code reads data from an Excel file, tokenizes the text in specific columns, and counts the frequency of each word in those columns. It then prints out the word frequencies for each column, which can be useful for understanding the data in those columns.

Monday – November 1,2023.

My group and I have decided to undertake these tasks, and we have divided the following tasks among ourselves.

Akshit:

  1. Data Collection: Coordinate with Gary to obtain location data from police stations. This initial step involves working with Gary to gather geographical information on police stations, including their latitude and longitude coordinates. Accurate location data is crucial for subsequent analysis.
  2. Distance Calculation: Once we have the police station coordinates, the next step is to calculate the distances between these police stations. This step is essential for understanding law enforcement’s spatial distribution and coverage in the area under consideration.
  3. Demographic Analysis: To gain a deeper understanding of the dataset, we will analyze data related to race, age, and shooting incidents. Our goal is to determine which areas experience the highest frequency of shootings. This analysis will help identify any potential hotspots.
  4. Proximity Analysis: Investigate how far shooting incidents occur from the police stations. This analysis will shed light on response times and potential areas where increased law enforcement presence may be required.
  5. Data Segmentation: To develop and validate our analysis, we will segment the data into training and testing datasets. Considering population distribution in this process is crucial to ensure our models are representative and can make accurate predictions or classifications.

Parag:

  1. Combination Analysis: In parallel with Akshit’s work, I will conduct a combination analysis. This involves considering variables such as “armed_with” and “flee_status” alongside other relevant factors from the dataset. The goal is to identify potential patterns or correlations among these variables and their impact on shooting incidents.
  2. Summary Statistics: I will generate basic summary statistics to gain initial insights into the dataset. These statistics will provide an overview of the data, including measures like means, medians, and standard deviations for critical variables. This step will help us identify trends and outliers.
  3. ANOVA Test: To assess the impact of different variables on the data, I will perform an analysis of variance (ANOVA) test. This statistical test will help us understand if significant differences exist between groups or categories within the dataset, particularly when considering factors like age, race, or other relevant variables.
  4. Grouping and Trend Analysis: I will group the data by age and race to identify trends and patterns in the analysis. This step aims to uncover any disparities or patterns related to age and race concerning shooting incidents. It can help inform potential policy recommendations or interventions.

Combining Akshit’s geographic and demographic analysis with Parag’s statistical and variable-focused analysis, we aim to comprehensively understand the factors contributing to shooting incidents, their locations, and potential strategies for improving public safety in the areas under investigation.

 

Friday – October 27,2023

Today’s work involved the development of a Python script for the analysis of an Excel dataset. The primary objective was to count distinct words within specified columns of the dataset. The process commenced with the importation of essential libraries, such as Pandas for data manipulation and the Counter class for word frequency calculations. To make the analysis adaptable, a list was used to specify the columns to be analyzed, and the file path to the Excel document was provided. Subsequently, the data from the Excel file was loaded into a Pandas DataFrame for further processing. To keep track of word counts, an empty dictionary was initialized. The code then iterated through the specified columns, extracting and converting data into strings. The textual content within each column was tokenized into words, and the frequency of each word was meticulously counted and stored within the dictionary. The final step involved printing the word counts for each column, presenting the column name along with the unique words and their corresponding frequencies. This code serves as a versatile tool for text analysis within targeted columns of an Excel dataset, delivering a well-structured and comprehensive output for further analytical insights.

Monday – October 23,2023.

Presently, my focus is centered on the comprehensive analysis of crime and statistical data. I am actively engaged in an endeavor to discern the potential impact of an individual’s environment on their propensity to engage in criminal activities. This multifaceted examination involves delving into various aspects of the environment, including socio-economic factors, living conditions, and community dynamics, all in a bid to gain insights into the root causes of criminal behavior.

Simultaneously, I am conducting a meticulous study of race-related data to unveil crucial patterns and trends in policing and criminal interactions. My aim is to shed light on which racial groups are disproportionately affected by incidents of being shot by law enforcement, as well as to understand the factors contributing to such occurrences. Furthermore, I am exploring instances where individuals from various racial backgrounds might be more likely to respond with force when encountering the police, which could potentially offer insights into the reasons behind the disproportionate number of shootings involving certain racial groups. This holistic analysis is pivotal in unraveling the complex dynamics of law enforcement interactions and aims to provide a deeper understanding of why certain racial groups face a higher likelihood of being shot by the police, thus contributing to the broader discourse on social justice and equity.

Friday – October 20,2023

Upon careful scrutiny of the dataset, it becomes evident that a prevalent issue pertains to missing data, presenting a significant challenge to our analytical endeavors. These gaps manifest diversely, with some cells conspicuously devoid of information while others bear the “not_available” label, complicating our analysis. To confront this issue, we’ve discerned a range of approaches and methodologies for managing these data gaps. These include the deletion of rows or columns when the missing values are minimal and randomly distributed, the practice of imputation, encompassing techniques such as mean, median, or mode imputation, linear regression imputation, interpolation, K-Nearest Neighbors (KNN), and the advanced Multiple Imputation by Chained Equations (MICE). For categorical data, we consider the creation of a distinct category for missing values labeled as “Unknown” or “N/A.” In select cases, the omission of imputation and the treatment of missing data as a unique category within our analysis may prove insightful. Furthermore, for intricate analyses, the employment of advanced statistical techniques like Expectation-Maximization (EM) algorithms or structural equation modeling may become indispensable for effectively handling missing data. To prevent the recurrence of missing or erroneous data in future entries, the establishment of data validation rules in tools such as Excel serves as a proactive measure to maintain data quality and integrity. By integrating these strategies, we can not only address the immediate issues related to missing data but also enhance the overall reliability and robustness of our data analysis efforts.

Wednesday – October 18,2023.

In today’s analysis, we aimed to address the following question: “Population-Based Analysis – Calculating the number of police shootings per 100,000 people in different areas and exploring whether population size influences police shootings.”

To approach this, we first gather population data on a county-by-county basis. Then, we will determine the total number of people shot by the police in each county, allowing us to identify the counties with the highest incidence of police shootings in the United States.

Additionally, we have identified several other key questions to investigate:

  1. Impact of Crime Rates: Assess the correlation between crime rates and the occurrence of police shootings, delving into how crime rates influence these incidents.
  2. Types of Crimes: Identify the crimes that most frequently lead to police shootings.
  3. Mental Illness Prediction: Explore whether incidents of police shootings are associated with cases involving individuals with mental illness.
  4. Race Bias Investigation: Examine the racial backgrounds of the victims to investigate whether there is any racial bias in police shootings.
  5. State-Level Analysis: Determine which state has the highest number of police shootings and separately identify states with the highest rates of homicides and petty crimes.
  6. Racial Bias in Shootings: Analyze whether there is evidence of racial bias in police shootings, focusing on the victims’ race.
  7. Police Training Duration: Investigate whether the duration of police training impacts the frequency of police shootings.
  8. Gender Impact Analysis: Determine the gender most frequently involved in police shootings and explore the factors contributing to this trend.

 

Monday – October 16, 2023.

After reviewing the dataset, I’ve observed that missing data is one of the major issues. Some cells are empty, while others are labeled as “not_available.”

I’ve identified common methods for handling missing data:

  1. Delete Rows or Columns: This approach is suitable when missing values are few and randomly distributed, with minimal impact on the analysis.
  2. Impute Missing Values: Imputation involves replacing missing values with estimated or predicted values. Common methods include:
    • Mean, Median, or Mode imputation: Replacing missing values with the respective column’s mean, median, or mode.
    • Linear Regression imputation: Using other variables to predict and fill in missing values.
    • Interpolation: Estimating missing values based on neighboring data points, especially for time-series data.
    • K-Nearest Neighbors (KNN): Replacing missing values with values from similar rows based on other variables.
    • MICE (Multiple Imputation by Chained Equations): An advanced method that considers relationships between variables.
  3. Categorize Missing Values: Creating a new category for missing values, such as “Unknown” or “N/A,” can be meaningful for categorical data.
  4. Don’t Impute and Treat as a Separate Category: In some cases, missing data may represent a meaningful category, and it’s better not to impute but treat it as a distinct category in the analysis.
  5. Use Advanced Statistical Techniques: For complex analyses, advanced methods like Expectation-Maximization (EM) algorithms or structural equation modeling may be necessary to handle missing data.
  6. Data Validation Rules: Setting up data validation rules in Excel can help prevent the entry of missing or invalid data in future entries.

I will consult with the professor and teaching assistants to determine the most appropriate approach for this dataset.

For dataset 1 (fatal-police-shootings-data), I started by calculating basic summary statistics:

  • For latitude: The median latitude is 36.08117181, the lowest latitude is 19.4975033, the highest latitude is 71.3012553, and the standard deviation is 5.346323915.
  • For longitude: The median longitude is 36.08117181, the lowest longitude is -9.00718E+15, the highest longitude is -67.8671657, and the standard deviation is 1.02104E+14.
  • For age: The median age is 35, the lowest age is 2, the highest age is 92, and the standard deviation is 12.99.

These statistics suggest potential outliers in the data, such as individuals as young as 2 or as old as 92 involved in police shootings. This could indicate possible instances of misfiring or accidental mistakes by the police.

The most common agency associated with the highest number of police shootings is agency 38, which corresponds to the “Los Angeles Police Department.” LAPD had the highest number of police shootings, totaling 129 incidents.

Wednesday – October 11, 2023.

Project 2: Initial Post

In Project 2, I’m working with two datasets. The first dataset, “Death Record Data,” is stored in the file /v2/fatal-police-shootings-data.csv. The second dataset, “Police Agencies Data,” can be found in /v2/fatal-police-shootings-agencies.csv. This dataset contains information about police agencies involved in at least one fatal police shooting since 2015.

Dataset 1 (“fatal-police-shootings-data”):

  • Description: This dataset comprises 19 columns and 8770 rows, covering the period from January 2, 2015, to October 7, 2023. Several columns have missing values, including “threat_type,” “flee_status,” “armed_with,” “city,” “county,” “latitude,” “longitude,” “location_precision,” “name,” “age,” “gender,” “race,” and “race_source.
  • Columns:
    • “threat_type” indicates different threat levels during encounters, such as “point,” “move,” “attack,” and “shoot.”
    • “flee_status” indicates whether the individual attempted to flee.
    • “armed_with” specifies the type of weapon or item the individual had.
    • Location data includes city, county, state, latitude, and longitude, facilitating geographical analysis.
    • Demographic details like name, age, gender, and race are provided.
    • The “mental_illness_related” column indicates if mental illness was a factor in the incident.
    • “body_camera” signifies whether law enforcement officers had active body cameras during the encounter.
    • “agency_ids” may represent the law enforcement agencies involved in these incidents.

Dataset 2 (“fatal-police-shootings-agencies”):

  • Description: This dataset includes six columns and 3322 rows. Some entries have missing values in the “oricodes” column.
  • Columns:
    • “id” serves as a unique identifier for each law enforcement agency.
    • “name” designates the name of the law enforcement agency, which can include sheriff’s offices, local police departments, state police agencies, and others.
    • The “type” column categorizes the law enforcement agency by type, such as “sheriff,” “local_police,” “state_police,” and more.
    • “state” identifies the state where the law enforcement agency is located.
    • “oricodes” contains a code or identifier associated with the law enforcement agency.
    • “total_shootings” records the total number of shootings or incidents involving the respective law enforcement agency.

Summary:

The datasets in Project 2 provide valuable information about law enforcement encounters and police agencies involved in fatal incidents. Dataset 1 focuses on individual cases, their characteristics, and the circumstances, while Dataset 2 offers insights into the law enforcement agencies, their types, locations, and their involvement in such incidents. Further analysis or specific questions about the data would require additional context and particular queries.

Sunday – October 8, 2023.

This is Project 1 for MTH 522 at the University of Massachusetts Dartmouth.

Project Title:

Unlocking Public Health: An Analysis of CDC Data on Diabetes, Obesity, and Inactivity in US Counties (2018).

The provided dataset has been thoroughly examined and comprehensively reported in the project document.

The contribution report has been added to the final page of the report.

Report for Project 1 - MTH 522

Friday – October 6, 2023.

Firstly, I conducted a geographical analysis. This analysis helped identify geographic patterns, disparities, or clusters within the dataset, shedding light on potential regional variations.

In addition to geographical analysis, I dived into predictive modeling. Specifically, I employed ridge and linear regression techniques to develop models to understand and predict key relationships within the data. Ridge regression was used to address multicollinearity and prevent overfitting, enhancing the robustness of the predictive models. Linear regression, on the other hand, provided insights into the linear relationships between variables.

Beyond model development, I thoroughly evaluated the performance of these models. This evaluation involved assessing their predictive accuracy, goodness-of-fit, and statistical significance. Through these analyses, I aimed to not only understand the dataset better but also derive actionable insights that could inform decision-making or further research in the field.

Wednesday – October 4 , 2023.

In today’s update, I’d like to inform you that we’re nearing the completion of our analysis. Currently, we’re consolidating everyone’s work and putting together a report. Our goal is to finish the initial draft of the report by this Friday and have it reviewed on the same day.

In our analysis, we’ve consistently calculated summary statistics. We’ve also employed various data modeling techniques, such as linear regression and logistic regression. To assess these models, we’ve used methods like cross-validation, p-values, and confidence intervals.

Monday – October 2,2023.

I have written this draft report .

Data Preparation:

  1. Data Gathering: Collect data from various sources.
  2. Data Cleaning: Remove duplicates, handle missing values, and correct inconsistencies.
  3. Data Integration: Combine three Excel sheets into a dataset containing 354 data points. Column Naming: Rename columns for clarity and understanding.

Exploratory Data Analysis (EDA):

  1. Summary Statistics: Compute mean, median, skewness, kurtosis, standard deviation, and percentiles.
  2. Data Visualization: Generate plots and charts to visualize data and explore relationships between variables.
  3. Outlier Detection: Identify and handle outliers.
  4. Geographical Analysis: It was discovered that 138 counties in the dataset belong to a single state. Tattnall County in Georgia has the highest combined percentage of inactivity, obesity, and diabetes, totaling 47.3%.

Data Modeling:

  1. Algorithm Selection: Choose appropriate machine learning or statistical algorithms based on the problem type (classification, regression, clustering, etc.).
  2. Model Evaluation: Assess model performance using evaluation metrics such as accuracy, F1-score, and RMSE on the testing data.
  3. Hyperparameter Tuning: Optimize model hyperparameters to enhance performance.

Interpretation of Model:

  1. Feature Interpretation: Determine which features have the most significant impact on the model’s predictions.Model Explanation: Understand the rationale behind the model’s predictions.

Reporting and Visualization:

  1. Report Creation: Summarize findings, insights, and model performance in clear and concise reports. Result Visualization: Use charts, graphs, and dashboards to communicate results effectively.

Deployment & Real-world Monitoring:

  1. Model Deployment: To obtain answers, implement the model in a real-world environment.Continuous Monitoring: Monitor the model’s performance in the real world and make necessary adjustments.

Documentation:

  1. Process Documentation: Document all the steps taken during the analysis for future reference.

Feedback:

  1. Feedback Collection: Gather input from professors and teaching assistants to improve the analysis and presentation.

 

 

Friday – September 29, 2023.

Based on my preliminary analysis, I have concluded that it is not possible to perform time series analysis modeling on a dataset with only one year of data. This is because time series models require a sufficient amount of historical data to learn the underlying trends and patterns in the data. Otherwise, the model will be unable to generate accurate predictions.

I also attempted to perform geospatial analysis on the dataset, as it contains county and state information. However, my code failed to execute because the dataset does not include a geometry column. This column is required for geospatial analysis, as it specifies the spatial location of each data point.

Finally, I tried to use ensemble methods, such as random forests, to gain insights into feature importance and relationships between predictor variables and the outcome. However, ensemble methods are not suitable for small datasets, as they are prone to overfitting.

Overall, I have made significant progress in exploring different modeling techniques for the dataset. However, I need to address the following challenges before I can finalize the modeling techniques and start writing the first report draft.

Project 1 - Progress report - Jupyter Notebook

Wednesday – September 27, 2023.

Today, I met with all my teammates for our project, and I’m happy to say we’re almost done with it. I have completed with linear regression model analysis and checked how well it works using a wide array of technical parameters like p-values, R-squared, confidence interval, cross-validation and collinearity.

Now, I’m figuring out how to use time series analysis to predict what might happen in the future, but I’m still working on getting the code right for that part. Meanwhile, my team members are working on their own models. Eventually, we’ll put everything together and start writing our first draft on Friday.

Project 1 - Progress report - Jupyter Notebook

Monday – September 25, 2023.

In today’s analysis, I have refined my code and effectively addressed all previously encountered codes issues. In the past, I had previously developed a linear regression model. However, in the present analysis, I did a comprehensive evaluation of this model across three distinct scenarios involving the computation of key statistical metrics:

  1. Calculation of p-values to assess the significance of individual predictors.
  2. Estimation of confidence intervals to gauge the precision of regression coefficient estimates.
  3. Computation of the coefficient of determination (R-squared) to measure the explained variance in the dependent variable.
  4. Execution of cross-validation procedures to assess model performance and generalizability.
  5. Investigation of collinearity among predictor variables to identify potential multicollinearity issues.

# Final Analysis of Linear Regression Model Let’s compare the three models (A, B, and C) based on various statistics and provide a detailed analysis:

Model A:

VIF values:
const: 325.88
% Diabetes: 1.18
% Obesity: 1.18
Mean R-squared: 0.125
Intercept: -0.158
Coefficients for % Diabetes and % Obesity: 0.957 and 0.445, respectively
Confidence Intervals for coefficients (95%):
% Diabetes: [0.769, 1.145]
% Obesity: [0.312, 0.578]
F-statistic: 115.2
Prob (F-statistic): 3.51e-39
Model B:

VIF values:
const: 318.05
% Inactivity: 1.29
% Obesity: 1.29
Mean R-squared: 0.155
Intercept: 1.654
Coefficients for % Inactivity and % Obesity: 0.232 and 0.111, respectively
Confidence Intervals for coefficients (95%):
% Inactivity: [0.187, 0.278]
% Obesity: [0.043, 0.180]
F-statistic: 90.71
Prob (F-statistic): 1.76e-32
Model C:

VIF values:
const: 120.67
% Inactivity: 1.47
% Diabetes: 1.47
Mean R-squared: 0.093
Intercept: 12.794
Coefficients for % Inactivity and % Diabetes: 0.247 and 0.254, respectively
Confidence Intervals for coefficients (95%):
% Inactivity: [0.173, 0.321]
% Diabetes: [0.097, 0.410]
F-statistic: 57.04
Prob (F-statistic): 3.54e-22
Analysis and Comparison:

VIF Values:

Model A has a very high VIF value for the constant (const), indicating potential multicollinearity with other variables in the model.
Model B and Model C also have high VIF values for the constant but lower than in Model A. These models include different sets of independent variables.
R-squared:

Model B has the highest mean R-squared (0.155), indicating that it explains the most variation in the dependent variable (% Inactivity).
Model A has the lowest mean R-squared (0.125).
Model C falls in between with a mean R-squared of 0.093.
Intercept and Coefficients:

The intercept values differ significantly between models. For Model A, it’s close to zero, while for Models B and C, it’s considerably higher.
The coefficients also vary between models, and their interpretations depend on the specific variables used in each model.
Confidence Intervals:

Confidence intervals for coefficients indicate whether they are statistically significant. In all models, some coefficients have confidence intervals that exclude zero, making them statistically significant predictors.
F-statistic:

Model A has the highest F-statistic (115.2), indicating strong overall model significance.
Model B has a lower F-statistic (90.71), but it is still highly significant.
Model C has the lowest F-statistic (57.04), which is also statistically significant but relatively lower than the other models.
Multicollinearity:

All three models exhibit multicollinearity to some extent, with high VIF values for the constant term in each case.
Model B and Model C include % Inactivity as an independent variable, which may contribute to multicollinearity in these models.
Relationships:

Model B appears to perform the best in terms of R-squared and overall model significance.
Model A has a particularly high VIF value for the constant, which indicates a potential issue with multicollinearity.
Model C has a moderate R-squared and F-statistic but also includes % Inactivity as an independent variableProject 1 - Progress report - Jupyter Notebook

Friday – September 22,2023.

In today’s work, I successfully updated my code and constructed linear regression models for all three categories:

  1. Inactivity vs. Obesity predicting Diabetes.
  2. Inactivity vs. Diabetes predicting Obesity.
  3. Obesity vs. Diabetes predicting Inactivity.

However, I am currently facing an issue with calculating confidence intervals & p values for these linear regression models. I’ve been troubleshooting this problem but have not yet found a solution. My goal is to refine my code and proceed with the analysis.

For the analysis of my linear regression models, I plan to follow these steps:

  1. Calculating the p-values: Resolve the issue with calculating p-values to determine the significance of each coefficient in the models.
  2. Calculating confidence intervals: Once the p-values are successfully calculated, estimate confidence intervals for the coefficients to understand the range of potential values.
  3. Using metrics like R-squared: Evaluate the goodness-of-fit of the models using metrics like R-squared to measure how well the models explain the variation in the dependent variable.
  4. Performing cross-validation: Implement cross-validation techniques to assess the models’ generalization performance and identify potential overfitting.
  5. Finding collinearity: Detect and handle multicollinearity among independent variables to ensure the models’ stability and interpretability.

I’m actively working on resolving the issue with p-values , confidence intervals and progressing with the analysis of these linear regression models.Project 1 - Progress report - Jupyter Notebook

 

 

Wednesday – September 20, 2023

In today’s analysis, we concentrated on building a linear regression model to examine the ‘% Obesity’ data by utilizing ‘% Diabetes’ and ‘% Inactivity’ data. Furthermore, we also built a linear regression model to examine the ‘% Diabetes’ data, considering the influence of ‘% Obesity’ and ‘% Inactivity’ data.

Linear regression is a statistical method for modeling the relationship between a dependent variable and one or more independent variables. Linear regression aims to find the best-fitting line through the data points. This line can then be used to make predictions about the dependent variable given the values of the independent variables.

The equation for a linear regression model is as follows:

y = mx + b

Where:

  • y is the dependent variable
  • x is the independent variable
  • m is the slope of the line
  • b is the y-intercept of the line

The slope of the line tells us how much the dependent variable changes for every one-unit change in the independent variable. The y-intercept of the line tells us the value of the dependent variable when the independent variable is equal to zero.

Imports the necessary libraries. This includes the following:

    • pandas: A library for data manipulation and analysis.
    • numpy: A library for scientific computing.
    • sklearn: A library for machine learning.
    • matplotlib: A library for data visualization.
  1. Loads the data from an Excel file. The file path is specified by the variable file_path.
  2. Removes rows with missing values in the dependent variable. The dependent variable is the variable that we want to predict. In this case, it is the percentage of obesity and diabetes.
  3. Defines the independent variables and the dependent variable. The independent variables are the variables that we use to predict the dependent variable. In this case, they are the percentage of obesity and inactivity; in the second case, they are the % Obesity’ and ‘% Inactivity data.
  4. Creates a linear regression model. This is done using the LinearRegression() class from the sklearn library.
  5. Fits the model to the data. This is done using the fit() method of the LinearRegression class.
  6. Prints the intercept and coefficients. The intercept is the value of the predicted dependent variable when all independent variables equal zero. The coefficients are the values that multiply the independent variables in the linear regression equation.
  7. Makes predictions using the model. This is done using the predict() method of the LinearRegression class.
  8. Plots the actual vs. predicted values. This is done using the matplotlib library.
  9. Calculates the regression line. This is done using the predict() method of the LinearRegression class.
  10. Adds the regression line to the plot. This is done using the plot() method of the matplotlib library.
  11. Displays the legend. This is done using the legend() method of the matplotlib library.
  12. Shows the plot. This is done using the show() method of the matplotlib library.
Project 1 - Progress report - Jupyter Notebook

Monday – September 18, 2023.

Import Necessary Libraries: The code imports essential libraries for data handling, analysis, and plotting, including pandas, numpy, scikit-learn, and matplotlib.

Load Data: It retrieves data from an Excel file located at a specified file path from my laptop.

Data Cleaning: The code ensures data cleanliness by removing rows with missing values (NaN) in the “Inactivity” column.

Data Setup: After cleaning, the data is split into two parts:

Independent variables (X): These are features that might affect “Inactivity,” like “% Diabetes” and “% Obesity.”

Dependent variable (y): This is the variable we want to predict, which is “Inactivity.”

Linear Regression Model: The code constructs a linear regression model, which is a mathematical formula that finds a link between independent variables (diabetes and obesity percentages) and the dependent variable (inactivity percentage).

Model Training: The model is trained on the data to learn how changes in independent variables influence the dependent variable. It identifies the best-fit line that minimizes the difference between predicted and actual “Inactivity” percentages.

Print Results: The code displays the outcomes of the linear regression analysis, including the intercept (where the line crosses the Y-axis) and coefficients (slopes for each independent variable). These values help interpret the relationship between the variables.

Make Predictions: Using the trained model, the code predicts “Inactivity” percentages based on new values of independent variables (diabetes and obesity percentages).

Plot Results: To visualize the model’s performance, a scatter plot is created. It compares actual “Inactivity” percentages (X-axis) with predicted percentages (Y-axis). A well-fitted model will have points closely aligned with a diagonal line.

In summary, this code loads, cleans, and prepares data, trains a linear regression model to understand relationships, and visualizes the model’s predictions, all aimed at explaining “Inactivity” percentages based on diabetes and obesity percentages.

Project 1 - Progress report - Jupyter Notebook

 

Friday – 15 September, 2023.

I performed a correlation analysis on three datasets: diabetes, obesity, and inactivity. The study revealed a strong correlation between all three datasets, with the FIPS code being the common factor. Recognizing the need for a more comprehensive analysis, I merged these three datasets into a single Excel spreadsheet for a more holistic examination.

I wrote a code to combine the three datasets and found 356 data points in common. I then cleaned the Excel sheet which involved addressing the issue of redundant columns containing information on county, state, and year. To enhance data clarity, I removed these columns. Additionally, I improved the dataset’s readability by renaming specific columns and adjusting column widths to facilitate data visualization.

Next, I focused on a geographical analysis, explicitly counting the number of counties within each state. I found that Texas has 138 counties, while several states in the dataset have only one county entry, making them statistically less reliable for meaningful analysis. This makes the data unreliable for research, as it is heavily skewed towards certain states. For example, Wyoming only has one county, so analysis of Wyoming would be incorrect as it will skewed towards once particular county only and we would not get the general view of the entire state which is our objective.

docs

Wednesday – 13 September , 2023.

During my work today, I conducted an in-depth analysis of the 2018 health data from the CDC. Leveraging the Python programming language, I executed calculations about statistical measures, specifically the standard deviation and kurtosis, for the variables representing the percentages of diabetic cases, obesity rates, and inactivity levels. Throughout this process, it became apparent that I encountered a notable challenge in the form of missing or “NaN” values within the dataset. As a result, I dedicated a significant portion of my efforts to data cleaning and preparation to ensure the accuracy of subsequent statistical analyses.

Regrettably, I observed a disparity between the kurtosis values I obtained and those outlined in the reference material provided by our professor. I intend to bring it to the attention of our instructor and teaching assistants during our upcoming class session.

I have attached a PDF of the code I wrote today for you to look over.

Looking forward, my immediate objective involves the calculation of p-values. I plan to implement a t-test, a statistical method that will facilitate hypothesis testing and aid in making informed conclusions about the dataset.Project 1 - Progress report - Jupyter Notebook

 

Monday – September 11,2023.

I conducted a comprehensive descriptive statistics analysis, specifically focusing on key metrics such as the Mean, Median, and Skewness, using the Python programming language. During the process, I encountered a significant challenge from non-numeric values and numerous occurrences of “NaN” (Not-a-Number) values within the dataset. To ensure the accuracy of my statistical computations, I implemented data preprocessing steps to eliminate these non-numeric and “NaN” entries.

Subsequently, I have presented my observations and results below, along with the corresponding code utilized in this analysis.

My next step is constructing visual representations such as histograms and box plots, which are instrumental in visually capturing data distributions. Furthermore, I aim to segment the data based on geographical attributes, potentially grouping it by county or state. This analysis will enable a more granular examination of each subgroup’s characteristics.

Furthermore, I intend to conduct hypothesis testing to find significant differences in obesity, inactivity, and diabetes rates among counties within different states. To achieve this, I would like to do statistical tests like the t-test or analysis of variance (ANOVA).

Code :

!pip install pandas

import pandas as pd

file_path = ‘D:\\Juice Wrld\\University\\Subjects\\Fall 2023\\MTH 522 – Mathematical Statistics\\a1.xlsx’

df = pd.read_excel(file_path, sheet_name=’Diabetes’)
df1 = pd.read_excel(file_path, sheet_name=’Obesity’)
df2= pd.read_excel(file_path, sheet_name=’Inactivity’)

import numpy as np

selected_column_diabetes = df[‘% DIABETIC’]

selected_column_diabetes = pd.to_numeric(selected_column_diabetes, errors=’coerce’)

selected_column_obesity = df1[‘% OBESE’]

selected_column_obesity = pd.to_numeric(selected_column_obesity, errors=’coerce’)

selected_column_inactivity = df2[‘% INACTIVE’]

selected_column_inactivity = pd.to_numeric(selected_column_inactivity, errors=’coerce’)

from scipy import stats

median_diabetes = np.nanmedian(selected_column_diabetes)
median_obesity = np.nanmedian(selected_column_obesity)
median_inactivity = np.nanmedian(selected_column_inactivity)

skewness_diabetes = stats.skew(selected_column_diabetes, nan_policy=’omit’)
skewness_obesity = stats.skew(selected_column_obesity, nan_policy=’omit’)
skewness_inactivity = stats.skew(selected_column_inactivity, nan_policy=’omit’)

# Print the results
print(“Mean of % Diabetes:”, mean_diabetes)
print(“Median of % Diabetes:”, median_diabetes)
print(“Skewness of % Diabetes:”, skewness_diabetes)

print(“Mean of % Obesity:”, mean_obesity)
print(“Median of % Obesity:”, median_obesity)
print(“Skewness of % Obesity:”, skewness_obesity)

print(“Mean of % Inactivity:”, mean_inactivity)
print(“Median of % Inactivity:”, median_inactivity)
print(“Skewness of % Inactivity:”, skewness_inactivity)

OUTPUT:-

Mean of % Diabetes: 8.714891599294615
Median of % Diabetes: 8.4
Skewness of % Diabetes: 0.919133846559114
Mean of % Obesity: 18.1678463101346
Median of % Obesity: 18.3
Skewness of % Obesity: -7.104751234514205
Mean of % Inactivity: 16.52061469099719
Median of % Inactivity: 16.7
Skewness of % Inactivity: -0.9638674028443353