Monday – November 1,2023.

My group and I have decided to undertake these tasks, and we have divided the following tasks among ourselves.

Akshit:

  1. Data Collection: Coordinate with Gary to obtain location data from police stations. This initial step involves working with Gary to gather geographical information on police stations, including their latitude and longitude coordinates. Accurate location data is crucial for subsequent analysis.
  2. Distance Calculation: Once we have the police station coordinates, the next step is to calculate the distances between these police stations. This step is essential for understanding law enforcement’s spatial distribution and coverage in the area under consideration.
  3. Demographic Analysis: To gain a deeper understanding of the dataset, we will analyze data related to race, age, and shooting incidents. Our goal is to determine which areas experience the highest frequency of shootings. This analysis will help identify any potential hotspots.
  4. Proximity Analysis: Investigate how far shooting incidents occur from the police stations. This analysis will shed light on response times and potential areas where increased law enforcement presence may be required.
  5. Data Segmentation: To develop and validate our analysis, we will segment the data into training and testing datasets. Considering population distribution in this process is crucial to ensure our models are representative and can make accurate predictions or classifications.

Parag:

  1. Combination Analysis: In parallel with Akshit’s work, I will conduct a combination analysis. This involves considering variables such as “armed_with” and “flee_status” alongside other relevant factors from the dataset. The goal is to identify potential patterns or correlations among these variables and their impact on shooting incidents.
  2. Summary Statistics: I will generate basic summary statistics to gain initial insights into the dataset. These statistics will provide an overview of the data, including measures like means, medians, and standard deviations for critical variables. This step will help us identify trends and outliers.
  3. ANOVA Test: To assess the impact of different variables on the data, I will perform an analysis of variance (ANOVA) test. This statistical test will help us understand if significant differences exist between groups or categories within the dataset, particularly when considering factors like age, race, or other relevant variables.
  4. Grouping and Trend Analysis: I will group the data by age and race to identify trends and patterns in the analysis. This step aims to uncover any disparities or patterns related to age and race concerning shooting incidents. It can help inform potential policy recommendations or interventions.

Combining Akshit’s geographic and demographic analysis with Parag’s statistical and variable-focused analysis, we aim to comprehensively understand the factors contributing to shooting incidents, their locations, and potential strategies for improving public safety in the areas under investigation.

 

Friday – October 27,2023

Today’s work involved the development of a Python script for the analysis of an Excel dataset. The primary objective was to count distinct words within specified columns of the dataset. The process commenced with the importation of essential libraries, such as Pandas for data manipulation and the Counter class for word frequency calculations. To make the analysis adaptable, a list was used to specify the columns to be analyzed, and the file path to the Excel document was provided. Subsequently, the data from the Excel file was loaded into a Pandas DataFrame for further processing. To keep track of word counts, an empty dictionary was initialized. The code then iterated through the specified columns, extracting and converting data into strings. The textual content within each column was tokenized into words, and the frequency of each word was meticulously counted and stored within the dictionary. The final step involved printing the word counts for each column, presenting the column name along with the unique words and their corresponding frequencies. This code serves as a versatile tool for text analysis within targeted columns of an Excel dataset, delivering a well-structured and comprehensive output for further analytical insights.

Monday – October 23,2023.

Presently, my focus is centered on the comprehensive analysis of crime and statistical data. I am actively engaged in an endeavor to discern the potential impact of an individual’s environment on their propensity to engage in criminal activities. This multifaceted examination involves delving into various aspects of the environment, including socio-economic factors, living conditions, and community dynamics, all in a bid to gain insights into the root causes of criminal behavior.

Simultaneously, I am conducting a meticulous study of race-related data to unveil crucial patterns and trends in policing and criminal interactions. My aim is to shed light on which racial groups are disproportionately affected by incidents of being shot by law enforcement, as well as to understand the factors contributing to such occurrences. Furthermore, I am exploring instances where individuals from various racial backgrounds might be more likely to respond with force when encountering the police, which could potentially offer insights into the reasons behind the disproportionate number of shootings involving certain racial groups. This holistic analysis is pivotal in unraveling the complex dynamics of law enforcement interactions and aims to provide a deeper understanding of why certain racial groups face a higher likelihood of being shot by the police, thus contributing to the broader discourse on social justice and equity.

Friday – October 20,2023

Upon careful scrutiny of the dataset, it becomes evident that a prevalent issue pertains to missing data, presenting a significant challenge to our analytical endeavors. These gaps manifest diversely, with some cells conspicuously devoid of information while others bear the “not_available” label, complicating our analysis. To confront this issue, we’ve discerned a range of approaches and methodologies for managing these data gaps. These include the deletion of rows or columns when the missing values are minimal and randomly distributed, the practice of imputation, encompassing techniques such as mean, median, or mode imputation, linear regression imputation, interpolation, K-Nearest Neighbors (KNN), and the advanced Multiple Imputation by Chained Equations (MICE). For categorical data, we consider the creation of a distinct category for missing values labeled as “Unknown” or “N/A.” In select cases, the omission of imputation and the treatment of missing data as a unique category within our analysis may prove insightful. Furthermore, for intricate analyses, the employment of advanced statistical techniques like Expectation-Maximization (EM) algorithms or structural equation modeling may become indispensable for effectively handling missing data. To prevent the recurrence of missing or erroneous data in future entries, the establishment of data validation rules in tools such as Excel serves as a proactive measure to maintain data quality and integrity. By integrating these strategies, we can not only address the immediate issues related to missing data but also enhance the overall reliability and robustness of our data analysis efforts.

Wednesday – October 18,2023.

In today’s analysis, we aimed to address the following question: “Population-Based Analysis – Calculating the number of police shootings per 100,000 people in different areas and exploring whether population size influences police shootings.”

To approach this, we first gather population data on a county-by-county basis. Then, we will determine the total number of people shot by the police in each county, allowing us to identify the counties with the highest incidence of police shootings in the United States.

Additionally, we have identified several other key questions to investigate:

  1. Impact of Crime Rates: Assess the correlation between crime rates and the occurrence of police shootings, delving into how crime rates influence these incidents.
  2. Types of Crimes: Identify the crimes that most frequently lead to police shootings.
  3. Mental Illness Prediction: Explore whether incidents of police shootings are associated with cases involving individuals with mental illness.
  4. Race Bias Investigation: Examine the racial backgrounds of the victims to investigate whether there is any racial bias in police shootings.
  5. State-Level Analysis: Determine which state has the highest number of police shootings and separately identify states with the highest rates of homicides and petty crimes.
  6. Racial Bias in Shootings: Analyze whether there is evidence of racial bias in police shootings, focusing on the victims’ race.
  7. Police Training Duration: Investigate whether the duration of police training impacts the frequency of police shootings.
  8. Gender Impact Analysis: Determine the gender most frequently involved in police shootings and explore the factors contributing to this trend.

 

Monday – October 16, 2023.

After reviewing the dataset, I’ve observed that missing data is one of the major issues. Some cells are empty, while others are labeled as “not_available.”

I’ve identified common methods for handling missing data:

  1. Delete Rows or Columns: This approach is suitable when missing values are few and randomly distributed, with minimal impact on the analysis.
  2. Impute Missing Values: Imputation involves replacing missing values with estimated or predicted values. Common methods include:
    • Mean, Median, or Mode imputation: Replacing missing values with the respective column’s mean, median, or mode.
    • Linear Regression imputation: Using other variables to predict and fill in missing values.
    • Interpolation: Estimating missing values based on neighboring data points, especially for time-series data.
    • K-Nearest Neighbors (KNN): Replacing missing values with values from similar rows based on other variables.
    • MICE (Multiple Imputation by Chained Equations): An advanced method that considers relationships between variables.
  3. Categorize Missing Values: Creating a new category for missing values, such as “Unknown” or “N/A,” can be meaningful for categorical data.
  4. Don’t Impute and Treat as a Separate Category: In some cases, missing data may represent a meaningful category, and it’s better not to impute but treat it as a distinct category in the analysis.
  5. Use Advanced Statistical Techniques: For complex analyses, advanced methods like Expectation-Maximization (EM) algorithms or structural equation modeling may be necessary to handle missing data.
  6. Data Validation Rules: Setting up data validation rules in Excel can help prevent the entry of missing or invalid data in future entries.

I will consult with the professor and teaching assistants to determine the most appropriate approach for this dataset.

For dataset 1 (fatal-police-shootings-data), I started by calculating basic summary statistics:

  • For latitude: The median latitude is 36.08117181, the lowest latitude is 19.4975033, the highest latitude is 71.3012553, and the standard deviation is 5.346323915.
  • For longitude: The median longitude is 36.08117181, the lowest longitude is -9.00718E+15, the highest longitude is -67.8671657, and the standard deviation is 1.02104E+14.
  • For age: The median age is 35, the lowest age is 2, the highest age is 92, and the standard deviation is 12.99.

These statistics suggest potential outliers in the data, such as individuals as young as 2 or as old as 92 involved in police shootings. This could indicate possible instances of misfiring or accidental mistakes by the police.

The most common agency associated with the highest number of police shootings is agency 38, which corresponds to the “Los Angeles Police Department.” LAPD had the highest number of police shootings, totaling 129 incidents.

Wednesday – October 11, 2023.

Project 2: Initial Post

In Project 2, I’m working with two datasets. The first dataset, “Death Record Data,” is stored in the file /v2/fatal-police-shootings-data.csv. The second dataset, “Police Agencies Data,” can be found in /v2/fatal-police-shootings-agencies.csv. This dataset contains information about police agencies involved in at least one fatal police shooting since 2015.

Dataset 1 (“fatal-police-shootings-data”):

  • Description: This dataset comprises 19 columns and 8770 rows, covering the period from January 2, 2015, to October 7, 2023. Several columns have missing values, including “threat_type,” “flee_status,” “armed_with,” “city,” “county,” “latitude,” “longitude,” “location_precision,” “name,” “age,” “gender,” “race,” and “race_source.
  • Columns:
    • “threat_type” indicates different threat levels during encounters, such as “point,” “move,” “attack,” and “shoot.”
    • “flee_status” indicates whether the individual attempted to flee.
    • “armed_with” specifies the type of weapon or item the individual had.
    • Location data includes city, county, state, latitude, and longitude, facilitating geographical analysis.
    • Demographic details like name, age, gender, and race are provided.
    • The “mental_illness_related” column indicates if mental illness was a factor in the incident.
    • “body_camera” signifies whether law enforcement officers had active body cameras during the encounter.
    • “agency_ids” may represent the law enforcement agencies involved in these incidents.

Dataset 2 (“fatal-police-shootings-agencies”):

  • Description: This dataset includes six columns and 3322 rows. Some entries have missing values in the “oricodes” column.
  • Columns:
    • “id” serves as a unique identifier for each law enforcement agency.
    • “name” designates the name of the law enforcement agency, which can include sheriff’s offices, local police departments, state police agencies, and others.
    • The “type” column categorizes the law enforcement agency by type, such as “sheriff,” “local_police,” “state_police,” and more.
    • “state” identifies the state where the law enforcement agency is located.
    • “oricodes” contains a code or identifier associated with the law enforcement agency.
    • “total_shootings” records the total number of shootings or incidents involving the respective law enforcement agency.

Summary:

The datasets in Project 2 provide valuable information about law enforcement encounters and police agencies involved in fatal incidents. Dataset 1 focuses on individual cases, their characteristics, and the circumstances, while Dataset 2 offers insights into the law enforcement agencies, their types, locations, and their involvement in such incidents. Further analysis or specific questions about the data would require additional context and particular queries.

Sunday – October 8, 2023.

This is Project 1 for MTH 522 at the University of Massachusetts Dartmouth.

Project Title:

Unlocking Public Health: An Analysis of CDC Data on Diabetes, Obesity, and Inactivity in US Counties (2018).

The provided dataset has been thoroughly examined and comprehensively reported in the project document.

The contribution report has been added to the final page of the report.

Report for Project 1 - MTH 522

Friday – October 6, 2023.

Firstly, I conducted a geographical analysis. This analysis helped identify geographic patterns, disparities, or clusters within the dataset, shedding light on potential regional variations.

In addition to geographical analysis, I dived into predictive modeling. Specifically, I employed ridge and linear regression techniques to develop models to understand and predict key relationships within the data. Ridge regression was used to address multicollinearity and prevent overfitting, enhancing the robustness of the predictive models. Linear regression, on the other hand, provided insights into the linear relationships between variables.

Beyond model development, I thoroughly evaluated the performance of these models. This evaluation involved assessing their predictive accuracy, goodness-of-fit, and statistical significance. Through these analyses, I aimed to not only understand the dataset better but also derive actionable insights that could inform decision-making or further research in the field.

Wednesday – October 4 , 2023.

In today’s update, I’d like to inform you that we’re nearing the completion of our analysis. Currently, we’re consolidating everyone’s work and putting together a report. Our goal is to finish the initial draft of the report by this Friday and have it reviewed on the same day.

In our analysis, we’ve consistently calculated summary statistics. We’ve also employed various data modeling techniques, such as linear regression and logistic regression. To assess these models, we’ve used methods like cross-validation, p-values, and confidence intervals.

Monday – October 2,2023.

I have written this draft report .

Data Preparation:

  1. Data Gathering: Collect data from various sources.
  2. Data Cleaning: Remove duplicates, handle missing values, and correct inconsistencies.
  3. Data Integration: Combine three Excel sheets into a dataset containing 354 data points. Column Naming: Rename columns for clarity and understanding.

Exploratory Data Analysis (EDA):

  1. Summary Statistics: Compute mean, median, skewness, kurtosis, standard deviation, and percentiles.
  2. Data Visualization: Generate plots and charts to visualize data and explore relationships between variables.
  3. Outlier Detection: Identify and handle outliers.
  4. Geographical Analysis: It was discovered that 138 counties in the dataset belong to a single state. Tattnall County in Georgia has the highest combined percentage of inactivity, obesity, and diabetes, totaling 47.3%.

Data Modeling:

  1. Algorithm Selection: Choose appropriate machine learning or statistical algorithms based on the problem type (classification, regression, clustering, etc.).
  2. Model Evaluation: Assess model performance using evaluation metrics such as accuracy, F1-score, and RMSE on the testing data.
  3. Hyperparameter Tuning: Optimize model hyperparameters to enhance performance.

Interpretation of Model:

  1. Feature Interpretation: Determine which features have the most significant impact on the model’s predictions.Model Explanation: Understand the rationale behind the model’s predictions.

Reporting and Visualization:

  1. Report Creation: Summarize findings, insights, and model performance in clear and concise reports. Result Visualization: Use charts, graphs, and dashboards to communicate results effectively.

Deployment & Real-world Monitoring:

  1. Model Deployment: To obtain answers, implement the model in a real-world environment.Continuous Monitoring: Monitor the model’s performance in the real world and make necessary adjustments.

Documentation:

  1. Process Documentation: Document all the steps taken during the analysis for future reference.

Feedback:

  1. Feedback Collection: Gather input from professors and teaching assistants to improve the analysis and presentation.