After reviewing the dataset, I’ve observed that missing data is one of the major issues. Some cells are empty, while others are labeled as “not_available.”
I’ve identified common methods for handling missing data:
- Delete Rows or Columns: This approach is suitable when missing values are few and randomly distributed, with minimal impact on the analysis.
- Impute Missing Values: Imputation involves replacing missing values with estimated or predicted values. Common methods include:
- Mean, Median, or Mode imputation: Replacing missing values with the respective column’s mean, median, or mode.
- Linear Regression imputation: Using other variables to predict and fill in missing values.
- Interpolation: Estimating missing values based on neighboring data points, especially for time-series data.
- K-Nearest Neighbors (KNN): Replacing missing values with values from similar rows based on other variables.
- MICE (Multiple Imputation by Chained Equations): An advanced method that considers relationships between variables.
- Categorize Missing Values: Creating a new category for missing values, such as “Unknown” or “N/A,” can be meaningful for categorical data.
- Don’t Impute and Treat as a Separate Category: In some cases, missing data may represent a meaningful category, and it’s better not to impute but treat it as a distinct category in the analysis.
- Use Advanced Statistical Techniques: For complex analyses, advanced methods like Expectation-Maximization (EM) algorithms or structural equation modeling may be necessary to handle missing data.
- Data Validation Rules: Setting up data validation rules in Excel can help prevent the entry of missing or invalid data in future entries.
I will consult with the professor and teaching assistants to determine the most appropriate approach for this dataset.
For dataset 1 (fatal-police-shootings-data), I started by calculating basic summary statistics:
- For latitude: The median latitude is 36.08117181, the lowest latitude is 19.4975033, the highest latitude is 71.3012553, and the standard deviation is 5.346323915.
- For longitude: The median longitude is 36.08117181, the lowest longitude is -9.00718E+15, the highest longitude is -67.8671657, and the standard deviation is 1.02104E+14.
- For age: The median age is 35, the lowest age is 2, the highest age is 92, and the standard deviation is 12.99.
These statistics suggest potential outliers in the data, such as individuals as young as 2 or as old as 92 involved in police shootings. This could indicate possible instances of misfiring or accidental mistakes by the police.
The most common agency associated with the highest number of police shootings is agency 38, which corresponds to the “Los Angeles Police Department.” LAPD had the highest number of police shootings, totaling 129 incidents.