I explored the Inpatient data in my previous blog. Let’s take a look at the outpatient data. (517737, 27) This is a much bigger data set than the inpatient data set. Outpatient has 517,737 rows of data, whereas inpatient only had 40,474 rows. This is not surprising as there are many advantages to having outpatientContinue reading “Healthcare Provider Fraud: Exploring the Outpatient Data”
Category Archives: Uncategorized
Healthcare Provider Fraud: Exploring the Inpatient Data
As we continue our project on Healthcare Provider Fraud Detection, we explore the remainder of our data. We have two csv files, one for Inpatient data and the other for Outpatient data. Let’s load in our data, take a look at its shape and the dataframe itself. We have 40,474 rows of data with 30Continue reading “Healthcare Provider Fraud: Exploring the Inpatient Data”
HC Provider Fraud: A Look At Our Data So Far
Our beneficiary data provides information on several chronic conditions such as alzheimer’s, cancer, depression, diabetes, heart failure, ischemic heart disease, kidney disease, obstructed pulmonary disease, osteoporosis, rheumatoid arthritis, and stroke. The blue columns or 1, show when a patient is positive for a chronic condition. The orange bar or 2 shows when a patient isContinue reading “HC Provider Fraud: A Look At Our Data So Far”
Healthcare Provider Fraud Detection Analysis
I came across a data set in Kaggle that attempts to determine if a Healthcare Provider is committing Fraud. Let’s take a deep dive into the data and perform our exploratory data analysis (EDA). I use pandas to read the first csv file and name it train. I then use train.shape to see how manyContinue reading “Healthcare Provider Fraud Detection Analysis”
What is Gradient Boosting?
Gradient Boosting takes a predictive model that performs only slightly better than random chance. This model is called a weak learner. Boosting is the process that takes this weak learner and figures out what the weak learner got wrong. It builds another model based on the weak learner’s errors in an attempt to improve itsContinue reading “What is Gradient Boosting?”
Predictive Modeling Continued: Random Forests
Random Forests is an ensemble method built upon Decision Trees. You can read about Decision Trees in my previous article here. An ensemble method uses multiple predictive models to achieve the highest predictive performance. Ensemble Methods work off of the idea of the “Wisdom of the Crowd”. This phrase refers to the phenomenon that theContinue reading “Predictive Modeling Continued: Random Forests”
More Visualizations Utilizing Datetime
Feature engineering days_elapsed by finding the difference between when a 311 service request was created through aug[‘created_date’] and when the service request was closed through aug[‘closed_date’] gives us another interesting feature to look at. (aug[‘closed_date’] – aug[‘created_date’]).dt.days returns an integer where we can make visulizations. A scatter plot of raw data of complaint type andContinue reading “More Visualizations Utilizing Datetime”
More Predictive Models: K-Nearest Neighbors
K-Nearest Neighbors (KNN) is a supervised learning algorithm that is used for classification and regression. KNN predicts based on the distance between two points and assumes that the smaller the distance is between two points, the more similar they are. For prediction, KNN will calculate the prediction point and find the K closest points toContinue reading “More Predictive Models: K-Nearest Neighbors”
Optimizing Our Predictive Model: SMOTE
SMOTE stands for Synthetic Minority Over-sampling Technique. The main difference between SMOTE and resampling is that SMOTE will not only increase the size of the training data set, but will also increase the variety of training examples. Oversampling increases the size of the training data through repetition of the original examples. SMOTE creates new trainingContinue reading “Optimizing Our Predictive Model: SMOTE”
Optimizing Our Predictive Model With Resampling
We made a predictive model using Logistic Regression. The model yielded a F1 Score of 0.5935. One of the things we can take a look at to optimize our model is…our data. Do we have a class imbalance problem? Are there more negative outcomes than positive outcomes or vice versa? We can use sampling techniquesContinue reading “Optimizing Our Predictive Model With Resampling”