MLB - Hall of Fame
Notes:
- The details of my implementation without many codes. This document is written for technical audience. So more code and logic details, please refer to
mlb-data_challenge
- The whole analysis was completed within 24 hours, so there are some rough analysis and assumptions. I admit there are errors, just focus on the structure instead of each detail.
- Detail codes could be found in my GitHub Repository
Business Background
Background
There are lots of baseball fans around the world. And players who are talented in this area will be honored as Hall of Fame.
This is one of most honorable award in baseball area.
So if we identify which player is more likely to be selected by Hall of Fame, we might have better investment.
- For executives, they can know which players have higher potential
- For advertisement companies, they can know which players might bring higher revenue
Problem & Task
For this exercise, I am going to analyze Major League Baseball statistics and come up with an implementation of a classifier that indicates if a player is in the baseball Hall of Fame or not, based on the players history performance.
Data Source
Environment
- Platform : Google Colaboratory
- Language : Python
- Version: 3.7.4
- Packages :
- pandas
- numpy
- matplotlib
- Scikit-learn
- seaborn
- Inital Models :
- Decision Tree
- K Nearest Neighbors
- Random Forest
- XGBoost
Source Table
The data come from Sean Lahman. He summarized the data on his website (link).
The file contains:
- MASTER - Player names, DOB, and biographical info
- Batting - batting statistics
- Pitching - pitching statistics
- Fielding - fielding statistics
- AllStarFull - All-Star appearances
- Hall of Fame - Hall of Fame voting data
- Managers - managerial statistics
- Teams - yearly stats and standings
- BattingPost - post-season batting statistics
- PitchingPost - post-season pitching statistics
- TeamFranchises - franchise information
- FieldingOF - outfield position data
- FieldingPost- post-season fieldinf data
- ManagersHalf - split season data for managers
- TeamsHalf - split season data for teams
- Salaries - player salary data
- SeriesPost - post-season series information
- AwardsManagers - awards won by managers
- AwardsPlayers - awards won by players
- AwardsShareManagers - award voting for manager awards
- AwardsSharePlayers - award voting for player awards
- Appearances - details on the positions a player appeared at
- Schools - list of colleges that players attended
- CollegePlaying - list of players and the colleges they attended
For this project, I mainly use HallofFame
, Master
, Batting
table for analysis.
More data will come into analysis in the future.
Master Table
Master table contains the following information:
Hall of Fame Table
Hall of Fame table contains the following information:
Batting Table
Batting table contains the following information:
Analysis Flow
I followed the below approach to do the analysis. Due to limit time, I was unable to analyze very deep nor could I go through every detail. But the logic is the same as what I showed below.
Picture Source: hackernoon
Feature Engineer
Objective: Predict if a player is in the baseball Hall of Fame
Approach: Based on the objective, I aggregated all the data to individual label. The unit observation of the final table is per person level with his/her demographic features and baseball performance.
Due to the space of this document and the limit time, I only address the feature engineer process for one table. For other table, the approaches and logic behind are exactly the same. For more details, please refer to the codes.
Master Table
There are 18846 rows in total in this table with 18846 unique players. So there are no duplicate records. Each row represent an player’s information.
# check how many players are in this table
Master.playerID.nunique()
Keep necessary columns
Not all columns are useful for prediction. For example, I only need to know when was the player born so that I could know his age. I don’t need the birth month, or where is his birth city etc.
To avoid information redundant, I decided to drop some columns based on my understanding. Given more knowledge of the data and the domain knowledge, I could make better decisions. For now, I decided to drop them.
# drop columns
droplist = ['birthMonth','birthDay', 'birthCountry','birthState', 'birthCity','deathDay','deathCountry',
'deathState', 'deathCity', 'nameFirst', 'nameLast','nameGiven','retroID', 'bbrefID']
master_x = Master.drop(droplist,axis = 1).copy()
Missing Values
I check the data for columns with missing values and decide how to handle them. The data is pretty clean.
Death column
There are lots of missing values in death related fields which make sense. Because many players are still alive. For those who are dead, their information are also useful. So the missing values here makes sense. Given more knowledge of the data, I could make better decisions like creating dummy variables.
For now, I think these missing values make sense and I would choose to keep them first.
Other columns
For columns like throws, bats, they indicate players’ characteristics and I do think they are important. Since there only less than 1% missing, I decided to remove these missing values first. Given more knowledge of the data, I could make better imputation like using predictive modelings or find some third part data to complete the information.
For columns like height, weight, same operations: remove just for now.
# Change setting to display all columns and rows
pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", None)
# check the the missing percentage
1 - master_x.count()/len(master_x.index)
# output
playerID 0.000000
birthYear 0.007588
deathYear 0.504616
deathMonth 0.504669
weight 0.046217
height 0.042715
bats 0.063196
throws 0.051894
debut 0.010241
finalGame 0.010241
dtype: float64
# drop empty rows
master_x = master_x[(master_x.debut.notnull() & master_x.finalGame.notnull()) & master_x.birthYear.notnull() &
master_x.throws.notnull() & master_x.bats.notnull()]
# impute missing values using median number (height, weight)
master_x.weight.fillna((master_x.weight.median()), inplace=True)
master_x.height.fillna((master_x.height.median()), inplace=True)
# the current missing values percentage after processing
1 - master_x.count()/len(master_x.index)
Now there are only 17446 records remaining.
# ther are 17446 records remaining
master_x.shape
Transformation
I care more about the players’ age instead of when he was born. So I am gonna to replace the birth-year column with the age.
If a player dies within the five year span, he is eligible six months after his death provided he meets the above criteria. If an active player dies, he is eligible six months after his death.
The age calculation follows the following rules
- If the player is still alive (deathYear is null), then I use current year (2015) - birth year
- If the player is not alive (deathYear is not null), then I use death year - birth year
# calculate the age
master_x['age'] = np.where(master_x.deathYear.isnull(),master_x.birthYear.apply(lambda x: 2015 - x),master_x.deathYear - master_x.birthYear)
Restriction for Hall of Fame - retire at least 5 years
If the player wants to enter Hall of Fame, one condition is that the player has been retired for at least five seasons.
If a player comes back and plays in the major leagues, the clock restarts. The easiest way to figure out the rule is to add six to the last season the player was active. Therefore, players eligible in 2007 played their last game in 2001.
So I want to find these inactive players. Only players retired for more than 5 years are qualified to enter Hall of Fame.
# create a new column indicating the final game year for each player
master_x['finalgame_year'] = master_x.finalGame.apply(lambda x: float(str(x)[:4]))
# create label indicating the retire years so far
master_x['retire'] = master_x.finalgame_year.apply(lambda x: 2015 - int(x))
Now I have done with this dataset. I am going to select only necessary columns for prediction.
master_final = master_x[['playerID','age','weight', 'height', 'bats', 'throws', 'debut','finalgame_year','retire']]
master_final.head()
Preview the data**
playerID | age | weight | height | bats | throws | debut | finalgame_year | retire |
---|---|---|---|---|---|---|---|---|
aardsda01 | 34.0 | 220.0 | 75.0 | R | R | 2004-04-06 | 2015.0 | 0 |
aaronha01 | 81.0 | 180.0 | 72.0 | R | R | 1954-04-13 | 1976.0 | 39 |
aaronto01 | 45.0 | 190.0 | 75.0 | R | R | 1962-04-10 | 1971.0 | 44 |
aasedo01 | 61.0 | 190.0 | 75.0 | R | R | 1977-07-26 | 1990.0 | 25 |
abadan01 | 43.0 | 184.0 | 73.0 | L | L | 2001-09-10 | 2006.0 | 9 |
###
Exploratory Data Analysis
Outlier Detection
Here I mainly use basic exploratory data analysis to gain insight of the data.
Weight and Height
Same situation happens for players’ weight and height. It’s possible for baseball players that are heavier than others or extreme high. I lack this domain knowledge. Given more time, I could make better decisions.
For now, I just use 1.5 quantile to replace those “outliers” values.
# weight distribution
plt.figure(figsize = (20,5))
plt.subplot(1,4,1)
plt.title('weight Distribution')
sns.distplot(full_data.weight)
plt.subplot(1,4,2)
plt.title('weight BoxPLot')
sns.boxplot(full_data.weight)
plt.subplot(1,4,3)
plt.title('height Distribution')
sns.distplot(full_data.height)
plt.subplot(1,4,4)
plt.title('height BoxPLot')
sns.boxplot(full_data.height)
plt.show()
# create a function to remove outlier based on quantile
def remove_outlier(column_name,low,high):
q1, q3= np.percentile(full_data[column_name],[low,high])
iqr = q3 - q1
lower_bound = q1 -(1.5 * iqr)
upper_bound = q3 +(1.5 * iqr)
full_data.loc[full_data[column_name] > upper_bound ,column_name] = upper_bound
full_data.loc[full_data[column_name] < lower_bound ,column_name] = lower_bound
# remove outliers in weight column
remove_outlier('weight',25,75)
# remove outliers in height column
remove_outlier('height',25,75)
Bats and throws
The distribution of bats
and throws
are also not very balanced. This make sense since most people are right handed. I am not sure if this will impact our analysis at this point, but I’ll do feature selection later do see if this feature is important.
# check the distribution
plt.figure(figsize = (12,5))
plt.subplot(1,2,1)
sns.countplot(full_data.bats)
plt.subplot(1,2,2)
sns.countplot(full_data.throws)
plt.show()
Age
It seems that most players’ ages are in reasonable range.
However, by looking at the box plot, there are some outliers which might not be reasonable.
# age distribution
plt.figure(figsize = (12,5))
plt.subplot(1,2,1)
plt.title('Age Distribution')
sns.distplot(full_data.age)
plt.subplot(1,2,2)
plt.title('Age BoxPLot')
sns.boxplot(full_data.age)
plt.show()
So there are some people aged more than 120 years old which doesn’t make sense. I decided to remove them in terms of the data quality.
The lowest age here is -19 years old which should be removed
print('Max Age')
print('---------')
print(full_data.age.sort_values(ascending = False)[:10])
print('Min Age')
print('---------')
print(full_data.age.sort_values(ascending = True)[:10])
# output
Max Age
---------
3736 168.0
13102 164.0
12768 161.0
7981 159.0
8358 156.0
16803 155.0
3954 151.0
7367 150.0
11875 150.0
8556 147.0
Name: age, dtype: float64
Min Age
---------
7882 -19.0
14651 20.0
11901 20.0
3729 20.0
2740 21.0
14225 21.0
14307 21.0
11703 21.0
7046 21.0
16717 21.0
Name: age, dtype: float64
# only keep resonable age
full_data = full_data[(full_data.age <= 120) & (full_data.age > 15)]
Predictive Modeling
Normalization
Normalizing the features allows for faster gradient descent convergence and reduce bias of features which have large ranges.
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x)
x_scaled_df = pd.DataFrame(x_scaled, columns = x.columns)
Feature Selection
Taking a univariate selection approach, I choose to use the chi2 metric since the target is categorical. I also tested using f_classif which also worked, and mutual_info_classif (long runtime).
Actually, there are multiple ways to do feature selection.
- Univariate Selection
- Feature Important
- Correlation Matrics with Heatmap
- And so on
I chose the first approach for no reason. If I could have more time, I’ll also try other approaches. In addition, the thrid approach returns me the result that there are tons of correlated features and leave nothing to me. I guess this might due to the small sample size. I need more inforamtion to make the decision. For now, Univariate could be a good approach.
selector = SelectPercentile(chi2, percentile=20)
fit = selector.fit(x_scaled_df,y)
# Show results from greatest to least
x_scores = pd.DataFrame(fit.scores_)
x_col = pd.DataFrame(x_scaled_df.columns)
featureScores = pd.concat([x_col,x_scores],axis=1)
featureScores.columns = ['Feature', 'Score']
fs_sorted = featureScores.sort_values('Score', ascending=False)
fs_sorted
Features | Score |
---|---|
R | 50.027255 |
RBI | 49.348040 |
H | 44.371156 |
And so on | And so on |
# Filter dataframe on features with highest scores
x_chi2 = x_scaled_df[fs_sorted.Feature[0:14].tolist()]
x_chi2.head()
Model Construction
Here I want to skip the basic training and split process and briefly talk about the models.
I used models including
- K Nearest Neighbor
- Decision Tree
- Random Forest
- XGBoost
Other Models
In this document, I only introduce how I do with XGBoost, for other models, please refer to the detail codes
XGBoost
Documentation: https://xgboost.readthedocs.io/en/latest/
# Fit XGRegressor to the Training set
xg_reg = xgb.XGBRegressor(objective ='binary:hinge',
learning_rate = 0.1,
max_depth = 5)
xg_reg.fit(X_train,y_train)
# Test the model
pred_xgb = xg_reg.predict(X_test)
# Get the model performance
print(classification_report(y_test, pred_xgb))
# Get all Measurements
xgb_summary = get_all_measurements(y_test, pred_xgb, "XGBoost")
xgb_summary
The XGBoost model give us 92% accuracy and around 0.58% precision and 36% recall. This is not very good. I will summarize the low performance later.
# plot roc curve
plt.figure(figsize = (6,4))
fpr_xgb, tpr_xgb, thresholds = roc_curve(y_test, pred_xgb)
plt.plot(fpr_xgb, tpr_xgb, linewidth=1, label='XGBoost')
plt.title("XGBoost ROC Curve")
plt.show()
Visualize Results: Feature Importance and sample tree
Visualizing the model for a better idea of what’s going on in the blackbox.
# Visualizing: Shows the sixth boosted tree
plt.rcParams['figure.figsize'] = [70, 35]
xgb.plot_tree(xg_reg, num_trees=5)
This is the visualization of the tree. Clear picture could be found here: Link
Hyper-parameter Tuning
I can improve the performance (AUC, accuracy, etc) by tuning the parameters, but this requires more time. I have included code below that can be used for this purpose.
The most popular method for tuning hyperparameters is using GridSearch.
Documentation:
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html
# Code for using GridSearch to tune hyperparameters
# list some feasible features for tuning, Will add more in the feature
parameters = {'objective' : ['binary:hinge','binary:logistic']}
# 'learning_rate' : [0.05, 0.1],
# 'max_depth' : [5, 4, 3],
# 'gamma' : [0.5, 1]}
clf = GridSearchCV(xgb.XGBRegressor(), parameters, n_jobs=5 ,
cv=StratifiedKFold(n_splits=5, shuffle=True),
scoring='roc_auc',
verbose=2, refit=True)
clf.fit(X_train, y_train)
print(clf.best_score_)
print(clf.best_params_)
# output
0.9061710653839832
{'objective': 'binary:logistic'}
Model Comparison and Conclusion
Here, I want to plot all the ROC curve together.
plt.figure(figsize = (12,8))
plt.plot(fpr_xgb, tpr_xgb, linewidth=1, label='XGBoost')
plt.plot(fpr_knn, tpr_knn, linewidth=1, label='KNN')
plt.plot(fpr_dt, tpr_dt, linewidth=1, label='Decision Tree')
plt.plot(fpr_rf, tpr_rf, linewidth=1, label='Random Forest')
plt.title("Model ROC Curves")
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.legend(loc = 'lower right')
plt.show()
And these is the result comparison among each model.
With limited information, I am not sure the penalty/cost of wrong prediction ( False positive and False negative). So personally I prefer a model with high precision and recall in imbalanced data.
Since Random Forest has the highest F1-score. It’s my best my model this time.
Reflection & What’s Next
Imbalanced Issue
I think this is the most severe issues in this analysis this time.
This is a very imbalanced data in terms of target variable and will cause some problem in our feature prediction. With this in mind, let me finish the prediction and discuss more about the consequence of using imbalanced data.
# check the distribution
plt.figure(figsize = (6,5))
sns.countplot(full_data.inducted)
And most time we are referring to accuracy. However, for imbalanced data, accuracy might not make perfect sense.
The conventional model evaluation methods do not accurately measure model performance when faced with imbalanced datasets.
“Standard classifier algorithms like Decision Tree and Logistic Regression have a bias towards classes which have number of instances. They tend to only predict the majority class data. The features of the minority class are treated as noise and are often ignored. Thus, there is a high probability of misclassification of the minority class as compared to the majority class” (from Link )
So we need to address this issue.
- Collect more data
- Resampling Techniques
- Random Undersample
- Random oversample
- Cluster-Based Over Sampling
- Try more ensemble models
Model Aspect
Constructing Model
I tried some baseline models to have a feeling of the prediction. However, I was unable to do hyper-parameter tuning which is very important for modeling.
Also, try to use more ensemble models techniques
So, next
- Try hyper-parameter turning.
- Grid Search
- Random Search
- Try ensemble model
- Bagging
- Boosting
- Stacking etc
Evaluating Models
Apart from constructing models, we should also come up with more evaluation metrics.
Lift chart could be a good example.
In addition, we should also collect information about the cost of wrong prediction (false positive and false negative) so that we can better decide which metrics works better in our context.
By having the penalty/cost information, we could also build a cost matrix, combined with confusion matrix, we can know the expected value (cost) of making a wrong prediction. This is helpful for us when choosing models.
Time Series Model
Our current objective is to predict if a player is in the Hall of Fame or not. In the future, we can extend our analysis/model to predict if a player will be in the Hall of Fame next year.
This is a time-series prediction problem.
We could take advantage of history data to predict future. From business application prospective, this model might be more widely used.
Data Source Aspect
This time, I only used some tables. In the future analysis, we should include more players’ information.
For example:
- Number of Big Awards Received
- Win times
- Lose Time
- Team
- League
- Salary Information
- And so on
With more sound and reliable data, we can have a better prediction.
Data Engineer Aspect
We cannot ensure the quality of the data. Missing values, outliers always exists.
This time, I dropped some outliers directly because of they are very small proportion of the whole data. But with more time, research and information, I think we can try some imputation like
- Aggregation (min/max/median)
- Prediction Models
Or we might find some third part data to supplement our source table.
In addition,
More Exploratory Data Analysis should be performed.
Application Aspect
Think more about how this prediction can impact the real-world business so that we could better tailor our product to the market.
Maybe design an interactive dashboard for baseball executives to use. Or maybe we can try to program a phone application etc.
Done!
Thanks for Reading
2020/03/08
Warm Regards
Xiangke Chen