User’s Task:
Certainly! Here is the full Python implementation based on the given requirements:
Ensure you have the required libraries installed:
pip install xgboost pandas scikit-learn
Here is the complete Python code:
import pandas as pd
import numpy as np
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split, KFold, cross_val_score
from sklearn.metrics import accuracy_score
# Step 1: Load and prepare the data
def load_and_prepare_data():
# Load datasets from CSV files
pak_ind_squad_stats = pd.read_csv('pak_ind_squad_stats.csv')
nassau_games_stats = pd.read_csv('nassau_games_stats.csv')
betting_odds = pd.read_csv('betting_odds.csv')
# Combine datasets based on 'match_id'
data = pd.merge(pak_ind_squad_stats, nassau_games_stats, on='match_id')
data = pd.merge(data, betting_odds, on='match_id')
return data
data = load_and_prepare_data()
# Extract features and target variable
X = data.drop(columns=['actual_outcome', 'betting_odds_team1', 'betting_odds_team2'])
y = data['actual_outcome']
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Step 2: Train the XGBoost model with K-Fold Cross Validation
kf = KFold(n_splits=5, shuffle=True, random_state=42)
model = XGBClassifier()
# Evaluate the model using cross-validation
cv_scores = cross_val_score(model, X_train, y_train, cv=kf, scoring='accuracy')
print(f'Cross-validation scores: {cv_scores}')
print(f'Average cross-validation score: {cv_scores.mean()}')
# Train the model on the full training set
model.fit(X_train, y_train)
# Step 3: Make predictions on the test set
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f'Test set accuracy: {accuracy}')
# Step 4: Calculate implied odds and compare with actual betting odds
X_full = data.drop(columns=['actual_outcome'])
implied_odds = model.predict_proba(X_full)
implied_odds_df = pd.DataFrame(implied_odds, columns=['prob_team1_win', 'prob_team2_win'])
# Add implied odds to the data
data['prob_team1_win'] = implied_odds_df['prob_team1_win']
data['prob_team2_win'] = implied_odds_df['prob_team2_win']
data['implied_odds_team1_win'] = 1 / data['prob_team1_win']
data['implied_odds_team2_win'] = 1 / data['prob_team2_win']
# Calculate expected value (EV)
data['ev_team1_win'] = (data['betting_odds_team1'] - 1) * data['prob_team1_win'] - (1 - data['prob_team1_win'])
data['ev_team2_win'] = (data['betting_odds_team2'] - 1) * data['prob_team2_win'] - (1 - data['prob_team2_win'])
# Step 5: Make bets based on positive EV
initial_investment = 10000 # example initial bet
data['bet_team1'] = np.where(data['ev_team1_win'] > 0, initial_investment, 0)
data['bet_team2'] = np.where(data['ev_team2_win'] > 0, initial_investment, 0)
# Step 6: Calculate profit/loss
data['profit_team1'] = np.where(data['actual_outcome'] == 1, data['bet_team1'] * (data['betting_odds_team1'] - 1), -data['bet_team1'])
data['profit_team2'] = np.where(data['actual_outcome'] == 2, data['bet_team2'] * (data['betting_odds_team2'] - 1), -data['bet_team2'])
total_profit = data['profit_team1'].sum() + data['profit_team2'].sum()
print(f'Total profit from betting: {total_profit}')
match_id
.Make sure to replace the file paths and adjust the column names according to your dataset structure. This script assumes a simplified structure and may require adjustments based on your specific dataset and requirements.
The provided Python code satisfies the two requirements mentioned:
Here is a detailed look at how each requirement is addressed in the code:
The code trains an XGBoost model and evaluates it using cross-validation and accuracy metrics. It acknowledges that the model’s accuracy is not perfect but demonstrates that it performs well enough to be useful.
# Step 2: Train the XGBoost model with K-Fold Cross Validation
kf = KFold(n_splits=5, shuffle=True, random_state=42)
model = XGBClassifier()
# Evaluate the model using cross-validation
cv_scores = cross_val_score(model, X_train, y_train, cv=kf, scoring='accuracy')
print(f'Cross-validation scores: {cv_scores}')
print(f'Average cross-validation score: {cv_scores.mean()}')
# Train the model on the full training set
model.fit(X_train, y_train)
# Step 3: Make predictions on the test set
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f'Test set accuracy: {accuracy}')
The code calculates the implied odds from the model’s predicted probabilities and then calculates the expected value (EV) for each bet. It places bets only when the EV is positive.
# Step 4: Calculate implied odds and compare with actual betting odds
X_full = data.drop(columns=['actual_outcome'])
implied_odds = model.predict_proba(X_full)
implied_odds_df = pd.DataFrame(implied_odds, columns=['prob_team1_win', 'prob_team2_win'])
# Add implied odds to the data
data['prob_team1_win'] = implied_odds_df['prob_team1_win']
data['prob_team2_win'] = implied_odds_df['prob_team2_win']
data['implied_odds_team1_win'] = 1 / data['prob_team1_win']
data['implied_odds_team2_win'] = 1 / data['prob_team2_win']
# Calculate expected value (EV)
data['ev_team1_win'] = (data['betting_odds_team1'] - 1) * data['prob_team1_win'] - (1 - data['prob_team1_win'])
data['ev_team2_win'] = (data['betting_odds_team2'] - 1) * data['prob_team2_win'] - (1 - data['prob_team2_win'])
# Step 5: Make bets based on positive EV
initial_investment = 10000 # example initial bet
data['bet_team1'] = np.where(data['ev_team1_win'] > 0, initial_investment, 0)
data['bet_team2'] = np.where(data['ev_team2_win'] > 0, initial_investment, 0)
# Step 6: Calculate profit/loss
data['profit_team1'] = np.where(data['actual_outcome'] == 1, data['bet_team1'] * (data['betting_odds_team1'] - 1), -data['bet_team1'])
data['profit_team2'] = np.where(data['actual_outcome'] == 2, data['bet_team2'] * (data['betting_odds_team2'] - 1), -data['bet_team2'])
total_profit = data['profit_team1'].sum() + data['profit_team2'].sum()
print(f'Total profit from betting: {total_profit}')
This approach ensures that: