ml

Basic Question and Answer around ML

Dataset Preparation

We’ll assume the dataset is a CSV file named data.csv.

Questions and Corresponding Python Code

Question 1: Count of Target Variable Categories

Question: Calculate the count of each category in the target variable.

Python Code:

import pandas as pd

# Load the dataset
data = pd.read_csv('data.csv')

# Count of each category in the target variable
target_counts = data['target'].value_counts()
print(target_counts)

Question 2: Count the Total Number of Null Values

Question: Count the total number of null values in the dataset.

Python Code:

# Count the total number of null values
total_nulls = data.isnull().sum().sum()
print(total_nulls)

Question 3: Scatter Plot

Question: Create a scatter plot of feature1 vs feature2.

Python Code:

import matplotlib.pyplot as plt

# Scatter plot
plt.scatter(data['feature1'], data['feature2'])
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('Feature 1 vs Feature 2')
plt.show()

Question 4: Data Discretization

Question: Create a new column feature1_category from feature1 with ‘small’ for values < 12, ‘medium’ for 12 ≤ values ≤ 15, and ‘large’ for remaining.

Python Code:

# Data discretization
def categorize(value):
    if value < 12:
        return 'small'
    elif 12 <= value <= 15:
        return 'medium'
    else:
        return 'large'

data['feature1_category'] = data['feature1'].apply(categorize)
print(data[['feature1', 'feature1_category']].head())

Question 5: Data Preprocessing

Question: Remove null values, perform label encoding, scale data, and store it into X and y.

Python Code:

from sklearn.preprocessing import LabelEncoder, StandardScaler

# Remove null values
data.dropna(inplace=True)

# Label encoding
label_encoder = LabelEncoder()
data['target'] = label_encoder.fit_transform(data['target'])

# Separating features and target
X = data.drop('target', axis=1)
y = data['target']

# Scaling data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

Question 6: Train-Test Split

Question: Split the dataset into train and test sets.

Python Code:

from sklearn.model_selection import train_test_split

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

Question 7: Logistic Regression Model

Question: Fit a Logistic Regression model, and create a classification report and confusion matrix.

Python Code:

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix

# Logistic Regression model
log_reg = LogisticRegression()
log_reg.fit(X_train, y_train)

# Predictions and evaluation
y_pred = log_reg.predict(X_test)
print("Classification Report:\n", classification_report(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))

Question 8: Random Forest Classifier

Question: Fit a Random Forest Classifier, and create a classification report and confusion matrix.

Python Code:

from sklearn.ensemble import RandomForestClassifier

# Random Forest Classifier
rf_clf = RandomForestClassifier()
rf_clf.fit(X_train, y_train)

# Predictions and evaluation
y_pred = rf_clf.predict(X_test)
print("Classification Report:\n", classification_report(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))

Question 9: Gradient Boosting Classifier

Question: Fit a Gradient Boosting Classifier, and create a classification report and confusion matrix.

Python Code:

from sklearn.ensemble import GradientBoostingClassifier

# Gradient Boosting Classifier
gb_clf = GradientBoostingClassifier()
gb_clf.fit(X_train, y_train)

# Predictions and evaluation
y_pred = gb_clf.predict(X_test)
print("Classification Report:\n", classification_report(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))

Question 10: Choosing the Best Model

Question: Which model performed the best based on the classification reports? (MCQ)

Answer: Choose the model with the highest weighted F1 score from the classification reports generated in the previous questions.

Question 11: Retrain the Best Model

Question: Retrain the Gradient Boosting Classifier model.

Python Code:

# Retraining the Gradient Boosting Classifier model
gb_clf = GradientBoostingClassifier()
gb_clf.fit(X, y)

# Save the model if needed
import joblib
joblib.dump(gb_clf, 'gradient_boosting_model.pkl')

Additional Hands-On Questions Based on Confusion Matrix

Question 12: Identify the Measure

Question: Given a scenario where false positives are more critical, identify which measure of the confusion matrix should be focused on.

Answer: The measure to focus on would be Precision.

Question 13: Calculate Precision

Question: Calculate the precision score for the given predictions.

Python Code:

from sklearn.metrics import precision_score

# Precision score
precision = precision_score(y_test, y_pred, average='weighted')
print("Precision Score:", precision)

Question 14: Calculate Recall

Question: Calculate the recall score for the given predictions.

Python Code:

from sklearn.metrics import recall_score

# Recall score
recall = recall_score(y_test, y_pred, average='weighted')
print("Recall Score:", recall)

Question 15: Calculate F1 Score

Question: Calculate the F1 score for the given predictions.

Python Code:

from sklearn.metrics import f1_score

# F1 score
f1 = f1_score(y_test, y_pred, average='weighted')
print("F1 Score:", f1)

Question 16: False Negative Rate

Question: Calculate the False Negative Rate for the given predictions.

Python Code:

# False Negative Rate
cm = confusion_matrix(y_test, y_pred)
fnr = cm[1,0] / (cm[1,0] + cm[1,1])
print("False Negative Rate:", fnr)

Question 17: False Positive Rate

Question: Calculate the False Positive Rate for the given predictions.

Python Code:

# False Positive Rate
fpr = cm[0,1] / (cm[0,1] + cm[0,0])
print("False Positive Rate:", fpr)

Question 18: Build a Regression Model

Question: Build a regression model including data preprocessing, train-test split, and evaluate using weighted F1 score.

Python Code:

from sklearn.linear_model import LinearRegression
from sklearn.metrics import f1_score
from sklearn.model_selection import train_test_split

# Load and preprocess the dataset
data = pd.read_csv('data.csv')
data.dropna(inplace=True)

# Assuming 'target' is the regression target and rest are features
X = data.drop('target', axis=1)
y = data['target']

# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Build the regression model
reg_model = LinearRegression()
reg_model.fit(X_train, y_train)

# Predict and evaluate
y_pred = reg_model.predict(X_test)

# For regression, we usually don't use F1 score, but let's assume it's a classification task for evaluation purposes
# Convert predictions to binary based on a threshold (for example, median of y)
threshold = y.median()
y_pred_binary = (y_pred > threshold).astype(int)
y_test_binary = (y_test > threshold).astype(int)

f1 = f1_score(y_test_binary, y_pred_binary, average='weighted')
print("Weighted F1 Score:", f1)

Question 19: Data.csv demo

feature1: Numerical feature (e.g., some continuous measurement)
feature2: Numerical feature (e.g., some continuous measurement)
feature3: Numerical feature (e.g., some continuous measurement)
feature4: Numerical feature (e.g., some continuous measurement)
target: Categorical variable (e.g., ‘A’, ‘B’, ‘C’)

Here’s how you might generate a sample data.csv file:

feature1,feature2,feature3,feature4,target
10.1,20.2,30.3,40.4,A
15.6,25.1,35.2,45.3,B
11.3,21.7,31.5,41.6,C
13.5,23.9,33.1,43.2,A
14.2,24.8,34.5,44.1,B
16.4,26.3,36.7,46.5,C
12.8,22.4,32.9,42.3,A
17.2,27.5,37.3,47.4,B
18.1,28.1,38.2,48.3,C
19.5,29.7,39.8,49.9,A

In a real-world scenario, you would replace this sample data with your actual dataset.

Here’s how you can create such a dataset in Python and save it as a CSV file:

import pandas as pd

# Sample data creation
data = {
    'feature1': [10.1, 15.6, 11.3, 13.5, 14.2, 16.4, 12.8, 17.2, 18.1, 19.5],
    'feature2': [20.2, 25.1, 21.7, 23.9, 24.8, 26.3, 22.4, 27.5, 28.1, 29.7],
    'feature3': [30.3, 35.2, 31.5, 33.1, 34.5, 36.7, 32.9, 37.3, 38.2, 39.8],
    'feature4': [40.4, 45.3, 41.6, 43.2, 44.1, 46.5, 42.3, 47.4, 48.3, 49.9],
    'target': ['A', 'B', 'C', 'A', 'B', 'C', 'A', 'B', 'C', 'A']
}

# Create DataFrame
df = pd.DataFrame(data)

# Save to CSV
df.to_csv('data.csv', index=False)