We’ll assume the dataset is a CSV file named data.csv
.
Question: Calculate the count of each category in the target variable.
Python Code:
import pandas as pd
# Load the dataset
data = pd.read_csv('data.csv')
# Count of each category in the target variable
target_counts = data['target'].value_counts()
print(target_counts)
Question: Count the total number of null values in the dataset.
Python Code:
# Count the total number of null values
total_nulls = data.isnull().sum().sum()
print(total_nulls)
Question:
Create a scatter plot of feature1
vs feature2
.
Python Code:
import matplotlib.pyplot as plt
# Scatter plot
plt.scatter(data['feature1'], data['feature2'])
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('Feature 1 vs Feature 2')
plt.show()
Question:
Create a new column feature1_category
from feature1
with ‘small’ for values < 12, ‘medium’ for 12 ≤ values ≤ 15, and ‘large’ for remaining.
Python Code:
# Data discretization
def categorize(value):
if value < 12:
return 'small'
elif 12 <= value <= 15:
return 'medium'
else:
return 'large'
data['feature1_category'] = data['feature1'].apply(categorize)
print(data[['feature1', 'feature1_category']].head())
Question:
Remove null values, perform label encoding, scale data, and store it into X
and y
.
Python Code:
from sklearn.preprocessing import LabelEncoder, StandardScaler
# Remove null values
data.dropna(inplace=True)
# Label encoding
label_encoder = LabelEncoder()
data['target'] = label_encoder.fit_transform(data['target'])
# Separating features and target
X = data.drop('target', axis=1)
y = data['target']
# Scaling data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
Question: Split the dataset into train and test sets.
Python Code:
from sklearn.model_selection import train_test_split
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)
Question: Fit a Logistic Regression model, and create a classification report and confusion matrix.
Python Code:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix
# Logistic Regression model
log_reg = LogisticRegression()
log_reg.fit(X_train, y_train)
# Predictions and evaluation
y_pred = log_reg.predict(X_test)
print("Classification Report:\n", classification_report(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
Question: Fit a Random Forest Classifier, and create a classification report and confusion matrix.
Python Code:
from sklearn.ensemble import RandomForestClassifier
# Random Forest Classifier
rf_clf = RandomForestClassifier()
rf_clf.fit(X_train, y_train)
# Predictions and evaluation
y_pred = rf_clf.predict(X_test)
print("Classification Report:\n", classification_report(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
Question: Fit a Gradient Boosting Classifier, and create a classification report and confusion matrix.
Python Code:
from sklearn.ensemble import GradientBoostingClassifier
# Gradient Boosting Classifier
gb_clf = GradientBoostingClassifier()
gb_clf.fit(X_train, y_train)
# Predictions and evaluation
y_pred = gb_clf.predict(X_test)
print("Classification Report:\n", classification_report(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
Question: Which model performed the best based on the classification reports? (MCQ)
Answer: Choose the model with the highest weighted F1 score from the classification reports generated in the previous questions.
Question: Retrain the Gradient Boosting Classifier model.
Python Code:
# Retraining the Gradient Boosting Classifier model
gb_clf = GradientBoostingClassifier()
gb_clf.fit(X, y)
# Save the model if needed
import joblib
joblib.dump(gb_clf, 'gradient_boosting_model.pkl')
Question: Given a scenario where false positives are more critical, identify which measure of the confusion matrix should be focused on.
Answer:
The measure to focus on would be Precision
.
Question: Calculate the precision score for the given predictions.
Python Code:
from sklearn.metrics import precision_score
# Precision score
precision = precision_score(y_test, y_pred, average='weighted')
print("Precision Score:", precision)
Question: Calculate the recall score for the given predictions.
Python Code:
from sklearn.metrics import recall_score
# Recall score
recall = recall_score(y_test, y_pred, average='weighted')
print("Recall Score:", recall)
Question: Calculate the F1 score for the given predictions.
Python Code:
from sklearn.metrics import f1_score
# F1 score
f1 = f1_score(y_test, y_pred, average='weighted')
print("F1 Score:", f1)
Question: Calculate the False Negative Rate for the given predictions.
Python Code:
# False Negative Rate
cm = confusion_matrix(y_test, y_pred)
fnr = cm[1,0] / (cm[1,0] + cm[1,1])
print("False Negative Rate:", fnr)
Question: Calculate the False Positive Rate for the given predictions.
Python Code:
# False Positive Rate
fpr = cm[0,1] / (cm[0,1] + cm[0,0])
print("False Positive Rate:", fpr)
Question: Build a regression model including data preprocessing, train-test split, and evaluate using weighted F1 score.
Python Code:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import f1_score
from sklearn.model_selection import train_test_split
# Load and preprocess the dataset
data = pd.read_csv('data.csv')
data.dropna(inplace=True)
# Assuming 'target' is the regression target and rest are features
X = data.drop('target', axis=1)
y = data['target']
# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Build the regression model
reg_model = LinearRegression()
reg_model.fit(X_train, y_train)
# Predict and evaluate
y_pred = reg_model.predict(X_test)
# For regression, we usually don't use F1 score, but let's assume it's a classification task for evaluation purposes
# Convert predictions to binary based on a threshold (for example, median of y)
threshold = y.median()
y_pred_binary = (y_pred > threshold).astype(int)
y_test_binary = (y_test > threshold).astype(int)
f1 = f1_score(y_test_binary, y_pred_binary, average='weighted')
print("Weighted F1 Score:", f1)
feature1
: Numerical feature (e.g., some continuous measurement)feature2
: Numerical feature (e.g., some continuous measurement)feature3
: Numerical feature (e.g., some continuous measurement)feature4
: Numerical feature (e.g., some continuous measurement)target
: Categorical variable (e.g., ‘A’, ‘B’, ‘C’)Here’s how you might generate a sample data.csv
file:
feature1,feature2,feature3,feature4,target
10.1,20.2,30.3,40.4,A
15.6,25.1,35.2,45.3,B
11.3,21.7,31.5,41.6,C
13.5,23.9,33.1,43.2,A
14.2,24.8,34.5,44.1,B
16.4,26.3,36.7,46.5,C
12.8,22.4,32.9,42.3,A
17.2,27.5,37.3,47.4,B
18.1,28.1,38.2,48.3,C
19.5,29.7,39.8,49.9,A
In a real-world scenario, you would replace this sample data with your actual dataset.
Here’s how you can create such a dataset in Python and save it as a CSV file:
import pandas as pd
# Sample data creation
data = {
'feature1': [10.1, 15.6, 11.3, 13.5, 14.2, 16.4, 12.8, 17.2, 18.1, 19.5],
'feature2': [20.2, 25.1, 21.7, 23.9, 24.8, 26.3, 22.4, 27.5, 28.1, 29.7],
'feature3': [30.3, 35.2, 31.5, 33.1, 34.5, 36.7, 32.9, 37.3, 38.2, 39.8],
'feature4': [40.4, 45.3, 41.6, 43.2, 44.1, 46.5, 42.3, 47.4, 48.3, 49.9],
'target': ['A', 'B', 'C', 'A', 'B', 'C', 'A', 'B', 'C', 'A']
}
# Create DataFrame
df = pd.DataFrame(data)
# Save to CSV
df.to_csv('data.csv', index=False)