Exploring Reinforcement Learning and Classification Algorithms
This document outlines an exploration of Q-learning for a grid world environment, a Naive Bayes classifier for text categorization, and Gaussian Mixture Models for clustering.
Dependencies and Setup
First, necessary libraries are imported. gym is used for the reinforcement learning environment, numpy and pandas for data manipulation, matplotlib for plotting, and sklearn for machine learning utilities.
# Assuming pip install was run in a separate environment or cell if needed
# %pip install gym numpy pandas matplotlib scikit-learn
import gym
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from sklearn.metrics import f1_score, accuracy_score, silhouette_score
from sklearn.mixture import GaussianMixture
Part 1: Grid World with Q-Learning
This section details the setup of a custom grid world environment and the implementation of a Q-learning agent to navigate it.
1.1: Grid World Environment Setup
Grid Definition
The environment is a 10x10 grid with open spaces ('O'), obstacles ('X'), a start ('S'), and a goal ('G').
grid = [
['O', 'O', 'X', 'O', 'O', 'O', 'O', 'X', 'O', 'G'],
['O', 'X', 'X', 'X', 'O', 'X', 'O', 'X', 'O', 'O'],
['O', 'O', 'O', 'O', 'O', 'X', 'O', 'X', 'O', 'X'],
['O', 'X', 'X', 'X', 'O', 'X', 'O', 'X', 'O', 'O'],
['O', 'O', 'X', 'O', 'O', 'X', 'O', 'X', 'X', 'O'],
['O', 'O', 'X', 'X', 'O', 'X', 'O', 'O', 'O', 'O'],
['O', 'X', 'O', 'X', 'O', 'O', 'O', 'X', 'X', 'O'],
['O', 'X', 'O', 'X', 'O', 'X', 'O', 'X', 'O', 'O'],
['O', 'X', 'O', 'O', 'O', 'X', 'O', 'X', 'O', 'O'],
['O', 'O', 'O', 'X', 'O', 'O', 'S', 'O', 'O', 'O']
]
Grid World Class
A custom GridWorld class, extending gym.Env, was defined to manage states, actions, and rewards.
Note: The original notebook mentioned issues with Gym UI rendering in a WSL2/Windows 10 environment. The class implementation focuses on the logic, with a matplotlib-based render method.
class GridWorld(gym.Env):
def __init__(self, grid_layout): # Renamed grid to grid_layout to avoid conflict with global
self.grid_layout = grid_layout # Store the initial layout
self.rows = len(self.grid_layout)
self.cols = len(self.grid_layout[0])
self.goal_pos = self._find_pos('G') # Use _ for internal helper
self.agent_pos = self._find_pos('S')
self.agent_path = [self.agent_pos]
self.steps = 0
self.actions_map = {0: 'UP', 1: 'DOWN', 2: 'LEFT', 3: 'RIGHT'} # Renamed for clarity
self.action_space = gym.spaces.Discrete(len(self.actions_map))
self.observation_space = gym.spaces.Tuple((
gym.spaces.Discrete(self.rows),
gym.spaces.Discrete(self.cols)
))
self.open_space_reward = -1
self.obstacle_reward = -10
self.goal_reward = 10
def _find_pos(self, char): # Internal helper
for i in range(self.rows):
for j in range(self.cols):
if self.grid_layout[i][j] == char:
return (i, j)
return None # Should not happen if S and G are in grid
def _is_valid_pos(self, i, j): # Internal helper
return (0 <= i < self.rows and
0 <= j < self.cols and
self.grid_layout[i][j] != 'X')
def _get_reward(self, i, j): # Internal helper
if not self._is_valid_pos(i,j): # Check attempted move, not current agent pos
return self.obstacle_reward # If new position would be invalid
elif (i, j) == self.goal_pos:
return self.goal_reward
else: # Valid, not goal
return self.open_space_reward
def _move_agent(self, action_idx): # Renamed for clarity, takes action_idx
action_str = self.actions_map[action_idx]
next_i, next_j = self.agent_pos
if action_str == 'UP': next_i -= 1
elif action_str == 'DOWN': next_i += 1
elif action_str == 'LEFT': next_j -= 1
elif action_str == 'RIGHT': next_j += 1
reward = self._get_reward(next_i, next_j)
if self._is_valid_pos(next_i, next_j): # If move is valid, update position
self.agent_pos = (next_i, next_j)
self.agent_path.append(self.agent_pos)
# If move is not valid, agent_pos doesn't change, but reward reflects penalty
return reward
def step(self, action_idx): # Takes action_idx
reward = self._move_agent(action_idx)
self.steps += 1
done = (self.agent_pos == self.goal_pos)
# info can be an empty dict
return self.agent_pos, reward, done, {}
def reset(self):
self.agent_pos = self._find_pos('S')
self.agent_path = [self.agent_pos]
self.steps = 0
# Must return initial observation and info
return self.agent_pos, {}
def render(self, mode='human'): # Added mode argument for gym compatibility
plt.figure(figsize=(self.cols, self.rows))
# Create a grid indicating visited squares
traversed_display_grid = np.zeros((self.rows, self.cols))
for r, c in self.agent_path:
traversed_display_grid[r, c] = 1
plt.imshow(traversed_display_grid, cmap='Blues', origin='upper', alpha=0.3) # Alpha for overlay
# Add text for grid elements
for r in range(self.rows):
for c in range(self.cols):
char = self.grid_layout[r][c]
color = 'black'
if char == 'X': color = 'red'
elif char == 'S': color = 'green'
elif char == 'G': color = 'orange'
plt.text(c, r, char, ha='center', va='center', color=color, fontweight='bold')
# Draw path
if len(self.agent_path) > 1:
path_y, path_x = zip(*self.agent_path) # Matplotlib uses (x,y) not (row,col)
plt.plot(path_x, path_y, color='black', linewidth=2, marker='o', markersize=3)
plt.xticks([])
plt.yticks([])
plt.title(f"Agent Path - Steps: {self.steps}")
plt.show()
Refinement: Renamed some internal methods and variables for clarity (e.g., _find_pos, _is_valid_pos). The step and reset methods were updated to return info dictionaries, common in Gym environments. The render method was slightly adjusted for visual clarity.
1.2: Q-Learning Agent Implementation
A Q-learning agent was implemented to learn an optimal policy for navigating the grid.
Q-Learning Agent Class
The agent maintains a Q-table and uses an epsilon-greedy policy for action selection.
class QLearningAgent: # Renamed class
def __init__(self, num_actions, epsilon, gamma, alpha): # num_actions instead of actions dict
self.num_actions = num_actions
self.epsilon = epsilon
self.alpha = alpha # Learning rate
self.gamma = gamma # Discount factor
self.q_table = {} # {(row, col): [q_up, q_down, q_left, q_right]}
def get_q_value(self, state, action_idx):
if state not in self.q_table:
self.q_table[state] = np.zeros(self.num_actions)
return self.q_table[state][action_idx]
def update_q_value(self, state, action_idx, reward, next_state):
current_q = self.get_q_value(state, action_idx) # Q(s,a)
# Ensure next_state Q-values exist for max Q(s',a') calculation
if next_state not in self.q_table:
self.q_table[next_state] = np.zeros(self.num_actions)
max_next_q = np.max(self.q_table[next_state]) # max_a' Q(s',a')
# Q-learning update rule
new_q = current_q + self.alpha * (reward + self.gamma * max_next_q - current_q)
self.q_table[state][action_idx] = new_q
def choose_action(self, state):
if np.random.rand() < self.epsilon:
# Explore: choose a random action
return np.random.choice(self.num_actions)
else:
# Exploit: choose the best known action
if state not in self.q_table: # If state unseen, initialize Q-values
self.q_table[state] = np.zeros(self.num_actions)
# If multiple actions have the same max Q-value, argmax returns the first one.
# To break ties randomly:
# best_actions = np.where(self.q_table[state] == np.max(self.q_table[state]))[0]
# return np.random.choice(best_actions)
return np.argmax(self.q_table[state])
Refinement: The QLearningAgent class was slightly restructured. The Q-value update now directly implements the standard formula Q(s,a) <- Q(s,a) + alpha * (R + gamma * max_a' Q(s',a') - Q(s,a)). A note on tie-breaking in choose_action was added.
1.3: Evaluation
Evaluation Function
This function trains and evaluates the Q-learning agent over a specified number of episodes.
def evaluate_q_learning_agent(grid_layout, episodes, epsilon, gamma, learning_rate_alpha): # Renamed params
environment = GridWorld(grid_layout)
num_actions = environment.action_space.n # Get from env
agent = QLearningAgent(num_actions, epsilon, gamma, learning_rate_alpha) # Use agent class
# Store rewards per episode for plotting later, if desired
# episode_rewards = []
for episode in range(episodes):
state, _ = environment.reset() # Get initial state
total_reward_episode = 0
done = False
while not done:
action_idx = agent.choose_action(state)
next_state, reward, done, _ = environment.step(action_idx) # Use action_idx
total_reward_episode += reward
agent.update_q_value(state, action_idx, reward, next_state)
state = next_state
# episode_rewards.append(total_reward_episode) # For plotting learning curve
# Optional: Print progress periodically
# if (episode + 1) % (episodes // 10) == 0:
# print(f'Episode: {episode + 1}/{episodes}, Steps: {environment.steps}, Reward: {total_reward_episode}')
# After all episodes, print final run and render
# This will show the path taken using the learned Q-values (with some epsilon exploration)
final_state, _ = environment.reset()
final_total_reward = 0
final_done = False
# Set epsilon to 0 for purely greedy policy in final evaluation run, if desired
# original_epsilon = agent.epsilon
# agent.epsilon = 0.0
while not final_done:
final_action_idx = agent.choose_action(final_state)
final_next_state, final_reward, final_done, _ = environment.step(final_action_idx)
final_total_reward += final_reward
final_state = final_next_state
# agent.epsilon = original_epsilon # Restore epsilon if changed
print(f'Evaluation after {episodes} episodes: Steps: {environment.steps}, Total Reward: {final_total_reward}')
environment.render()
Refinement: The evaluation function was updated to run a final greedy evaluation path after training to better demonstrate the learned policy. The use of agent class methods is now consistent.
Evaluate Agent
The agent was trained for 1000 episodes with epsilon=0.1, gamma=0.5, and learning_rate=0.1.
episodes_count = 1000 # Renamed
epsilon_val = 0.1
gamma_val = 0.5 # Discount factor for Q-learning
alpha_val = 0.1 # Learning rate for Q-learning
evaluate_q_learning_agent(grid, episodes_count, epsilon_val, gamma_val, alpha_val)
Evaluation after 1000 episodes: Steps: 14, Total Reward: -3

1.4: Analysis of Q-Learning Performance
1.4.1: Effect of Epsilon (Exploration Rate)
A higher epsilon value encourages more exploration. While initially beneficial for discovering the environment, if epsilon remains high, the agent may not consistently exploit its learned knowledge, leading to suboptimal paths or longer convergence times. A lower epsilon (e.g., 0.1) led to more consistent and efficient pathfinding once the Q-values started to converge. A common strategy, not implemented here, is to decay epsilon over time, starting higher and gradually reducing it.
1.4.2: Performance Over Training Episodes
With the chosen hyperparameters (epsilon=0.1, gamma=0.5, alpha=0.1), the agent's performance improved noticeably. Early episodes involved more random exploration. By around 50-100 episodes, the agent began finding efficient paths more regularly, though still making occasional exploratory mistakes. After 1000 episodes, the agent consistently found paths to the goal in approximately 14-16 steps. Some minor suboptimal moves (e.g., hitting a wall occasionally due to epsilon-greedy exploration) still occurred, which might be reduced with more episodes or epsilon decay.
1.4.3: Challenges and Adjustments
Initially, a higher epsilon (e.g., 0.5) resulted in excessive exploration, preventing consistent convergence to an optimal path. Reducing epsilon to 0.1 significantly improved the agent's ability to exploit learned Q-values. Increasing the number of training episodes to 1000 also contributed to more stable and optimal pathfinding.
1.5: Hyperparameter Exploration
The impact of varying epsilon and gamma was explored, keeping the learning rate (alpha) at 0.1 and episodes at 1000.
epsilons_to_test = [0.1, 0.2, 0.3, 0.4, 0.5]
gammas_to_test = [0.6, 0.7, 0.8, 0.9, 1.0] # Note: gamma=1.0 can be problematic if goal is not always reachable
fixed_learning_rate = 0.1
fixed_episodes = 1000
# This loop can be time-consuming.
# for test_epsilon in epsilons_to_test:
# for test_gamma in gammas_to_test:
# print(f'Testing with Epsilon: {test_epsilon}, Gamma: {test_gamma}')
# evaluate_q_learning_agent(grid, fixed_episodes, test_epsilon, test_gamma, fixed_learning_rate)
# print("-" * 30)
(The following images represent selected outputs from the hyperparameter exploration loop shown in the original notebook. Only a few key examples are included here for brevity.)
Example Output for Epsilon: 0.1, Gamma: 0.6
Evaluation after 1000 episodes: Steps: 14, Total Reward: -3

Example Output for Epsilon: 0.5, Gamma: 1.0
Evaluation after 1000 episodes: Steps: 39, Total Reward: -55

Hyperparameter Observations:
Consistent with earlier findings, lower epsilon values (e.g., 0.1) generally resulted in better final path efficiency (fewer steps, higher total reward) after 1000 episodes. As epsilon increased, the agent's path tended to be longer and more erratic due to persistent exploration. The discount factor gamma showed less dramatic impact on the final path length in this specific grid and training duration, though values closer to 1 give more weight to future rewards. For simpler grids where the optimal path is relatively short, the impact of gamma might be less pronounced than in more complex environments or if a learning curve was plotted.
Part 2: Naive Bayes Classifier for Text Categorization
This section covers the implementation of a Naive Bayes classifier to categorize text comments.
2.1: Building Binary Feature Vectors
Comments are represented as binary vectors, where each dimension corresponds to a word in a predefined vocabulary. A '1' indicates the presence of the word in the comment, and '0' its absence.
Function to Build Binary Vectors
def build_binary_vectors_from_file(file_path, num_total_attributes): # Renamed params
# Dictionary: {comment_id: binary_vector_np_array}
binary_vectors_dict = {}
with open(file_path, 'r') as file:
for line in file: # More pythonic iteration
comment_id, word_id = map(int, line.strip().split())
# word_id is 1-indexed in file, adjust to 0-indexed for array
# Assuming num_total_attributes is the size of the vocabulary
if word_id > num_total_attributes: # Safety check
# print(f"Warning: word_id {word_id} exceeds num_attributes {num_total_attributes}. Skipping.")
continue
if comment_id not in binary_vectors_dict:
binary_vectors_dict[comment_id] = np.zeros(num_total_attributes, dtype=np.int8) # Use int8 for memory
# word_id from file might be 1-indexed based on context
# If vocabulary is 0 to N-1, then word_id-1
binary_vectors_dict[comment_id][word_id - 1] = 1 # Assuming word_id in file is 1-indexed
return binary_vectors_dict
Function to Read Labels
Labels are read from a file and associated with the corresponding comment IDs.
def read_labels_for_vectors(file_path, existing_comment_ids): # Renamed params
# List of labels, ordered by comment_id implicitly (after filtering)
labels_list = []
temp_labels_dict = {} # {comment_id: label}
with open(file_path, 'r') as file:
# Assuming labels file is 1 label per line, comment_id implied by line number
for idx, line in enumerate(file):
comment_id_from_line = idx + 1 # Line numbers are 1-indexed
if comment_id_from_line in existing_comment_ids:
temp_labels_dict[comment_id_from_line] = int(line.strip())
# Ensure labels are ordered by sorted existing_comment_ids
# This is crucial if binary_vectors_dict is later converted to a list/array
for cid in sorted(list(existing_comment_ids)):
if cid in temp_labels_dict:
labels_list.append(temp_labels_dict[cid])
# else:
# This case should not happen if existing_comment_ids is derived from data file
# print(f"Warning: Comment ID {cid} from data has no label.")
return labels_list
Refinement: The read_labels function was adjusted to ensure labels correctly correspond to the binary vectors, especially if the binary vectors are later converted to an ordered list/array for scikit-learn compatibility. Assumed word_id in trainData.txt is 1-indexed for array access.
Load Data
Training and testing data (word occurrences and labels) are loaded. The vocabulary size is 6968.
vocabulary_size = 6968 # Renamed
train_vectors_dict = build_binary_vectors_from_file('trainData.txt', vocabulary_size)
# Pass keys of train_vectors_dict to ensure labels match existing comments
train_labels_list = read_labels_for_vectors('trainLabel.txt', train_vectors_dict.keys())
print(f"Training Comments Processed: {len(train_vectors_dict)}")
print(f"Training Labels Loaded: {len(train_labels_list)}")
test_vectors_dict = build_binary_vectors_from_file('testData.txt', vocabulary_size)
test_labels_list = read_labels_for_vectors('testLabel.txt', test_vectors_dict.keys())
print(f"Testing Comments Processed: {len(test_vectors_dict)}")
print(f"Testing Labels Loaded: {len(test_labels_list)}")
# For Naive Bayes, convert dict of vectors to a list of vectors (data matrix)
# and ensure labels list order matches. The read_labels_for_vectors now handles this ordering.
train_data_matrix = np.array([train_vectors_dict[cid] for cid in sorted(train_vectors_dict.keys())])
test_data_matrix = np.array([test_vectors_dict[cid] for cid in sorted(test_vectors_dict.keys())])
# train_labels_list and test_labels_list are already ordered
Training Comments Processed: 1460
Training Labels Loaded: 1460
Testing Comments Processed: 1450
Testing Labels Loaded: 1450
Observation: The original notebook noted discrepancies in comment counts vs. label file lines. The refined read_labels_for_vectors ensures only labels for existing comments are loaded and correctly ordered. The data is also converted to NumPy arrays for efficient processing.
2.2: Naive Bayes Classifier Implementation
A Bernoulli Naive Bayes classifier was implemented. It assumes features (word presence) are binary and conditionally independent given the class.
Probability Calculation (Training)
This function calculates class prior probabilities and conditional probabilities of word presence for each class, using Laplace smoothing.
def train_bernoulli_naive_bayes(X_data, y_labels, smoothing_alpha=1): # Renamed params
# X_data: numpy array of shape (num_samples, num_features)
# y_labels: numpy array of shape (num_samples,)
num_samples, num_features = X_data.shape
unique_classes = np.unique(y_labels)
num_classes = len(unique_classes)
class_priors = {} # P(c)
# P(w_i=1 | c) for each word i and class c
# Using log probabilities is often more stable for prediction
log_conditional_probs_present = {}
log_conditional_probs_absent = {} # P(w_i=0 | c) = 1 - P(w_i=1 | c)
for c_idx, c_val in enumerate(unique_classes):
# Get samples belonging to class c
X_class_c = X_data[y_labels == c_val]
num_samples_class_c = X_class_c.shape[0]
# Calculate class prior P(c)
class_priors[c_val] = (num_samples_class_c + smoothing_alpha) / (num_samples + num_classes * smoothing_alpha)
# Using log prior for later sum
# class_priors[c_val] = np.log((num_samples_class_c + smoothing_alpha) / (num_samples + num_classes * smoothing_alpha))
# Calculate P(word_i=1 | class_c) with Laplace smoothing
# Sum word occurrences for class c, add smoothing_alpha
word_counts_in_class_c = np.sum(X_class_c, axis=0) + smoothing_alpha
# Denominator: num_samples_in_class_c + num_features_categories (2 for binary) * smoothing_alpha
# For Bernoulli, denominator is (count(c) + k*alpha), where k is # categories for feature (2 for binary)
total_counts_denominator = num_samples_class_c + (2 * smoothing_alpha)
prob_word_present_given_c = word_counts_in_class_c / total_counts_denominator
# Store log probabilities to avoid underflow during prediction
log_conditional_probs_present[c_val] = np.log(prob_word_present_given_c)
log_conditional_probs_absent[c_val] = np.log(1.0 - prob_word_present_given_c)
# Convert class_priors to log_priors as well
log_class_priors = {c: np.log(p) for c,p in class_priors.items()}
return log_class_priors, log_conditional_probs_present, log_conditional_probs_absent
Refinement: The training function now calculates log probabilities directly to prevent numerical underflow during prediction, a common practice for Naive Bayes. The formula for Bernoulli Naive Bayes with Laplace smoothing was clarified for the denominator: count(c) + k*alpha, where k=2 for binary features.
Naive Bayes Classifier (Prediction)
This function classifies new instances using the learned probabilities.
def predict_bernoulli_naive_bayes(X_test_instance, log_class_priors, log_cond_probs_present, log_cond_probs_absent):
# X_test_instance: a single binary vector (1D numpy array)
posteriors = {} # {class_val: log_posterior_probability}
for c_val in log_class_priors.keys():
log_posterior_c = log_class_priors[c_val] # Start with log P(c)
# Add sum of log P(feature_i | c)
# For features present (value = 1) in X_test_instance
log_posterior_c += np.sum(log_cond_probs_present[c_val][X_test_instance == 1])
# For features absent (value = 0) in X_test_instance
log_posterior_c += np.sum(log_cond_probs_absent[c_val][X_test_instance == 0])
posteriors[c_val] = log_posterior_c
# Return class with the highest log posterior probability
return max(posteriors, key=posteriors.get)
def predict_all_naive_bayes(X_test_matrix, log_class_priors, log_cond_probs_present, log_cond_probs_absent):
predictions = [predict_bernoulli_naive_bayes(instance, log_class_priors, log_cond_probs_present, log_cond_probs_absent) for instance in X_test_matrix]
return np.array(predictions)
Train and Test Classifier
# Convert labels list to numpy array for consistency
y_train_np = np.array(train_labels_list)
y_test_np = np.array(test_labels_list)
# Train the Naive Bayes classifier
log_priors_nb, log_cond_present_nb, log_cond_absent_nb = train_bernoulli_naive_bayes(train_data_matrix, y_train_np, smoothing_alpha=1)
# Make predictions on training and test data
train_predictions_nb = predict_all_naive_bayes(train_data_matrix, log_priors_nb, log_cond_present_nb, log_cond_absent_nb)
test_predictions_nb = predict_all_naive_bayes(test_data_matrix, log_priors_nb, log_cond_present_nb, log_cond_absent_nb)
# Calculate and print accuracy
train_acc_nb = accuracy_score(y_train_np, train_predictions_nb)
test_acc_nb = accuracy_score(y_test_np, test_predictions_nb)
print(f'Training Accuracy: {train_acc_nb:.4f}')
print(f'Test Accuracy: {test_acc_nb:.4f}\n')
# Calculate and print F1 score (weighted for potential imbalance, though this data is balanced)
train_f1_nb = f1_score(y_train_np, train_predictions_nb, average='weighted')
test_f1_nb = f1_score(y_test_np, test_predictions_nb, average='weighted')
print(f'Training F1 Score: {train_f1_nb:.4f}')
print(f'Test F1 Score: {test_f1_nb:.4f}')
Training Accuracy: 0.9158
Test Accuracy: 0.7434
Training F1 Score: 0.9157
Test F1 Score: 0.7434
Analysis and Conclusion
The Naive Bayes classifier achieved a training accuracy and F1-score of approximately 91.6%, and a test accuracy and F1-score of approximately 74.3%. The similarity between accuracy and F1-score is expected for this dataset, as the class distribution appears balanced (the original notebook mentioned 50/50, though not explicitly checked here again). The higher performance on training data compared to test data is typical.
Initial implementation challenges included ensuring correct probability calculations (e.g., proper application of Laplace smoothing and handling of log probabilities to prevent underflow) and aligning data vectors with their corresponding labels, especially given that comment IDs in the raw data might not be contiguous or complete.
Part 3: Data Visualization Using Gaussian Mixture Models (GMM)
Gaussian Mixture Models (GMMs) were used for clustering a 4-dimensional dataset.
3.1: Clustering Algorithm Selection
Gaussian Mixture Models were chosen. GMMs assume data points are generated from a mixture of a finite number of Gaussian distributions with unknown parameters.
3.2: Building The Clustering Algorithm
The number of clusters (components) for the GMM was determined using the Bayesian Information Criterion (BIC) and Akaike Information Criterion (AIC). Lower BIC/AIC values generally indicate a better model fit, balancing goodness of fit with model complexity.
Determine Number of Clusters
# Load "Clustering Data.csv"
df_clustering = pd.read_csv('Clustering Data.csv', header=None, names=['x', 'y', 'z', 'w'])
num_components_range = range(1, 11) # Renamed
bic_scores = []
aic_scores = []
for k_components in num_components_range: # Renamed
gmm_model = GaussianMixture(n_components=k_components, random_state=0, n_init=10) # n_init for stability
gmm_model.fit(df_clustering)
bic_scores.append(gmm_model.bic(df_clustering))
aic_scores.append(gmm_model.aic(df_clustering))
plt.figure(figsize=(10, 5))
plt.plot(num_components_range, bic_scores, label='BIC', marker='o')
plt.plot(num_components_range, aic_scores, label='AIC', marker='x')
plt.xlabel('Number of Components (Clusters)')
plt.ylabel('Information Criterion Score')
plt.title('BIC and AIC for GMM')
plt.legend()
plt.grid(True)
plt.show()

The plot of BIC and AIC scores suggests that 2 components (clusters) is a reasonable choice. While AIC might continue to decrease or decrease more sharply for a higher number of components (e.g., around 8 in the original notebook's plot), BIC often penalizes complexity more heavily and can be a better indicator for the number of clusters. The "elbow" or point of diminishing returns in these plots is sought. Here, 2 clusters provides a low BIC and a relatively low AIC without suggesting overfitting that a much higher number of components might with a significantly lower AIC but higher BIC.
3.3: Report The Results
A GMM with 2 components was fitted to the data.
gmm_final = GaussianMixture(n_components=2, random_state=0, n_init=10)
cluster_labels_gmm = gmm_final.fit_predict(df_clustering)
df_clustering['Cluster'] = cluster_labels_gmm
print(f'Number of Clusters Determined: {df_clustering["Cluster"].nunique()}')
cluster_counts = df_clustering['Cluster'].value_counts().sort_index()
print('\nNumber of Points In Each Cluster:')
for i, count in cluster_counts.items(): # Iterate through sorted cluster indices
print(f'\tCluster {i}: {count}') # Assuming cluster labels are 0, 1...
silhouette_avg = silhouette_score(df_clustering[['x', 'y', 'z', 'w']], df_clustering['Cluster'])
print(f'\nSilhouette Score: {silhouette_avg:.4f}')
Number of Clusters Determined: 2
Number of Points In Each Cluster:
Cluster 0: 100
Cluster 1: 50
Silhouette Score: 0.6986
Explanation of Results
The GMM identified 2 clusters. Cluster 0 contains 100 points, and Cluster 1 contains 50 points, indicating an imbalance in cluster sizes.
The Silhouette Score, which measures how similar a point is to its own cluster compared to other clusters, was approximately 0.6986. Scores range from -1 to 1, where a high value indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters. A score of ~0.7 suggests reasonably well-defined and separated clusters.
The original notebook mentioned testing 8 clusters, which sometimes yielded a slightly higher silhouette score but with high variance between runs. The 2-component model was deemed more stable and robust based on BIC/AIC and consistent silhouette scores.
3.4: Visualize The Results
The clusters are visualized in a 3D scatter plot, using three of the four dimensions for axes and the fourth dimension (w) to scale point size.
fig = plt.figure(figsize=(10, 8)) # Adjusted size
ax = fig.add_subplot(111, projection='3d')
# Scatter plot: x, y, z for coordinates, 'Cluster' for color, 'w' for size
scatter_plot = ax.scatter(
df_clustering['x'],
df_clustering['y'],
df_clustering['z'],
c=df_clustering['Cluster'],
cmap='viridis', # Color map
s=df_clustering['w'] * 20, # Scale point size by 'w'
alpha=0.7, # Transparency
edgecolor='k', # Add edge color for better visibility of overlapping points
linewidth=0.5
)
ax.set_xlabel('X dimension')
ax.set_ylabel('Y dimension')
ax.set_zlabel('Z dimension')
ax.set_title('GMM Clustering (3D Visualization of x, y, z; w scales size)')
# Add a color bar for the clusters
legend_elements = [plt.Line2D([0], [0], marker='o', color='w', label=f'Cluster {i}',
markerfacecolor=plt.cm.viridis(i / (df_clustering["Cluster"].nunique()-1 if df_clustering["Cluster"].nunique()>1 else 1.0)), # Get color from cmap
markersize=8) for i in sorted(df_clustering['Cluster'].unique())]
ax.legend(handles=legend_elements, title="Clusters")
plt.show()

The 3D visualization shows the spatial distribution of the two identified clusters. The differing sizes of points (scaled by 'w') add another layer of information to the plot. Despite some visual overlap in the 2D projection of this 3D space, the silhouette score suggests good separation in the 4-dimensional feature space.