Data Analysis (Link)

This project is a compilation of multiple data analysis projects, most of which incorporate the use of various Machine Learning algorithms for predictions.

Breast Cancer Analysis (Link)

dataanalysis-bca-heatmap1.png dataanalysis-bca-heatmap2.png dataanalysis-bca-heatmap3.png
The project aimed to classify breast cancer samples (benign vs. malignant) by analyzing imaging attributes from a fine needle aspirate (FNA). It also explored mislabelled data detection to improve data quality.
This project was completed by me and three others for IT1244, a machine learning course at NUS. We chose this topic because we believed that discovering an effective solution could significantly benefit the medical field.
Dataset:
  • 569 samples labeled 'B' (benign) or 'M' (malignant)
  • 30 features capturing mean, standard error and "worst" measurements of cell nuclei (i.e., radius, texture, concavity)
  • Correlation analysis identified the most influential features (i.e., radius, concavity, concave points)
dataanalysis-bca-fe1.png dataanalysis-bca-fe2.png
The problem was split into two tasks. For the first task, we are told that some labels have been mislabelled, and we are required to detect them. We utilised several methods, combining them together to perform a multi-faceted approach in the detection process.
The attribute 'radius' refers to the radius of the tumour. The bigger the tumour, the more harmful it is, falling into the malignant category. We performed feature engineering on this attribute, taking the squared difference and cubed difference of radius_worst and radius_mean. This allowed the data to split almost perfectly into two categories, one benign (blue), the other malignant (red). The blues above the curve and the reds below the curve are considered mislabelled data.
Task 1 - Mislabelled Data Detection Methods:
  • Feature Engineering: Computed differences (worst - mean) and examined squared/cubed values to distinguish benign vs. malignant curves
  • K-Means Clustering: Ran multiple initializations to find consistent outliers
  • Logistic Regression: Tested whether flagged points consistently mismatched the known labels
dataanalysis-bca-rocauc1.png dataanalysis-bca-rocauc2.png dataanalysis-bca-rocauc3.png dataanalysis-bca-rocauc4.png
For the second task, we are required to classify the type of breast cancer ('B' vs 'M') given the diagnosis. To do this, we employed the use of several Machine Learning algorithms, using 10-fold cross-validation and multiple metrics like Accuracy, F1 Score and ROC-AUC to assess the model.
For the pictures, from left to right, the ROC-AUC curves are obtained from the Machine Learning approaches below respectively. If the image is too small, you can zoom in or right click, then open image in new tab.
Task 2 - Breast Cancer Classification Approaches:
  • Logistic Regression (LR)
  • k-Nearest Neighbours (kNN)
  • Random Forest (RF)
  • Support Vector Machines (SVM)
dataanalysis-bca-keyfindings1.png
Key Results:
  • SVM performed best across all metrics, suggesting it handles high-dimensional, non-linear data effectively
  • LR came in a close second, indicating the decision boundary might not be overly complex
  • kNN struggled in higher dimensions without additional feature reduction
  • RF showed signs of overfitting and noise sensitivity with certain irrelevant features

Enhancing Guest Experience (Link)

dataanalysis-ege-waittime1.png
The project aimed to enhance overall guest experience at Universal Studios Singapore (USS). This project was completed by me and 7 others for DSA3101, a course in NUS on practical Data Science.
My objective is to optimally allocate staff to different attractions at an amusement park (e.g., Universal Studios Singapore) in order to reduce wait times and improve guest satisfaction. Two models: Linear Optimization and Greedy Heuristic, are compared.
dataanalysis-ege-models1.png
Model 1 - Linear Optimization:
  • Objective: Minimize the total deviation from a target staff allocation
  • Constraints:
    1. Total staff assigned must not exceed the available staff
    2. Each attraction has a minimum staff requirement
    3. Staff allocation must be close to forecasted wait times
  • Insights:
    1. Tends to produce allocations proportional to ride wait times
    2. Occassionally fails to find a feasible solution if constraints are too tight
Model 2 - Greedy Heuristic:
  • Objective: Reduce overall wait times by continually allocating staff to the ride with the highest current wait time
  • Wait-Time Formula: Each additional staff reduces wait time by a certain factor (i.e., 0.95)
  • Insights:
    1. Always yields a feasible solution
    2. May produce a more uniform allocation than the linear model
    3. Assumes wait time decreases consistently with each staff addition, which might not always be true
dataanalysis-ege-comparison1.png
Comparing Both Models:
  1. Allocations:
    • Linear Optimization often mirrors the distribution of forecasted wait times
    • Greedy Heuristic focuses on whichever ride currently has the longest wait, potentially distributing staff less proportionally
  2. Decision Rule:
    • By default, pick the model that allocates more staff to the highest-wait attraction
    • If both allocate equally, default to Linear Optimization (since it tracks wait-time distribution more closely)
  3. Trade-Offs:
    • Greedy is guaranteed feasible but might over-allocate staff to a single ride if it assumes continuous wait-time reductions
    • Linear Optimization can fail in certain scenarios but often reflects a more directly proportional distribution
dataanalysis-ege-usage1.png
Usage and Implementation:
  1. User Inputs:
    • Forecasted wait times
    • Total available staff
    • Minimum staff per attraction
    • (Optional) A reduction factor for the Greedy model
  2. Outputs:
    • Recommended staff allocation from each model, plus a final “chosen” allocation based on the decision rule
    • Bar plots visualizing how each model distributes staff
  3. Observations:
    • Both models run quickly (milliseconds)
    • Stakeholders can override the recommended choice if they prefer a more uniform distribution or have other practical considerations
By comparing Linear Optimization and a Greedy Heuristic, this system provides two distinct allocations for staff assignment—one aligning more proportionally with forecasted wait times, and another always feasible but potentially less proportional. Users can choose the final allocation that best fits their operational goals and constraints.