top of page

Dimensionality Reduction and Classification: A Machine Learning Framework for High-Dimensional Data

Overview:
This project explored the application of machine learning techniques for classifying and analyzing imbalanced, high-dimensional datasets. The workflow included data preprocessing, feature extraction, dimensionality reduction, model training, and evaluation with a focus on optimizing classification accuracy for real-world image data.

​

Key Contributions:

  1. Data Preprocessing and Imbalance Handling:

    • Addressed class imbalance with resampling strategies and weighted evaluation metrics to improve minority class prediction.

    • Implemented data cleaning and normalization techniques using Python libraries such as Pandas and Scikit-learn.

  2. Dimensionality Reduction:

    • Applied Principal Component Analysis (PCA) to reduce dataset dimensions while retaining 95% of variance, effectively transforming features for improved computational efficiency and interpretability.

    • Conducted exploratory analysis to understand the variance distribution and selected optimal components for modeling.

  3. Model Development and Optimization:

    • Trained and evaluated multiple classifiers, including Support Vector Machines (SVM), Random Forest, and K-Nearest Neighbors (KNN).

    • Performed hyperparameter tuning using Grid Search and cross-validation to identify the best model configurations.

  4. Evaluation and Metrics:

    • Used metrics such as AUC-ROC, AUC-PR, F1-score, and Matthews Correlation Coefficient to assess model performance.

    • Designed pipelines integrating PCA and SVM with optimal parameters, achieving an accuracy of 93.3% and robust precision-recall balance.

  5. Visualization and Insights:

    • Created scatter plots and scree graphs to visualize PCA results, class separability, and cumulative variance.

    • Plotted ROC and Precision-Recall curves for multiple classifiers to compare performance and highlight the best model.

  6. End-to-End Deployment:

    • Built a scalable pipeline for preprocessing, feature extraction, and classification, with outputs saved in structured formats for downstream analysis.

 

Skills and Technologies:

  • Programming & Tools: Python, Scikit-learn, Pandas, NumPy, Matplotlib, Seaborn

  • Machine Learning: SVM, Random Forest, KNN, PCA

  • Evaluation Metrics: Accuracy, AUC-ROC, AUC-PR, F1-score, Matthews Correlation Coefficient

  • Visualization: PCA scatter plots, ROC, and PR curves for classifier comparisons

 

Results:

  • Achieved high classification performance with optimized PCA and SVM configurations.

  • Demonstrated the ability to handle complex, high-dimensional datasets effectively, aligning with real-world data science and machine learning tasks.

 

Relevance for Data Roles:
This project showcases expertise in data preprocessing, dimensionality reduction, and advanced classification techniques, making it directly relevant for roles in data analysis, data science, and machine learning.

bottom of page