top of page

Dimensionality Reduction and Classification for Biomedical Image Analysis: Insights into Protein Expression and Cellular Imaging

Overview:
Developed a comprehensive pipeline for analyzing protein expression data and cellular images to extract meaningful biological insights. The project utilized advanced data analysis, image processing, and machine learning techniques to correlate protein expression levels with cellular features, aiding in biomedical research.

​

Key Contributions:

  1. Data Preprocessing and Analysis:

    • Processed protein expression datasets with 38 features, filtering and preparing data by specimen groups for training and testing.

    • Conducted exploratory data analysis (EDA), including histograms of key proteins (e.g., NESTIN, cMYC, MET), identifying distribution patterns across specimen groups.

  2. Image Processing:

    • Converted RGB cellular images to Hematoxylin-Eosin-DAB (HED) color space to isolate the Hematoxylin channel, enhancing the visualization of cellular nuclei.

    • Analyzed channel intensity to derive key metrics such as mean and variance, enabling cellular morphology analysis.

  3. Feature Extraction and Engineering:

    • Extracted statistical features (mean, variance) from HED channels and RGB images to model correlations between image intensity and protein expression levels.

    • Applied Principal Component Analysis (PCA) for dimensionality reduction and to identify the most predictive components for NESTIN expression.

  4. Machine Learning:

    • Built predictive models using linear regression to correlate protein expression with extracted image features.

    • Evaluated models using metrics like correlation coefficient, Mean Squared Error (MSE), and R-squared to assess prediction quality.

  5. Visualization and Insights:

    • Created scatter plots and regression models to demonstrate relationships between H-channel intensity and protein expression levels, identifying potential predictive features.

    • Visualized histograms and other graphical representations to communicate patterns in protein expression and cellular characteristics effectively.

 

Results:

  • Identified a moderate correlation between H-channel intensity and NESTIN expression, demonstrating the potential of image features in predicting protein expression.

  • PCA analysis revealed principal components that retained 95% of variance, enabling dimensionality reduction without significant information loss.

 

Skills and Technologies:

  • Programming & Tools: Python, Pandas, NumPy, Scikit-learn, Matplotlib, Seaborn

  • Image Processing: Skimage, RGB-to-HED conversion, Hematoxylin channel analysis

  • Machine Learning: Linear regression, PCA, feature engineering

  • Data Visualization: Scatter plots, histograms, and regression analysis for actionable insights

 

Impact:
This project highlights expertise in integrating data mining and image processing techniques with machine learning to solve complex biological problems. The approach demonstrates the ability to analyze multidimensional datasets and extract meaningful insights, directly relevant for data science and machine learning roles in biomedical domains.

bottom of page