Data-driven analyst with experience in computational biology, machine learning, SQL analytics, and risk-oriented data analysis. Interested in financial analytics, fraud detection, cybersecurity analytics, and AI-driven data science applications.
Genomic data
→ RNA-seq processing
→ Differential expression
→ Functional pathway analysis
→ Survival modeling
→ Prognostic biomarker discovery
→ Clinical interpretation
https://github.com/ag48665/sql-data-analysis-portfolio
SQL analytics project focused on banking KPIs, fraud detection, customer segmentation, transaction analysis, ranking functions, window functions, and risk-oriented financial analytics using SQLite.
Interactive Tableau dashboard analyzing airline customer satisfaction, delays, service quality, and passenger experience trends.
Data analysis project focused on airline passenger satisfaction using exploratory data analysis, customer segmentation, and visualization techniques.
Business and data analytics internship project involving data cleaning, reporting, visualization, and analytical insights for business decision-making.
https://github.com/ag48665/hospital-readmission-risk-sql-python
Healthcare analytics and machine learning project focused on hospital readmission prediction, SQL healthcare analytics, ICU risk profiling, cost analysis, CTEs, window functions, and predictive modeling using Python and scikit-learn.
https://github.com/ag48665/healthcare-claims-risk-analytics-pipeline
Healthcare analytics and risk modeling project focused on insurance claims processing, fraud analytics, patient risk scoring, cost optimization, ETL pipelines, SQL analytics, and predictive modeling using Python and machine learning techniques.
https://github.com/ag48665/healthcare-executive-dashboard-powerbi
Interactive Power BI healthcare dashboard focused on executive KPI monitoring, healthcare claims analytics, ICU utilization, hospital performance tracking, diagnosis cost analysis, readmission risk reporting, and healthcare business intelligence visualization.
https://github.com/ag48665/mimic-iv-icu-risk-analytics
Real-world healthcare analytics project using MIMIC-IV clinical ICU data, SQL, Python, SQLite, mortality analytics, ICU KPI monitoring, healthcare data visualization, clinical risk analysis, and exploratory healthcare data analytics.
https://github.com/ag48665/multimodal-cancer-foundation-models
Objective: Characterize the cellular composition, signaling interactions, and transcriptional dynamics of the lung tumor microenvironment using single-cell RNA sequencing (scRNA-seq) data from lung cancer, metastatic, and normal tissue samples.
Data: GSE131907 lung cancer single-cell RNA-seq dataset including normal lung (nLung), normal lymph node (nLN), brain metastasis (mBrain), and tumor lung/bronchus (tL/B) samples.
Methods:
- Single-cell quality control and preprocessing using Scanpy
- Highly variable gene selection and dimensionality reduction
- Batch integration using scVI-tools
- UMAP visualization and Leiden clustering
- Cell-type and subtype annotation
- Differential gene expression analysis
- GO Biological Process enrichment analysis
- Ligand-receptor interaction inference using LIANA
- Cell-cell communication network reconstruction
- Diffusion pseudotime (DPT) analysis
- PAGA trajectory inference
- Multisample comparison of metastatic and normal tissues
- Identification of metastasis-associated transcriptional programs
Technologies:
- Python
- Scanpy
- scVI-tools
- LIANA
- GSEApy
- AnnData
- Pandas
- NumPy
- Seaborn
- Matplotlib
Key Findings:
- Identification of major tumor microenvironment populations including epithelial, immune, endothelial, fibroblast, and stromal cells
- Discovery of prominent ligand-receptor interactions such as APP→AGER, APOE→ABCA1, ARPC5→ADRB2, and ACTR2→ADRB2
- Reconstruction of epithelial differentiation trajectories using pseudotime analysis
- Detection of terminal epithelial states enriched for AGER and AQP5 expression
- Identification of brain metastasis-associated genes including APLP1, APOD, AIF1L, ANLN, and ABCA2
- Functional enrichment revealing vesicle trafficking, phagosome maturation, intracellular pH regulation, and metastatic adaptation pathways
Applications:
- Single-cell transcriptomics
- Tumor microenvironment analysis
- Cancer systems biology
- Computational oncology
- Cell-cell communication analysis
- Metastatic progression studies
- Biomarker discovery
https://github.com/ag48665/gatk-somatic-variant-calling-demo
Objective:
Build a reproducible tumor-normal somatic variant calling workflow demonstrating practical next-generation sequencing (NGS) analysis skills used in computational genomics and cancer bioinformatics.
Methods:
- Tumor-normal paired sequencing workflow design
- BWA and minimap2 read alignment
- BAM sorting and indexing using SAMtools
- Somatic SNV/indel calling using GATK Mutect2
- Variant filtering using FilterMutectCalls
- YAML-based workflow configuration
- GitHub Actions CI/CD testing
- Docker-based reproducible environment setup
Technologies:
- Python
- GATK Mutect2
- SAMtools
- BWA
- minimap2
- Docker
- GitHub Actions
- YAML
- Linux command-line workflows
Applications:
- Cancer genomics
- Somatic mutation analysis
- NGS workflow engineering
- Reproducible bioinformatics pipelines
- Computational oncology
https://github.com/ag48665/aws-cancer-survival-pipeline
Objective:
Build a cloud-ready, reproducible bioinformatics pipeline for RNA-seq statistical analysis, biomarker discovery, and survival modeling in lung cancer using TCGA-LUAD data.
Methods:
- TCGA RNA-seq and clinical data processing
- Differential expression analysis using DESeq2
- Exploratory transcriptomic analysis and PCA
- Kaplan–Meier survival analysis
- Cox proportional hazards modeling
- LASSO Cox feature selection
- Reproducible workflow structure with Docker, Nextflow, GitHub Actions, and AWS-ready architecture
Technologies:
- R
- Bioconductor
- DESeq2
- TCGAbiolinks
- survival / survminer
- glmnet
- Nextflow
- Docker
- GitHub Actions
- AWS-ready project structure
https://github.com/ag48665/genomic_identification
Objective: Develop a computational genomics framework for forensic-style human identification from degraded and mixed DNA samples using machine learning, Bayesian inference, and deep learning.
Data: 1000 Genomes Project Phase 3 genotype data (chromosome 20 SNP profiles)
Methods:
- Genomic SNP matrix construction
- PCA and UMAP population structure analysis
- Population-aware ancestry inference
- DNA degradation robustness simulation
- Bayesian posterior identity estimation
- Entropy-based uncertainty analysis
- Explainable AI using SNP feature importance
- Deep autoencoder latent genomic embeddings
- Interactive Streamlit dashboard development
Technologies:
- Python
- PyTorch
- scikit-allel
- scikit-learn
- NumPy
- Matplotlib
- Streamlit
Applications:
- Forensic genomics
- Population genetics
- Probabilistic genomic inference
- AI-assisted human identification
- Explainable genomic machine learning
https://github.com/ag48665/lusc-transcriptomic-prognostic-signature
Objective:
Develop and validate a robust survival prediction model for lung squamous cell carcinoma (LUSC) patients.
Data:
TCGA (training cohort), GEO external validation cohorts (GSE30219, GSE37745)
Methods:
- Cox proportional hazards modeling
- Elastic-net feature selection
- Kaplan–Meier survival analysis
- Time-dependent ROC analysis
- Multivariable Cox regression
- Calibration and decision curve analysis
https://github.com/ag48665/nasa-spaceflight-ai
Objective: Analyze NASA GeneLab RNA-seq transcriptomics datasets using bioinformatics, machine learning, and data visualization techniques to investigate gene expression variability under spaceflight conditions.
Data: NASA GeneLab RNA-seq datasets (GLDS-168, GLDS-245)
Methods:
- RNA-seq transcriptomics preprocessing
- Principal Component Analysis (PCA)
- Heatmap visualization
- Highly expressed gene analysis
- K-Means clustering
- Comparative transcriptomics
- Machine learning-based exploratory analysis
Technologies:
- Python
- Pandas
- NumPy
- Matplotlib
- Seaborn
- Scikit-learn
- Jupyter Notebook
Applications:
- Space medicine research
- Biomarker discovery
- Precision medicine
- AI-assisted bioinformatics
- Transcriptomic anomaly detection
https://github.com/ag48665/tcga-lusc-biomarker-analysis
Objective:
Identify survival-associated gene expression programs and potential prognostic biomarkers in LUSC.
Methods:
- TCGA data acquisition using TCGAbiolinks
- Differential expression analysis (DESeq2)
- Functional enrichment analysis (GO / KEGG)
- Survival analysis
https://github.com/ag48665/lusc-immune-escape-analysis
Objective:
Characterize immune heterogeneity in lung squamous cell carcinoma by analyzing immune activation, checkpoint signaling, and exhaustion-associated tumor states using TCGA and GEO datasets.
Methods:
- Immune gene signature scoring
- T-cell exhaustion profiling
- Checkpoint signaling analysis
- UMAP visualization
- Survival analysis
- External cohort validation
https://github.com/ag48665/spaceflight-rnaseq-analysis
Objective:
Investigate transcriptomic changes associated with spaceflight exposure using RNA-seq analysis to identify altered biological pathways and adaptive molecular responses.
Methods:
- RNA-seq preprocessing and normalization
- Differential gene expression analysis
- Functional enrichment analysis (GO / KEGG)
- Pathway-level interpretation
- Exploratory transcriptomic visualization
- Reproducible bioinformatics workflow development
https://github.com/ag48665/tcga-lung-immune-evasion-scRNAseq
Objective:
Explore immune cell populations and functional states within the tumor microenvironment at single-cell resolution.
Methods:
- Scanpy preprocessing and normalization
- PCA / UMAP dimensionality reduction
- Clustering analysis
- Cell-type annotation
https://github.com/ag48665/nextflow-variant-calling-pipeline
Objective:
The objective of this project was to build a reproducible and containerized bioinformatics workflow for genomic variant calling using Nextflow and Docker.
The pipeline demonstrates core steps used in next-generation sequencing (NGS) analysis, including:
- FASTQ quality control
- sequence alignment
- BAM processing
- variant calling
- reproducible workflow execution
This project also aimed to develop practical skills in:
- workflow orchestration with Nextflow
- Linux/WSL2 bioinformatics environments
- Docker containerization
- genomics data processing
- Git and GitHub version control
Methods:
The workflow was implemented using Nextflow DSL2 and executed with Docker containers to ensure reproducibility.
Pipeline steps:
-
FASTQ Quality Control
- FastQC was used to assess sequencing read quality.
-
Sequence Alignment
- BWA MEM was used to align sequencing reads to the reference genome.
-
BAM Processing
- SAMtools converted SAM to sorted BAM format.
-
Variant Calling
- BCFtools was used to identify genomic variants and generate VCF files.
Technologies used:
- Nextflow
- Docker
- FastQC
- BWA
- SAMtools
- BCFtools
- Linux / WSL2
https://github.com/ag48665/scrna-pbmc-cell-atlas
Objective:
Reconstruct immune cell populations from human PBMC single-cell RNA-seq data using an unsupervised Scanpy workflow.
Data:
Public PBMC3K dataset from Scanpy (~2,700 human peripheral blood mononuclear cells)
Methods:
- Single-cell RNA-seq quality control and filtering
- Normalization and highly variable gene selection
- PCA / UMAP dimensionality reduction
- Leiden clustering
- Marker gene identification
- Cell-type annotation using canonical immune markers
https://github.com/ag48665/Pilot-Hypoxia-Detection-using-Physiological-Signals
Objective:
Develop a machine learning–based system for early hypoxia detection in pilots using physiological signals.
Data:
Multimodal physiological signals including heart rate, oxygen saturation, and respiration measurements
Methods:
- Signal preprocessing and feature extraction
- Time-series analysis of physiological data
- Machine learning classification models
- Model evaluation (accuracy, ROC, confusion matrix)
- Data visualization and pattern detection
https://github.com/ag48665/soc-analyst-portfolio
Cybersecurity analytics portfolio focused on SOC analysis, threat detection, incident investigation, SIEM-style log analysis, alerts, risk triage, and security monitoring workflows.
RNA-seq analysis • differential expression (DESeq2) • survival modeling (Cox, Kaplan–Meier) • functional enrichment (GO / KEGG) • prognostic modeling • single-cell RNA-seq (Scanpy) • TCGA / GEO data analysis • biomarker discovery
Supervised learning • classification models • feature selection • model evaluation (ROC, AUC, confusion matrix) • time-series analysis • physiological signal processing
R (tidyverse, survival, DESeq2) • Python (pandas, numpy, scikit-learn, scanpy)
Linux • Git • reproducible workflows • statistical modeling • data visualization (ggplot2, matplotlib, seaborn) • data preprocessing • pipeline development
Email: agatagabara@gmail.com
LinkedIn: https://www.linkedin.com/in/agatha-gabara-06494a37/