Skip to content
View ag48665's full-sized avatar

Block or report ag48665

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don’t include any personal information such as legal names or email addresses. Markdown is supported. This note will only be visible to you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
ag48665/README.md

Agata Gabara

Data Analytics | Computational Biology | SQL | Finance & Risk Analytics | Cybersecurity

Data-driven analyst with experience in computational biology, machine learning, SQL analytics, and risk-oriented data analysis. Interested in financial analytics, fraud detection, cybersecurity analytics, and AI-driven data science applications.


Research Focus

Genomic data
→ RNA-seq processing
→ Differential expression
→ Functional pathway analysis
→ Survival modeling
→ Prognostic biomarker discovery
→ Clinical interpretation


Featured Analytics Projects

SQL Banking & Fraud Analytics Portfolio

https://github.com/ag48665/sql-data-analysis-portfolio

SQL analytics project focused on banking KPIs, fraud detection, customer segmentation, transaction analysis, ranking functions, window functions, and risk-oriented financial analytics using SQLite.

Airline Customer Satisfaction Tableau Dashboard

Interactive Tableau dashboard analyzing airline customer satisfaction, delays, service quality, and passenger experience trends.

🔗 View Project

Airline Satisfaction Analysis

Data analysis project focused on airline passenger satisfaction using exploratory data analysis, customer segmentation, and visualization techniques.

🔗 View Project

Victoria Data Analytics Internship

Business and data analytics internship project involving data cleaning, reporting, visualization, and analytical insights for business decision-making.

🔗 View Project

Predicting Hospital Readmission Risk using SQL & Python

https://github.com/ag48665/hospital-readmission-risk-sql-python

Healthcare analytics and machine learning project focused on hospital readmission prediction, SQL healthcare analytics, ICU risk profiling, cost analysis, CTEs, window functions, and predictive modeling using Python and scikit-learn.

Healthcare Claims Risk Analytics Pipeline

https://github.com/ag48665/healthcare-claims-risk-analytics-pipeline

Healthcare analytics and risk modeling project focused on insurance claims processing, fraud analytics, patient risk scoring, cost optimization, ETL pipelines, SQL analytics, and predictive modeling using Python and machine learning techniques.

Healthcare Executive Dashboard (Power BI)

https://github.com/ag48665/healthcare-executive-dashboard-powerbi

Interactive Power BI healthcare dashboard focused on executive KPI monitoring, healthcare claims analytics, ICU utilization, hospital performance tracking, diagnosis cost analysis, readmission risk reporting, and healthcare business intelligence visualization.

MIMIC-IV ICU Risk Analytics

https://github.com/ag48665/mimic-iv-icu-risk-analytics

Real-world healthcare analytics project using MIMIC-IV clinical ICU data, SQL, Python, SQLite, mortality analytics, ICU KPI monitoring, healthcare data visualization, clinical risk analysis, and exploratory healthcare data analytics.


Research & Bioinformatics Projects

Lung Tumor Microenvironment Single-Cell RNA-seq Analysis

https://github.com/ag48665/multimodal-cancer-foundation-models

Objective: Characterize the cellular composition, signaling interactions, and transcriptional dynamics of the lung tumor microenvironment using single-cell RNA sequencing (scRNA-seq) data from lung cancer, metastatic, and normal tissue samples.

Data: GSE131907 lung cancer single-cell RNA-seq dataset including normal lung (nLung), normal lymph node (nLN), brain metastasis (mBrain), and tumor lung/bronchus (tL/B) samples.

Methods:

  • Single-cell quality control and preprocessing using Scanpy
  • Highly variable gene selection and dimensionality reduction
  • Batch integration using scVI-tools
  • UMAP visualization and Leiden clustering
  • Cell-type and subtype annotation
  • Differential gene expression analysis
  • GO Biological Process enrichment analysis
  • Ligand-receptor interaction inference using LIANA
  • Cell-cell communication network reconstruction
  • Diffusion pseudotime (DPT) analysis
  • PAGA trajectory inference
  • Multisample comparison of metastatic and normal tissues
  • Identification of metastasis-associated transcriptional programs

Technologies:

  • Python
  • Scanpy
  • scVI-tools
  • LIANA
  • GSEApy
  • AnnData
  • Pandas
  • NumPy
  • Seaborn
  • Matplotlib

Key Findings:

  • Identification of major tumor microenvironment populations including epithelial, immune, endothelial, fibroblast, and stromal cells
  • Discovery of prominent ligand-receptor interactions such as APP→AGER, APOE→ABCA1, ARPC5→ADRB2, and ACTR2→ADRB2
  • Reconstruction of epithelial differentiation trajectories using pseudotime analysis
  • Detection of terminal epithelial states enriched for AGER and AQP5 expression
  • Identification of brain metastasis-associated genes including APLP1, APOD, AIF1L, ANLN, and ABCA2
  • Functional enrichment revealing vesicle trafficking, phagosome maturation, intracellular pH regulation, and metastatic adaptation pathways

Applications:

  • Single-cell transcriptomics
  • Tumor microenvironment analysis
  • Cancer systems biology
  • Computational oncology
  • Cell-cell communication analysis
  • Metastatic progression studies
  • Biomarker discovery

GATK Somatic Variant Calling Demo

https://github.com/ag48665/gatk-somatic-variant-calling-demo

Objective:
Build a reproducible tumor-normal somatic variant calling workflow demonstrating practical next-generation sequencing (NGS) analysis skills used in computational genomics and cancer bioinformatics.

Methods:

  • Tumor-normal paired sequencing workflow design
  • BWA and minimap2 read alignment
  • BAM sorting and indexing using SAMtools
  • Somatic SNV/indel calling using GATK Mutect2
  • Variant filtering using FilterMutectCalls
  • YAML-based workflow configuration
  • GitHub Actions CI/CD testing
  • Docker-based reproducible environment setup

Technologies:

  • Python
  • GATK Mutect2
  • SAMtools
  • BWA
  • minimap2
  • Docker
  • GitHub Actions
  • YAML
  • Linux command-line workflows

Applications:

  • Cancer genomics
  • Somatic mutation analysis
  • NGS workflow engineering
  • Reproducible bioinformatics pipelines
  • Computational oncology

AWS Cancer Survival Pipeline

https://github.com/ag48665/aws-cancer-survival-pipeline

Objective:
Build a cloud-ready, reproducible bioinformatics pipeline for RNA-seq statistical analysis, biomarker discovery, and survival modeling in lung cancer using TCGA-LUAD data.

Methods:

  • TCGA RNA-seq and clinical data processing
  • Differential expression analysis using DESeq2
  • Exploratory transcriptomic analysis and PCA
  • Kaplan–Meier survival analysis
  • Cox proportional hazards modeling
  • LASSO Cox feature selection
  • Reproducible workflow structure with Docker, Nextflow, GitHub Actions, and AWS-ready architecture

Technologies:

  • R
  • Bioconductor
  • DESeq2
  • TCGAbiolinks
  • survival / survminer
  • glmnet
  • Nextflow
  • Docker
  • GitHub Actions
  • AWS-ready project structure

Genomic Identification AI

https://github.com/ag48665/genomic_identification

Objective: Develop a computational genomics framework for forensic-style human identification from degraded and mixed DNA samples using machine learning, Bayesian inference, and deep learning.

Data: 1000 Genomes Project Phase 3 genotype data (chromosome 20 SNP profiles)

Methods:

  • Genomic SNP matrix construction
  • PCA and UMAP population structure analysis
  • Population-aware ancestry inference
  • DNA degradation robustness simulation
  • Bayesian posterior identity estimation
  • Entropy-based uncertainty analysis
  • Explainable AI using SNP feature importance
  • Deep autoencoder latent genomic embeddings
  • Interactive Streamlit dashboard development

Technologies:

  • Python
  • PyTorch
  • scikit-allel
  • scikit-learn
  • NumPy
  • Matplotlib
  • Streamlit

Applications:

  • Forensic genomics
  • Population genetics
  • Probabilistic genomic inference
  • AI-assisted human identification
  • Explainable genomic machine learning

Transcriptomic Prognostic Signature for Lung Squamous Cell Carcinoma

https://github.com/ag48665/lusc-transcriptomic-prognostic-signature

Objective:
Develop and validate a robust survival prediction model for lung squamous cell carcinoma (LUSC) patients.

Data:
TCGA (training cohort), GEO external validation cohorts (GSE30219, GSE37745)

Methods:

  • Cox proportional hazards modeling
  • Elastic-net feature selection
  • Kaplan–Meier survival analysis
  • Time-dependent ROC analysis
  • Multivariable Cox regression
  • Calibration and decision curve analysis

NASA Spaceflight AI – RNA-seq Transcriptomics Analysis

https://github.com/ag48665/nasa-spaceflight-ai

Objective: Analyze NASA GeneLab RNA-seq transcriptomics datasets using bioinformatics, machine learning, and data visualization techniques to investigate gene expression variability under spaceflight conditions.

Data: NASA GeneLab RNA-seq datasets (GLDS-168, GLDS-245)

Methods:

  • RNA-seq transcriptomics preprocessing
  • Principal Component Analysis (PCA)
  • Heatmap visualization
  • Highly expressed gene analysis
  • K-Means clustering
  • Comparative transcriptomics
  • Machine learning-based exploratory analysis

Technologies:

  • Python
  • Pandas
  • NumPy
  • Matplotlib
  • Seaborn
  • Scikit-learn
  • Jupyter Notebook

Applications:

  • Space medicine research
  • Biomarker discovery
  • Precision medicine
  • AI-assisted bioinformatics
  • Transcriptomic anomaly detection

TCGA Lung Squamous Cell Carcinoma Transcriptomic Analysis

https://github.com/ag48665/tcga-lusc-biomarker-analysis

Objective:
Identify survival-associated gene expression programs and potential prognostic biomarkers in LUSC.

Methods:

  • TCGA data acquisition using TCGAbiolinks
  • Differential expression analysis (DESeq2)
  • Functional enrichment analysis (GO / KEGG)
  • Survival analysis

Immune Landscape of Lung Squamous Cell Carcinoma

https://github.com/ag48665/lusc-immune-escape-analysis

Objective:
Characterize immune heterogeneity in lung squamous cell carcinoma by analyzing immune activation, checkpoint signaling, and exhaustion-associated tumor states using TCGA and GEO datasets.

Methods:

  • Immune gene signature scoring
  • T-cell exhaustion profiling
  • Checkpoint signaling analysis
  • UMAP visualization
  • Survival analysis
  • External cohort validation

Spaceflight-Induced Transcriptomic Adaptation Analysis

https://github.com/ag48665/spaceflight-rnaseq-analysis

Objective:
Investigate transcriptomic changes associated with spaceflight exposure using RNA-seq analysis to identify altered biological pathways and adaptive molecular responses.

Methods:

  • RNA-seq preprocessing and normalization
  • Differential gene expression analysis
  • Functional enrichment analysis (GO / KEGG)
  • Pathway-level interpretation
  • Exploratory transcriptomic visualization
  • Reproducible bioinformatics workflow development

Single-Cell RNA-seq Tumor Microenvironment Analysis

https://github.com/ag48665/tcga-lung-immune-evasion-scRNAseq

Objective:
Explore immune cell populations and functional states within the tumor microenvironment at single-cell resolution.

Methods:

  • Scanpy preprocessing and normalization
  • PCA / UMAP dimensionality reduction
  • Clustering analysis
  • Cell-type annotation

NGS Variant Calling Pipeline

https://github.com/ag48665/nextflow-variant-calling-pipeline

Objective:
The objective of this project was to build a reproducible and containerized bioinformatics workflow for genomic variant calling using Nextflow and Docker.

The pipeline demonstrates core steps used in next-generation sequencing (NGS) analysis, including:

  • FASTQ quality control
  • sequence alignment
  • BAM processing
  • variant calling
  • reproducible workflow execution

This project also aimed to develop practical skills in:

  • workflow orchestration with Nextflow
  • Linux/WSL2 bioinformatics environments
  • Docker containerization
  • genomics data processing
  • Git and GitHub version control

Methods:

The workflow was implemented using Nextflow DSL2 and executed with Docker containers to ensure reproducibility.

Pipeline steps:

  1. FASTQ Quality Control

    • FastQC was used to assess sequencing read quality.
  2. Sequence Alignment

    • BWA MEM was used to align sequencing reads to the reference genome.
  3. BAM Processing

    • SAMtools converted SAM to sorted BAM format.
  4. Variant Calling

    • BCFtools was used to identify genomic variants and generate VCF files.

Technologies used:

  • Nextflow
  • Docker
  • FastQC
  • BWA
  • SAMtools
  • BCFtools
  • Linux / WSL2

Single-cell RNA-seq Cell Atlas of Human PBMCs

https://github.com/ag48665/scrna-pbmc-cell-atlas

Objective:
Reconstruct immune cell populations from human PBMC single-cell RNA-seq data using an unsupervised Scanpy workflow.

Data:
Public PBMC3K dataset from Scanpy (~2,700 human peripheral blood mononuclear cells)

Methods:

  • Single-cell RNA-seq quality control and filtering
  • Normalization and highly variable gene selection
  • PCA / UMAP dimensionality reduction
  • Leiden clustering
  • Marker gene identification
  • Cell-type annotation using canonical immune markers

Pilot Hypoxia Detection using Physiological Signals

https://github.com/ag48665/Pilot-Hypoxia-Detection-using-Physiological-Signals

Objective:
Develop a machine learning–based system for early hypoxia detection in pilots using physiological signals.

Data:
Multimodal physiological signals including heart rate, oxygen saturation, and respiration measurements

Methods:

  • Signal preprocessing and feature extraction
  • Time-series analysis of physiological data
  • Machine learning classification models
  • Model evaluation (accuracy, ROC, confusion matrix)
  • Data visualization and pattern detection

SOC Analyst Portfolio

https://github.com/ag48665/soc-analyst-portfolio

Cybersecurity analytics portfolio focused on SOC analysis, threat detection, incident investigation, SIEM-style log analysis, alerts, risk triage, and security monitoring workflows.


Technical Skills

Bioinformatics

RNA-seq analysis • differential expression (DESeq2) • survival modeling (Cox, Kaplan–Meier) • functional enrichment (GO / KEGG) • prognostic modeling • single-cell RNA-seq (Scanpy) • TCGA / GEO data analysis • biomarker discovery

Machine Learning & Data Analysis

Supervised learning • classification models • feature selection • model evaluation (ROC, AUC, confusion matrix) • time-series analysis • physiological signal processing

Programming

R (tidyverse, survival, DESeq2) • Python (pandas, numpy, scikit-learn, scanpy)

Tools & Methods

Linux • Git • reproducible workflows • statistical modeling • data visualization (ggplot2, matplotlib, seaborn) • data preprocessing • pipeline development


Contact

Email: agatagabara@gmail.com

LinkedIn: https://www.linkedin.com/in/agatha-gabara-06494a37/

Pinned Loading

  1. lusc-transcriptomic-prognostic-signature lusc-transcriptomic-prognostic-signature Public

  2. healthcare-claims-risk-analytics-pipeline healthcare-claims-risk-analytics-pipeline Public

    End-to-end healthcare claims and risk analytics pipeline using SQL, Python, SQLite, Docker, and automated reporting for healthcare KPI monitoring, anomaly detection, and predictive analytics.

    Python

  3. healthcare-executive-dashboard-powerbi healthcare-executive-dashboard-powerbi Public

    Power BI healthcare executive dashboard for hospital KPI monitoring, claims cost analysis, readmission risk, and ICU utilization.

  4. hospital-readmission-risk-sql-python hospital-readmission-risk-sql-python Public

    Healthcare analytics project predicting hospital readmission risk using SQL and Python with synthetic patient-level data inspired by Dutch healthcare analytics.

    Python

  5. mimic-iv-icu-risk-analytics mimic-iv-icu-risk-analytics Public

    Healthcare analytics and ICU risk prediction project using real-world MIMIC-IV clinical data, SQL, Python, machine learning, and healthcare KPI analysis.

    Python

  6. sql-data-analysis-portfolio sql-data-analysis-portfolio Public

    This project analyzes synthetic banking data to identify customer segments, transaction patterns, fraud risk indicators, and key financial KPIs using SQL.