Skip to content

LOEUF compatiblity with gnomAD 4.1.1#837

Open
likhitha-surapaneni wants to merge 1 commit intoEnsembl:postreleasefix/116from
likhitha-surapaneni:update/LOEUF
Open

LOEUF compatiblity with gnomAD 4.1.1#837
likhitha-surapaneni wants to merge 1 commit intoEnsembl:postreleasefix/116from
likhitha-surapaneni:update/LOEUF

Conversation

@likhitha-surapaneni
Copy link
Copy Markdown
Contributor

@jamie-m-a jamie-m-a self-requested a review May 7, 2026 12:01
@jamie-m-a jamie-m-a self-assigned this May 7, 2026
Copy link
Copy Markdown
Contributor

@jamie-m-a jamie-m-a left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some issues with file surrounding refseq transcripts and a line drop with that current sort.

Comment thread LOEUF.pm
These files can be tabix-processed by:
zcat gnomad.v4.1.1.constraint_metrics.tsv.bgz | (head -n 1 && tail -n +2 | sort -t$'\t' -k 9,9 -k 10,10n ) > loeuf_temp.tsv
sed '1s/.*/#&/' loeuf_temp.tsv > loeuf_dataset.tsv
bgzip loeuf_dataset.tsv
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok first up, the sort and zip can be combined and the current sort is losing a line - this is bettrer and a bit faster:
zcat gnomad.v4.1.1.constraint_metrics.tsv.bgz | (sed -u 1q; sort -k 9,9 -k 10,10n) | sed '1s/.*/#&/' | bgzip -c - > loeuf_dataset.tsv.bgz

Comment thread LOEUF.pm
zcat gnomad.v4.1.1.constraint_metrics.tsv.bgz | (head -n 1 && tail -n +2 | sort -t$'\t' -k 9,9 -k 10,10n ) > loeuf_temp.tsv
sed '1s/.*/#&/' loeuf_temp.tsv > loeuf_dataset.tsv
bgzip loeuf_dataset.tsv
tabix -f -s 9 -b 10 -e 11 loeuf_dataset.tsv.gz
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However when you try to tabix either file you run in to a bunch of errors, because not everything in the file has chr and sequence data - basically only the ENSG rows have that, RefSeq has NA, which breaks tabix. We either have to skip refseq entries (grep -v NM* on transcript_id) or insert the correct coordinates for those.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants