LOEUF compatiblity with gnomAD 4.1.1#837
LOEUF compatiblity with gnomAD 4.1.1#837likhitha-surapaneni wants to merge 1 commit intoEnsembl:postreleasefix/116from
Conversation
jamie-m-a
left a comment
There was a problem hiding this comment.
Some issues with file surrounding refseq transcripts and a line drop with that current sort.
| These files can be tabix-processed by: | ||
| zcat gnomad.v4.1.1.constraint_metrics.tsv.bgz | (head -n 1 && tail -n +2 | sort -t$'\t' -k 9,9 -k 10,10n ) > loeuf_temp.tsv | ||
| sed '1s/.*/#&/' loeuf_temp.tsv > loeuf_dataset.tsv | ||
| bgzip loeuf_dataset.tsv |
There was a problem hiding this comment.
Ok first up, the sort and zip can be combined and the current sort is losing a line - this is bettrer and a bit faster:
zcat gnomad.v4.1.1.constraint_metrics.tsv.bgz | (sed -u 1q; sort -k 9,9 -k 10,10n) | sed '1s/.*/#&/' | bgzip -c - > loeuf_dataset.tsv.bgz
| zcat gnomad.v4.1.1.constraint_metrics.tsv.bgz | (head -n 1 && tail -n +2 | sort -t$'\t' -k 9,9 -k 10,10n ) > loeuf_temp.tsv | ||
| sed '1s/.*/#&/' loeuf_temp.tsv > loeuf_dataset.tsv | ||
| bgzip loeuf_dataset.tsv | ||
| tabix -f -s 9 -b 10 -e 11 loeuf_dataset.tsv.gz |
There was a problem hiding this comment.
However when you try to tabix either file you run in to a bunch of errors, because not everything in the file has chr and sequence data - basically only the ENSG rows have that, RefSeq has NA, which breaks tabix. We either have to skip refseq entries (grep -v NM* on transcript_id) or insert the correct coordinates for those.
#836