I've recently been working on further developing a bioinformatics pipeline called YMAP [1] that I wrote during grad school. YMAP was originally designed for genomes on the scale of the 16 Mbase genome of Candida albicans, and primarily handled short-read Illumina sequence type data (as well as a custom microarry I designed that basically nobody ever used).
Eventually, I'll release an updated version of the tool as YMAP2 with an associated research paper describing it in some detail. For now, the development copy is only running on my home system, where I can test it against various datasets as I implement feature improvements.
One of the major feature updates is to allow the tool it to process the data produced by recent developments in long-read DNA sequencing technology. Recently, some users of the existing public YMAP server were trying to process long-read data through it even though it was not built to handle such data. The result was some tool or other in the pipeline would tie up all of the memory on the server and cause the web-server to crash. (I had to do a hot-fix of the live server to allow it to recognize and block such attempts.)
I've also been updating it to handle much larger genomes, like the 600 Mbase genome of Phaseolus vulgaris. To test this, I recently downloaded some PacBio sequence data from the Sequence Read Archive for P. vulgaris cv. Flavert [2]. The figure below shows the number of aligned sequence reads on the y-axis vs the genome position on the x-axis. The lines drawn down from the center of each chromosome cartoon represent areas where there were less reads than expected, and drawn up areas where there were more reads than expected. In high-quality data from yeast experiments, this sort of variation would happen due to different regions of the genome being found in different copy numbers. Duplication or deletion of genome regions would be easily seen. However, with this dataset, something else is going on.
| YMAP plot produced by analyzing SRR23332460.1 [2] data against the Phaseolus vulgaris cv. Flavert reference genome [3]. |
Lets take a close look at chromosome 6. Roughly half of the chromosome had very little representation in the dataset, while the other half was over-represented.
I had recently come across a paper by McClean et al from 2022 [4] that showed chromosome 6 of P. vulgaris to have a large region of heterochromatin on the left half of this chromosome. Heterochromatin is a highly condensed form of genomic DNA that often surrounds the centromere, reduces gene expression, and prevents recombination.
![]() |
| Figure 2 from McClean et al 2022. https://pmc.ncbi.nlm.nih.gov/articles/PMC9009181/ |



No comments:
Post a Comment