Ants, Bees, Genomes & Evolution @ Queen Mary University London

Genome analyses for emerging model organisms

Using modern molecular tools on emerging (non-model) organism makes it possible to address exciting new questions. But the data aren’t as perfect they should be. In particular, genomes created from Roche 454, Illumina or ABI Solid sequence are fragmented: You wish you’d get a FASTA file with one long sequence per chromosome. Dream on! You get sequences for dozens to thousands of scaffolds. Each scaffold is a series of contigs, separated by stretches of unresolved NNNNNNNNN sequence (usually repetitive sequences). But the assembler knows these contigs are adjacent thanks to paired reads.

ContigsJoinedToScaffold

Genome fragmentation can make things challenging. Some tips from my experience with ant genomes:

How can you determine what is inside the unresolved poly-NNNNN sequence without genome walking or PCR and sequencing? Getting the whole thing will be difficult. But its easy to get a little:

**Are two scaffolds adjacent? **

How good is good enough? Some sequence/data/scaffolds/models are missing or mediocre! But no biological dataset is ever perfect. If you’re trying to make your emerging model organism’s data perfect… you’ll get nowhere fast. The 20% effort that bring you 80% of the way will probably be good enough to answer your exciting biological question.


September 21, 2011