Ants, Bees, Genomes & Evolution @ Queen Mary University London

UNIX/bioinformatics teaching cloud

November 1, 2020

Getting into big data science can be a big leap if you’re a biologist who is new to the command-line.

We try to cut that down into a series of smaller, more manageable steps.

As part of that, we run a hands-on genome bioinformatics course that introduces students to UNIX, and covers topics from Illumina read cleaning to genome assembly, annotation, population genomics and genome-wide association mapping.

For obvious 2020 reasons, we needed to do this online in a manner that:

  • has manageable costs but sufficient power for genomics analyses;
  • is easy for students to access autonomously;
  • provides students the flexibility to work when they want (timezones) from where they want;
  • can be easily modified by us as needed.
  • doesn’t require students to install complex software (Docker, Virtual box, linux subsystem…) which are difficult to troubleshoot.

We built it.

Process for students

A student wanting to access their Linux machine must:

  • connect to http://switch.genomicscourse.com (currently offline)
  • enter their login/password
  • click a button to switch on their virtual computer
  • this creates their personal virtual computer and shows its IP address & hostname

The student can connect to the computer by ssh, and download or visualise files in a web browser by putting them in a designated folder in their home directory (~/www).

If a student forgets to switch off their computer, this occurs automatically after 30 minutes of idle time.

For course administrators

This is great for us as organisers because:

  • it avoids paying for cloud computing instances that are not being used.
  • it allows us to give students more CPU and RAM.
  • no physical rooms required - anyone can connect from any computer.
  • ensures that all students use the same setup.

This uses Amazon EC2 infrastructure, and thus scales easily to any number of students, and can use computers with small or large amount of cpu power or ram.

Interested?

We can potentially deploy our solution for other courses. If you’re interested, get in touch.

Example screenshots

/img/news/supergene_expansion.png

/img/news/2020-11-01-unix_bioinf_cloud/panel.png

Press release: Degeneration of Gene Expression on Social Supergene

August 25, 2020

Scientists from Queen Mary University of London have found that harmful mutations accumulating in the fire ant social chromosome are causing its breakdown.

The chromosome, first discovered by researchers at the University in 2013, controls whether the fire ant colony has either one queen or multiple queens. Having these two different forms of social organisation means the species can adapt easily to different environments and has resulted in them becoming a highly invasive pest all over the world, living up to their Latin name Solenopsis invicta, meaning “the invincible”.

For the new study, published in eLife, the research team performed detailed analyses of the activity levels of all the genes within the social chromosome for the first time to understand how it works and its evolution. They found that damaging mutations are accumulating in one version of the social chromosomes, causing it to degenerate. The findings also showed that most of the recent evolution of these chromosomes stems from attempts to compensate for these harmful mutations. Natural selection is the main evolutionary mechanism that helps to optimise genes over generations but normally, it cannot simultaneously optimise genes for two different types of social organisation within one species.

To overcome this evolutionary conflict, social chromosomes group together genes adapted to each type of social form. The results of the new study show that this solution prevents the removal of harmful mutations from the genome and as a result, these mutations accumulate over time and begin to dominate the fate of the system.

The social chromosomes in fire ants are a rare example of a direct link between genes and social behaviour. They work in a similar way to the X and Y chromosomes in humans, which determine sex. This discovery has wider ecological and medical implications because genomic structures similar to social and sex chromosomes can not only help species adapt to changing environments but also underpin diseases such as cancer.

Dr. Martínez-Ruiz, lead author of the study from Queen Mary University of London, said: “Our results show that the initial benefit of nature combining genes into a social chromosome has a cost. One million years later, most of the differences we see between social chromosomes are due to the accumulation of negative mutations.”

“We also see that the rest of the genome adapts very quickly in response to negative mutations,” added Dr. Wurm, Reader in Bioinformatics at Queen Mary and senior author of the study. “This is how evolution works, by adding patches to imperfect solutions, rather than by finding the most efficient solution.”

“Despite the degeneration of the social chromosomes, the fire ants are unlikely to lose them anytime soon. This would require another major chromosomal reshuffling - such events are rare and usually lethal,” Dr Wurm continues. “However, over long evolutionary timescales, anything is possible. Most of the 20,000 species of ants either have only single-queen colonies or only multiple-queen colonies. We are now trying to understand whether social chromosomes are required for changes in social organisation.”

The study builds on earlier research by the authors on the evolution of social chromosomes. They have previously identified differences in genes for chemical communication that may be responsible for perceiving queens, showed that one social chromosome supergene variant has doubled in size, and that this social chromosome supergene lacks genetic diversity.

Research paper: ‘Genomic architecture and evolutionary antagonism drive allelic expression bias in the social supergene of red fire ants’ Carlos Martinez-Ruiz, Rodrigo Pracana, Eckart Stolle, Carolina I. Paris, Richard A. Nichols, and Yannick Wurm, 2020. eLife, 9, p.e55862. Available at https://doi.org/10.7554/eLife.55862.

Fire ants on Illumina Miseq chip

Workers of the red fire ant Solenopsis invicta on an llumina Miseq sequencing chip used to analyse the genes in their social chromosome (credit: E Favreau & Y Wurm).

Getting a good internet connection from out in the boonies

July 24, 2020

Ok so Covid made Yannick et al head for the woods. But what about internet access?

It took some time but I figured it out. It is now fast and reliable. Low ping. Fast download. Fast upload. Amazing.

I suspect others in remote places could benefit from what I learnt. Two things were needed.

chateau.png

A high performance 4G/LTE router

The MikroTik Chateau is incredible.

Put a SIM card inside, and its already much faster than:

  • a phone,
  • a mifi hotspot,
  • or a “simple” 90 euro 4G/LTE router like the Dlink DWR 921.

But plug in a dual antenna and it’s crazy fast.

fast_speedtest_chateau_lte_mimo.png

FYI, those 180 Mbps are despite having only 3 bars of reception.

It turns out that this router is Category 12 LTE, which means that it connects to the cell tower 3 times. So you get more combined bandwidth, and more resilience to interruptions - say if one of the cell towers were to become overloaded or fail.

MikroTik design and build these in Latvia. Most of their stuff is geared towards professionals. So the user interface offers immense flexibility - but is not easy to use. And it didn’t just “plug and play”.

FWIW the Microtik Chateau is ~200 GBP in UK. If out of stock, or you want something more user friendly, the Netgear Orbi is super-fast, or TP-link Archer MR600 is cheap but slower (Cat 6). All of these have antennas built in - so if signal is strong enough, no external antenna is needed.


A Yagi MIMO directional antenna on the roof

This had to be pointed at the nearest 4G phone tower - which I located using this handy map.

This made a huge improvement in cell reception.

(This type of pair of directional antennas is ~100 GBP). If you want to avoid the hassle of precise pointing, at the cost of lower sensitivity, get an omni-directional antenna that can just be stuck to the wall or window.

coflex_mimo_lte_antenna.jpg

MikroTik Chateau configuration

Every small thing you could want to imagine can be tuned on this router - and bazillions of things I am not even close to imagining (!).

However, it didn’t work right away. I had to specifically:

1. Set the APN

Following MikroTik’s help documents, I did this in the Terminal interface:

/interface lte apn add apn=internet.it use-network-apn=no
/interface lte set lte1 apn-profiles=internet.it

2. In the Quick interface

  • Specifically tell it to use both antennas
  • Rename network, add password
  • Update the OS to the latest development version

3. In the WebFig interface

  • Tell it to accept incoming SMS


Getting internet while traveling

On the road (train, hotel rooms…), tethering to the iPhone is sometimes ok… but throughput is really much better with a dedicated device.

I stick a SIM card into a Netgear Aircard 790. Its wonderful and works right away; their newer NightHawk M2 is likely even better. If signal is weak, I take a portable mini LTE-antenna that can just be plugged into the aircard or nighthawk and stick that to the window. Obviously, if signal is very weak, a bigger antenna setup (like above) is needed…

Research Highlight: Degenerative expansion of the fire ant supergene

April 28, 2020

Non-recombining variant of a young supergene is larger than the normally-recombining variant

You know how a Y chromosome is usually smaller than an X chromosome, and it contains fewer genes?

We tested whether this pattern holds for the fire ant social chromosomes: is the b (which is similar to Y) smaller than the B (which is similar to X)?

Surprisingly, the opposite is true: the non-recombining b is almost twice as big as the B.

/img/news/supergene_expansion.png

This is due to a process called “degenerative expansion”. We found the first evidence of it in an animal (there was a previous report in Papaya).

So why is the b variant of the fire ant supergene expanding?

It turns out that immediately after recombination is suppressed, massive loss of genes is too costly.

Instead, degeneration of a non-recombining region occurs in 3 steps:

  1. Accumulation of “mildly” deleterious mutations (e.g., repeats) ➜ Slow degradation of functional elements ➜ Degenerative expansion
  2. Compensatory mutations in the rest of the genome are selected for (e.g., dosage compensation, gene relocation)
  3. Cost of losing genes is lower ➜ Bigger chunks of the genome can now be lost

Why is it rare to observe the growth of a non-recombining region?

In most sex chromosome and other supergene systems, we have only seen the end of the 3rd phase. The systems studied are typically millions to hundreds of millions years old.

However, in the fire ant social supergene system we’ve been able to observe the first phase. This is probably for four reasons: the fire ant supergene is young, we used extra-long molecule sequencing, and ant males are haploid, making it easy to detect differences between haplotypes, and also creating strong purifying selection against gene loss.

For more details, check Eckart & Roddy’s paper: Degenerative expansion of a young supergene. Molecular Biology and Evolution 2019.

Succesful PhD & MSc completions & new members

April 25, 2020

Long time no update (!)

Party Time!

Huge congratulations for the labs three new 2019 PhD graduates:

Similarly, big congrats to the four MSc students we hosted in 2019: Valentine Patterson and Richard Burns are now pursuing PhDs (respectively in Mainz and at Kings College). Iwo Pieniak and Catherine Okuboyejo are now data engineers.

All will be sorely missed.

Additional congrats to the lab’s former postdocs Joe Colgan and Eckart Stolle, who have or are both moved to new junior group leader positions in Germany.

Hellos

Simultaneously we have welcomed a bunch of happy faces: Gabriel Hernandez-Gomez, Alicja Witwicka and Guy Mercer as new PhD students, and Anindita Brahma as a new Marie Sklodowska-Curie fellow. We look forward to working together!

Opportunities

As detailed on our phd and postdoc opportunities page, we are always eager to work with bright & motivated people, regardless what your background. Don’t hesitate to get in touch.

We will soon have an opening for a BBSRC-funded postdoctoral position looking at the molecular mechanisms underpinning the ability of pollinators to respond to environmental challenges. Stay tuned.

Open PhD studentship: Evolutionary genomics of social insects

February 27, 2019

We have an exciting PhD position open through the London NERC DTP.

Apply by March 18th on the QMUL website.

The studentship is funded by the London NERC DTP will cover tuition fees and provide an annual tax-free maintenance allowance for 4 years at the Research Council rate (£17,009 in 2019/20). Candidates must meet RCUK eligibility criteria (I think this means ok for UK citizens and medium-term residents).

The project is highly interdisciplinary.

Great candidates fulfill at least 3 of the following 4 criteria:

  • smart
  • hard working
  • understands genomes or social insects
  • not scared of data analysis or coding.

We can adapt the project to the students’ interests and background.

If you have any questions regarding prerequisites, scope or nature of the project, please don’t hesitate to get in touch with me (Yannick).

Research context

We have two main lines of research, in collaboration with national and international colleagues and stakeholders.

Genetics of social behaviour. Social animals exhibit a broad range of behaviors, and some theoretical understanding exists of the tradeoffs between different forms of social organisation. However, we know little about the genes and processes underpinning social organisation or how it evolves. The diversity of social behaviors across the 20,000 species of ants represents a unique opportunity to empirically understand the mechanisms and tradeoffs involved in social change. We use highly molecular approaches, including genomics and bioinformatics but also potentially behavioural or field work to address major questions about social evolution. We aim to generate exciting new insights into genes and processes underpinning a major social transition, with implications on understanding evolution of complex phenotypes.

Molecular diagnostics for pollinator health. Effective pollination is crucial for the stability of the ecosystem, and for crop productivity. Governments had approved what they thought were “safe” levels of pesticides. But in fact, the pesticides are generic neurotoxins: they reduce the learning abilities, dexterity, foraging ability and ultimately survival of pollinators who consume nectar or pollen. As a result, several commonly used pesticides have now been banned. However, the problem may just have been shifted: we lack a good way of understanding whether authorised pesticides are better. Thus there is an urgent need for approaches that are more powerful/sensitive. The 50,000-fold drop in the cost of DNA sequence over the past 10 years has completely changed medical research and practice. Inspired by the changes, we are developing high-resolution molecular diagnostics approaches for pollinator health – these are poised to fundamentally change for the better how research on pesticides is performed and the mechanisms through which such crop chemicals are evaluated by regulatory agencies.

Training

The student will receive extensive training in big data bioinformatics, phylogenomics, data visualisation, and experimental research approaches in evolution and genomics. Furthermore, they will receive hands-on training in interdisciplinary project management, communicating science in writing and verbally, including by presenting at workshops and conferences.

IUSSI conference talk: Better analyses for social insect genomics

October 9, 2018

Social insect biology is now a data science!

I (Yannick) spent the week of August 5th at the 18th Congress of the International Society for the Study of Social Insects in Guarujá, Brazil. This is a big quadrennial conference uniting researchers from around the world who study ants, bees, wasps, termites and a few other animals.

Part of my trip was funded by the Software Sustainability Institute which lobbies for and helps people do better research through improving software. Hence this blog post.

The study of social insects has traditionally used approaches including behavioral observation and taxonomic sampling, with genetic analyses becoming more common since the mid 2000s. A pleasant surprise at the conference was the recent increase in highly molecular, genome-wide approaches where whole or partial genomes or transcriptome sequences of many individuals are obtained in order to make specific comparisons within species, or sometimes also between species.

This disruptive shift is largely due to the 50,000-fold drop in DNA sequencing costs over the past 10 years. See Émeline’s recent review on the genes and processes underpinning evolution of social behavior in ants.

With great power comes great responsibilities.

A major challenge for small research labs now wielding in large genomic datasets is that it is easy to make a small mistake that has high costs.

In light of this, as part of a workshop on genomics approaches organised with Tim Linksvayer and Alex Mikheyev, I gave an overview of some of the lessons we can transfer from the worlds of “other” data sciences to our expanding world of social insect genomics. This includes:

  • writing analysis code for humans;
  • respecting style guides for code (e.g., R style guide), and for how to structure a genomic analysis;
  • benefits of peer-reviewing code, and of peer-coding sessions;
  • using specific tools that increase productivity while decreasing risks (rmarkdown, fat machines, snakemake/nextflow);
  • benefits of visualising data in many different manners. Typically when people learn to do basic linear models they learn the importance of visually inspecting some plots (e.g. qqplot, residuals). But when we end up performing tens of thousands of such analyses (e.g. one for each gene or one for each SNP), many forgo doing this.

My slides are here:


It is worth highlighting three additional, important points raised during the congress that have more to do with interpretation, vocabulary and experimental design than anything technical:

  1. There is occasional misconception/mislabeling that extant species may be representative of species that lived in the past. No: just as much time has passed since the most recent common ancestor of all ants and Pheidole pallidula ants as passed since the most recent common ancestor of all ants and any particular Harpegnathos saltator. Similarly, no current species of great ape is “more similar” to any ancestor of humans - all are equally similar to their shared common ancestor.
  2. The definition of eusocial has become too fuzzy to be useful. Superorganismality is a much more precise and relevant concept that clearly identifies irreversible evolutionary transitions from context‐dependent reproductive altruism to unconditional differentiation of permanently unmated castes. See also Koos’ paper Superorganismality and caste differentiation as points of no return: how the major evolutionary transitions were lost in translation.
  3. Comparisons (e.g. of genome content) between two species are often confounded by many differences other than the first two that come to mind (ecology, lifespan, environment, demographic history etc…).

A fun and highly stimulating conference.

Project structures for genomics analyses

October 1, 2018

How do you structure your files and folders for genomics analyses?

One challenge is that many analyses actually require multiple steps, thus having all steps in one place becomes a mess.

So we should structure our analyses across multiple folders. But how should we name them and keep track of their order?

Another (key) challenge in performing genomics analyses is that we often have to perform analyses multiple times.

  • we need to try three different approaches because we don’t know which will perform best;
  • or we want to try a new version of the analysis software;
  • or we want to start with a small “test” dataset before scaling up to the full data;
  • or we want to redo everything on a completly different dataset;
  • or a reviewer asks for a minor adjustment in analysis or an additional plot on the data we analyzed months/years ago.

So how do we keep track of the different steps and versions of analyses?

The standard approach we use for all projects in the lab is derived from ideas initially proposed by William Noble in A Quick Guide to Organizing Computational Biology Projects. That initial model has been adjusted based on our experience of dozens of projects over the years, as well as discussions with Julien Roux, Anurag Priyam, and Roddy Pracana.

Stable link here.

Best to just illustrate with an example of how this works at the simplest level.

Example:

2016-04-14-bombus_variant_calling
├── input
│   ├── 2016-04-14-bombus_raw_28_samples
│   │   ├── sample1.fq    #  could link to /data/SBCS-WurmLab/archive/db/genomic/reads/...                 
│   │   ├── sample2.fq 
│   │   ├── sample3.fq
│   │   ├── bombus_genome.fa -> ~/db/genomic/B_terrestris/Bter20110317-genome.fa
│   │   └── WHATIDID.sh  # list of ln -s, cp or wget/curl commands 
│   └── 2016-04-16-cleaned_reads
│       ├── sample1.fq.gz   -> ../../results/2016-04-14-read_cleaning/results/sample1.clean.fq.gz
│       ├── sample2.fq.gz   -> ../../results/2016-04-14-read_cleaning/results/sample2.clean.fq.gz
│       ├── sample3.fq.gz   -> ../../results/2016-04-14-read_cleaning/results/sample3.clean.fq.gz
│       └── WHATIDID.sh  # just the ln -s commands.
├── results
│   ├── 2016-04-14-read_cleaning
│   │   ├── input        -> ../../input/2016-04-14-bombus_raw_28_samples
│   │   ├── results                                # only few files here
│   │   ├── sratoolkit   -> ../../soft/sratoolkit-2.4.2/bin/
│   │   ├── tmp                                    # use real scratch dir if more appropriate
|   |   ├── ENVIRONMENT.sh                         # if any particular software, modules or containers need to be loaded
│   │   └── WHATIDID.txt                           # or equivalent .sh or .Rmd (or knitr/jupyter)
│   ├── 2016-04-16-mapping_to_reference
│   │   ├── input        -> ../../input/2016-04-16-cleaned_reads
│   │   ├── results                                # only few files here
│   │   ├── tmp                                    # use real scratch dir if more appropriate
|   |   ├── ENVIRONMENT.sh                         # if any particular software, modules or containers need to be loaded
│   │   └── WHATIDID.txt                           # or equivalent .sh or .Rmd (or knitr/jupyter)
│   └── WHATIDID.txt                               # for overall rationale
└── soft
    ├── sratoolkit-2.4.2                           # if installed locally
    ├── bwa              -> /share/apps/sbcs/bwa/0.6.2/bin/bwa
    └── # links to other software if needed

Explicit (partial) conventions

Conventions include:

  • key directory names begin with YYYY-MM-DD date, followed by _underscore_delimited description; For example, a new project starting today should begin as follows: 2018-10-10-a_self_explanatory_name;
  • all subdirectory names should be self-explanatory;
  • link to files when appropriate. this can save tons of space AND reduce ambiguity/risks;
  • every results dir should contain a link named input to an input directory with a self explanatory name;
  • every directory in which you did something should contain a WHATIDID.txt (or an equivalent ruby/perl/jupyter/R/knitR/Sweave/Rmarkdown script) that contains all relevant commands. required to get from input to results;
  • once you have created an “input” (i.e. “data”) folder, make it read-only because you don’t want any accidental edits while you are running your analysis.`

Open PhD studentship: Data science & machine learning for genomic analysis

July 5, 2018

Interested in supercharging the productivity of genome biologist researchers?

We have an exciting 4-year bioinformatics PhD position open through the London BBSRC LiDO Doctoral Training Programme.

Apply by 5pm July 20th here at LIDO to start in September.

A description of the project is below. It is highly interdisciplinary - no need to already be able to understand all the details today.

Great candidates fulfill 3 of the following four criteria: smart, hard working, understands genomes, and not scared of data analysis or coding.

If you have any questions regarding scope or nature of the project, or whether your skills are potentially sufficient, please don’t hesitate to get in touch with me (Yannick).

(Standard UKRI eligibility criteria apply (i.e. I think one must be UK resident). - the LiDO people can explain this better).

Project summary

(apologies for the use of domain-specific jargon!)

Inferring gene function for emerging model organisms

The first generation of molecular-genetic research focused on traditional model organisms including mouse, yeast, zebrafish, Drosophila, and C. elegans. Genetic research increasingly uses diverse organisms that are much more relevant models for specific questions. For example, some such emerging organisms exhibit unique phenotypes including 100-fold intra-specific variation in lifespan, resistance to harsh environmental conditions, represent novel animal models for disease or development, provide crucial ecosystem services, or are key to food security because they are crops or may pollinate them.

A major challenge when working with such “emerging” model organisms is making sense of the “gene lists” that result from genome-wide analyses (e.g., of gene expression or genome-wide associations).

Here, we will develop a bioinformatics tool that takes a list of genes or genomic locations from a new species as input, and transparently produces relevant functional information describing this list of loci. When presented with data for which no direct information exists, the tool will in a first instance identify relationships of orthology to regions of other species. This will create a trail of links to databases in which functional information for orthologous regions does exist. These databases will be interrogated following hierarchical set of rules (initially defined based on human-curated examples). Using using cutting-edge “learning to rank” machine learning techniques the rulesets will be refined over time by tracking user behaviour (based on logs of which relationships/trails users retain) as well as explicitly allowing users to flag issues. The tool hereby makes it possible to extract significant value from largescale datasets that would otherwise require laborious case-by-case engineering efforts to connect. Summary data will be returned to the user using visualisations, statistics and tables in a manner that facilitates interpretation. Inferences and relationship calculations taking seconds will be available immediately; those taking minutes (e.g., distant orthology) will appear asynchronously as they complete; and those taking longer will result in email notification.

We will package our work in a manner that makes it accessible to biologists working with new or existing genomes. This builds on our extensive success with including with the SequenceServer and OMA software. Overall, our approach will substantially improve the ability of genome biologists to generate meaningful biological insight when working with new organisms.

This project is in collaboration with Christophe Dessimoz at UCL/Lausanne.

Posted to bioRxiv: Degenerative Expansion of a Young Supergene

May 23, 2018

We have just posted a new manuscript to bioRxiv, where we describe the structural differences between the SB and Sb versions of the fire ant social chromosome pair.

We find that Sb is larger than SB and discuss how the suppression of recombination of Sb would lead to this type of ‘degenerative expansion’, as hypothesised for Y chromosomes and other non-recombining chromosomal regions. Read the manuscript, and tell us what you think!