Population genetics in R
We have samples with two genotypes: the B genotype (associated with single-queen colony phenotype) and the b genotype (associated with multiple-queen colony phenotype). B and b actually mark a large supergene, a genomic region with strong linkage disequilibrium (Wang et al, 2013). The B and b variants of this region do not recombine with each other.
Our dummy assembly has two scaffolds, each from a different chromosome. The aim of our analysis is to test whether any part of this assembly is associated with the B and b supergene variants.
In the first part of the analysis, we are going to create a heat map of the genotypes of the individuals and we are going to run PCA on these genotypes. This will allow us to test if any of the individuals cluster by their B/b genotype. This will be done using the
adegenet package in R.
In the second part, we are going to measure genetic differentiation between the two groups (B and b). We will do this analysis over a sliding window, to see if the differentiation between B and b are specific to any portion of the genome. We will also measure the the genetic diversity among each of the groups, which may tell us something about the evolutionary history of the portions genome represented in our assembly. This will be done using the
PopGenome package in R.
Input into R
Again, make a directory for this practical. You should create a directory for the data and one for the results (with a link to the data directory). You will only need the
snp.vcf file we created in the last practical. It’s a good idea to note down the results of your analysis in the the results section, as well saving any graph you make.
adegenet uses a object called
r genlight. To create it, we need to input a matrix where each row is an individual and each column is a locus (i.e. a SNP position). We can do this using bcftools:
bcftools query snp.vcf -f '%CHROM\t%POS[\t%GT]\n' > snp_matrix.txt # get sample names bcftools query -l snp.vcf > sample_names.txt
You can open a new R session by typing
R in the terminal. Then:
# input the SNP data and the sample names snp_matrix <- read.table("snp_matrix.txt") sample_names <- read.table("sample_names.txt") sample_names <- sample_names$V1 # Keep the position of the loci loci <- snp_matrix[,1:2] colnames(loci) <- c("scaffold", "position") # Turn the matrix on its side (rows = individuals, columns = loci) snp_matrix <- snp_matrix[,3:ncol(snp_matrix)] snp_matrix <- t(snp_matrix) # add sample names sample_names <- gsub("\\.bam","", sample_names) row.names(snp_matrix) <- sample_names # reorder the rows by population bb <- sample_names[grep("B", sample_names)] lb <- sample_names[grep("b", sample_names)] snp_matrix <- snp_matrix[c(bb,lb),]
Once this is done, we can create a new
genlight object that contains all the SNP data
library(adegenet) snp <- new("genlight", snp_matrix, chromosome=loci$scaffold, position=loci$position, pop=as.factor(c(rep("B",7), rep("b",7)))) # You can access the data using the "@" sign: snp email@example.com snp@gen snp@chromosome # although there are some functions that may also be helpful: ploidy(snp)
Now plot a heatmap showing the genotypes.
## Heat map of genotype glPlot(snp)
You can also perform principal component analysis (PCA) and plot the first few axes.
## PCA pca <- glPca(snp, nf=10) # you can select 10 axes # fast plot scatter(pca, posi="bottomright") # Plot of the first few axes coloured by population par(mfrow=c(1,2)) plot(pca$scores[,1], pca$scores[,2], col=c(rep("blue",7), rep("orange",7)),cex=2) text(pca$scores[,1], pca$scores[,2] + 0.7, labels=rownames(pca$scores), cex= 0.7) plot(pca$scores[,1], pca$scores[,3], col=c(rep("blue",7), rep("orange",7)),cex=2) text(pca$scores[,1], pca$scores[,3] + 0.7, labels=rownames(pca$scores), cex= 0.7)
The aim of these analysis is to test whether the B and the b individuals cluster together or separately. The first principal component separates B and b. But if you look in the heat map, the separation is not constant in the whole genome.
Each of the scaffolds has been retrieved from a different chromosome. Below, we can test whether the differentiation between B and b is only seen in one of the scaffolds.
# scaffold_1 scaffold_1_index <- which(snp@chromosome == "scaffold_1") scaffold_1 <- snp[,scaffold_1_index] glPlot(scaffold_1) pca1 <- glPca(scaffold_1, nf=10) # you can select 10 axes scatter(pca1, posi="bottomright") # scaffold_2 scaffold_2_index <- which(snp@chromosome == "scaffold_2") scaffold_2 <- snp[,scaffold_2_index] glPlot(scaffold_2) pca2 <- glPca(scaffold_2, nf=10) scatter(pca2, posi="bottomright")
PopGenome to measure differentiation and diversity
Another way of measuring differentiation between groups of individuals is using the fixation index, FST, which tests whether there is genetic structure in the population. FST can be used, for example, to test whether there is evidence of low gene flow between populations. In the case of the fire ant, the B and b supergene variants coexist in the population, but do not recombine with each other - so they should show strong differentiation.
An important population genetics measure is genetic diversity. Patterns of genetic diversity can be informative of a population’s evolutionary past - for example, low genetic diversity may be evidence for a recent population bottleneck. Furthermore, the variation of diversity within the genome can be informative of different evolutionary effects, such as the strength of selection in different parts of the genome.
We will measure FST and nucleotide diversity (a measure of genetic diversity) using the R package PopGenome.
In theory, the
r PopGenome can read VCF files directly, using the
readVCF function. However, because our samples are haploid, we need to use a different function,
r readData, which requires a folder with a separate VCF for each scaffold.
# Make new directory mkdir popgenome-vcf # compress and index the VCF bgzip snp.vcf tabix -p vcf snp.vcf.gz bcftools view snp.vcf.gz scaffold_1 > popgenome-vcf/scaffold_1 bcftools view snp.vcf.gz scaffold_2 > popgenome-vcf/scaffold_2
You can now load the data in
library(PopGenome) # Load the data snp <- readData("popgenome-vcf", format="VCF") # This is complex object, with several slots get.sum.data(snp) show.slots(snp) # You can access the different "slots" by using the "@" sign: firstname.lastname@example.org # Set populations pops <- get.individuals(snp)[] pop1 <- pops[grep("B\\.bam",pops)] pop2 <- pops[grep("b\\.bam",pops)] snp <- set.populations(snp, list(pop1,pop2)) snp@populations # check if it worked
Let’s calculate FST between the two populations and nucleotide diversity in each of the populations.
# Diversities and FST (by scaffold) snp <- F_ST.stats(snp) # this does the calculations and # adds the results to the appropriate slots # Print FST get.F_ST(snp) # each line is a scaffold snp@nucleotide.F_ST # Print diversities get.diversity(snp) get.diversity(snp)[] # pop1 (B) get.diversity(snp)[] # pop2 (b) email@example.com
Another useful tool is to do the calculations along a sliding window.
# Transform object into object divided by sliding window win_snp <- sliding.window.transform(snp, width=10000, jump=10000, type=2, whole.data=FALSE) # Measurements per window win_snp <- F_ST.stats(win_snp) win_snp@nucleotide.F_ST firstname.lastname@example.org # A simple plot win_fst <- win_snp@nucleotide.F_ST[,1] bb_div <- email@example.com[,1] # diversity among B (bb = "big B") lb_div <- firstname.lastname@example.org[,2] # diversity among B (lb = "little b") plot(1:length(win_fst), win_fst) par(mfrow=c(2,1)) win_fst <- win_snp@nucleotide.F_ST[,1] plot(1:length(bb_div), bb_div) win_fst <- win_snp@nucleotide.F_ST[,1] plot(1:length(lb_div), lb_div)
- Knowing that there is no recombination between B and b somewhere in the genome, why do you think there is high FST in one scaffold and not the other?
- What factors could make b have such low diversity in one of the scaffolds?