Today is Save the Koala Day and a perfect time to to tell you about some noteworthy and ground-breaking research that was made possible by AWS Research Credits and the AWS Cloud.
Five years ago, a research team led by Dr. Rebecca Johnson (Director of the Australian Museum Research Institute) set out to learn more about koala populations, genetics, and diseases. As a biologically unique animal with a limited appetite, maintaining a healthy and genetically diverse population are both key elements of any conservation plan. In addition to characterizing the genetic diversity of koala populations, the team wanted to strengthen Australia’s ability to lead large-scale genome sequencing projects.
Inside the Koala Genome
Last month the team published their results in Nature Genetics. Their paper (Adaptation and Conservation Insights from the Koala Genome) identifies the genomic basis for the koala’s unique biology. Even though I had to look up dozens of concepts as I read the paper, I was able to come away with a decent understanding of what they found. Here’s my lay summary:
Toxic Diet – The eucalyptus leaves favored by koalas contain a myriad of substances that are toxic to other species if ingested. Gene expansions and selection events in genes encoding enzymes with detoxification functions enable koalas to rapidly detoxify these substances, making them able to subsist on a diet favored by no other animal. The genetic repertoire underlying the accelerated metabolics also renders common anti-inflammatory medications and antibiotics ineffective for treating ailing koalas.
Food Choice – Koalas are, as I noted earlier, very picky eaters. Genetically speaking, this comes about because their senses of smell and taste are enhanced, with 6 genes giving them the ability to discriminate between plant metabolites on the basis of smell. The researchers also found that koalas have a gene that helps them to select eucalyptus leaves with a high water content, and another that enhances their ability to perceive bitter and umami flavors.
Reproduction – Specific genes which control ovulation and birth were identified. In the interest of frugality, female koalas produce eggs only when needed.
Koala Milk – Newborn koalas are the size of a kidney bean and weigh less than half of a gram! They nurse for about a year, taking milk that changes in composition over time, with a potential genetic correlation. The researchers also identified genes known to function as anti-microbial properties.
Immune Systems – The researchers identified genes that formed the basis for resistance, immunity, or susceptibility to certain diseases that affect koalas. They also found evidence of a “genomic invasion” (their words) where the koala retrovirus actually inserts itself into the genome.
Genetic Diversity – The researchers also examined how geological events like habitat barriers and surface temperatures have shaped genetic diversity and population evolution. They found that koalas from some areas had markedly less genetic diversity than those from others, with evidence that allowed them to correlate diversity (or the lack of it) with natural barriers such as the Hunter Valley.
Powered by AWS
Creating a complete gene sequence requires (among many other things) an incredible amount of compute power and vast amount of storage.
While I don’t fully understand the process, I do know that it works on a bottom-up basis. The DNA samples are broken up into manageable pieces, each one containing several tens of thousands of base pairs. A variety of chemicals are applied to cause the different base constituents (A, T, C, or G) to fluoresce, and the resulting emission is captured, measured, and stored. Since this study generated a koala reference genome, the sequencing reads were assembled using an overlapping layout consensus assembly algorithm known as Falcon which was run on AWS. The koala genome comes in at 3.42 billion base pairs, slightly larger than the human genome.
I’m happy to report that this groundbreaking work was performed on AWS. The research team used cfnCluster to create multiple clusters, each with 500 to 1000 vCPUs, and running Falcon from Pacific Biosciences. All in all, the team used 3 million EC2 core hours, most of which were EC2 Spot Instances. Having access to flexible, low-cost compute power allowed the bioinformatics team to experiment with the configuration of the Falcon pipeline as they tuned and adapted it to their workload.
We are happy to have done our small part to help with this interesting and valuable research!