Advancing Microbiome Data Analyses with Benjamin Callahan
With his widely used microbiome data analysis tools, and dedication to data interoperability and accessibility, Dr. Benjamin Callahan, Ph.D., the 2023 ASM Microbiome Data Prize awardee and an associate professor of microbiomes and complex microbial communities at North Carolina State University, is a notable figure in the microbiome field.
His research helps scientists make better use of microbiome data, allowing them to identify and study microbes that play roles in all facets of life. Given his successes and contributions, it’s somewhat surprising that Callahan didn’t originally expect to work with microbiomes at all. As with all good scientific journeys, his is far from a straight line.
Fast forward to his college years at Iowa State University, and Callahan did, indeed, embark on a path in STEM, albeit a winding one. He started as a computer engineering major, then segued into linguistics before finishing with a degree in physics and math. It was as a physics Ph.D. student at University of California, Santa Barbara where Callahan’s foray into biology began. His mentor, Dr. Boris Shraiman, a theoretical physicist, was intrigued by biological questions related to population genetics and evolution, interests Callahan shared. After discovering that the most interesting results of his first Ph.D. project were previously published, Callahan embarked on a second theoretical and computational project exploring polyketide synthase pathways in bacteria, and how horizontal gene transfer affects their adaptive evolution. Toward the end of his degree, he also began working in genomic data analysis—a nod to his future scientific pursuits.
Of course, no path is without its bumps. During his postdoc in the lab of Dr. Daniel Fisher, a professor in applied physics at Stanford University, Callahan, who had mostly taken a theoretical approach to biology, had the “the romantic idea that a scientist should do it all, which include[d] doing experiments.” He spearheaded a project focused on niche construction (i.e., when an organism modifies its local environment in a way that benefits its competitive fitness) of experimental microbial populations.
By the end of the project, he had “learned a lot of humility and [gained] an understanding of how important and hard it is to be a good experimentalist." However, Callahan didn't think he had the aptitude or passion to pursue a career at the bench. He was ready to leave academia altogether—and he might have, if Dr. Susan Holmes, a professor of statistics at Stanford, hadn’t recruited him to analyze data for a project investigating the vaginal microbiota and preterm birth. This marked the beginning of Callahan’s work with microbiome data. It is also where he began developing the analytical tools that would form the foundation of his career.
With that in mind, Callahan’s research program focuses on understanding microbiome measurements to make better use of the data. The application of this approach to amplicon sequencing resulted in 1 of his most significant contributions to the field, an open-source software package called Divisive Amplicon Denoising Algorithm 2 (DADA2). The software allows researchers to parse out errors in amplicon sequencing data. It also helps organize the data to determine microbial composition of a sample. The latter function is where the power of DADA2 lies.
Traditionally, microbiome composition was (and often still is) determined by assessing which sequences in a sample are most like each other, based on an arbitrary threshold, and clumping them together. This is akin to pairing a green apple with a red apple and labelling the group “apples”—there is no finer resolution than that. In microbial terms, similar sequences might be clustered into a group called “Neisseria” or “Pseudomonas,” but related species and strains will be grouped together.
DADA2 champions an alternative approach. Rather than clustering sequences, DADA2 attempts to discriminate real sequences from one another, no matter how small the difference between them. This paints a more nuanced picture of microbial diversity by allowing for discrimination between different species in a genus and, in some cases, different strains of the same species. Yet, for Callahan, the greatest benefit of DADA2's approach is how it improves the utility of microbiome data.
However, with certain standard clustering methods, “you’re going to get different [microbial] groups for every data set you go through the [clustering] process with,” he explained. For example, standard clustering might produce an “apple” group and a “pear” group from 1 dataset but a “green” group and a “red” group from another—even if the same red/green apples/pears were present in both datasets. As a result, it is not possible to directly compare results from 1 dataset with another.
This is where DADA2 shines. Because the method produces exact DNA sequences that exist independently of the rest of the dataset, those sequences will correspond to the same group of microbes, regardless of which dataset they came from. Thus, the data are inherently interoperable (i.e., can be exchanged or used across different systems/applications).
This has important practical implications. Callahan references the development of microbial therapeutics, such as probiotics or fecal microbiota transplants, as an example. Here, consistent measurements are needed to determine what concentration of microbes in the product stick around in the body. The standardization of microbial sequence data fostered by Callahan’s software is a step in the right direction.
“I’ve probably put more time into DADA2 after [it was published] than I did beforehand,” he said, emphasizing the long-term commitment required to develop successful software. This commitment is evident in Callahan’s development of data workflows, workshops and courses that make the DADA2 understandable and usable for a broad audience—and it is only 1 of the software packages and analysis pipelines his lab has contributed to the research community.
The ability to make a difference in others’ lives is one of the most rewarding parts of Callahan’s career so far. Often, as a scientist, “you publish a paper [and think] ‘what did that do?’ Starting out in theory, I felt that most acutely,” he said. Through his analytical tools, Callahan feels he is directly helping people advance their own scientific endeavors.
Still, the diverse phases of his career have played an important role in getting him where he is today. When asked what he would tell early-career scientists, Callahan highlighted the value of entertaining a broad range of interests. “There are a lot of advances that come from seeing how they do something in 1 field and borrowing some of those ideas into another field,” he said. Sound advice coming from a physicist turned microbiome aficionado.
From Physics to Microbiome Science
As a kid, Callahan was convinced he’d become a marine biologist. This conviction was shaped by his love of stories penned by the famous French oceanographer, Jacques Cousteau. However, as childhood phases sometimes do, this one soon passed. While Callahan had an inkling he’d do something related to science, technology, engineering and mathematics (STEM), he “never really thought about being a scientist…for a long time thereafter.”Fast forward to his college years at Iowa State University, and Callahan did, indeed, embark on a path in STEM, albeit a winding one. He started as a computer engineering major, then segued into linguistics before finishing with a degree in physics and math. It was as a physics Ph.D. student at University of California, Santa Barbara where Callahan’s foray into biology began. His mentor, Dr. Boris Shraiman, a theoretical physicist, was intrigued by biological questions related to population genetics and evolution, interests Callahan shared. After discovering that the most interesting results of his first Ph.D. project were previously published, Callahan embarked on a second theoretical and computational project exploring polyketide synthase pathways in bacteria, and how horizontal gene transfer affects their adaptive evolution. Toward the end of his degree, he also began working in genomic data analysis—a nod to his future scientific pursuits.
Of course, no path is without its bumps. During his postdoc in the lab of Dr. Daniel Fisher, a professor in applied physics at Stanford University, Callahan, who had mostly taken a theoretical approach to biology, had the “the romantic idea that a scientist should do it all, which include[d] doing experiments.” He spearheaded a project focused on niche construction (i.e., when an organism modifies its local environment in a way that benefits its competitive fitness) of experimental microbial populations.
By the end of the project, he had “learned a lot of humility and [gained] an understanding of how important and hard it is to be a good experimentalist." However, Callahan didn't think he had the aptitude or passion to pursue a career at the bench. He was ready to leave academia altogether—and he might have, if Dr. Susan Holmes, a professor of statistics at Stanford, hadn’t recruited him to analyze data for a project investigating the vaginal microbiota and preterm birth. This marked the beginning of Callahan’s work with microbiome data. It is also where he began developing the analytical tools that would form the foundation of his career.
Championing New Approaches for Microbiome Data Analysis
Microbiome analyses use DNA sequencing to determine which microbes are present in a given environment, such as the gut or soil. This is often done by sequencing a region of a conserved gene that varies among microbes in a group, such as the 16S rRNA gene in bacteria. Callahan likens each of these gene amplicon sequences to a barcode that corresponds to a type of bacteria, just like barcodes at a store are assigned to specific items. Researchers use computational methods to look at all the barcodes (sequences) in a sample to uncover their identity.With that in mind, Callahan’s research program focuses on understanding microbiome measurements to make better use of the data. The application of this approach to amplicon sequencing resulted in 1 of his most significant contributions to the field, an open-source software package called Divisive Amplicon Denoising Algorithm 2 (DADA2). The software allows researchers to parse out errors in amplicon sequencing data. It also helps organize the data to determine microbial composition of a sample. The latter function is where the power of DADA2 lies.
Traditionally, microbiome composition was (and often still is) determined by assessing which sequences in a sample are most like each other, based on an arbitrary threshold, and clumping them together. This is akin to pairing a green apple with a red apple and labelling the group “apples”—there is no finer resolution than that. In microbial terms, similar sequences might be clustered into a group called “Neisseria” or “Pseudomonas,” but related species and strains will be grouped together.
DADA2 champions an alternative approach. Rather than clustering sequences, DADA2 attempts to discriminate real sequences from one another, no matter how small the difference between them. This paints a more nuanced picture of microbial diversity by allowing for discrimination between different species in a genus and, in some cases, different strains of the same species. Yet, for Callahan, the greatest benefit of DADA2's approach is how it improves the utility of microbiome data.
Advancing FAIR Data Practices
Indeed, DADA2 is a testament to the findable, accessible, interoperable and reusable (FAIR) data principles that inform Callahan’s research endeavors. First devised in 2016, FAIR practices support the idea that it “should be reasonably easy for a computer to find and reuse data in a novel way,” Callahan said. The goal is to make it “possible and feasible [for researchers] to go and grab data from 10 different studies, and then integrate them together and do a new analysis that we learn even more from.”However, with certain standard clustering methods, “you’re going to get different [microbial] groups for every data set you go through the [clustering] process with,” he explained. For example, standard clustering might produce an “apple” group and a “pear” group from 1 dataset but a “green” group and a “red” group from another—even if the same red/green apples/pears were present in both datasets. As a result, it is not possible to directly compare results from 1 dataset with another.
This is where DADA2 shines. Because the method produces exact DNA sequences that exist independently of the rest of the dataset, those sequences will correspond to the same group of microbes, regardless of which dataset they came from. Thus, the data are inherently interoperable (i.e., can be exchanged or used across different systems/applications).
This has important practical implications. Callahan references the development of microbial therapeutics, such as probiotics or fecal microbiota transplants, as an example. Here, consistent measurements are needed to determine what concentration of microbes in the product stick around in the body. The standardization of microbial sequence data fostered by Callahan’s software is a step in the right direction.
Making a Tangible Impact
Callahan’s research has clearly made a splash—DADA2 is the most widely used tool of its kind. He has continued to spend a substantial amount of time developing, supporting and maintaining the package, including adding new functionalities that keep pace with the evolution in sequencing technologies.“I’ve probably put more time into DADA2 after [it was published] than I did beforehand,” he said, emphasizing the long-term commitment required to develop successful software. This commitment is evident in Callahan’s development of data workflows, workshops and courses that make the DADA2 understandable and usable for a broad audience—and it is only 1 of the software packages and analysis pipelines his lab has contributed to the research community.
The ability to make a difference in others’ lives is one of the most rewarding parts of Callahan’s career so far. Often, as a scientist, “you publish a paper [and think] ‘what did that do?’ Starting out in theory, I felt that most acutely,” he said. Through his analytical tools, Callahan feels he is directly helping people advance their own scientific endeavors.
Still, the diverse phases of his career have played an important role in getting him where he is today. When asked what he would tell early-career scientists, Callahan highlighted the value of entertaining a broad range of interests. “There are a lot of advances that come from seeing how they do something in 1 field and borrowing some of those ideas into another field,” he said. Sound advice coming from a physicist turned microbiome aficionado.