Counting the invisible
John Bunge is a navigator in a sea of numbers. An associate professor in Cornell's Department of Statistical Science (http://www.stat.cornell.edu/) who holds a joint appointment with CIS, he is developing innovative methods to help solve a knotty problem: how many classes or species should we expect to find in a population, given incomplete information?
Currently, this research is being applied to work in the field of biology, but it will help answer questions in many other areas as well—to cite just three examples, eliminating needless duplication in digital library catalogs, deriving good estimates of the number of unreported crimes in a society, and limning ancient economies based on coin collections excavated by archaeologists. He's looking for students of varying expertise to help, including programmers, catalogers, and ambitious souls interested in improving the technique itself ( http://www.stat.cornell.edu/~bunge/current_research_directions.htm).
Current work, in the field of biology, is exploring how many species of fish there are on a particular reef. It's not practical to count every animal. Instead, researchers sample (http://www.stat.cornell.edu/~bunge/basic_theory.htm) the population in a certain place and time. Just by chance, however, members of some species can be missed altogether. That's where statisticians like Bunge come in and look at and plot the number of times each species was seen. Bunge can construct a graphical curve that can be extrapolated back to estimate how many species were observed zero times—that is, the ones completely missed by sampling.
In most cases, Bunge is working with populations that contain hundreds or thousands of different species. Some of them appear in the sample many times. Others, though, appear very infrequently. When researchers find that many species are seen only once, or very infrequently, the logic follows that there are other species the researchers didn't see.
“The idea is to take a model of the count data and project downward into the invisible … and then you add the zero count to the actual count and you get the total,” Bunge says.
In statistics, such procedures lead to two results—the estimated number of species, and a range of possible error. Getting the possible estimates to fit within a reasonable range is computationally intensive—Bunge uses a supercomputer in his work to refine his models and reduce the error.
The implications of such techniques are wide-ranging. In the near term, Bunge's statistics promise to refine our knowledge of biological diversity even when limited resources are available to inventory endangered or inaccessible places. Past efforts have focused on diversity in populations of birds in Florida and butterflies in Malaysia; Bunge and his graduate students are collaborating now with biologists studying habitats from the microbial world at sea (http://www.atsweb.neu.edu/s.epstein/) to the canopies of redwood forests ( http://seattletimes.nwsource.com/pacificnw/2005/0130/cover.html) . Forthcoming work will focus on estimating numbers of expressed genes in DNA and the number of kinds of working proteins we should expect to see in living cells. Like any expert navigator, Bunge is eager to take his ship toward new and exciting horizons.