Informatics
Page contents—scroll down.
- Ontology for evolutionary biology.
- Digital republication of the first edition of Darwin’s Origin of Species.
- Literature of evolutionary biology.
These projects are intended to supplement or be integrated with the Darwin Manuscripts Project, of which I am Associate Editor.
Ontology for evolutionary biology
I am currently at work on two ontologies which, together, will be powerful tools for organizing the literature of evolutionary biology. The general aim is to create topic-oriented pathways through the literature on evolutionary biology useful for the scientist asking fundamental questions about evolution, for instance, about the origin of species, adaptation, or the explanation of organic diversity.
Ontology of evolutionary processes
Organic evolution has a history; reconstructing that history is one of evolutionary biology’s most important tasks. What came from what? Almost without exception, ontologies about evolution model phylogeny. Nonetheless, there is more to evolution than phylogeny. Evolution occurs by processes such as natural selection, random drift, mutation, speciation, extinction, adaptation, and the like: understanding these processes is required for answering questions about how and why the history of life on Earth has taken the particular shape we see it in now and how it can be expected to proceed in the future, as a part of a general theory of organic evolution.
The Evolution Ontology (EO) is intended to describe the latter. Built on the Basic Formal Ontology, EO will provide a relatively coarse-grained model of evolutionary processes, representing each as a class. It will make use of digital representations of neighboring domains, for instance, gazetteers (e.g., the GAZ Project), environment ontologies (e.g., the EnvO project), the Ontology for Biomedical Investigations (OBI), the Information Artifact Ontology (IAO), resources for organizing biological names (e.g., uBio), resources describing homology (e.g., the Homology Ontology), and ontologies representing phenotypes, anatomy and morphology (e.g., Phenoscape), and ontologies representing descent among lineages (e.g., CDAO). In the future, as semantic modeling tools become more sophisticated, a finer-grained view is intended. For instance, at present, natural selection and its subclasses will be represented as fundamental, otherwise undefined. The aim is to be able to represent genes, phenotypes, and the like as having fitness values and other attributes so that whether natural selection has occurred in a particular case can be determined by computing over the ontology.
In addition to organizing bibliographic records (see below) according to topic, EO will be useful for organizing and exploring biological data, by supplementing information about genes and their products with information about how and why those genes and products became the way they are. For instance, Gene Ontology records enriched with EO classes would provide insight into an amino acid’s history of natural selection, which will aid scientists’ reverse-engineering efforts to discover what’s important about that amino acid’s role in biological processes.
Evolutionary Process Ontology would be a better name for this ontology. Evolution Ontology will have to do, because this was the name under which it was established, and, among those few who know about it, the name by which it is known.
Ontology of explicit discourse
The class names in EO can be used to tag bibliographic records for works about evolution, categorizing them by subject. This is useful. What would be more useful would be to organize the literature of a given topic in a way that offered the researcher a pathway through that literature which followed its development. The “explicit discourse” of a topic is what those working on it say to one another, in print, in conversation, or in their manuscripts and notes. The explicit discourse of a topic in evolutionary biology—whether geographic isolation is required for speciation, for instance—shows how the thinking on that topic developed. This is especially important in evolutionary biology, because themes recur. Darwin’s theory of speciation was something like what we would call a sympatric theory; others at the time objected, claiming that geographic isolation (allopatry of some form) is required. This issue is still very much in dispute, and, although much of the context has changed, many of the arguments and observations considered over the topic’s history since 1859 come back to life. Efficiently accessing the literature on a topic and placing it in the sequence in which the explicit discourse on that topic developed would offer researchers a way to quickly survey the history of thought on it.
The ever-increasing volume of digital texts makes the ontology of explicit discourse, paired with EO, a particularly exciting project. Many publications begin life as digital texts, for instance, those published in recent years; and the Biodiversity Heritage Library has scanned some tens of millions of pages of the literature on natural history, taken from some of the world’s deepest and most complete libraries on the topic, for instance, those of The Natural History Museum in London, and the Harvard Museum of Comparative Zoology.
Documents
Code and other files, including the mailing list and wiki, are available at the Google Code and Google Group for EO.
Digital republication of the 1859 Origin
I am republishing the 1859 copy of Darwin’s Origin of Species, published by John Murray. Plain text was obtained from the Oxford Text Archive and proofread three times against the facsimile of the 1859 text edited by Ernst Mayr. The plain text is then typeset with LaTeX. The resulting text is a word-for-word replica of the 1859 Origin in PDF. The typography is in a more modern style than that of the Murray publication, but markers indicating page breaks in the original text are dynamically inserted into the PDF document.
The text is currently undergoing its third round of proofreading. Completion of the final draft of the text is anticipated in June 2010.
This text is intended for two uses.
- Offer the general reading public a lightweight, professionally proofread and typeset copy of the Origin. A range of formats are planned, including PDF’s sized to read on an iPod or iPhone and E-Book encodings.
- Offer informatics researchers access to a plain-text copy of the Origin. This plain-text can be searched with text-processing tools to identify key words and patterns. Passages identified by these methods can be highlighted by the text processor, and the entire text typeset, with this additional markup, into a human-readable form. This approach takes advantage of the speed, accuracy, and flexibility of a machine reader and the readability of a text which looks like a print book.
Documents
The appendices to the text are available for download. These include the editor’s statement on the methods used to produce the text and an argument for the need for such a text, the text’s licensing and permissions, and the release history and to-do list.
The very curious can browse the svn repository, but be warned: the text is still in draft form, much of it only having been proofread twice, and the document design is still fluid. LaTeX documents are not guaranteed to be well-formed.
Literature of evolutionary biology
I am creating a comprehensive bibliography of works about evolutionary biology: the scope of the bibliography encompasses any work published on the subject since Darwin, and also includes works by those recognized to be Darwin’s predecessors. At present, the bibliography is at an early stage. It contains a mere 3,500 or so records. Nonetheless, these records are highly significant, because they include records for works referred to by Darwin in his correspondence and also works he owned. Extracting relevant records from the 16 million or so records in PubMed is proceeding at present, and there are plans to extract records from WorldCat. Other mechanisms, such as the digitization and ingestion of print bibliographies, are being planned.
If a work is known to be held in the Biodiversity Heritage Library, a link is provided to the full text.
The bibliography is intended for informatics work (see above, “Ontology for evolutionary biology.”).
Documents
The bibliography can be processed with BibTeX using a custom style file, amnh-print.bst. A rudimentary interpretation of the various fields representing various attributes of the works will help users understand the meaning of the raw data.
