Special Report 19

Natural Language Processing (NLP)

Literature text-mining is an important aspect of informal ontology development. Whereas many database projects are underway to manually curate data from developmental endpoints, unstructured data presents a different challenge. This information often holds the key to the major themes or ideas associated with the structured data but must be extracted within proper context and managed differently than structured data. NLP can capture a good deal of information about molecular and pathway activity from the scientific literature, starting with curated databases (e.g. GO – gene ontology, EMAGE – mouse embryo gene expression, GXD – mouse gene expression, MPO – mammalian phenotype ontology, ZFIN – zebrafish model organism database, OMIM – online Mendelian inheritance in man). NLP enhances the coarse semantic search for specific concepts and then provides a way to automatically extract the key facts, relationships and quantitative information from literature. The results are then presented to an analyst to perform manual quality assurance/quality control (QA/QC) and data cleaning. Extensible Markup Language (XML) conveys information about text or other data using embedded codes not easily read by humans. Since XML syntax rules functionally represent data from any subject domain, unstructured data must be parsed with common software tools that read universal XML syntax rules. Rule-based indexing, extensible thesaurus, document classification and document filters go beyond simple keyword searches to summarise information as major themes or main ideas for developmental processes and toxicities. It is important to use consistent terms when populating the ontology with such information.

As a specific example of NLP, consider:

<observation>: “over expression of” | “under expression of” | “co-regulation of”

<gene>: “PKA” | “PKB” | “PCNS” | “RAP” | (any gene related to development)

<stage or location>: “in the liver” | “in gastrulation” | “during gastrulation”

<effect>: “causes” | “resulted in” | “activates” | “controls” | “regulates”

When a regular expression parser is applied to abstracts available in PubMed, entries such as the following excerpts are flagged as potentially important:

“... Overexpression of PCNS resulted in gastrulation failure but conferred little if any specific adhesion on ectodermal cells. Loss of function accomplished independently with two non-overlapping antisense morpholino oligonucleotides resulted in failure of CNC migration, leading to severe defects in the craniofacial skeleton. ...” (Rangarajan et al., 2006).

“...We used Affymetrix microarrays to examine temporal gene expression patterns during chondrogenic differentiation in a mouse micromass culture system. ... One gene that was up-regulated at later stages of chondrocyte differentiation was Rgs2. Overexpression of Rgs2 in the chondrogenic cell line ATDC5 resulted in accelerated hypertrophic differentiation, thus providing functional validation of microarray data. ...” (James et al., 2005).

NLP can reliably capture the complex relationships for an <observation><gene><stage or location><effect>. However, QC issues must be concerned with information that is either not relevant to the model under investigation (noisy data), or whether key information is not being identified (incomplete data). NLP can also assist with running deeper queries. For example, a formal ontology of embryo development that is fact-based can be used as an automated core for the application of an informal ontology that is easier to navigate but less automated.

This can be illustrated by considering the case for <hypospadias>. The advantage of a simple hierarchy linking the defect to a functional system, such as <genitourinary system>, is the straightforward path to define a subhierarchy for each part, the <urethra> for example, that can be easily navigated by non-experts. Triples can describe almost any concept and can be described in standard formats that are recognised by machines (Murray-Rust, 2008):

Hypospadias {is a defect of genitourinary development that affects the male urethra}.

Written in this way, the sentence is about hypospadias (subject) and the {predicate} tells something about it. The same sentence can be written in different ways with the same meaning. As a triple, we can consider the subject <hypospadias> and the predicate <is a defect of genitourinary development that affects> linked to an object <male urethra>.

In a broader sense the relevant endpoints that comprise critical effects in developmental toxicology studies traditionally include a search string that might be modified from ToxML[1] language, of the form:

<observation>: “malformation of” | “litter size” | “evaluation of” | “weighing” |

<target>: “eye” | “face” | “liver” | “foetal weight” | (term in keywords_target) |

<description>: “hydronephrosis” | “microphthalmia” | (term in keywords_description) |

<effect>: “reduced” | “results in” | “increased” | “deficiency” | “duplication of” |

[1] ToxML: open source data exchange, XML-based standard that represents toxicological data in a structured electronic format.