Multi-source and ontology-based retrieval engine for maize mutant phenotypes Open Access

Statistics on ontology occurrence in the text sources

Text Source	Size	Number with PO terms (%)	Number with GO terms (%)
Phenotype description	4103	3009 (73.3)	615 (15.0)
Locus description	1539	584 (38.0)	598 (38.9)
Gene product description	32	4 (12.5)	9 (28.1)

Text Source	Size	Number with PO terms (%)	Number with GO terms (%)
Phenotype description	4103	3009 (73.3)	615 (15.0)
Locus description	1539	584 (38.0)	598 (38.9)
Gene product description	32	4 (12.5)	9 (28.1)

Table 1.

Statistics on ontology occurrence in the text sources

Text Source	Size	Number with PO terms (%)	Number with GO terms (%)
Phenotype description	4103	3009 (73.3)	615 (15.0)
Locus description	1539	584 (38.0)	598 (38.9)
Gene product description	32	4 (12.5)	9 (28.1)

Text Source	Size	Number with PO terms (%)	Number with GO terms (%)
Phenotype description	4103	3009 (73.3)	615 (15.0)
Locus description	1539	584 (38.0)	598 (38.9)
Gene product description	32	4 (12.5)	9 (28.1)

Implementation

Multi-source retrieval concept

In standard information retrieval systems, a query is submitted to a search engine, which then attempts to retrieve and rank the most relevant documents from the underlying corpus. However, in the case of multi-source retrieval, where a document is linked to other documents from other sources, the traditional retrieval model formulations cannot be directly applied, as the engine must be capable of retrieving and ranking groups of documents.

To make this more concrete, consider the set of three sources for the maize mutant phenotype search engine depicted in the top half of Figure 1. Each phenotype variation is linked to zero or more documents from each of the text sources. Thus, when a phenotype search is submitted, the retrieval engine will need to consider sets of related documents (the bottom half of Figure 1) from various sources, e.g. consisting of all related phenotype, locus and gene product descriptions. Collectively, these sets of documents, referred to as a folders, contain the information used for retrieval, and so retrieval and ranking are performed on folders rather than individual documents. For the remainder of this article, the term folder will be used to designate the set of documents from all sources related to a specific mutant phenotype.

Figure 1.

Depiction of the main grouping source and its relationship to the various text sources (top) as well as the folder view (bottom) where related documents for phenotypic variations are grouped.

The simplest solution to retrieving and ranking groups of documents is to merge the text from each document of a folder, referred to as a member or component document, into one mega-document. While this solution does have the advantage of allowing the use of the standard vector space retrieval model (19), it does not, however, reflect the data making up these folders very well. This is because the phenotype descriptions, locus descriptions and gene product descriptions are describing very different entities in the maize domain. Thus, there is variability in the distribution of specific terms, both in terms of usage and rarity, between text sources. This can be illustrated concretely by examining the variability of ontology terms across these sources. Because the GO contains terms related to genes and gene products, we would expect these terms to appear more often in locus and gene product descriptions than in phenotype ones. Likewise, since the PO contains plant anatomy and morphology terms, we expect these to appear most often in the phenotype descriptions and, to a lesser extent, in the locus descriptions. By examining Table 2, this is precisely the situation we find in the MaizeGDB mutant phenotype collection. The trend shown is that PO terms appear less frequently as we move from phenotype to locus to gene product descriptions, with precisely the opposite pattern present with GO terms.

Table 2.

Ontology term breakdown per text source

Average terms/document	Text source	Average terms/document
0.18	Phenotype description	1.60
0.55	Locus description	1.02
0.34	Gene product description	0.16

Average terms/document	Text source	Average terms/document
0.18	Phenotype description	1.60
0.55	Locus description	1.02
0.34	Gene product description	0.16

Table 2.

Ontology term breakdown per text source

Average terms/document	Text source	Average terms/document
0.18	Phenotype description	1.60
0.55	Locus description	1.02
0.34	Gene product description	0.16

Average terms/document	Text source	Average terms/document
0.18	Phenotype description	1.60
0.55	Locus description	1.02
0.34	Gene product description	0.16

For these reasons, merging all the components of a folder into a single document does not make the most logical sense. The proposed approach is a transformation of this multi-source retrieval problem so that groups of related documents can be retrieved using the vector space model (7).

Approach

Representing folders

To be able to retrieve groups of documents, two issues must be decided. First, the representation of a folder, i.e. a group of related documents, must be addressed, and second, a novel similarity measure between folders and the query must be formulated.

Let K be the set of text sources relevant to a particular query. Let D^k be the set of all documents within source k with D¹ denoting the base set, that is, the seed documents or identifiers that are used to build folders. The other document sets D^k, k > 1, are secondary sources with documents related to one or more folders. In this search engine, D¹ corresponds to the phenotype variation identifiers, with D², D³ and D⁴ being the phenotype image, locus and gene product descriptions, respectively. Let denote document j within document set D^k, and represent the vector associated with user's text query.

In order to formally define a folder, the document relationships between text sources must be modeled. Let f be the mapping relationship [Equation (1)] that, given a document j from D¹ and a text source k, returns the list of documents in source k that are related to

⁠.

(1)

where a ∼ b means that a is related to b, which in this search engine means one document is related to another through a series of database joins (i.e. common keys). Using this mapping relationship, we can define the folder corresponding to the base document

as F(j, K), which is the set of all documents mapped to

for a given set of text sources K.

(2)

In the context of this search engine, F(j, K) would correspond to a single phenotypic variation and would contain all the phenotype, locus and gene product descriptions related to that variation.

Constructing folder vectors

With the notion of folders formalized, considerations for folder representation and similarity can be addressed. Clearly, to use any variant of the vector space model, each folder must be represented as a vector of constituent term weights. Our approach is to treat the folder as a collection of |K| documents, one for each of the text sources being searched. We form the representative vector for each source by applying some function g(.) to the documents in f(j, k). Two classes of functions that can be used for this purpose are merge and selection.

Merge: With this approach, all the documents in f(j, k) are combined together and treated as a large document. Conventional vector space weighting schemes can be applied to the conglomerate document vector or a summarized version of the combined document. This option may be appropriate when the documents in f(j, k) are more homogeneous, i.e. when each contains a textual description of the same entity. As an example, consider a mutant phenotype that is linked to a single genetic locus, but there may be multiple descriptions of that locus in the database. In this case, the amalgamation or summarization of the set of descriptions is more likely to represent the concept as a whole (e.g. the locus, in this example) than any one document in the set.

Selection: Instead of merging the documents in f(j, k) together, one may alternatively choose to select a particular document from the set to represent the text source. This function class can be used in any situation, but is particularly appropriate when the documents in f(j, k) are more heterogeneous. As an alternative example, consider a mutant phenotype that is linked to several different gene products in the database, each of which has a single description in the database. Each of these descriptions is describing a completely different entity and, therefore, merging all these descriptions does not make sense from a retrieval standpoint. However, if only one of the gene product descriptions represents a good match for a query, then the user may still want to see that mutant phenotype. This can be accomplished by selecting only the best match from f(j, k), i.e. the document from this text source with the maximum similarity to the query. One could consider other schemas for selection that are based on the most average document or simply one selected at random.

The chosen function class is highly dependent on the text source. In the case of this maize mutant phenotype search engine, both classes of functions could be applied. The locus source would have a good candidate for the merge option. This is because in the current data set, each mutant phenotype is linked to a single locus, which may itself have several descriptions. On the other hand, the gene product source is more heterogeneous and is thus better suited for the selection option, as each phenotype description may be linked to several gene product descriptions, each describing a different gene product. In our search engine, the selection function class was chosen for all text sources and utilizes the function defined in Equation (3). Briefly, it selects the document from f(j, k) that has the highest cosine similarity to the query.

(3)

where q is the query vector and the cosine similarity is defined as

(4)

Measuring folder similarity

In addition to choosing the function g(.) for each text source, a decision must be made on how to use the |K| vectors comprising the folder to define a similarity measure. Two alternatives (weighted average and folder vector) for the folder similarity are discussed.

Weighted average: the first method to combine the |K| document vectors is to perform a weighted average as follows:

(5)

With this approach, a weight for each text source w_k can be provided to reflect the importance of text source k relative to the other sources. Also, the traditional cosine similarity measure can be applied between the query q and the formed vector (by merge or selection) for each text source.

Folder vector: alternatively, one could form a folder vector by concatenating the |K| component document vectors [see Equation (6)].

(6)

The similarity between the query and a folder can then be expressed as a reformulation of the cosine similarity measure in terms of text sources and component documents. The similarity formula is provided for both the merge and selection approaches.

For the merge case, the similarity measure is given by Equation (7). In keeping with the idea behind merge, each component document from each source is involved in the calculations of the cross product (numerator) and the folder normalization factor (left term in the denominator).

(7)

where q is the query vector, the weight of term i in document d of text source k,

⁠, is expressed by

(8)

and the analogous query term weights are given by

(9)

Note that term frequency (TF) and inverse document frequency (IDF) are calculated using standard definitions, as in ref. (7). These quantities give higher weight to terms that appear frequently within a single document and to those that are rare across all documents, respectively. It should be noted that while many documents in this data set are rather short, the use of a TF*IDF weighting scheme is rationalized by the fact that there is a sizable number of terms that appear more than once in a single document (9343 out of 85 193, or roughly 11%), and a few words that appear up to 25 times in a single document. This variability makes TF a worthwhile factor for those terms. For the rest of the terms that appear once per document, the IDF is still a very useful parameter to determine term importance, as the more rare terms in the entire text source become the higher weighted and hence more important terms.

The similarity measure when using the selection function may be computed as follows:

(10)

Recall that with selection, only one document from each text source, chosen using the function g(.), is used in the similarity calculation of the document case to the query.

System details

Offline and online processing

Using the derived similarity measure, a multi-source retrieval engine can then be implemented using the vector space model. The construction of the search mechanism can be partitioned into two main steps: (i) offline preprocessing of the entire document corpus and (ii) online query processing and retrieval. For offline preprocessing, each document from each source within MaizeGDB proceeds as in Figure 2. Documents are parsed into words and phrases to facilitate matching to ontology concepts. Phrases that do not match ontology terms and synonyms are removed, as are stop words. TFs are calculated for the remaining words and phrases in each document and stored in a MySQL database. Once all the documents have been processed, IDFs are calculated for each word and phrase from each text source and also stored. IDFs are calculated separately for each text source to maintain the differences in term discrimination between sources as shown in Table 1. A PHP script is used to automate the offline processing of all text sources and insert the appropriate information into the MySQL database.

Figure 2.

Preprocessing flow chart for the system, including parsing of text descriptions into individual term, matching terms to ontological concepts, calculation of needed quantities for determining term weights, and insertion of terms and quantities into the MySQL database.

Once the above offline procedure has been completed, the retrieval engine is functional. A submitted query can proceed as in Figure 3 to retrieve appropriate results for the user. The query follows a similar procedure as a document in offline processing. It is parsed into words and phrases that are matched to the available ontologies. Matched concepts undergo query expansion, in which term synonyms, parents and children may be added to the query. Unmatched phrases and stop words are removed, and TFs are calculated for the remaining words and phrases. Query term weights are then calculated using Equation (11).

(11)

where TF_q_,_i is the query TF defined in the standard manner by Equation (12), IDF_i is the IDF of term i in source k, w_source is the weight of the text source, w_ont is the weight of the matched ontology, w_rel is the weight of the expansion relationship type (term, synonym, parent, child).

(12)

Figure 3.

Flow chart for search engine retrieval, from query processing, to similarity calculations for individual text descriptions and folders, to ranking and highlighting of the search results.

The default weights for the various sources, ontologies, and relationships are given in Table 3, though each parameter can be manually adjusted by the user before each search. Default weights for the query expansion terms were based on experimental results from ref. (18).

Table 3.

Default weights of text sources, ontologies and relationships

Symbol	Weight description	Default weight
w_phenotype	Phenotype source weight	1.00
w_locus	Locus source weight	0.20
w_{gene_product}	Gene product source weight	0.10
w_GO	Gene ontology weight	1.00
w_PO	Plant ontology weight	1.00
w_unmatched	Nonontology terms	0.50
w_synonym	Synonym expansion weight	0.20
w_parent	Parent expansion weight	0.10
w_child	Child expansion weight	0.05

Symbol	Weight description	Default weight
w_phenotype	Phenotype source weight	1.00
w_locus	Locus source weight	0.20
w_{gene_product}	Gene product source weight	0.10
w_GO	Gene ontology weight	1.00
w_PO	Plant ontology weight	1.00
w_unmatched	Nonontology terms	0.50
w_synonym	Synonym expansion weight	0.20
w_parent	Parent expansion weight	0.10
w_child	Child expansion weight	0.05

Table 3.

Default weights of text sources, ontologies and relationships

Symbol	Weight description	Default weight
w_phenotype	Phenotype source weight	1.00
w_locus	Locus source weight	0.20
w_{gene_product}	Gene product source weight	0.10
w_GO	Gene ontology weight	1.00
w_PO	Plant ontology weight	1.00
w_unmatched	Nonontology terms	0.50
w_synonym	Synonym expansion weight	0.20
w_parent	Parent expansion weight	0.10
w_child	Child expansion weight	0.05

Symbol	Weight description	Default weight
w_phenotype	Phenotype source weight	1.00
w_locus	Locus source weight	0.20
w_{gene_product}	Gene product source weight	0.10
w_GO	Gene ontology weight	1.00
w_PO	Plant ontology weight	1.00
w_unmatched	Nonontology terms	0.50
w_synonym	Synonym expansion weight	0.20
w_parent	Parent expansion weight	0.10
w_child	Child expansion weight	0.05

Query search interface

The query interface for the search engine is depicted in Figure 4 and contains the standard text box for the user's free-text query as well as all the weighting options for each type of term, located in the ‘search options’ box underneath these controls. Each column of controls represents a related set of options.

Figure 4.

Query interface for the multi-source ontology-based retrieval engine for maize mutant phenotypes.

The first column regards the available searchable text sources. The position of the slider bars can be used to select how much emphasis to place on each source relative to the other sources, ranging from 0, which corresponds to no weight (slider at the far left), to 1, which corresponds to the greatest amount of emphasis (slider at the far right). The default values for the text sources are given in Table 3.

Similarly, the second column controls the emphasis of different types of terms present in the query. The top two sliders control the importance of terms matching to GO and PO, respectively; the bottom slider in this column controls the weight for query terms that did not match an ontology. This gives the user the flexibility to really stress specific types of terms.

The third and fourth columns control the emphasis of terms added through query expansion from the GO and PO, respectively. These sliders determine what kinds of ontology terms are used to expand the query and how much weight is given to each type of term (again, relative to the other term types). Initially, the expansion terms are given fairly low weights to guide the search primarily towards the query terms. Of the expansion terms, synonyms are defaulted with the most weight, followed by parent terms and then children, in accordance with the results in ref. (18).

Ranked results interface

After the query is submitted, the search engine retrieves and ranks folders in decreasing order of similarity to the query. For the sample query ‘small kernel’, the results page is shown in Figure 5. At the top of the page, the user's query and the current system weights are redisplayed in the search interface. This interface is repeated so that the user can make adjustments to the query if desired and resubmit.

Figure 5.

Sample search results for our retrieval engine using the query ‘small kernels’. Highlighted terms indicate matches to the query (including original terms and terms included through query expansion), with the color providing information about the term.

Below the search interface is the list of ranked folders. With each folder, the variation name, along with a link to the corresponding MaizeGDB page, is provided to the user as well as a phenotype image, the top-ranking descriptions from each text source, and the folder's relevance. Relevance is calculated using Equation (13), which consists of two factors. The first is a normalization term for the folder's cosine similarity value, and the second penalizes a folder for each query term that is not matched by the folder.

(13)

Only the best match from each text source is displayed in the ranked results because these are the documents that were used to calculate the folder similarity measure. To view all the documents associated with the phenotypic variation, including the ones that were not used to calculate the folder score, the user can click on the ‘view all documents in folder’ link in the last column (Figure 5).

When performing query expansion with synonyms, parents and/or children, it is possible that a query will be enriched with many terms from the available ontologies. As such, results containing few words from the original query may appear in the top-ranked results if the expansion weights are set sufficiently high. To aid the user in determining which terms were added and how they relate to the query, color highlighting of terms is provided. Terms matched to or added from a specific ontology are given the same general color. Different shades of that color will distinguish the relationship as matched term, synonym, parent, or child and are defined in the legend. For example, the term ‘seed’ appears in the second result in Figure 5 in an orange color, indicating a child relationship to the PO synonym ‘kernel’.

Results and discussion

The developed multi-source ontology-based maize mutant phenotype search engine utilizes interconnected free-text fields from MaizeGDB, specifically descriptions of phenotypes, loci and gene products, as well as two domain ontologies, the GO and PO. Users are given the ability to search with any combination of text sources and ontologies and can adjust the weights of these entities as desired to customize retrieval of specific queries. This is the first retrieval tool in the plant community that utilizes sophisticated information retrieval techniques to search free-text fields and provide ranking of results according to similarity to the query and that utilizes domain ontologies in this manner for query expansion.

The search engine was evaluated according to retrieval performance, in terms of speed and accuracy, as well as scalability. The setup and results of the corresponding experiments are described below. All these experiments were conducted on a standalone development server with Intel X5570 dual 2.93 GHz quad core CPU and 72 GB of RAM.

Retrieval performance

Experiments were conducted to measure both the speed and accuracy of the developed search engine. For all experiments, the list of 29 test queries in Table 4 was used. These test queries included a set of anatomical terms, most of which were linked to phenotypic variations in MaizeGDB, anatomical terms plus one or more modifiers and other miscellaneous queries.

Table 4.

List of experimental queries

ID	Query
1	Lysine
2	Gibberellins
3	Aleurone
4	Pericarp
5	Mottled appearance
6	Tassel
7	Purple aleurone
8	Endosperm
9	Tassel branch
10	Andromonoecious plant
11	Leaf
12	Leaf blade
13	Necrotic tissue
14	Seedling
15	Narrow leaf
16	Kernel
17	Yellow leaf
18	Ear
19	Broad leaf
20	Segregating ears
21	Collapsed kernels
22	Floury kernel
23	Immature ear
24	Small kernel
25	Broad green leaf
26	Opaque dented kernel
27	Small floury kernel
28	Small yellow kernel
29	Leaf blade with white stripes

ID	Query
1	Lysine
2	Gibberellins
3	Aleurone
4	Pericarp
5	Mottled appearance
6	Tassel
7	Purple aleurone
8	Endosperm
9	Tassel branch
10	Andromonoecious plant
11	Leaf
12	Leaf blade
13	Necrotic tissue
14	Seedling
15	Narrow leaf
16	Kernel
17	Yellow leaf
18	Ear
19	Broad leaf
20	Segregating ears
21	Collapsed kernels
22	Floury kernel
23	Immature ear
24	Small kernel
25	Broad green leaf
26	Opaque dented kernel
27	Small floury kernel
28	Small yellow kernel
29	Leaf blade with white stripes

Table 4.

List of experimental queries

ID	Query
1	Lysine
2	Gibberellins
3	Aleurone
4	Pericarp
5	Mottled appearance
6	Tassel
7	Purple aleurone
8	Endosperm
9	Tassel branch
10	Andromonoecious plant
11	Leaf
12	Leaf blade
13	Necrotic tissue
14	Seedling
15	Narrow leaf
16	Kernel
17	Yellow leaf
18	Ear
19	Broad leaf
20	Segregating ears
21	Collapsed kernels
22	Floury kernel
23	Immature ear
24	Small kernel
25	Broad green leaf
26	Opaque dented kernel
27	Small floury kernel
28	Small yellow kernel
29	Leaf blade with white stripes

ID	Query
1	Lysine
2	Gibberellins
3	Aleurone
4	Pericarp
5	Mottled appearance
6	Tassel
7	Purple aleurone
8	Endosperm
9	Tassel branch
10	Andromonoecious plant
11	Leaf
12	Leaf blade
13	Necrotic tissue
14	Seedling
15	Narrow leaf
16	Kernel
17	Yellow leaf
18	Ear
19	Broad leaf
20	Segregating ears
21	Collapsed kernels
22	Floury kernel
23	Immature ear
24	Small kernel
25	Broad green leaf
26	Opaque dented kernel
27	Small floury kernel
28	Small yellow kernel
29	Leaf blade with white stripes

To determine the overall speed of the retrieval system, each query was executed 20 times against one, two and all three text sources. The overall average retrieval time for all the test queries in all situations was measured at 2.16 s/query. The fastest query was ‘lysine’, which completed in 0.28 s but only had seven matching documents in the database. The slowest query was ‘leaf blade with white stripes’, which without query expansion partially matched 452 phenotypic variations, and finished in 5.67 s.

The quality and accuracy of the search mechanism were also measured. We designed an experiment that utilized eight body part terms (‘kernel’, ‘ear’, ‘leaf’, ‘tassel’, ‘aleurone’, ‘pericarp’, ‘endosperm’ and ‘seedling’) to query the system, with the expectation of retrieving variations whose affected body part matches the query. The queries were executed in four environments: with one source (phenotype captions) or all three sources and with or without query expansion. In each situation, precision was measured at 10% recall intervals and was also measured for the top 10, 20, 30, 40 and 50 results. The affected body parts, which have been manually curated by humans, in MaizeGDB for each variation were used as the standard to assess the relevance of each ranked result. The standard equations for precision and recall (2) were used and are shown below.

(14)

and

(15)

The precision values at the top 10 through 50 results for the four scenarios are shown in Figure 6. It is noteworthy to mention that the scenario with the best performance is the one that uses all three text sources but does not perform query expansion. This result helps to establish the usefulness of one of the major components of this search engine—the integration of multiple text sources to improve search—as demonstrated in this situation by the increased precision when all the text sources are utilized. The scenario with the second best performance occurs when there is the phenotype caption source is used alone in conjunction with query expansion, which gives credence to the utilization of ontologies, particularly with respect to query expansion, to help improve search results.

Figure 6.

Precision was measured at the top 10, 20, 30, 40 and 50 results for four scenarios: one text source with query expansion, one text source without query expansion, all three text sources with query expansion, and all three text sources without query expansion. The precision values at each of those levels for each scenario are shown.

These ideas are reinforced further by looking at the some of the precision–recall plots for the individual queries. Consider Figure 7, which shows the results for the ‘endosperm’ and ‘ear’ queries. While all the scenarios for ‘endosperm’ perform comparably at the early recall levels, it is the scenario with all sources and with query expansion that maintains the highest precision for the higher recall levels. In addition, by utilizing these features, the result set is able to pick up more than 90% of all the variations whose affected body part is ‘endosperm’ versus roughly 45% for the single source search without expansion. The ‘ear’ query is a quite clear example of how extra text sources and query expansion can improve the search results, as the top 10 results all retrieve correct variations, and the precision stays above all the other scenarios for all but one recall level.

Figure 7.

Precision–recall plots for two individual queries: (A) ‘ear’ and (B) ‘endosperm’. For the ‘endosperm query, all four scenarios have similar precision values at low recall levels; however, the scenario with all the text sources and query expansion includes many more of the relevant documents in the result sets than the others, as indicated by the higher precision at the higher recall levels. For the ‘ear’ query, the best performance is achieved by all the text sources with no query expansion.

System scalability

The scalability of the system was assessed using three separate experiments. First, we designed an experiment to examine the effect of query length on retrieval time. Retrieval speed was measured for each of the test queries, which were composed of one, two or three terms. The average query speeds were then used to illustrate the trend in retrieval time as the length of the query increases. Figure 8 shows the average query time for each query using only the caption text source. The figure is organized by query length with the single-word queries on the left, the two-word queries in the middle and the three-term queries on the right. While the average times did increase as query length increased (0.43 s between query lengths of one and two terms, and 1.55 s between query lengths of two and three words), there were overlaps in the retrieval times of individual queries across the sets. It was noted that the more dominant factor in retrieval time was not the number of query terms, but rather the number of descriptions matched by the query terms.

Figure 8.

Average retrieval speeds for queries, ordered by query length. The leftmost group corresponds to the single-term queries, and these have the quickest execution times. The middle group contains queries consisting of two terms, and the query speeds for these are on average slightly slower than the single-term queries. The rightmost group of queries all contain three terms, and these are the most time consuming of the queries tested.

Second, in order to determine the effect of the number of text sources on query performance, each of the test queries was executed using just one text source (phenotype captions), two sources (captions and locus descriptions) and all three text sources. For each scenario, each query was executed 20 times with the average retrieval speeds measured and shown in Figure 9. This figure shows slight increases in retrieval time with the inclusion of additional text sources, but not a significant increase. The average increase in retrieval time for all the test queries in adding the second and third text sources was found to be 0.49 and 0.12 s, respectively.

Figure 9.

Comparison of retrieval speeds for each of the test queries (see Table 4 for the link between query identifier and query text) executed on one text source (phenotype captions only) (right bars), two sources (phenotype caption + locus descriptions) (middle bars) and three sources (left bars). There is a slight increase in execution as the number of sources increases.

Finally, we designed an experiment to examine how the size of the database, in terms of the number of documents, affects retrieval time. To perform this experiment, smaller databases were generated from the current data set by decreasing the size of the phenotype caption table. Six versions of each phenotype caption table size, which included 20, 25, 33, 50 and 75% of the original table size, were constructed. Each query was executed five times on each of the constructed databases. Figure 10 shows the trend in query performance as the number of documents in the database increased. The 5-fold increase in database size depicted shows an increase in retrieval time from 1.19 s to only 2.42 s. Should the size of the database eventually cause retrieval times to become unacceptable, various strategies could be employed to speed up the search. Performing the search on each text source in parallel and then merging the results of each of those threads could make significant inroads in decreasing retrieval speed. In addition, because the search engine is currently implemented via a PHP application that communicates with a MySQL database, porting portions of the search procedure to a higher performance language, like C++, could also allow for substantial gains in query speeds.

Figure 10.

Effect of database size on retrieval speeds. Six different-sized test databases were constructed from subsets of the original data set. All the test queries were executed against each of these test databases, with the minimum, average and maximum query times measured. The trend suggests a nonlinear complexity for the search task.

Limitations

As with any system, there are some limitations to this search mechanism. The first limitation is the requirement for exact matches when pairing terms to ontology concepts. This, at times, prevents the matching of concepts in documents to ontologies and thus prevents the ability to perform query expansion on those terms. For example, the term ‘aleurone’ on its own is not in the PO; however, ‘aleurone layer’ is a concept. Due to the large number of concepts that can be returned by performing a partial match to an ontology concept, it was decided that missing an occasional ontology match was a better alternative than falsely identifying many ontology pairings. Though stemming is performed on words in the text documents, ontologies and queries in an effort to reduce each term to its base word, variations in terms do exist that stemming does not account for. The result of this is the inability to match some variations of words (e.g. ‘necrosis’ and ‘necrotic’ map to ‘necrosi’ and ‘necrot’, respectively). A third limitation of this approach is the potential steep learning curves for users. Because of the built in flexibility, it may take some time for users to get accustomed to the available weighting options and it may take some exploration of the capabilities of the system to learn how to best utilize it.

Future work

There are several places where this project could be expanded through future work. First, we could investigate the use of linked grammars in this type of search engine, which may allow us to weight terms better by considering their word classes (e.g. noun, verb or adjective). Query expansion is performed automatically in our search engine; however, we also in the future plan to incorporate user-assisted query expansion, which allows the user to see the list of candidate expansions to the query and remove any unwanted terms from that list before the search is performed. We are also investigating the possibility of giving the user the flexibility to weigh individual terms in the query (including candidate expansions) rather than one weight for an entire class of terms. In addition, we would like to investigate using relevance feedback techniques to automatically determine user-driven weights for the various parameters. Currently only free-text sources are searchable through this mechanism; however, future development to integrate other kinds of sources into this search engine is planned. First, attribute fields, like those searched in current plant retrieval tools, will be incorporated. This would allow a phenotype description search to be limited to, for example, only those phenotypes linked to a specific trait or anatomical body part or to gene products of certain types. Leveraging attribute searches in conjunction with free-text information is promising. We also hope to explore the possibility of including nontext sources, like images and sequences, into the search mechanism as well. Phenotype searches using these modalities have already been developed; MaizeGDB has a sequence search mechanism called POPcorn available, and we have already explored in a previous work (20) how to represent phenotype images as feature vectors and perform image searches. In both cases, however, an approach to intelligently combine these various types of sources remains unstudied. Nevertheless, such a hybrid search mechanism that merges varied information sources would have great potential.

Finally, the MaizeGDB resource itself is in the early stages of a full redesign. One component of the redesign will be the inclusion of the system described here as MaizeGDB's phenotype search tool. Full deployment of the new resource including the VPhenoDBS: Maize search is anticipated for March of 2013.

Conclusions

We have developed a novel retrieval engine for maize mutant phenotypes. The main contribution is the development of multi-source retrieval, in which several interconnected text sources are utilized for searching and folders are retrieved and ranked for the user. Folder representation was defined, and a similarity measure was constructed so that traditional information retrieval techniques could be employed. The developed search engine also utilized domain ontologies for query expansion to help improve the accuracy and coverage of search results. This is the first such retrieval tool in the plant community. As plant genome databases continue to increase in size, next-generation search mechanisms will be needed to meet the needs of plant science researchers in a timely fashion.

Funding

National Science Foundation (DBI-0447794) and the National Library of Medicine (NLM) Bioinformatics and Health Informatics Training (BHIRT) (2T15LM007089-17 to J.G.). Funding for open access charge: Shumaker Endowment for Bioinformatics.

Conflict of interest. None declared.

Acknowledgements

The authors would like to acknowledge MaizeGDB for contributing the data for the retrieval engine as well as the MaizeGDB team itself for fruitful discussions.The multi-source retrieval was developed and implemented by J.G., who also prepared the manuscript. J.H. contributed to the development of multi-source retrieval and also was involved with manuscript preparation. M.S. provided maize expertise, ideas for making the retrieval engine most useful to the plant community, and provided guidance during manuscript preparation. C.J.L. provided insight regarding MaizeGDB, outlined unmet needs of maize researchers that were addressed by this work, and provided guidance during manuscript preparation. C.R.S. directed the project and provided essential guidance during development of the system as well as during preparation of the manuscript.

References

Lawrence

Harper

Schaeffer

, et al. ,

MaizeGDB: The maize model organism database for basic, translational, and applied research

Int. J. Plant Genomics

2008

, vol.

2008

Article ID 496957

Jaiswal

Yap

, et al. ,

Gramene: a bird's eye view of cereal genomes

Nucleic Acids Res.

2006

, vol.

(pg.

D717

D723

)

Swarbreck

Wilks

Lamesch

, et al. ,

The arabidopsis information resource (TAIR): gene structure and function annotation

Nucleic Acids Res.

2008

, vol.

(pg.

D1009

D1014

)

Mueller

Solow

Taylor

, et al. ,

The SOL genomics network: a comparative resource for solanaceae biology and beyond

Plant Physiol.

2005

, vol.

138

(pg.

1310

1317

)

SoyBase and the Soybean Breeder's Toolbox. http://soybase.org/index.php (7 April 2011, date last accessed)

Kurata

Yamazaki

. ,

Oryzabase: an integrated biological and genome information database for rice

Plant Physiol.

2006

, vol.

140

(pg.

)

Baeza-Yates

Ribeiro-Nero

. ,

Modern Information Retrieval

1999

Reading, MA

Addison Wesley

The Gene Ontology Consortium. (2000) Gene Ontology: tool for the unification of biology. Nature Genet., 25, 25–29

Jaiswal

Avraham

Ilic

, et al. ,

Plant ontology: a controlled vocabulary of plant structures and growth stages

Comp. Funct. Genomics

2005

, vol.

(pg.

388

397

)

Banerjee

Pedersen

. ,

An adapted lesk algorithm for word sense disambiguation using WordNet

Computational Linguistics and Intelligent Text Processing

2002

, vol.

2276

Berlin, Heidelberg

Springer

(pg.

117

171

)

Szpakowicz

Matwin

. ,

A WordNet-based algorithm for word sense disambiguation

Proceedings for the IJCAI'95

1995

Canada

Montreal

(pg.

1368

1374

)

Widdows

Peters

Cederberg

, et al. ,

Unsupervised monolingual and bilingual word-sense disambiguation of medical documents using UMLS

Proceedings of ACL 2003 Workshop on NLP

2003

Stroudsburg, PA

Aronson

. ,

Effective Mapping of Biomedical Text to the UMLS Metathesaurus: The MetaMap Program

Proc. AMIA Symp.

2001

(pg.

)

Harabagiu,S. and Lacatusu,F. (2005) Topic themes for multi-document summarization. Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM, New York, NY, pp. 202–209

Lee

C-S

Jian

Z-W

Huang

L-K

. ,

A fuzzy ontology and its application to news summarization

IEEE Trans. Systems, Man, Cybernetics

2005

, vol.

(pg.

859

880

)

Crossref

Jones

Abdelmoty

. ,

Ontology-based spatial query expansion in information retrieval

Lecture Notes in Computer Science

2005

, vol.

3761

(pg.

1466

1482

)

Navigli

Velardi

. ,

An analysis of ontology-based query expansion strategies

Proceedings of the Internationl Workshop & Tutorial on Adaptive Text Extraction and Mining

2003

CavtatDubrovnik, Croatia

Wollersheim

Rahayu

. ,

Ontology based query expansion framework for use in medical information systems

J. Web Inf. Systems

2005

, vol.

(pg.

101

115

)

Crossref

Salton

Wong

Yang

. ,

A vector space model for automatic indexing

Information Retrieval and Language Processing

1975

, vol.

(pg.

613

620

)

Shyu

C-R

Harnsomburana

Green

, et al. ,

Searching and mining visually-observed phenotypes of maize mutants

J. Bioinformatics Comput. Biol.

2007

, vol.

(pg.

1193

1213

)

Crossref