About GNI

Names Indexing environment (GNI)

The names index provides a list of all names that have been used for organisms.  Within this list lie all of the nomenclaturally correct names, all of the names that are accepted as tokens for taxa, and all of the taxonomic metadata for biodiversity informaticians. 

 Names are found in books, papers, databases, and so logically the names can be derived from the Aus bus" 14 times within the article (i.e. there are 14 appearances of the binomial), then there would be three usages - "Aus in Smith 1955", "bus in Smith 1955", and (arguably) "Aus bus in Smith 1955".  The combination of the name and the reference is a chresonym.  In the event that the usages bears on the identity of the taxon, these usages may also be referred to as "Aus sec Smith 1955" or "Aus bus sensu Smith 1955".">Usage Bank.  Yet, GNI appeared before GNUB ? partly because from the perspective of the global names architecture, names are the common denominator and so is the first part of the system that we need to manage, and because a variety of initiatives had already compiled lists of names for their own purposes.  Not surprising then that the index jumped to almost 20,000,000 names within a few months of being created.

 The index holds not only the names, but also some information about the names provider, and links to the resources held by the provider.  One benefit is a capacity to interconnect the information held by the different contributors.  NameLink is a tool that takes advantage of this.  It recognizes names in documents and inserts an anchor to which other items can be can attached. Those attachments may be from other sites (such as images) ? if they offer appropriate APIs.  NameLink is currently set up to connect names to the relevant pages of the Enyclopedia of Life.

 GNI is not intended for people to visit, but is set up to provided information through web services to other machines. 

 The names index contains, at this time, 18,000,000 names (any chance of an automatic update in here?). The excess of names compared to species gives us a good indication of the scale of the many-names-to-one-species problem (link to page).  It?s important to reconcile those names so that we have a single group of names for each species.   As those groups groups grow, so semantic technologies allow services to reach out to more specialist names providers who can then add additional information ? such as to show which names meet nomenclatural standards or validly represent species.

 There are several ways by which reconciliation and other value-adding is being achieved.

-       Firstly, there are many authoritative web sites that provide authoritative lists of species or identify the nomenclaturally correct names.  As they engage with GNA, their content can be used to annotate or filter the very big lists.

-       Some names are simply errors of data conversion, and are really not names at all ? such as the Zz entrees listed to the right.  We can develop algorithms to check all of the names and apply various rules to distinguish non-names, vernaculars, and surrogates.  We can create white lists of good names, black lists of words that are not (yet) names (such as Anorexia nervosa), and grey lists where the uncertainty is referred to experts for their attention. so that we may better focus on the management of the scientific names.  Rules can also use the peculiar binomial format of species names to separate them from the single word names of higher taxa or polynomial names, and we can use the standardized endings of families, orders, classes etc. to identify names at higher ranks.

 

-       Most of the excess of names comes from alternative spellings of the same name. Some variations are entirely consistent with the codes, others come from mis-spellings, OCR errors, or truncations. Fuzzy matching algorithms, such as those developed by Tony Rees (link?) provide a means of automatically reconciling various representations of names, typographic errors, sound-alikes, variations on how dipthongs and accented letters may be spelled out, (etc.). When applied to the 18 million names in GNI, Tony?s algorithm renders the names down to about 6 million reconciliation groups.

-       A second component of the reconciliation process is to parse name strings into their components.

In the case of species, this separates the genus and species names, authority information, as well as annotations or other markers (such as sp., aff., cf., ex etc.).  Treated in this way, we can then reconstruct the canonical version of the name (Fagus sylvaticus orientalis).

  • Mycosphaerella eryngii (Fr. ex Duby) Johanson ex Oudem. 1897
  • Mycosphaerella eryngii (Fr. Duby) ex Oudem. 1897
  • Mycosphaerella eryngii (Fr.ex Duby) ex Oudem. 1897

Can all be rendered to a canonical form Mycosphaerella eryngii.  In turn, this allows us to link all of these names and any lexical variants of them.