Glossary

information beads

Glossary

Terms explained

A

A/B Testing

Originally used as a concept in the advertising industry. A test method to optimize an existing system. Next to the live system part of the user traffic is randomly redirected to a second system, which differs from the first system in one specific feature. A/B testing may be used to test two different search engines on otherwise identical websites.

Associative Relationship

In a Controlled Vocabulary, especially a Thesaurus, one potential relationship is the associative one. Association is usually expressed by the abbreviation RT for "related term". An example of an associative relationship between two terms might be "wine" - "grapes". Since this kind of relationship is less specific than an Equivalence or Hierarchical Relationship, it should be used with care.

B

Boolean Query

A combination of words which are connected by Boolean operators (AND, OR, NOT) for example "taxonomy NOT biology". See also Query.

C

Categorization

The term "categorization" is often used as a synonym to "classification". Categorization and classification may sometimes be used as opposites to make a distinction between the filing of content items within a flexible, context-dependent structure (categorization) in contrast to a rigid, context-independent structure (classification). An example for a classification which highlights this difference is a physical library filing system or a website navigation, where books / pages can have only one specific place in the structure, i.e. belong to one specific class. In contrast, categorization then refers to content items being associated with various categories, as for instance in an information retrieval system (see Information Retrieval) where they help to refine searches (Queries).

Computational Linguistics

Computational linguistics is a field of study where human language and computer science intersect, covering topics like Information Retrieval, text mining, text categorization, speech recognition or machine translation. Natural language processing is often used synonymously.

Controlled Vocabulary

A controlled vocabulary is a vocabulary which is maintained according to certain principles (defined in guidelines) and for example used to categorize (see Categorization). Controlled vocabularies use canonical terms for each concept and differ in types of relationships with which terms are linked. Examples of controlled vocabularies are Taxonomies or Thesauri. There are various standards that describe controlled vocabularies such as ANSI/NISO Z39.19-2005, "Guidelines for the Construction, Format, and Management of Monolingual Controlled Vocabularies", or ISO 25964-1, "Thesauri for information retrieval". Before creating a controlled vocabulary it is important to understand what its overall goal / purpose is, what the user needs are and how they will interact with the vocabulary as well as what content will be covered.

Conversion Rate

To optimise the site search of an e-commerce platform, statistics on the conversion rate of products are usually monitored. The conversion rate measures the number of product orders (often based on shopping basket not real orders) with respect to the number of search queries of a certain search term.

D

Dictionary

A dictionary is a reference work which lists words alphabetically. In Lexicography, the difference between a dictionary and a Thesaurus is that the latter will usually just list synonyms for each entry, whereas the dictionary may provide a description, translations (bilingual dictionary) as well as information on the pronunciation of a word.

E

Equivalence Relationship

In a Controlled Vocabulary, preferred terms are linked to equivalent non-preferred terms. These non-preferred terms may be Synonyms, near-synonyms, lexical variants (singular / plural, British / American spelling, etc.) and other similar terms.

G

Gazetteer

A gazetteer is a geographical dictionary, which next to place names usually contains the respective geographical footprint (coordinates), information about the type of place (city, town, lake, etc.) as well as other information (for example statistics). Gazetteers are used as works of reference, in geographical information systems (GIS), for geographical information retrieval (see Information Retrieval) or geoparsing (Toponym recognition and disambiguation).

Glossary

A glossary lists words alphabetically and gives a definition for each entry. Glossaries are usually related to a specific subject.

H

Hierarchical Relationship

In a controlled vocabulary, the hierarchical relationship links a narrower (NT) to a broader term (BT). An example might be "bird" - "robin".

Homograph

Homographs have the same spelling, i.e. written form, but represent different concepts. The term "pool" can for example refer to a game or a place for swimming.

I

Information Retrieval

Information retrieval focuses on making electronic, unstructured (= text) information findable. Next to documents / texts, information retrieval may also refer to other types of information objects such as images or videos. Information retrieval is sometimes also used as a synonym for search, although it is broader in the sense that it usually would also entail navigation / browsing, filtering and search refinement as capabilities to find information.

L

Lemma / Lemmatization

A lemma is the dictionary form of a word. Reducing a word form to its lemma is called lemmatization (mice --> mouse). In Information Retrieval, reducing tokens to a base form is more commonly implemented with Stemming not lemmatization.

Lexicography

Lexicography refers to the study of dictionaries. Lexicology is sometimes used as a synonym, however, it usually relates to the study of words and thus encompasses lexicography.

O

Onomasiology

Lexical resources can be structured onomasiologically (meaning --> word) or semasiologically (word --> meaning). The former type of lexical resource (for example a Taxonomy or a Thesaurus) will display the semantic context of an entry and is thus usually arranged in a hierarchy or network. The latter type usually lists entries in alphabetical order and tries to explain the meaning of a word (for example a Dictionary).

Onomastics

Onomastics is the study of proper names. A subcategory of onomastics is toponymy (see Toponym).

Q

Query

A query is a request that a user sends to a system to transmit his or her information need.

S

Semantics

Semantics is the study of meaning. Lexical semantics deals with the meaning of words and tries to understand the relationship between a word (for example represented in written form) and the concept it stands for. In a Controlled Vocabulary, the objective is to select a preferred term for every concept such that synonymous terms (see Synonym) can be linked and a term with several meanings can be disambiguated (see Homograph).

Semasiology

See Onomasiology

Stem / Stemming

A word consists of its stem plus an inflectional ending. The word "going" is made up of the stem "go" and the inflectional ending "ing". Stemming is the process of reducing a word to its stem. In Information Retrieval, stemming is a rough process where endings of tokens are cut off without any morphological analysis (as would be the case with Lemmatization).

Synonym

Synonymous words are words with the same meaning, i.e. words that represent the same concept. In a controlled vocabulary preferred terms are selected to represent a concept and respective synonyms are then linked as non-preferred terms via an Equivalence Relationship. In practice, it is difficult to identify true synonyms like "sodium chloride" and "NaCl", since each word will have specific connotations and contexts in which it is used. Terms like "salt" and "sodium chloride" could be considered only near-synonyms. They refer to the same thing, but they are used in different contexts (cooking versus chemistry).

T

Taxonomy

The term "taxonomy" has Greek origins and translates as "classification of names". In biology, it is used to refer to the classification of organisms in a hierarchical system. This concept has been adapted in information science to refer to a hierarchical grouping of terms, which is used as a structure for categorizing and browsing. In a wider context, the term "taxonomy" is employed to denote Controlled Vocabularies of any kind.

Thesaurus

In information and library science, a thesaurus is a Controlled Vocabulary with predefined relationships (hierarchical, equivalence and associative) as well as scope notes or definitions. In Lexicography, a thesaurus is a reference work that lists words and their synonyms (usually grouped by subject).

Token / Type

In computational linguistics, a token is an individual occurrence of a linguistic entity (usually words and punctuation marks). The sentence "A word is a word is a word." contains 9 tokens (8 words and 1 full stop). In contrast to this, a type is a class of tokens. The sentence "A word is a word is a word." contains 3 types ("a", "word", "is" and "."; "A" and "a" are here considered to belong to the same type).

Tokenizer

For many tasks in computational linguistics, tokenization is the very first step in the process of doing something with a text. A tokenizer splits text into tokens.

Toponym

A toponym is a place name. Toponymy, the study of place names, is a subcategory of onomastics, the study of proper names.

Translation Memory

Translation memories are usually a component of computer-aided human translation, but sometimes also used in machine translation. They link parallel text segments to reuse source segments and corresponding translations as suggestions for the translator.

W

Word

The term "word" is ambiguous as it can refer to words in spoken or written form, to words as grammatical units, and to words as dictionary entries (lemma). In computational linguistics, the more clearly defined terms "token", "types", "word forms", "lexeme" and "lemma" are usually preferred.

Publications

Interesting reads

If you would like to know more about search, navigation, categorization and controlled vocabularies and other subjects in information science and computational linguistics please check out our publications section.

glossary