Warning: This is old version of the site! Return to the current version.


07 Oct 2004

Dictionary structure

I have started to rewrite dictionaries in Ellinika project. The idea is to use XML instead of the definition language I have been using so far. The latter was designed to be simple, short (I hate to type) and suitable for describing dictionary entries. Its parser is written in C and is therefore fairly fast. Presently, the dictionary created with it contains 1150 entries, and the process of its creation has confirmed that the dictionary structure is right and input language generally suitable for the purpose. However, it has also exposed some drawbacks of the language.

The principal drawback is that currently a dictionary entry is supposed to contain only one part of speech 1) , i.e. currently assumed entry structure is:

 (key part-of-speech articles)
However, there exist words that pertain to several parts of speech simultaneously, and that change their meaning according to the part of speech. For example, κρυώνω, when used as a transitive verb, means to refrigerate, whereas being used as an intransitive verb it means to feel cold, to freeze. The number of such words (verbs in particular) in Greek is fairly large.

So, I have decided to redesign the input language, but instead of simply fixing the already existing language, I've chosen to fully write the dictionary sources in XML.

External representation

In the new definition language each entry is represented as follows:

<NODE>
 <K>string</K>+
 [<F>string</F>]
 <P ID="string">
  <M>string</M>+
  <A>string</A>*
  <X>string</X>*
  <T ID="string" />*
  <X>string</X>*
 </P>+
 <X>string</X>*
</NODE>

(as usual, optional elements are inclosed in brackets, * means zero or more occurrences of the element, and + means one or more occurrences of the element).

Elements have the following meaning:

NODE
Start the definition of a dictionary entry
K
Introduces the dictionary key, i.e. the word of the source language that is explained by this entry. There may be several keys if the notion in question has several sinonyms.
F
Introduces grammatic forms of the key, whenever these are not formed by standard rules. In future I expect to write proper verb conjugator, then this field will probably mark a reference to or invocation of it.
P
Part of speech and meanings associated with it (see below). Attribute ID introduces the name (usually abbreviated) of the part of speech.
M
Translation of the word (M stands for Meaning)
A
Antonym
X
Cross-reference for this entry. Usually this is a reference to sinonym or some semantically related key.
T
Topic or group this entry pertains to. ID identifies the topic. When many entries pertain to the same topic, their definitions can be enclosed in
<T ID="name">
...
</T>
construct.

There are two special forms of this notation. One is useful as a shortcut for those words that have only one part of speech (as I said I hate to type, so I'm trying to spare as much typing as possible):

<NODE>
 <K>string</K>+
 [<F>string</F>]
 <P>string</P>
 <M>string</M>*
 <A>string</A>*
 <X>string</X>*
 <T ID="string" />*
</NODE>

Another one introduces an entry that is a reference to another entry in the dictionary:

<NODE>
 <K>string</K>+
 [<F>string</F>]
 <P>string</P>
 <X>string</X>*
 <T ID="string" />*
</NODE>

This is useful for such pairs as "ο ποταμός" and "το ποτάμι", both meaning "river" but having different genders. The special form <X /> means reference to the immediately preceeding node definition wich has at least one translation (M element).

Examples

A working example can be found here.

Internal representation

The initial version of the dictionary translator is already available. See its heading comment for the short description of the internal representation (Scheme of course).

1) The term part of speech is used here in broader meaning: e.g. for the dictionary purposes transitive verbs and intransitive verbs are regarded as different parts of speech. I'll possibly have to introduce finer granularity here (e.g. part-of-speech/subpart or something similar), but currently I am not sure it will be worth the effort.