Lexical Database of Pakistani Regional Languages
Chapter 2 Literature Review
2.1 LANGUAGES OF PAKISTAN
There are many local or regional languages of Pakistan including Punjabi, Pashto, Sindhi, Saraiki, Urdu and Balochi etc. [30, 32]. Most of these languages do not have standardized alphabets. Only Sindhi, Pashto and Urdu have consistant alphabets. Also Urdu or Sindhi alphabets are used to write other languages . In this research, Punjabi is taken as the main consideration. Below we will discuss writing style, grammar and morphology of Punjabi in detail.
The world’s twelfth most generally spoken language is Punjabi. Also it is most common language which is spoken in Pakistan. According Census Report of Pakistan in 2008, number of Punjabi speakers are 76,335,300 which are roughly the 75% of the total population of Pakistan with their first or second language as Punjabi.
Punjabi language has two dialects: Eastern Punjabi mostly spoken by the people of Punjab in India, and Western Punjabi mostly spoken by the people of Punjab in Pakistan [1, 22, 36]. Perso-Arabic (Shahmukhi) script is used by the people of Pakistan, whereas Gurmukhi / Devanagari script is used by the people of India [30, 31, 35, 37, 41].
Punjabi language connections back with the Indo-Aryan languages [38, 40]. But with the passage of time, Persian, Arabic and Turkish words constitute the Punjabi vocabulary. Also there is a problem of its alphabets. There are no standardized alphabets of Punjabi. It is usually written by using the alphabets of Urdu . Punjabi (especially that is spoken in Pakistan) is a less resourced language. Generally very little work is done on Punjabi [1, 22] (Gurmukhi / Shahmukhi). The main aim behind this research is to create a huge resource of this language in Shahmukhi script that can be used both by the language learners and users and by the linguists to collect certain information for particular NLP applications.
Shahmukhi is written from right to left and is based on Nastalique style of Persian and Arabic script. The shape of the characters in a word is context sensitive, means a letter has different shape if it occurs at the start position, at middle position or end position of a word .
This script has thirty eight letters which includes four long vowels Alif ا [ɘ], Vao و[v], Choti-ye ى[j] and Badi-ye ے[j], three short vowels Zer ِ, Peshُ and Zabar َ , diacritical marks like Shadّ , Khari-Zabarٰ , do-Zabar ً [ən] and do-Zer ٍ[ɪn] , or symbol hamza ء [ɪ] . Ten aspirated consonants (بھ،پھ،تھ،ٹھ،جھ،چھ،دھ،ڈھ،کھ،گھ) are very frequently used as compared to the remaining six aspirates (رھ، ڑھ، لھ، مھ، نھ، وھ) .
The content written in Shahmukhi script generally doesn’t utilize short vowels and diacritical imprints. The same written word with and without diacritics or same word with different set of diacritics represents the different meanings which makes the uncertainty for the machines and for the Punjabi non-speakers especially, e.g. the word ترنا can be written in two ways: تَرنا (To Swim/Float)and تُرنا (To Walk/Move). Similarly the word بیل means ‘the vine’ whereas the word بَیل means ‘the bull’ in English. Thus diacritics are necessary to remove the ambiguities between word meanings/senses . How this was handled in the current implementation? Get the text from the text box, remove the diacritics and search the diacritics free word in the DB, if the word found in DB, get the ID(s) and further search the same word with aerab (diacritics) in the next fields of the searched IDs. OR ask the user with alternate search word with aerabs so that the user can select the exact word otherwise show all the words without aerabs.
b) Punjabi Grammar
Here we will discuss various morphological and syntactic structures of the Punjabi language. Like Urdu, Punjabi also follows the canonical word order of Subject-Object-Verb . There are Masculine and Feminine (two genders), Singular and Plural (two numbers), and six cases including accusative, nominative, instrumental, ablative, dative, and locative, two types of adjectives (inflected and uninflected), two types of affixes (Prefix and Suffix) . It has postpositions rather than prepositions e.g. in English we write ‘on the roof’ and in Punjabi it is ‘چھت تے’ or ‘چھت اُتے’.
c) Punjabi Classes
Punjabi words have both inflected and uninflected nature. Suffix is mostly used as inflection expressing the grammatical information like number, person, and tense. Nouns are inflected for number and case e.g. مُنڈا ‘boy’ is used for singular and مُنڈے‘boys’ is used for plural. Sometimes the same word used for singular and plural depending upon the context in which it is used e.g.
مُنڈے چھت اُتے چڑھ گئے۔ here مُنڈے word represents the plural category, whereas in مُنڈے نے اپنا گلا کٹ لیا۔, the word مُنڈے represents the singular category.
d) Gender Rule for Nouns
If the Noun has ending in alif (ا), it represents a masculine noun whereas if the ending letter is Choti-ye (ی), the noun will be feminine. For example the word مُنڈا represents the masculine entity and کُڑی represents the feminine entity. Most of the Punjabi words follow this rule with some exceptions.
e) Number Rule for Nouns
If the Noun has ending in alif (ا), it represents a singular noun whereas if the word ends atیاں or Badi-ye (ے), the noun will be plural. For example the word مُنڈا represents the singular entity and مُنڈے، کُڑیاں represents the plural entities. Most of the Punjabi words follow this rule with some exceptions.
Adjectives may also be put into inflected and uninflected category. Punjabi adjectives have also inflection for singular and plural, masculine and feminine etc. We have to concate the appropriate adjective form that best fits with the noun so following the above mentioned rules for the Noun category, Inflected Adjectives are also marked through endings, for the gender, for the number and the noun cases they qualify. For Example کالا بکرا, word کالا is masculine adjective and word بکرا is a masculine noun; کالی بکری , wordکالی is feminine adjective and بکری is a feminine noun, which are in accordance with the gender of the noun it qualifies in both cases. The uninflected adjectives are totally constant and rigid having ending in either consonants or vowels, for Example اُداس، بدشکل.
g) Tonal Features of Punjabi
Modern Punjabi has a phonetic nature and due to this nature it is called a tonal language , means the same word with different pronunciation make a different word/sense e.g. the word کوڑا can be pronounced as KoRa (horse), KoRRa (leprosy,a disease), KoRaa (whip) . The tonal / melodic features are the inherent properties of pronunciation of a word. There are three tones in Punjabi: high-falling / high tone, low-rising / low tone and mid tone / level tone .
2.2 LEXICAL DATABASE / WORDNET PRINCIPLE
WordNets are very crucial structures in natural language processing. PWN is an online lexical reference system for English designed according to the inspiration of psycholinguistic theories of human mind . A traditional dictionary usually shows the lexical entries in a sequence alphabetically. Whereas WN is managed on the basis of word meanings; all the words that can represent a particular sense are combined collectively in a synonym set (called synset).The synsets are created based on the grammatical categories like Noun, Verb, Adjective and Adverb. All the words of same grammatical category having same concept are grouped together to form the members of that synset. These synsets further connected with each other through the lexico-semantic relations. In 1985 some linguists and psychologists joined their heads at Princeton University and started developing a lexical database for English on these lines usually known as Princeton WordNet (PWN). Then following the principle of PWN, EuroWordNet is developed for the European languages. India also builtWordNet for Indian languages (IndoWordNet).
2.2.1 LEXICAL MATRIX
The lexical matrix is a vital piece of the human language adopting and learning system. It gives the connection between word form and word meanings. WN structure can be described by the lexical matrix. The word forms referring to the physical utterance denoted the headings for the columns whereas the word meaning referring to the lexicalized concept denoted the headings for the rows.The word forms and their meanings are then mapped to each other. A value in a particular cell showing the word form in that column for a specific context is used to represent the meaning (sense) of that row. For example, if as shown in following table TABLE 2.1 the entry E1,1 shows the word form F1 having the meaning M1. Also if there are more than one entries of a particular column, that word form will be polysemous e.g. under the heading F2 the entries E1,2 and E2,2 occur so F2 is polysemous (same word F2 with multiple senses M1 and M2) and if there are more entries in the same horizontal row, the word forms are synonymous (different words with same sense) relative to that context e.g. word forms F1 and F2 are synonyms (entries E1,1 and E1,2 have same sense M1).
So some forms might be of polysemous nature (have several different meanings) and several different forms might be of synonymous nature (having same meanings) .
2.2.2 SYNSET AND SENSE
Synonymous words are grouped in synsets having same sense or concept. A word has different meanings depending upon the context being used. The context sensitive meaning is called word sense. For example the wordآhas different meanings (or senses) used differently at the different occasions e.g used for (1)calling a person / animal / bird (2)voice at the start of a song/gazal/qawali etc. Also, for a particular word having multiple senses, it will appear in more than one synset pairs. Example the Punjabi word بتّی can have sense وٹّی and بجلی etc
2.3 RELATIONS IN WORDNET
As discussed above WN has grouping of word forms corresponding to their meanings. This grouping is called synsets. The relations between one synset to another are called semantic relations and the relations between the members of different synsets are called lexical relations. The database is confined to the relations proposed by the psycholinguistic information. WN groups the words w.r.t. their part of speech / grammatical category. Different relations exist for these different categories. For examples, Nouns have synonymous nature; adjectives have antonymy nature whereas the verbs are sorted out by an assortment of entailment relations. These relations are discussed below.
2.3.1 SEMANTIC RELATIONS
Semantic relations hold between whole synsets. Below we will discuss different semantic relations.
WordNets are organized by the concept of synonymy. This is the most important semantic relation. Two words are said to be true synonymous if one word can be used as a substitute for the other in every context. But true synonyms are very hard to find, if they exist anyway. A weakened synonymy definition would be used instead: two words are synonymous in a particular context if one word can be used as substitution for the other in that context and this substitution does not change the overall sentence concept. This relation is symmetric i.e. If one word is similar to other, then other is equally similar to first one. Figure 2.2 shows some example synonymous groups.
• Hypernymy/Hyponymy (a-kind-of or is-a relation)
WordNet is organized in the hierarchy of superset (or superordination) and subset (or subordination) classes. Hypernymy/Hyponymy is the most widely recognized connection among the noun synsets. This is a-kind-of relation e.g Red is a (kind of) Color so Red is Hyponymy and Color is its Hypernymy. In the hierarchical structure hyponym is at beneath its superordinate. Hypernymy is a generalized concept whereas the hyponymy is its particular case. Also the hyponym acquires all the qualities of its hypernymy (super-ordinate) and possesses at least one additional feature that make it distinguishable from all co-hyponyms and super-ordinate too. All the hyponyms of a single superordinate are called co-hyponymys.
• Meronymy/Holonymy (part-whole or has-a relation)
This relations describes the part-whole relation between two entities. If X and Y are the two entities such that the entity X is a part of entity Y, then X is called holonym of meronym Y. This relation can be used to maintain a part hierarchy in LDB. Our body has different parts like head, eyes, hands etc. So there is meronymy and holonymy relation between body and body-parts. In this example body is meronym and body-parts all are its holonyms.
A verb does not behave in the same way as noun. Troponymy is a most commonly discovered semantic relation among the verbs of a language. Hyponym (“kind-of” relation) is used for nouns and Troponymy (“manner-of” relation) is used for verbs.
Entailment is also another semantic relations between verbs of a language. A verb ‘A’ entails a verb ‘B’ if the concept or sense of verb ‘A’ constitutes the concept of the verb ‘B’.
2.3.2 LEXICAL RELATIONS
An important lexical relation between the words of different synsets is Antonymy. It seems a simple words connecting relation, but is quite difficult to determine because of its complexity. The antonym of a word ‘A’ is defined as sometimes not ‘A’ but is not hold good in every context. For example, rich is antonyms of poor, but we cannot say that if someone is not rich it must be then poor. Also usually people say them not rich and not poor. This relation is less occurring among the nouns but is the commonly found relation between the adjectives. The adjectives are grouped in conceptual opposition expressed by the pair of antonyms i.e. if an adjective has no direct antonym then its such synonym is found which has a direct antonym.
Gradation is a mechanism to determine the intermediate relation between two antonyms, like noon is between the two antonyms morning and evening.
2.4 IMPLEMENTED LDB RELATIONS
The lexical relation antonymy and the semantic relation synonymy are implemented in the current lexical database structure for the Punjabi language. Hopefully in future this structure will be used for other local languages too and all other relations will be implemented in the database.
2.5 EXISTING LEXICAL DATABASES
We have studied the underlying a number of lexical databases thoroughly in the literature to find the structure and storage mechanism among them. Some of these are discussed below.
2.5.1 English WordNet
The mostpopular and pioneer work for the development of WordNet or lexical database is an English WordNet developed at Princeton University usually known as PWN. The details of this lexical database can be found at http://wordnet.princeton.edu/. In this lexical database the smallest unit is synset(a set of synonyms) that is the grouping of words having the same lexical concept. Here two kinds of relations are defined; semantic relation between synsets and lexical relation between words. It stores the base forms of lexemes and has no embedded morphological model in it. An external lemmatizer named Morphy is used to access lexical senses through the inflected word forms [2, 29]. This technique is used because English is not a morphological rich language. WordNet is used for many computational linguistic tasks such as Word Sense Disambiguation (WSD), Information Retrieval (IR) and Information Extraction (IE) etc. One of the major applications of this tool is EuroWordNet developed in Europe for multiple languages like Czech, Dutch, Estonian, French, German, Italian, and Spanish following the same implementation like English WordNet.
This lexical and semantic resource of German language is built by keeping the aim of interoperability and compatibility with the other lexical resources and also structured along the PWN lines. Some databases uses the merge approach while other adopted the expansion approach. In the expansion approach the synstes were translated to the targeted language synsets. GermaNet follows the merge approach that is already existing German synsets were linked to the Interlingual Index by creating the appropriate equivalence relations. Germanet uses the XML approach promoting web semantic concept. WordNet-LMF format was adopted further to convert it into the LMF specifications proposed by KYOTO project [45, 46].
Lexique stores not only the classical word related information like gender, number and part of speech tag to the French words but also word frequencies counted from the corpus and web pages, and storage of word lemmas .
The database structure suggested and designed by the Princeton University (PWN) for English is extended to build the lexical database of multiple languages called MultiWordNet. The technique of concept alignment was adopted. There are two models for the development of MultiWordNet, first model suggests that there should be separate WordNet development and the find the way to find the similarity with the PWN and in the second model the development of MultiWordNet is aligned with the PWN from the start .
The functionality of the WordNets are extended by the addition of phrasets (sets of free combination of words). This will be helpful for managing lexical gaps in multilingual lexical databases .
2.5.5 Kannada WordNet
The structure of kannada WordNet has been enlivened by the renowned English WordNet and Hindi WordNet. Additionally this lexical resource has graded antonyms and meronymy relations, verbal compounding and complex verb construction. It consists of two components, first one is the lexical interface where the lexical entries are added by the lexicographers and the second one is the web interface where the end user searches the word in the database .