Lexical Database of Pakistani Regional Languages
Chapter1 – Introduction
The internet is full of different kind of information in multiple languages. Every person is not able to take advantage from this as much as it should due to the unawareness from multiple languages or due to the fact that this information is not in the form of its native language. This situation motivates people to develop such technologies that would solve this issue. The new technologies and new theories developed make all this available information accessible more effectively as ever before. By using Linguistics and Natural Language Processing (NLP), people made different tools having natural language interfaces to computer systems to make the man-machine interaction easy. The lexical resources got the most importance in computational linguistics and NLP. A very popular lexical resource is English WordNet. A WordNet is a lexical database in which different grammatical categories like nouns, verbs, adjectives and adverbs are grouped together lexically and semantically on the basis of related concepts with identical or closed to identical meaning. This research is adopted to make a rich Punjabi lexical database using the similar concepts. Further, the same structure can be adopted for the other local languages.
1.1 NATURAL LANGUAGE PROCESSING (NLP)
Natural Language Processing (NLP) is used in two ways by the machines; Natural Language Understanding (NLU) and Natural Language Generation (NLG). In the former, human languages are used to communicate with the computers while in the later Information from the computers are converted in the human languages. In NLP, the text is analyzed by the computer using the available technologies and theories .
NLP is possible with the help of some other fields like computer science, psycho-linguistics, statistics, electrical and electronic engineering, artificial intelligence etc. Some of the example applications are machine translation, text summarization, information retrieval etc. [3, 4].
1.1.1 NLP TASKS
When processing natural languages, there are different problems faced by programmers and technologists. Some of these are discussed below.
a) Text Segmentation
In the natural text processing system, the major problem faced is text tokenization or text segmentation. How the text is tokenized into words? This can be only possible if we can find the word boundary, which is quite difficult for some languages like Chinese, Japanese and Thai etc.
b) Speech Segmentation
In the natural or human speech processing system, the major problem faced is sound identification of each character pronounced. Different characters pronounced in different ways on their position in the sentences. There are also problems of word pauses between the successive words in a sentence pronunciation. So to solve such issues, we should design a system which considers the grammar, sentence structure, word senses and the context. Frequency and pitch consideration is also important.
c) Word Sense Disambiguation (WSD)
Words have different meanings/senses depending upon the particular position in which they are appeared among other words. There should be some repository of these words along with their senses. A WordNet or lexical database provides this feature that can be further used in the WSD system.
d) Syntactic Ambiguity
A natural language sentence is known as syntactic ambiguous in case of having more than one parse tree. The semantic and particular context available for the sentence words can reduce these ambiguities. A lexical resource will help a lot to remove these ambiguities.
e) Machine Translation (MT)
The translation of human languages amongst others is called machine translation. This is not an easy task but the translation between closely related languages which have some type of similarities like alphabets, structures and grammar, is a bit easy.
f) Named Entities Recognition (NER)
Named Entities have a significant functionality in NLP Systems. Identification, analysis, extraction, mining and transformation of the named entities like names of persons, organizations, locations, concepts in a given natural language all are challenging NLP tasks [5, 6]. NER is performed through two approaches: linguistic approaches (Rule Based Models) and machine learning approaches e.g. Maximum Entropy Models, Decision Tree, Support Vector Machines, Conditional Random Fields .  used Hidden Markov Model (HMM) to perform the NER. Transliteration can also be used for NER .
1.1.2 NLP LEVELS
There are different levels of NLP like phonology, morphology, lexical, syntactic, semantic, and pragmatic and discourse. Each level will provide some information about the language which will be applicable and useful in some NLP application.
When we deal with a natural language we need two types of processes.
- Morphological Analysis (Word Level Analysis)
- Syntactic Analysis (Sentence Level Analysis)
a) MORPHOLOGICAL ANALYSIS (Word Level Analysis)
Morphological analysis is the identification of the internal structure of the words. The fundamental unit of the lexicon of a language is called a lexeme and morphologically a word is the actual realization of a lexeme. For example, dies, died, dying and die all are words or word forms of the lexeme ‘DIE’. Some lexicons are lexeme based while others are the word based. In this research we are going to develop a word based lexicon or lexical database.
The words can be categorized into two general classes; open-class-words and closed-class-words.
The open words allow the creation of new members, so they are unlimited in number whereas the closed words are highly resistant to the addition of new members.
The task of the morphological analysis is to find out the basic building block (morpheme) of which open-class words are constructed. Morphemes may be free or bound. Free morphemes are also called roots. Some examples showing the morphology study of the words:
- The word ‘farm’ is a root, also a stem
- The word ‘farms’ is a root/stem + inflectional affix
- The word farmer is a root/stem + inflectional affix = new stem
- The word farmers is a root/stem + inflectional affix
Note that all the above words farm, farms, farmer, farmers have the same stem farm.
Now question arises How to encode Morphology? Some researchers suggested implementing the morphology into the lexicon whereas others have implemented no morphological component in their lexicon system like PWN.
b) SYNTACTIC ANALYSIS (Sentence Level Analysis)
Syntactic analysis is the analysis of the sentence structure. This is usually accomplished via Grammatical rules and statistical learning based.
1.2 LEXICAL DATABASE (LDB)
Lexical resources are the collections of lexical items (a word/morpheme  or might be a complete sentence ) along with some standardized terms describing the linguistic information. The lexical items have the linguistic information like pronunciation, part of speech, gender, number, gloss, and etymology etc. [11, 18].
A lexical item and its associated information form a lexical entry. In general the lexical resources might be called a dictionary, a lexicon or a lexical database. The dictionary is referred to a compiled work used by humans directly  whereas the lexicon / lexical database is a requirement and integral part of natural language processing systems [9, 12, 13, 14]. These lexicons have information about words of a particular language  and can be constructed for a single language called a mono-lingual lexicon, or constructed for more than one language concurrently called as multi-lingual lexicon. In a mono-lingual lexicon, lexical entries stored belong to one language only whereas in a multi-lingual lexicon the lexical entries stored with corresponding lexical entries in multiple languages . We will use the term LDB for both Lexicon and Lexical Database. Mono-lingual LDB are easy to construct and manage as compared with the construction and management of multi-lingual LDB because multilingual LDB has one of the biggest problems: conceptual mismatches between languages or lexical gaps .
1.3 DIFFERENCE BETWEEN TRADITIONAL DATABASE AND LDB
Traditional Databases or Dictionary Databases of Natural Languages are basically the digital versions of the printed dictionary and are based on the primarily the compilation of lexicographer’s work and there is no relationships between the words . This is usually known as Machine Readable Dictionary (MRD). Whereas a lexical database (LDB) is a lexical resource focusing on the computational exploitation having a specific structure so that the lexical resource can be used in both NLP systems and human consultation [9, 20].
1.4 NEED OF LEXICAL DATABASE
Why we want to build lexical databases of natural languages? Actually we want to store the information about a natural language words so that we can process them. The repository of these words along with their attributes is called a lexical database. LDB is an essential and central component of the natural language processing applications [11, 17]. NLP tasks like word sense disambiguation, machine translation, and part-of-speech tagging (POST) require large repositories of information about the words of a language . It should contain such information which is useable both for humans and computers directly [9, 17]. English WordNet is such a lexical resource developed at Princeton University’s Cognitive Science Laboratory . A complete and coherent LDB can provide us the facilities like 
- Updation and maintenance of dictionaries
- Check coherency within or across related dictionaries
- Exchanging and sharing of information among such projects
- Automatic generation of printed versions of a dictionary
- Basis for electronic dictionaries on CD-ROMS
- Online consultation by scholars and common users
1.5 ROLE OF WORDNET / LEXICAL DATABASE IN NLP
Lexical resources especially WordNets have already proved their importance in NLP applications e.g.
Inter-lingua Translation Systems: EuroWordNet is a multilingual database of Europeans languages based on the English WordNet developed by the Princeton University;
Word Sense Disambiguation Systems: Word Sense Disambiguation is the process of removing ambiguity of word senses in a particular context and selecting the appropriate sense with the help of rich lexical resources like WordNets. WSD is very important for text processing, speech processing and machine translation applications;
Semantic Similarity: Any natural language word may have more than one interpretation called the senses. The similarity among the various senses of a word can easily be calculated with the help of more general semantic classes which required lexico-semantic information provided by the WordNet. The adjectives include a relation of semantic similarity.
Thus WordNet (a lexical database) having enough information about the words of a language, can be used in various NLP applications. These lexical databases are monolingual like English WordNet and might be multilingual like EuroWordNet, MultiWordNet and BalkaNet. Urdu WordNet was developed at CRULP (Center for Research in Urdu Language Processing) using the expansion approach from the Hindi WordNet . According to our knowledge, there is no standard structure defined for the lexical databases of Pakistani regional languages. We have adopted our own database structure which will be modified after different experimentations.
1.6 BASIC STORAGE ENTITY OF LDB
The lexical resources might be word based  or morpheme based . According to the theory of the morpheme-based lexicon, only those morphemes should be stored in the lexicon which cannot be derived from others based on rules, and the complex (or compound) words are generated by using morphological or syntactic rules, whereas all the simple and complex words will be stored as autonomous lexical items in word based lexicons . The next step is grouping of these words. LDB / WordNet groups the words into sets of synonyms called synsets to support different artificial intelligence applications [22, 23].
1.7 STANDARD LEXICON MODELS
A great number of divergent formats and lexical structures are proposed and adopted for creating lexical resources because of wide range of linguistic information and differences among the languages [15, 19]. Various adopted formats are relational database format, plain-text files, MS-WORD format etc. The data structures are also varied from system to system e.g. “Comlex” uses typed feature structures, “Celex” uses relational tables, and “WordNet” uses flat files (unnormalized relational format) .
MULTILEX and GENELEX (GENEricLEXicon) systems are developed by the Expert Advisory Group on Language Engineering Standards (EAGLES). OLIF, MILE, ISLE, LIFT, OWL and DICT formats are also commonly used. Another standard particularly focuses on encoding the linguistic data in dictionaries for humans is ISO 1951, called LEXml which was not become much popular. The ISO encoding standard for natural language processing and machine readable dictionaries is ISO 24613:2008 (LMF). In modeling lexicographic data, the underlying language structures are conceptualized as tree like structures, which promotes the use of XML for linguistic data representation. Also according to software engineering point of view, UML is another option to follow for serialization with XML vocabulary. The authors of LMF adopted the same approach .
1.8 PROPOSED LDB CREATION PROCESS
We are using MySQL database tables at the backend, to store the lexical and semantic information of our local and regional languages. The frontend has two views, one for the DB manager, who can store and update DB; the other view for the end-user, who can query the word. We have stored words, their forms with possible part of speech tags, senses, set of synonyms (synsets) based on the word sense, gloss and example sentences showing the typical usage of an entry in LDB. In addition, the word forms are linked with the main word along with their own unique and distinguishable sense. For example in Punjabi, the word وٹ have many forms like متھے تے وٹ، کپڑے تے وٹ، ڈھڈ وچ وٹ showing different senses for each pair. We are not going to store the phonetic and phonemic features usually required by the speech processing system, at this time. There is no morphological model embedded in the construction of current lexical database.
This structure is proposed and implemented to keep the aim basically to create an online repository for language users and language learners and in the future for creating dictionaries and Interlingua translation systems based on this online resource.