Skip to content

Labs exercises in NTUA (2018) for the course Speech and Natural Language Processing 7th semester

License

Notifications You must be signed in to change notification settings

manzar96/speech_nlp_ntua

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 

Repository files navigation

speech_nlp_ntua

Lab exercise in NTUA (2018) for the course Speech and Natural Language Processing 7th semester

Greeklish to Greek converter

Greek is a complicated language in a way that the writer has to take many parameters into account while writing, like the spelling, punctuation, etc. Hence, it is a trend nowadays for greek people not to write greek while texting, but to use a different approach which is called Greeklish.

  • What is Greeklish?
    Greeklish is a type of writing, in which the greek alphabet is replaced with the latin. In this way, the writer does not need to be concerned about the spelling and the punctuation and as a result the writing can be accelerated.

    Examples:
    ΚΑΛΗΜΕΡΑ → KALIMERA
    ΘΑΛΑΣΣΑ → THALASA or 8ALASSA
    ΓΕΙΑ ΣΟΥ → GIA SU

In this project we had to create a converter that would take Greeklish texts as inputs and translate them to Greek texts. In order to complete this task we created a pipeline of Finite State Machines (FSMs), in which every FSM had a different role. The approach followed was:

  • 1st Step: Training the FSM
    The matching of letters from the latin alphabet to the greek one is an injective function. For instance, the letter Θ can be found in a form of 8 or TH. Hence, there can be more than one possible options for a match from latin to greek. In order to distinguish each case, the FSM should have weights on its edges so it can choose the most economic path. The weights were calculated from a training text and the values they got was the logarithm of the frequency of appearence of each matching between the greeklish and the greek text. The purpose of the logarithm was to simplify the calculations.

  • 2nd Step: Architecture of the Converter
    The final converter is created with a serial connection of the translator and an orthographer. At first, each word is inserted in the FSM and all the possible paths are selected along with the respective weights. Now, we have some possible translations for the word and we know how possible each translation could be valid. Then, those translations are inserted in the orthograpger, which is an FSM created from a dictionary of greek words. The word that generates the smaller cost is finally chosen and given as an output.

Implementation:
→ Python3
→ OpenFSM
→ bash

Further improvements:
The model may be improved if N-grams are used so each letter translation is done taking into account the translation of the N previous letters. Those N-grams could also be applied between words, so the translation would be sensitive to context. Other machine learning techniques worth of examination is the use of Markov Models, LSTMs and BoG techniques.

Contributors:

  • alexkaf
  • manzar96

About

Labs exercises in NTUA (2018) for the course Speech and Natural Language Processing 7th semester

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages