speech_nlp_ntua

Lab exercise in NTUA (2018) for the course Speech and Natural Language Processing 7th semester

Greeklish to Greek converter

Greek is a complicated language in a way that the writer has to take many parameters into account while writing, like the spelling, punctuation, etc. Hence, it is a trend nowadays for greek people not to write greek while texting, but to use a different approach which is called Greeklish.

What is Greeklish?
Greeklish is a type of writing, in which the greek alphabet is replaced with the latin. In this way, the writer does not need to be concerned about the spelling and the punctuation and as a result the writing can be accelerated.

Examples:
ΚΑΛΗΜΕΡΑ → KALIMERA
ΘΑΛΑΣΣΑ → THALASA or 8ALASSA
ΓΕΙΑ ΣΟΥ → GIA SU

In this project we had to create a converter that would take Greeklish texts as inputs and translate them to Greek texts. In order to complete this task we created a pipeline of Finite State Machines (FSMs), in which every FSM had a different role. The approach followed was:

1st Step: Training the FSM
The matching of letters from the latin alphabet to the greek one is an injective function. For instance, the letter Θ can be found in a form of 8 or TH. Hence, there can be more than one possible options for a match from latin to greek. In order to distinguish each case, the FSM should have weights on its edges so it can choose the most economic path. The weights were calculated from a training text and the values they got was the logarithm of the frequency of appearence of each matching between the greeklish and the greek text. The purpose of the logarithm was to simplify the calculations.
2nd Step: Architecture of the Converter
The final converter is created with a serial connection of the translator and an orthographer. At first, each word is inserted in the FSM and all the possible paths are selected along with the respective weights. Now, we have some possible translations for the word and we know how possible each translation could be valid. Then, those translations are inserted in the orthograpger, which is an FSM created from a dictionary of greek words. The word that generates the smaller cost is finally chosen and given as an output.

Implementation:
→ Python3
→ OpenFSM
→ bash

Further improvements:
The model may be improved if N-grams are used so each letter translation is done taking into account the translation of the N previous letters. Those N-grams could also be applied between words, so the translation would be sensitive to context. Other machine learning techniques worth of examination is the use of Markov Models, LSTMs and BoG techniques.

Contributors:

alexkaf
manzar96

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
converter		converter
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

speech_nlp_ntua

Greeklish to Greek converter

About

Releases

Packages

Languages

License

manzar96/speech_nlp_ntua

Folders and files

Latest commit

History

Repository files navigation

speech_nlp_ntua

Greeklish to Greek converter

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages