The task introduces the Levenshtein distance - a measure that is useful in tasks such as approximate string matching.
Task objectives (8 points):
- Use FIQA-PL corpus, including documents, questions and relations between them.
- Use SpaCy tokenizer API to tokenize the text in the documents.
- Compute frequency list for each of the processed files.
- Aggregate the result to obtain one global frequency list. This frequency list gives you unigram statistics of the words appearing in the corpus.
- Apply a distortion function to the queries part of the corpus. In each query draw randomly one word and remove, add or change one letter in the word (2024/2025 was only change one letter).
- Compute nDCG@10 for the distorted queris, using the same approach as in lab 2. This result will be the baseline for the other methods.
- Install Morfeusz (Binding dla Pythona) and use it to find all words from the queries that do not appear in that dictionary. Only these words should be corected in the next step. This step is not obligatory - if there's a problem installing Morfeusz, you can use one of the raw text files (Morfeusz "Słownik Polimorf" or "Słownik SGJP") and use the first column, as a list of all valid forms in Polish.
- Use Levenshtein distance and the frequency list, to determine the most probable correction of the words in the queries that were identified as invalid. (Note: You don't have to apply the distance directly. Any method that is more efficient than scanning the dictionary will be appreciated.)
- Compute nDCG@10 for your implementation of the spelling correction method.
- Use ElasticSearch's fuzzy match and compute nDCG@10 for this approach.
- Compare the results of baseline with the 2 implemented methods. Take into account the nDCG score and the performance of the methods.
- Use an LLM of your choice (you can use Bielik) to fix 30 first queries from the distorted set and compare the results manually with the method based on the Levenshtein distance.
Draw conclusions regarding (2 points):
- the distribution of words in the corpus,
- the performance of your method compared to ElasticSearch,
- the results provided by your method compared to ElasticSearch,
- the validity of the obtained corrections,
- ability of an LLM to fix invalid queries.
- Levenshtein distance (Edit distance) is a measure defined for any pair of strings. It is defined as the minimal number of single character edits (insertions, deletions or substitutions) needed to transform one string to the other. The measure is symmetric.
- The algorithm is usually implemented as a dynamic program, see Wikipedia article for details.
- The distance may be used to fix an invalid word by inspecting in the growing order of the distance, the words that are n edits away from the invalid word. If there are no words n edits away, the words n+1 edits away are inspected.
- The frequency list may be used to select the most popular word with given distance, if there are many candidate corrections.
- Usually the correction algorithm does not use the edit distance directly, since it would require to compare the invalid word with all words in the dictionary. The algorithms work in the opposite way - they generate candidate words that are 1 or 2 edits away from the invalid word (cf. P. Norvig's article for the details). A different approach is to use Levenshtein automaton for finding the corrections effectively.
- ElasticSearch has a fuzziness parameter for finding approximate matches of a query.