Skip to content

Instantly share code, notes, and snippets.

@planetis-m
Last active September 8, 2024 00:46
Show Gist options
  • Save planetis-m/eb4de8a5b70afff65056ed5f94aba8eb to your computer and use it in GitHub Desktop.
Save planetis-m/eb4de8a5b70afff65056ed5f94aba8eb to your computer and use it in GitHub Desktop.

COMP-498: Final Year Project I Proposal

Project Title: Using LLMs to Automate KDE Translations

Mission Statement

Translating FOSS software projects is often a thankless job, with fewer and fewer active contributors. Currently, the KDE-el translation team is dissolved. The steady decline in approved translations, as evident from the statistics, will likely lead to the Greek translation being excluded from the KDE release in the near future. Since this decision will be based on the percentage of translated messages according to the guidelines outlined here: https://l10n.kde.org/docs/translation-howto/translation-howto.docbook, there is a pressing need for action to prevent this decline.

Fortunately, with the advent of numerous open LLMs that excel at translation tasks, we may be able to reverse this trend.

Project Phases

The project is divided into four phases:

  1. Compile a Glossary of Terms Related to KDE Software
  2. Develop Comprehensive Rules or Instructions for Translating UI Messages to Greek
  3. Use an LLM to Translate All Remaining UI Messages
  4. Submit the Generated Translations for Review

Details of Each Phase

  1. Compile a Glossary of Terms

    Utilize various sources, such as:

    Additionally, by using classical machine learning algorithms such as WordNetLemmatizer, we can extract terms from the KDE source repository. These terms should take precedence over those from other sources to ensure consistency with existing translations.

  2. Develop Comprehensive Rules or Instructions for Translation

    Instructions guide LLMs toward producing better output. We should create an optimized prompt based on rules from sources such as:

    as well as our own rules. Such as:

    - Χρησιμοποιoύμε την υποτακτική έναντι της προστακτικής (πχ. "Να κλείσει το αρχείο" αντί "Κλείσε το αρχείο").
    - Προτιμούμε τα ουσιαστικά έναντι των ρημάτων σε μενού/κουμπία. (πχ. "Επεξεργασία επαφής" αντί "Επεξεργάσου την επαφή").
    - Χρησιμοποιoύμε την παθητική φωνή αντί για την ενεργητική, ή τη γραφή σε α' πρόσωπο (πχ. "Δε βρέθηκε το αρχείο", αντί "Δε βρήκα το αρχείο").
    - Αλλάζουμε την σειρά του ρήματος/υποκειμένου σε μηνύματα σφάλματος (πχ. "Δε βρέθηκε το αρχείο στον απομακρυσμένο διακομιστή" αντί "Το αρχείο δε βρέθηκε στον απομακρυσμένο διακομιστή").
    - Δεν ξεχνάμε να χρησιμοποιούμε τα απαραίτητα άρθρα (ο/η/το). Εκτός από τα μενού/κουμπιά, ώστε να εξοικονομούμε χώρο στο UI.
    
  3. Use an LLM to Translate the Messages

    All the previous steps will contribute to the quality of the output produced in this phase. The compiled list of rules will be used as a system message, and the glossary will be augmented with the prompts using a RAG framework.

    LLMs to consider for the project that have shown strong performance in English to Greek translation include:

    We can incorporate these models from Hugging Face into our project.

    The translated messages will be saved in .po files using the Python library polib, with the status set as "fuzzy" or "needs review."

  4. Submit the Generated Translations for Review

    The final step involves coordinating with the current project maintainer (if still available) to discuss the possibility of merging the changes upstream. This requires someone with a KDE developer account.

    Then, a human translator must review each AI-generated UI message and approve it using KDE's translation app.

    This means the translation will be included in the next KDE release!

    This step is not part of the project and will take some time to complete.

Forseen Challenges

  1. Choosing the Appropriate Prompt Language

    Meltemi-7B is designed for Greek, so prompts can be in Greek. However, for other LLMs like Mistral-Nemo, the Greek translation guidelines may need to be translated into English to ensure proper understanding by the model.

  2. Grouping Messages for Prompts

    Deciding how to group translation messages is important for context and accuracy. The metadata in the .po files, such as source code references (e.g., "lib/welcomeview/welcomeview.ui:242") and extracted comments, can help identify related messages. More details on this can be found here.

  3. Running the Models

    Running inference for large-scale translation tasks requires significant computational resources. While Hugging Face and Google Colab offer free-tier options, these are limited in terms of compute power and model size. It may be necessary to leverage cloud providers (Google Cloud, AWS, or Azure) that offer GPU instances that could be used to run the models, although this costs. Alternatively, university resources could be used if they are available.

Existing Work

  1. AI Glossary Generators

    Several projects, like the GPT Store's Glossary Generator, use LLMs for generating glossaries. However, the results are often unreliable and limited in scope.

  2. DeepL Translator

    DeepL claims to outperform Google Translate and ChatGPT-4 but is proprietary. While it offers some customization, its glossary feature is unavailable for Greek, and options are limited to choosing between formal and informal language.

Implementation Hurdles

  1. Glossary Generator
    • The gender of adjectives depends on and is determined by the gender of the nouns they modify. However, the process of lemmatization removes the endings. Therefore, multigrams should be checked and, if necessary, this information should be re-added.

By undertaking this project, I will have the opportunity to research topics in machine learning, such as RAG and transformers, while making a meaningful contribution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment