Teuken 7B Instruct

Multilingual, open source models for Europe – instruction-tuned and trained in all 24 EU languages

1. Key principles driving Teuken

1.1. Model training „Made in Europe“

To maintain Europe’s scientific and economic competitiveness now and in the future, we believe it is essential to develop large AI language models from the ground up. Therefore, we did not consider it sufficient to base the models developed in OpenGPT-X solely on „wrappers“ of existing models and to limit the scope of our scientific research to models developed by third parties.

As discussed in the following sections, the main challenges in creating competitive European language models are the availability of computational resources and high-quality data.

We believe that collaboration is essential to overcome these challenges and to strengthen the European GenAI landscape. Therefore, OpenGPT-X invites researchers, developers and AI enthusiasts to join and contribute. To support this collaboration, we have set up a dedicated Discord server, providing a space for technical discussions, idea exchange, and direct interaction with the development team. In addition, resources such as research publications and the European LLM Leaderboard provide insight into the performance and technical specifications of our Teuken 7B models. We encourage continued community engagement and collaborative exploration as the project evolves.

1.2. Multilingual by design: embracing Europe’s linguistic diversity

A fundamental principle in the development of Teuken 7B-v0.4 was to ensure that it was multilingual by design, specifically considering the diverse linguistic landscape of Europe. By prioritizing the representation of non-English European languages, our goal was to create a model that stands apart from those developed in the US and China.

Focus on European languages
As we prepared for model training, we identified a gap in the world of large language models (LLMs): most were heavily weighted toward English or Chinese, with minimal inclusion of other languages. Even so-called multilingual models often contained as little as 5% non-English data. To support the European AI ecosystem, we made a conscious decision to emphasize all 24 official languages of the European Union (EU) in the development of Teuken 7B-v0.4. This led to the creation of a custom multilingual tokenizer.

Innovating with a multilingual tokenizer
Our first major technical breakthrough was the development of a multilingual tokenizer. English-centric tokenizers tend to fragment non-English documents, leading to inefficiencies in training and higher costs during inference. To address this issue, we created a tokenizer specifically optimized for all 24 EU languages. This innovation significantly reduced training costs compared to using an off-the-shelf monolingual tokenizer (see Section 2.3 for more on tokenizer efficiency). It also proved to be more effective than other multilingual tokenizers, such as those from Mistral and Llama3 (unpublished results). This tokenizer lays the foundation for further language-specific pre-training and fine-tuning, ensuring that both training and inference remain efficient across multiple languages.

Training on >50% non-English data
Teuken 7B-v0.4 was trained on over 50% non-English data. As language is both a reflection of culture and a cornerstone of identity, this focus allows our model to better reflect Europe’s rich linguistic and cultural diversity. Our team collected data from across Europe to ensure robust representation of lesser-used languages. This commitment extended beyond training data: we created multilingual evaluation datasets for 21 EU languages, which are essential for accurately assessing model performance in languages where evaluation data is scarce (see Section 2.1 for more on the European LLM Leaderboard). This extensive process required significantly more computational resources and effort to automatically analyze results than if we had evaluated only in English.

This chart illustrates the language distribution in the Teuken 7B-v0.4 pre-training dataset. With about 50% non-English texts from 23 European countries and about 40% English data, Teuken 7B-v0.4 differs from most existing multilingual models, which are often only extended with non-English data during continued pre-training or fine-tuning. For comparison, Meta-Llama-3.1-8B was trained on 8% non-English data.

Tackling the challenge of under-represented languages
The limited availability of training and evaluation data for less widely spoken languages presented a major challenge. A considerable amount of our time was spent building evaluation datasets to fill these gaps. Our decision to go beyond the „top five“ European languages (English, German, French, Italian, Spanish) represents a meaningful contribution to the European AI community, allowing for greater inclusion of Europe’s linguistic diversity.

1.3. Data-driven innovation: research at the core

A defining feature of Teuken’s development is its research-focused, data-driven approach. In a rapidly evolving generative AI landscape, we continually adapted our methods based on experimental findings, underscoring our commitment not only to leveraging cutting-edge technology but also to actively contributing to its advancement.

Experimentation at the heart of every decision
At each stage of development – whether building, training, or evaluating the model – key decisions were driven by carefully framed research questions. Experimentation was fundamental to our approach. For example, a critical early research question was the impact of a multilingual tokenizer on model performance. Through experimentation, we found that using a multilingual tokenizer significantly improved performance in non-English languages. This finding led to a published paper detailing our approach and results. Another important decision we explored was whether to limit the model to a smaller subset of languages or to fully support all 24 official EU languages. Choosing the latter, we expanded our research scope to develop a fully multilingual model. This process of framing key decisions as research questions and adapting based on evidence ensured that Teuken 7B-v0.4 is built on a solid research foundation.

Leveraging scaling laws for smarter training
In addition to internal experimentation, we closely monitored advances in the wider AI community. Breakthroughs in scaling laws that emerged after the project was underway – notably from studies such as Hoffmann et al. 2022 and Touvron et al. 2023 – provided valuable insights into how to effectively allocate our limited computational resources. Applying these insights, we made a strategic choice: instead of training a massive model on a smaller dataset, we chose to train a smaller model on a larger dataset of 4 trillion tokens. While this is still considerably less than Llama3’s 15 trillion tokens, it is significantly larger than GPT-3’s 300 billion tokens, which was the state-of-the-art when we started. This approach allowed us to optimize for performance while balancing model size and dataset size without overextending our resources.

Building a comprehensive evaluation framework
Our data-driven approach extended beyond model development to include the creation of a comprehensive evaluation framework. This was essential for assessing performance and ensuring that Teuken 7B-instruct was both powerful and robust across all 24 languages. One of the notable results of these efforts is the European LLM Leaderboard – a new means for the AI community to compare the performance of multilingual models across European languages (see Section 2.1 for more details on the European LLM Leaderboard).

2. Teuken 7B-instruct: performance and evaluation

2.1. Benchmarks and European LLM Leaderboard: a new standard for multilingual LLMs

In the development of large AI language models, training and evaluation are inextricably linked. However, most of the commonly used evaluation benchmarks have traditionally focused on English, leaving gaps in assessing the performance of multilingual models. To fill this gap, we developed the European LLM Leaderboard – a ranking system that evaluates models in nearly all official EU languages. The Leaderboard allows direct comparison of models with up to 70 billion parameters, enabling a comprehensive assessment of multilingual performance beyond English-only datasets – a crucial requirement in Europe’s linguistically diverse landscape.

Explore the European LLM Leaderboard on Hugging Face: https://huggingface.co/spaces/openGPT-X/european-llm-leaderboard

To construct the Leaderboard, we translated a selection of well-known benchmarks into 20 EU languages using DeepL (with Irish, Maltese and Croatian still in progress). In addition to these translated benchmarks, we included two existing multilingual benchmarks – Belebele and FLORES-200.

Benchmarks featured in the European LLM Leaderboard:

ARC – Tests elementary school-level science questions, evaluating reasoning and knowledge application.
Belebele – Assesses multilingual reading comprehension across 122 language variants.
FLORES-200 – Evaluates machine translation quality between English and 200 low-resource languages.
HellaSwag – Contains sentence completion tasks to evaluate commonsense reasoning and Natural Language Inference (NLI).
MMLU – A diverse set of multiple-choice questions that test multitask language understanding across various subjects.
GSM8K – Measures mathematical problem-solving abilities at the elementary school level.
TruthfulQA – Evaluates the model’s ability to provide truthful answers and avoid false statements.

2.2. Performance metrics

Teuken 7B-instruct demonstrates competitive performance against leading multilingual language models in the 7-8 billion parameter range. This is evidenced by its scores on key benchmarks such as HellaSwag, ARC and TruthfulQA, where it performs head-to-head with models such as Meta-Llama-3.1-8B.

The following performance results are based on task accuracy averaged over the 21 EU languages included in the European LLM Leaderboard. The models included in the comparison are:

Mistral-7B-Instruct-v0.3 (trained on 8 trillion tokens)
Meta-Llama-3.1-8B-Instruct (trained on 15 trillion tokens)
Salamandra-7b-instruct (trained on 7.8 trillion tokens)
Occiglot-7b-eu5-Instruct (based on Mistral-7B-v0.1 trained on 8 trillion tokens with an additional 293 billion tokens of multilingual and code data)
Pharia-1-LLM-7B-control-aligned (trained on 7.7 trillion tokens)

This chart compares Teuken 7B-instruct-research-v0.4 with instruction-tuned open source LLMs in the 7-8 billion parameter range on selected benchmarks, with performance (accuracy) displayed in percentages. The evaluation results show the average performance of each model across 21 EU languages (excluding Maltese, Croatian and Irish due to translation quality limitations). Notably, the comparison models are trained on significantly more tokens.

Teuken 7B-instruct outperforms all comparison models when averaged across 21 languages on the „Europeanized“ ARC, HellaSwag and TruthfulQA benchmarks. Teuken is second only to Salamandra-7b-instruct on ARC and HellaSwag, and second only to Mistral-7B-Instruct-v0.3 on TruthfulQA.

Despite instruction-tuning, large language models may still generate content that is inappropriate, offensive, or harmful. Our bias and toxicity evaluations show that Teuken 7B-instruct-research-v0.4 ranks in the middle of the field compared to other models, indicating potential for improvement on these benchmarks.

Teuken 7B-instruct’s ability to maintain consistent performance across both widely spoken and less commonly represented EU languages is underscored by its low standard deviation in task accuracy across languages on each benchmark:

This chart shows the standard deviation of the comparison models‘ mean scores across 21 languages for each of the benchmarks. The lower the standard deviation, the more consistent and reliable the model’s performance on the respective task across multiple languages. The results suggest that, even when evaluated on different benchmarks, models trained on multilingual datasets tend to perform more consistently across languages.

On ARC and HellaSwag, Teuken 7B-instruct-research-v0.4 is second only to Salamandra-7b-instruct in terms of standard deviation. On TruthfulQA, all models achieve comparably low standard deviations, with Meta-Llama-3.1-8B achieving the lowest, closely followed by Teuken 7B-instruct-research-v0.4. Averaged over all three tasks, Salamandra-7b-instruct has the lowest standard deviation, closely followed by Teuken 7B-instruct-research-v0.4.

For a detailed breakdown of Teuken 7B-instruct’s performance across all tested languages and benchmarks, see the European LLM Leaderboard.

Improved tokenizer: reducing fragmentation, enhancing multilingual performance

Teuken 7B-instruct is released with a custom-built multilingual tokenizer, specifically designed to optimize model performance across European languages. The role of the tokenizer is to break down words into smaller units, called tokens, allowing the language model to process text. A key measure of tokenizer efficiency is fertility – the average number of tokens into which a word is split. Lower fertility means fewer tokens, which reduces computational demands (measured in GigaFLOPS) and lowers usage costs, as costs are directly related to the number of input and output tokens.

Our research revealed that tokenizing non-English European languages generally requires more computational power than tokenizing English text. For multilingual training data, such as the datasets used in OpenGPT-X, multilingual tokenizers like the one developed for Teuken 7B are more efficient, reducing the extra compute required to process non-English text. This efficiency can be illustrated by comparing the additional compute required by other models‘ tokenizers to process non-English text against the baseline compute needed for English text (using Llama3 as reference point):

On average across all 23 non-English EU languages, and in four highlighted languages, the Teuken models require the least amount of additional compute. For example, when processing German text, Teuken models require only 22% more compute, less than the comparison models.

The improved tokenizer used in Teuken reduces computational requirements and both training and inference costs across all 24 supported languages. This efficiency is particularly beneficial for European languages with relatively long words, such as German, Finnish and Hungarian, where the tokenizer allows longer queries to be processed without exceeding Teuken’s context length, making it a more adaptable and cost-effective solution for real-world multilingual applications.

3. Building Teuken

3.1. Overcoming technical obstacles: engineering Teuken for scale

Building large multilingual models like Teuken 7B-v0.4 presented numerous technical challenges. From scaling training infrastructure to processing massive datasets, overcoming these obstacles was key to shaping the model’s capabilities and performance.

Scaling the model
The size of Teuken 7B-instruct required rapid scaling of both infrastructure and expertise. Training a model with billions of parameters on up to 512 NVIDIA A100 GPUs was a significant undertaking given the limited precedent for such tasks in Germany at the start of the project. We leveraged compute resources from two of Germany’s leading High Performance Computing (HPC) centers: the JUWELS system operated by the Jülich Supercomputing Center (JSC) at Forschungszentrum Jülich (hardware overview) and the HPC systems operated by the Center for Information Services and High Performance Computing (ZIH) at TU Dresden (hardware overview).

To optimize the efficiency of the available infrastructure, we conducted numerous experiments and ablation studies, ensuring that the model could be trained at scale without sacrificing performance. These efforts are documented in several publications [1, 2, 3]. To configure the model layout to maximize efficiency, given that our code was parallelized across multiple dimensions in the model, we conducted benchmark tests to optimize the parallelism layout specifically for the JUWELS Booster system.

Selecting the right training framework
Another critical challenge was selecting and mastering the optimal training framework. Initially we chose Megatron-LM DeepSpeed, but as the project progressed, we found that Megatron-LM – with its ongoing development – was a better choice, especially for handling large-scale distributed training. The complexity of these frameworks, especially given their active development, meant that building expertise was critical. Mastering these tools allowed us to take full advantage of the HPC infrastructure and run large-scale training efficiently.

Data processing and deduplication
Handling vast amounts of multilingual data presented its own challenges. With limited prior experience in processing datasets of this scale, data deduplication – the process of removing redundant or repeated data – became a priority. Effective deduplication is crucial for minimizing memorization of text extracts and ensuring that training is performed on diverse, high-quality data, thereby improving the overall performance of the model and reducing unnecessary computational costs.

3.2. Optimizing with constraints: strategic decisions in a resource-limited environment

A consistent challenge throughout the development of Teuken 7B-v0.4 was working with limited computational resources compared to other large-scale LLM initiatives. This required us to make pragmatic and strategic decisions about the model’s size, dataset composition, and training processes to maximize efficiency and make the best use of our available resources.

Navigating limited compute resources
The computational power available for training Teuken 7B-base was a fraction of that available for efforts such as Meta’s Llama models. Specifically, we had just 18% of the compute used to train Meta’s Llama3 8B model, and less than 1% of the resources used for the Llama3 405B model. Nevertheless, we optimized our approach by focusing on efficiency at every stage – from training to data processing – to ensure that we balanced model size and dataset scale in a computationally optimal way.

To maximize resource efficiency while managing risk, we conducted medium-scale ablation experiments, carefully adjusting individual variables to observe their effect relative to a baseline. Our experiments examined a range of factors in the training process, such as variations in model architecture, optimizers, and learning rates, with a primary focus on computational efficiency – how each method impacted the overall computational demand. For example, if a method doubled the performance of our model but required four times as much compute, it was not considered useful. We sought to avoid misleading metrics by normalizing our ablation runs over time so that we could immediately consider the time impact of each adjustment in our analysis.

Strategic decisions driven by resource constraints
Our resource limitations led to key decisions that shaped the final model. One of the most significant was to limit Teuken 7B-base to 7 billion parameters. This decision allowed us to stay within computational limits, while making the model easier for researchers and developers in the community to use by requiring fewer GPUs for inference, fine-tuning, and experimentation. We also focused on optimizing the data processing pipeline and maximizing the efficiency of each training cycle.

3.3. Crafting a technology stack: contributing to AI development in Europe

We expect one of the lasting contributions of OpenGPT-X to be not only the Teuken model itself but the comprehensive technology stack developed around it. This stack empowers future researchers and developers to efficiently train, fine-tune and evaluate large language models, laying the groundwork for the next generation of AI innovation in Europe.

Data pre-processing and filtering: ensuring high-quality inputs
At the core of this technology stack is a robust data pipeline designed to ensure that only high-quality data is used during training. We developed a pre-processing system that filtered out over 90% of the data, preserving the integrity and quality of the model’s training set, which includes large amounts of multilingual content. This pipeline was especially critical for handling noisy web data, where ensuring relevance and quality is essential. No annotated pre-training data was purchased.

Optimized implementation
While the Megatron-LM code base provided a solid foundation, we aimed to incorporate advanced methods that were not yet standard, despite research supporting their effectiveness. For example, we implemented state-of-the-art distributed and highly optimized neural network layers.

Building custom frameworks for training and evaluation
A significant achievement enabled by insights gained in OpenGPT-X was the development by the Lamarr Institute of Modalities, a fully open source framework for end-to-end training of large language models. Modalities provides a complete workflow for data tokenization, data packing, training, and evaluation of foundation models. First presented at the Lamarr Conference in September 2024, this framework enables future AI projects to develop models with greater control and efficiency.

Alongside the training framework, we created a comprehensive evaluation framework that includes multilingual evaluation datasets, essential for assessing Teuken 7B-v0.4’s performance across a wide range of European languages. This evaluation framework addresses a significant gap in the LLM landscape, where most models are evaluated primarily in English. Integrated with the European LLM Leaderboard, this framework provides a benchmark for multilingual models across Europe (see Section 2.1 for more on the European LLM Leaderboard).

Long-term impact: a future-proof technology stack
The technology stack developed in OpenGPT-X represents a robust infrastructure that we expect to drive future AI research and development. While today’s models may soon be surpassed by technological advances, the expertise and tools established during Teuken’s development provide a foundation that supports future innovations such as continued pre-training, language-specific model forking, and specialized fine-tuning. This infrastructure enables progress in both multilingual and monolingual AI research.

4. Model update „Teuken 7B-instruct-v0.6“

The update „Teuken 7B-instruct-v0.6“ shows significant improvements compared to „Teuken 7B-instruct-v.04“, including increased performance, improved robustness and reliability as well as extended application flexibility.

Increased performance
The model shows an average 7 percent improvement in performance compared to „Teuken 7B-instruct-commercial-v.04“. This results in more precise and consistent speech generation. In comparison with „Mistral-7B-Instruct-v0.3“, „Teuken 7B-instruct-v0.6“ is on average ahead of „Mistral-7B-Instruct-v0.3“ in the ARC, GSM8K, HellaSwag, MMLU and TruthfulQA tasks.

Improved robustness and reliability
Thanks to further fine-tuning and more robust training methods, the model achieves more reliable results, even with complex or multi-layered input requests.

Enhanced application flexibility
Optimized Instruct features allow the model to respond more efficiently to diverse tasks.

License: CC BY-NC 4.0

More information about this model and the basic model in the Model Card.

5. Find out more and connect with the developers

We invite researchers, developers, and AI enthusiasts to dive deeper into the development of Teuken 7B-v0.4 and actively engage with the team behind it. Whether you have technical questions, innovative ideas, or insights to share, there are multiple ways to connect and contribute to the growing community.

Join the conversation on Discord
Our dedicated OpenGPT-X Discord server provides a collaborative space for technical discussions, idea exchange, and direct interaction with the developers working on Teuken 7B-v0.4. Whether you’re curious about the model’s architecture or want to discuss multilingual LLM development, this is the ideal platform to engage.

Join us on Discord: https://discord.com/invite/RvdHpGMvB3

Explore research and benchmarks
As a research-driven initiative, the development of the Teuken 7B-v0.4 models is grounded in academic research and technical experimentation. The research contributions of key project partners – including Fraunhofer IAIS/IIS, DFKI, Forschungszentrum Jülich, and TU Dresden – provide a comprehensive perspective on the technical approaches and methodologies that shaped Teuken 7B.

Explore the full list of research publications by OpenGPT-X partners here: https://opengpt-x.de/en/news-en/

See how Teuken 7B performs on multilingual benchmarks alongside other European LLMs by visiting the European LLM Leaderboard: https://huggingface.co/spaces/openGPT-X/european-llm-leaderboard

Stay up to date
Stay informed about Teuken 7B’s ongoing development, future updates, and opportunities for collaboration by following the project’s website and LinkedIn page. We look forward to engaging with you.