Workbook 3
Limitations and risks of large language models
Read about how the output of large language models can be problematic, sustainability concerns, and how these issues can be mitigated
3.1 Introduction
In 2017 a palestinian worker posted a picture of himself leaning against a bulldozer together with the message “good morning”, however Facebook’s`s LM driven translation service falsely translated to “attack them”. This in the end triggered a response by israeli police leading to the arrest of the man ( Guardian ). He was released a few hours later resulting in an apology by Facebook. While no greater harm has been caused in this case, it is not hard to conceive a much more violent outcome given the notoriously tense situation in the region.
This posed one of the first examples where the deployment of NLP technology based on machine learning failed in a way suitable for causing irreversible harm to humans and current LMs have evolved significantly.
We have seen the advent and rapid improvement of systems which are capable of a) performing “well” on a wide range of problems for which an “understanding” of natural language is deemed necessary and b) whose outputs are increasingly indistinguishable from human output. While their emerging capabilities are undeniably impressive and promise a possibly wide range of applications, the potential changes in how we share knowledge and communicate are fundamental. Identifying and addressing arising risks therefore is a pressing task given the speed of current developments.
So far most (by no means all) of what is learned by LLMs is acquired over the natural language utterances (social media,prose of all kinds and times, poetry, customer support interactions, websites etc.) produced by humans, each situated by their respective societal role and intent of engagement. Moreover, it is expected that the world views expressed in those texts and encoded by the LM, will not adequately reflect the heterogeneity of the views held up in said society, but instead will be biased. Taking this inherent prior bias in the training data into account has not been the focus of most actors in the extremely dynamic environment of research and development, which is heavily dominated by big technology companies (vs. public-interest academia). On the other hand training corpuses have become so large that an effective “curation” as pledged by Bender and Gebru, 2021 , can be seen as a yet intractable problem and requires further research for effective tooling . It is therefore expected that LLMs trained on contemporary large corpora like “the pile” (880GB, Gao et al, 2020 ) or the C4 (2.3 TB, Raffel et al 2020 )., exhibit views on race, gender, religion, sexual orientation etc which might not seem tolerable and can be considered toxic.
Bringing the ever increasing quality in the linguistic form of LLM outputs into play, when processing linguistic form humans have shown to infer meaning, grounded in the common human perception of reality, if it (the form) looks plausible enough. To what extent LMs are inherently able to capture the notion of meaning which somehow corresponds to humans, is subject to ongoing research and debate. It is certain however, that by “parroting” toxic views on the world, LLMs are suitable for reinforcing those views on a societal level by creating a kind of toxic feedback loop ( Weidinger et al, 2021 ).
When considering the process of development of a LM driven application it is important to stress the fact that in contrast to traditional rule based software it is inherently impossible to guarantee for desired (or an upper bound on undesired) output. The solution to a problem learned by a NN cannot be “debugged” and “fixed” comparable to a traditional software development cycle. The grave shift in the paradigm of developing software driven solutions (also with regard to “agility”) should always be included in the assessment whether LMs are in fact the right tools for the job because while they are indeed powerful they are hard to control.
3.2 What are the limitations or risks of large language models in terms of their output?
Large language models are becoming increasingly common in chatbots, including popular services such as Apple’s Siri, Amazon’s Alexa and Google’s Assistant. However, there have been several instances where AI chatbots have exhibited unintended and harmful behaviour, leading to them being shut down. In 2016, for example, Microsoft’s chatbot Tay was taken offline after it began using racist language on Twitter just 16 hours after its release (see image gallery below). Similarly, the Korean chatbot Luda was shut down in January 2021, just one month after its release, due to its vulgar language. As the use of LLM in chatbots increases, it is therefore important to be aware of the potential risks and limitations of the content they generate, and to take steps to mitigate them.
Bias
A bias is an inclination or prejudice for or against something or someone, which may be acquired or innate. Different types of bias can affect the training and use of large language models. In particular, statistical bias can occur during data collection and lead to unbalanced or misleading results. LLMs rely on large amounts of data, and bias occurs when models are trained on data that is not sufficiently diverse.
As a case study to illustrate the importance of being aware of such biases in the training data of LLMs, OpenAI’s image generator DALL·E 2, when given the prompt “a happy family”, generated images exclusively of heterosexual couples with at least one child, with same-sex parents or childless couples conspicuously absent (see figure below). Awareness of such biases is all the more important as these models become more prevalent in everyday interactions, being used in chatbots, voice assistants or for content generation in news and social media.
Command:
“A happy family”
–
Output:
Despite awareness of the problems of bias, the development of effective tools for its mitigation is still an open challenge. Current datasets such as “The Pile” at 880GB or C4 at 2.3TB contain unfiltered text scraped from the internet, which may contain offensive, if not outright illegal, content on issues such as race, gender, religion or sexual orientation. However, the size of these corpora for LLM training has grown to such an extent that effective “curation” is seen as an intractable problem. There is therefore a need for further research and development of effective tools for dealing with bias in large language models.
Toxicity content generation
Toxic language has the power to instigate hatred, violence, or offense. There is however a difference between human beings using toxic language and when a LM generates those toxic content. Because when human beings use them, mostly are by means and LM not. However this could make things even worse, as the example mentioned before, the Korean chatbot Luca or Tay from Microsoft, when it is only trying to learn from what people say without any moral or ethical perspective, you can never imagine how things could go wrong. An LM that performs worse for some social groups than others can hurt underprivileged groups, for instance when such models serve as the foundation for technologies that have an impact on these groups.
One extreme example is GPT4Chan created by Yannic Kilcher, which could not be accessed anymore, since it could be the Worst AI ever. According to the creator: “The model was good, in a terrible sense … It perfectly encapsulated the mix of offensiveness, nihilism, trolling, and deep distrust of any information.” GPT4Chan raised a lot of attention and discussions in the academical AI communities, in the end a stanford professor Percy Liang called for a condemnation of the deployment of GPT-4chan , which got support with signatures from 360 researchers and professors from top universities, because they believe Yannic Kilcher’s deployment of GPT-4chan is a clear example of irresponsible practice, before GPT4Chan stopped developing, it has generate and deceptively post over 30,000 posts.
Similar to bias problems, these dangers are largely attributable to the use of training corpora that are overrepresented in certain social identities and contain offensive language, whoever before training a new model should never undermine the responsible practice of AI science thinking of GPT4Chan.
(Mis)information hazards
Misinformation is incorrect or misleading information. It is a general problem everywhere in the modern world, especially when we think about using LLMs in the real world. A lot of the information offered by LMs may be inaccurate or deceptive. This includes the potential for worsening user ignorance and deteriorating confidence in shared knowledge. It becomes very hard to tell what is true especially among all the misleading or fake news.
Other misinformation examples, such as poor legal or medical advice, could be extremely dangerous, even harmful in sensitive fields. Users who get inaccurate or incomplete information may also be persuaded to do activities that are immoral or illegal that they otherwise would not have.
For example, here is a proof you can not trust even one of the best LLMs, not all the time. Below is when you ask simple math questions to ChatGPT. Even on the other side ChatGPT shows very impressive results in generating codes and chatting with humans among wide topics from history, daily cooking to politics.
When you ask ChatGPT itself, why is it so bad at math? It explained to me that it doesn’t have access to most mathematical functions or a calculator. Which is a very good and reasonable answer. Because the underlying statistical approaches are ill-suited to distinguish between factually accurate and false information, the procedures by which LMs learn to represent language contribute to the dangers of misinformation.
What could potentially be really dangerous is when young students start to use ChatGPT or other LLM applications and trust it to always be smart and tell the truth, then very likely all teachers will have a very hard time in their job.
ChatGPT seems to be very careful not to give misinformation, it always reminds you that it is only a machine when it talks about something dangerous like breaking windows to get into a house, it will also tell you ” these actions are illegal and dangerous in real life. ” Unlike ChatGPT it is created for chatting, a not super famous LLM from Meta called Galactica with its bot has been seen by the media as ‘most dangerous thing Meta’s ever made’, Why ? Because it is a fundamental problem with Galactica, that META is promoted to provide meta as a shortcut for researchers and students., but it is not able to distinguish truth from falsehood, a basic requirement for a language model designed to generate scientific text.
3.3 Risks of Sustainability
Over the last four years, the size of state of-the-art language models has doubled every 3-4 months. With this fast born of LLMs, it is crucial to take into account how large language models may affect the environment as they become more commonplace and especially in a time of climate crisis when carbon emissions must be drastically cut.
One may research question has been focused in recent years is whether these large compute budgets investigated in training those models are justified. In order to answer this question, in depth evaluation of the footprint of large models is crucial. We collected much research towards this question and try to give you an overview in the following perspectives : Energy cost, computational needs, carbon footprint as overall results. Eventually we try to bring some potential recommendations to keep the sustainability while training LLMs, as we will try to consider in the OpenGPT-X project.
Energy and Computational needs
Training a new LLM not only means feeding data and consuming time, there is also a lot of energy cost behind maybe more than you expected. And the energy costs when it comes to the task of training an LM, is never a single factor to measure. It is related to what kind of computing resource you use to train, how often you train, what powers all those machines behind you computational resources.
Start with computational power which is required and used in training. Many billions or perhaps trillions of weights are present in the most recent language models. There are 175 billion machine learning parameters in the widely used model GPT-3. It was trained on an NVIDIA V100 , but researchers estimate that training the model on NVIDIA A100 would have required 1,024 GPUs, 34 days, and $4.6 million. Although the amount of energy used has not been made public, GPT-3 is thought to have used 936 MWh. The Pathways Language Model, which has 540 billion parameters, was recently announced by Google AI. The need for servers to process the models increases exponentially as the models get bigger and bigger to handle more complex tasks.
Then we can take a quick look at the general energy cost behind these computing needs when training NLP Models. Choosing which energy resource not only has a major impact on the cost, but also to our next and final discussion regarding carbon footprint. So to get an impression,
from a research in 2019 compares China, Germany, and the United States’ relative energy sources to the top three cloud service providers. We believe this conversion provides a reasonable estimate of CO2 emissions per kilowatt hour of compute energy used because the energy breakdown in the United States is comparable to that of the most popular cloud compute service, Amazon Web Services. Table 1
On the other hand, we question if those computing power has been really used efficiently. AI and NLP researchers often rely on HPC data centers managed by cloud computing providers or their institutions if available. The efficiency of a datacenter varies through the day as well as through the year. A common metric used across the datacenter community to measure datacenter efficiency is Power Usage Effectiveness (PUE). According to So et al. (2019), their base model needs 10 hours to train for 300k steps on one TPUv2 core, and their whole architectural search lasted for a total of 979M training steps. 32,623 TPU hours or 274,120 hours on 8 P100 GPUs are the equivalent in this case. According to Peters et al. (2018), ELMo was trained for two weeks on three NVIDIA GTX 1080 GPUs (336 hours). The BERT base model (110M parameters) was trained on 16 TPU chips for 4 days, according to Devlin et al. (2019) . (96 hours). NVIDIA claims that a BERT model can be trained utilizing 4 DGX-2H servers and a total of 64 Tesla V100 GPUs in under 3.3 days (79.2 hours) (Forster et al., 2019). When it comes 2019, a large model like GPT-2 was described in Radford et al. (2019) has 1542M parameters and is reported to require 1 week (168 hours) of training on 32 TPUv3 chips.
The estimation of the necessary resources is based on empirical data for the creation of big, existent AI models, like GPT-3. A multilingual model’s data pre-processing takes roughly 5,000 to 10,000 CPU cores or 150–300 CPU servers (for example, data cleaning such as HTML splitting). GPT-3 training required 355 GPU years. The LEAM project calculates an order of magnitude of roughly 460 specialized AI servers in order to not only catch up with the state of the art but also to advance beyond it and to ensure flexibility for experimentation for real innovations in the AI field. Approximately 10 TB worth of storage resources must also be considered for each AI model.
Carbon footprint
Can you imagine how much Carbon will be produced by training a single AI model only once ? Well in 2019 an answer has been provided according to a
from the University of Massachusetts Amherst. A single AI model may be trained to produce as much carbon dioxide as five cars do over the course of their careers. Remember this only included one practice run and with the model size in 2019. Energy consumption will increase significantly as the model is growing as we mentioned in WB1 and 2 already, additionally a factor is how frequently the model is being trained. Many huge organizations, who have the capacity to train tens of thousands of models each day, are seriously considering the problem. This most recent
by Meta is a fantastic illustration of one such business that is investigating the environmental impact of AI, researching solutions, and making calls to action. research paper article
Factors influencing the carbon footprint of large models:
- Model Size: The larger the number of operations, the more energy is needed to train the model
- Hardware Characteristics: The amount of time needed to complete the work will depend on the throughput that the hardware can handle. Throughput per Watt will increase as hardware becomes more efficient.
- Data center Efficiency: The energy used is used to cool down the data center and meet other electrical needs in addition to powering the computers. Waste heat in data centers can also be reused for collective water heating, driving down the PUE (Power Usage Effectiveness).
- Electricity Mmix: An important consideration is the distribution of the energy sources used to power a data center, which is mostly determined by location. The carbon emissions per kWh of power depend on the electrical mix. The average carbon emission per kilowatt-hour of electricity produced today is 475 gCO2e/kWh, and a rising number of cloud providers’ data centers power their hardware with only renewable or nuclear energy. Once more using Google Cloud as an example, their 86 Montreal facility reports 27gCO2e/kWH, which is 20 times lower than the global average.
Recommendations
We compiled some ideas for future projects to reduce their carbon impact after contrasting numerous studies and experiments.
-
Modeling & Engineering
Efficient Training
How can we be able to keep the model quality uncompromised while reducing as much as possible of the model parameters? This could be the key to bring efficient training. Additionally, parameter compression together with Model speed optimization are also important to make LLMs viable for business demands.
Efficient Inference
Making the model leaner for inference is interesting since inference costs can quickly surpass training expenses. Although it speeds up inference and reduces numerical precision at inference time, quantization
hasn’t been widely used with large models. A promising approach is distillation, which involves training a smaller model from the results of a bigger one. Transformers have already proven distillation in the context of vision
. (Yang et al., nature sustainability) (Touvron et al., 2021)
Making the model leaner for inference is interesting since inference costs can quickly surpass training expenses. Although it speeds up inference and reduces numerical precision at inference time, quantization(Yang et al., nature sustainability)hasn’t been widely used with large models. A promising approach is distillation, which involves training a smaller model from the results of a bigger one. Transformers have already proven distillation in the context of vision (Touvron et al., 2021).
Efficient Implementations
To amortize the significant idle hardware consumption—
, know as the European Union’s greenest supercomputer, for example, that idle power is around 150W per GPU when accounting for CPU cores, infrastructure, etc.—distributed training implementations must be as effective as possible. To obtain the highest throughput achievable, this includes taking into account fine-grained impacts based on architectures, such as wave and tile quantization. MeluXina
-
Hardware
Data Center Choice
Based on recommendation from xxx A data center with a PUE of 1.1 will use 39% less energy than a facility with a PUE of 1.8 on a global scale. Platforms with low PUE should be favored.
Local Carbon Intensity
The ultimate footprint is strongly impacted by the electricity mix’s carbon intensity. The footprint of a project can be significantly reduced by placing training in an area with a clean mix. On online cloud computing platforms, which have a wide range of availability, this is extremely simple to achieve.
Efficient Inference
The footprint of model utilization can be reduced by carefully selecting an appropriate AI accelerator for managed inference workloads.
-
Other Practices
Minimizing Exogenous Impact
Surprisingly, some exogenous impact also plays an important role in sustainability of training new LLMs. Based on
although the main training runs were determined to dominate the final footprint, the overseas flights conducted throughout this collaboration had a considerable impact (20% of the final footprint). It’s crucial to reduce such high-intensity cost centers. research related to Noor, an ongoing project aiming to develop the largest multi-task Arabic language models,
Costs Reporting and Offset
The full cost of model development is rarely. We definitely need more transparency and awareness for this topic during the rise of LLM in the AI community.
3.4 Conclution
Contemporary LLMs show capabilities in leveraging world knowledge and linguistic structure in a way that makes their output oftentimes indistinguishable from humans in both relevance and coherence. This is due to the fact that, independently of whether with current methods we will be able to approach the representation of “meaning” as represented in humans, LMs have proven to be able to handle very complex linguistic structure and relations between things that exist in the world. This makes them undoubtedly very powerful tools which are suited to solve tasks that have seemed too complex to be solved by software alone only a little time ago.
The recent release by OpenAI of their latest chatGPT has further pushed the standards as it is able to answer questions and handle the implications of conversations between humans in an unseen way (conversation history, identifying intent etc.) about a very broad range of (even specific) topics. Unfortunately neither the data methodology nor other model details have been open sourced to this day, which would be a crucial enabler for research community driven advancements in risk mitigation. Perhaps the most complex task solved with help of LLMs today is ranking top 10% in the strategy game “diplomacy”, where language needs to be used to convince and deceive others. Cicero , developed and open sourced by Meta, is an impressive example of engineered solution to a highly complex problem by using an ensemble of interacting ML and rule based systems.
While we have seen no widespread adoption in user facing products yet (machine translation and google search notwithstanding) the arising risks resulting from unmitigated biases in the data and (ab-) usage as knowledge sources in combination with the fact that humans might be misled by not knowing about the identity of their interaction partner, can not be ignored.
When deploying a LLM it is therefore recommendable to embed it into use cases where the supervision and therefore the liability inherently stays with the human (as in writing tools). At the same time we would advocate for frameworks for enforcing maximal transparency towards the engaging user about the nature of the interfaced system, in order to counter the dangers of eroding trust and misinformation at scale.