August 30, 2024
A guest post from Fabrício Ceolin, DevOps Engineer at Comet. Inspired by the growing demand…
With advanced generative AI models like Generative Pre-trained Transformer 3 (GPT-3), which provides human-like responses to user queries, AI is progressing toward generative tools to create realistic content, including text, videos, images, and audio.
And with open-source becoming the norm, most AI models are available for public use for research and experimentation. As such, we explore the most recent open-source generative AI models that demonstrate the ever-expanding applications of AI.
Within the text generator domain, Large Language Model Meta AI (LLaMa) is a revolutionary technology that surpasses ChatGPT-3 by Open AI regarding safety and quality. Its architecture consists of four models, having 7, 13, 34, and 70 billion parameters, respectively. Although the parameter size is smaller than the more recent GPT 4 platform, potentially having eight 220-billion parameter models, LLaMa 2 uses a much larger dataset with a 3000-word context window.
And like other Large Language Models (LLMs), LLaMa 2’s most significant use case is superior chatbots that can provide relevant answers to different user prompts. Enterprises can download it directly onto their servers to build customer-centric applications that help visitors engage with businesses more effectively.
But the real game-changer is how LLaMa 2 manages the safety-helpfulness trade-off. Traditional chatbots are more helpful by answering almost any question – even dangerous questions like “How to kill?”
LLaMa 2 changes the generative AI landscape by incorporating two reward models to control the responses optimally. One model rewards LLaMa based on how helpful it is, while the other rewards based on safety. This is part of the Reinforcement Learning from Human Feedback (RLHF) approach, where the reward models are similar to humans assessing the quality of LLaMa 2’s response. In effect, the model learns to maximize the reward and improve its output.
If LLaMa 2 assesses a prompt as dangerous, it switches to the safety reward model and generates an appropriate response. For other prompts, it uses the helpfulness reward model. As such, LLaMa 2’s architecture is revolutionary and paves the way for AI to interact more safely with the real world.
Yet another innovation in the text generator space, BigScience Large Open-science Open-access Multilingual Language Model (BLOOM), is a multilingual language model by Hugging Face that can solve several mathematical and programming problems.
Trained with 176 billion parameters, BLOOM supports 46 languages and 13 programming languages. However, running BLOOM on a local machine can take time due to its sheer size.
Its architecture is similar to GPT-3, where it predicts the next token using 70 transformer blocks. Each block has a multi-layer perceptron and an attention layer to predict the next token from a given input which it takes in the form of word embeddings.
This generative AI model has several use cases. It can quickly solve arithmetic problems, translate one language into another, generate appropriate code, and generate general content per user requirements. Also, users can conveniently deploy it in production through the Hugging Face Accelerator library, making it easier to train and infer from the model.
As the only model with 100-billion-plus parameters, BLOOM extends AI’s boundaries to provide accurate and relevant responses with an easy-to-implement framework. And being open-source, users can fine-tune the model through the Hugging Face Transformers library. This expands its applications to various fields, such as education, eCommerce, research, etc.
MosaicML recently launched its Mosaic Pre-trained Transform (MPT) – 30B language model that outperforms several other LLMs, such as ChatGPT-3, StableLM 7B, and LLaMA-7B. It’s an open-source decoder-only transformer model that improves upon the previous version – the MPT-7B.
As the name suggests, the generative AI model consists of 30 billion parameters with a context window of 8000 tokens. This means it can understand pretty long word sequences to generate appropriate responses.
It also uses the Attention with Linear Biases (ALiBi) technique, enabling the model to comprehend sequences longer than 8000 tokens. This feature makes MPT-30 highly valuable in the legal domain, where experts may use it for analyzing long contracts with complex legal diction.
In addition, the MPT-30-Instruct is a purpose-built variation of MPT-30B, which effectively understands user instructions as input prompts. The variant is applicable where users want the model to precisely follow a set of instructions.
In contrast, the MPT-30 chat is a conversational generative AI application that generates relevant human-like responses. This version also performs well when generating code in several programming languages. It also reportedly performs better than other code generators, such as StarCoder-GPTeacher on HumanEval. However, the MPT-30 chat is not yet available for commercial use.
The defining development of the MPT-30 model is that it’s the first-ever LLM that partially used NVIDIA’s H100 GPUs for training, increasing throughput by 2.44 times per GPU.
With the increasing popularity of AI text generators, several text-to-image models are also emerging, with advanced architectures producing realistic visuals.
Dall-E by Open AI is a text-to-image model that is an offshoot of the GPT-3 model with 12 billion parameters. The company released the original version on January 5, 2021, followed by Dall-E 2 on September 28, 2022, which claims to have better speed and image resolution.
While Dall-E 2 is not widely available, Dall-E mini, also known as Craiyon, is open-sourced and generates simple images through textual prompts read through a bidirectional encoder.
The generative AI model features a transformer neural network that uses the attention mechanism allowing the neural net to consider the most significant aspects of a given sequence.
The attention method allows the model to produce more accurate results and make better connections between abstract elements to give unique images.
Craiyon and the more advanced Dall-E versions are invaluable in the fashion industry, where companies display several outfits and products. With such image-generating technology, they can conveniently generate relevant photos without hiring expensive models and other professional staff.
With an ability to create entirely new images of animals, humans, nature, and other arcane creatures, the Dall-E line of image generators can comprehend abstract textual descriptions and produce several variations by combining distinct concepts – extending human creativity to new levels with generative AI.
Boasting much faster speed and more realistic images, Stable Diffusion is a more sophisticated generative AI model that uses textual prompts to produce high-quality visual art.
As the name suggests, Stable Diffusion uses the diffusion model to create images. A diffusion model has two elements: forward diffusion and reverse diffusion. In forward diffusion, the model adds random noise to a photo. While in reverse diffusion, it subtracts the noise to get to the original image.
The architecture features a noise predictor which takes a text prompt and random noise in latent space. Latent space is a low-dimensional space with compressed representations of an image.
Next, the model subtracts the predicted noise from the latent image and repeats this step several times. Finally, a variational encoder decodes the latent image into an actual photo.
While DALL-E and DALL-E 2 also feature diffusion models, they are slower than Stable Diffusion. Also, Stable Diffusion is open-source and allows users to tweak several options through the Stability DreamStudio app. This provides more control over how you want to generate an image.
For example, you can increase the number of steps for subtracting noise, provide different seeds, and control the prompt strength.
Stable Diffusion is suitable for users that want images that relate more to the real world. It is perfect for generating photographs, portraits, 3D images, etc.
Try this full code tutorial using Stable Diffusion generative AI from Comet.
Although still in the early stages, generative AI’s capabilities are extending into the audio domain, with technologies including Open AI’s Jukebox and Harmonai models for music generation. But the most recent advancement is AudioCraft by Meta: an open-source text-to-music generation model.
AudioCraft can effectively take textual prompts, such as “rock music with electronic sounds,” and generate high-fidelity soundtracks without background noise. This is an impressive leap in generative AI, as previous models required audio inputs to create short clips that are often low quality.
The AI technology uses three proprietary models: MusicGen, AudioGen, and EnCodec. MusicGen is an autoregressive transformer model that creates music clips using textual prompts. AudioGen, in contrast, generates environmental sounds (such as a dog barking, a child crying, whistling, etc.) through textual prompts.
But the real game-changer is Meta’s audio compression codec: Encodec. Encodec is a neural network that allows the model to learn discrete audio tokens (similar to word tokens in large language models) and create a vocabulary for music.
The audio tokens then feed into autoregressive language models to generate new tokens. Finally, the Encodec model decodes the tokens and maps them onto the audio space to produce realistic musical clips.
With Encodec, generative AI can finally analyze music containing long audio sequences with different frequencies. To give some perspective, songs spanning a couple of minutes contain millions of time steps compared to textual tokens that consist of mere thousands of time steps used for training LLMs.
AudioCraft can help artists and other creative professionals conveniently generate unique soundtracks and add them to videos, podcasts, and other forms of media. This can be done simply while experimenting with different melodies to speed up the production process.
With generative AI entering the scene, the technology landscape is changing rapidly, allowing businesses to find more cost-effective ways to run their operations. Of course, what it means is organizations must adopt AI quickly to remain ahead of the competition.
And the Comet Machine Learning (ML) platform will help you get up to speed by letting you quickly train, test, and manage ML models in production. The tool allows businesses to easily leverage ML’s power and boost productivity through its experiment-tracking features, interactive visualizations, and monitoring capabilities.
So, create a free account now to benefit from Comet’s full feature stack.