In an effort to get a deeper understanding of Southeast Asia’s demographic and cultural diversity, the company’s research division has joined Google’s Large Language Model (LLM). The business is collaborating with AI Singapore on this, and together they have introduced Southeast Asian Languages in One Network data. Inform us about it.
Google is participating in joint efforts to develop a Large Language Model (LLM) in order to better serve and comprehend Southeast Asia’s diverse demographic and cultural makeup.
To increase the amount of datasets available for training, optimizing, and evaluating AI models in certain languages, the company’s research division will collaborate with AI Singapore. The program, named Project Southeast Asian Languages in One Network Data (SEALD), intends to enhance the cultural context in the LLM developed for the area, according to a statement released by AI Singapore on Monday.
Support will be available first in these cities
- According to the government agency, the partnership would first concentrate on Indonesian, Thai, Tamil, Filipino, and Burmese, with both parties working together to build trans localization and translation models.
- They will also provide tools to facilitate dataset tailoring procedures and large-scale trans localization capabilities. Additionally, pre-training criteria for languages spoken in Southeast Asia will be presented.
- AI Singapore announced that all Project SEALD datasets and results will be made available as open source. The project will assist with model training initiatives under the Singapore government agency’s SEA-LION (Southeast Asian Languages in One Network) program, which was introduced last year.
How will the model work?
- SEA-LION operates on two basic models and is driven by open-source LLMs that have been pre-trained to the social peculiarities of the area.
- Includes a model with three billion parameters and one with seven billion parameters. There are 981 billion linguistic tokens in the training set.
- These tokens, according to *AI Singapore, are word fragments produced by segmenting text during tokenization.
- There are 91 billion Chinese tokens, 128 billion tokens from Southeast Asia, and 623 billion English tokens in these tranches.
- Currently, Project SEALD is developing a use case to enhance communication among migrant workers in Singapore, who speak many regional languages more fluently than English.
Dive into AI with Machine Learning Singapore! Explore in-context learning, LLM pre-training for Southeast Asian languages, and extended context training. See how LLMs enhance summarization systems.
Register now! 🚀 → https://t.co/JL1yWNsEgw pic.twitter.com/hd0fLy01bW
— Google Developers Space, Singapore (@DevSpaceSG) April 18, 2024
- In order to facilitate community engagement, Project SEALD datasets and outputs will be connected with Google Cloud and generative AI applications created under the Singaporean government’s AI Trailblazer Plan.
- Partners in Project SEALD will also collaborate with business, academia, and government agencies on projects including data gathering and quality assurance.
- Let us also mention that Vertex AI, a model garden on Google Cloud, will host pre-verified AI models, and AI Singapore intends to make the SEA-Lion LLM accessible there as well.
- Hugging Face, an open-source repository for AI tools and pre-trained models, will now include the regional LLM.
- AI Singapore also revealed on Monday that it has agreements and letters of intent in place to build datasets and applications for LLM models with a number of enterprises in Malaysia, Vietnam, and Indonesia.
- In addition, she is developing materials on the syntax and semantics of regional languages in collaboration with colleagues in the Philippines, Thailand, and Indonesia. These include the Ateneo Social Computing Science Laboratory in the Philippines and the Vidyasirimedi Institute of Science and Technology in Thailand.
Read More: WhatsApp Introduces Meta AI: Image and Joke Creation Feature to Transform User Experience