To the central content area
Toggle Dark/Light Mode Dark Mode
:::

Taiwan Sovereign AI Training Corpus Goes Online! MODA Collaborates with 200 Agencies to Create Local Language Resources

With the rapid development of artificial intelligence (AI) globally, Taiwan certainly cannot be left behind! The Ministry of Digital Affairs (MODA) released the Taiwan Sovereign AI Training Corpus today (24th). Using a wide range of high-quality traditional Chinese data to support AI model training that is more relevant to Taiwan's linguistic, cultural and life contexts, the corpus facilitates AI models in cultivating higher localized recognition and semantic comprehension capabilities to meet the needs of Taiwan's society and industry.

Currently, more than 200 government agencies have participated, and over 2,000 datasets and 600 million tokens have been uploaded. These are high-quality datasets with Taiwanese cultural characteristics from various agencies, covering fields such as language, culture, education, biology, and geography. These datasets serve as teaching materials for AI, helping AI models better understand Taiwan and learn more authentic and natural Taiwanese linguistic expression. Director Wei from the Department of General Planning, Ministry of Culture pointed out that the public art and cultural asset datasets released by the Ministry of Culture embody Taiwan's rich and diverse arts and culture, and can serve as important materials for training AI models in learning about Taiwan culture. Section Chief Deng from the Department of Information and Technology Education, Ministry of Education said that the language dictionary data provided by the Ministry of Education covers Taiwanese, Hakka, and Mandarin, which helps to enhance the accuracy of AI models in lexical precision and semantic comprehension.

Meanwhile, in order to enable government agencies and the private sector to release data with confidence and use corpora with peace of mind, MODA and the Intellectual Property Office of the Ministry of Economic Affairs have collaborated to promulgate the Taiwan Sovereign AI Training Corpus Licensing Terms - Version 1. It delineates the licensing terms for the corpus released, which will reduce huge administrative costs of individual copyright negotiations, and minimize potential copyright disputes arising from usage of AI training data. Through the standardized licensing terms, the development and application of sovereign AI can be accelerated.

It is worth mentioning that the corpus connects the results of more than ten years of Government Open Data, synchronizing the enriched open data in text diligently accumulated in the past to the corpus. Users can query and download the corpus data as needed, making the search and application functions of the corpus easier to use.

MODA said that the corpus content will continue to be expanded in the future, extending from central government agencies to local governments and private organizations, allowing more people to participate and jointly promote the development of sovereign AI through public-private partnerships. We sincerely invite AI model trainers to apply for utilizing the corpus at https://taic.moda.gov.tw. Let us create AI that understands Taiwan using Taiwanese corpora!
 

Go Top