To the central content area
Toggle Dark/Light Mode Dark Mode
:::

Taiwan AI Training Corpus

Background

To strengthen the infrastructure required for AI development, accelerate the formation of Taiwan's sovereign AI ecosystem, and enhance the diversity and quality of Chinese-language training corpora to support AI model training and application development, we are collecting high-quality Traditional Chinese corpora that reflect Taiwan's cultural characteristics and feature coherent semantics. We will establish a lawful and compliant mechanism for sharing corpus resources to promote the development of sovereign AI.

 

Building the Taiwan AI Training Corpus

Given that large language models (LLMs) emphasize natural language expression capabilities, training data should focus on high-quality content that is semantically coherent, structurally complete, and fluent. The Ministry has prioritized mobilizing government agencies to provide high-quality Traditional Chinese corpora and, in collaboration with more than 200 agencies—including the Ministry of Education, the Ministry of Culture, the Council of Indigenous Peoples, and the Hakka Affairs Council—will jointly build the "Taiwan AI Training Corpus". This corpus will cover diverse topics such as culture and the arts, language and vocabulary, history and cultural heritage, local culture, tourism, and education and learning, thereby supporting the development and application of Taiwan's sovereign AI.

Promoting a lawful and compliant mechanism for sharing corpus resources to advance the development of sovereign AI.

Licensing Mechanism First

To accelerate improvements in the diversity and quality of AI training data, appropriately mitigate copyright disputes related to AI training data, and encourage agencies to expand the release of training data, the Ministry has drafted and published the "Taiwan Sovereign AI Training Corpus License - Version 1". These terms help establish a workable licensing mechanism between data providers and AI model trainers, thereby facilitating lawful data circulation and strengthening technological autonomy. In addition, the Ministry has worked with the Intellectual Property Office of the Ministry of Economic Affairs to develop relevant licensing application cases for reference by agencies, with the aim of striking a balance between expanding the release of training data and safeguarding the rights and interests of original copyright holders.

Go Top