Alice Group has launched two Korean-language educational datasets on the Hugging Face platform, aiming to enhance AI research and application. The datasets are the “Korean FineWeb-Edu Demo,” which includes 5% of a large corpus designed to train Korean large language models (LLMs) in academic domains, and “Korean-webtext-edu,” curated for educational value from existing Korean web texts. The Korean FineWeb-Edu Demo features around 190 billion tokens translated from English and serves as a preliminary tool for verifying data suitability for extensive training. The Korean-webtext-edu dataset emphasizes factuality, contextual consistency, and educational relevance, making it ideal for developing Korean AI models. Alice Group’s initiative is intended to foster both domestic and international AI research, providing essential high-quality data to developers and researchers. Soo-In Kim, CRO of Alice Group, emphasized their commitment to bolstering Korean AI research and industry growth through innovative data solutions.
Source link
Share
Read more