The MIT-IBM Watson AI Lab and the University of Washington introduced TOUCAN, the largest open dataset for training AI agents, featuring 1.5 million real tool interactions. This comprehensive dataset aims to enhance open models by demonstrating effective use of real-world tools, addressing a significant gap in existing datasets. Hosted on public MCP servers, TOUCAN documents detailed interactions involving over 2,000 tools across various domains, including finance and web development, capturing realistic errors and context dependencies through actual API executions. Unlike past datasets, which relied on simulations, TOUCAN offers improved tool execution insights. Testing showed significant enhancements in Qwen-2.5 models fine-tuned with TOUCAN, surpassing larger systems and yielding notable performance gains on multiple benchmarks, thereby advancing the capabilities of smaller models. Available on GitHub and Hugging Face, TOUCAN underscores the importance of high-quality training data for open-source AI development, while future plans include tool simulation and benchmarking enhancements.
Source link