
DataStax, a leading AI platform, has announced its partnership with Wikimedia Deutschland, the organization behind German Wikipedia and the development of Wikidata and Wikibase. Using the DataStax AI Platform, powered by NVIDIA AI technologies like NeMo Retriever and NIM microservices, Wikimedia Deutschland is transforming Wikidata into a vectorized database, making it easier for developers to access and use.
Wikidata, the world’s largest collaborative knowledge graph, supports over 300 languages and is maintained by a global community of 24,000 volunteers who have contributed more than 114 million entries. These entries, widely used by open-source developers, provide a vast dataset of structured knowledge. The collaboration aims to make this data more accessible to the open-source AI/ML community, fostering innovation in artificial intelligence.
A significant challenge addressed by this partnership was the vector embedding of Wikidata’s massive and dynamic dataset. With DataStax’s solution, Wikimedia Deutschland can now embed this data, enabling advanced semantic search in multiple languages. Initial implementations focused on English, achieving embedding speeds nearly ten times faster than previous on-premise GPU solutions. This capability, combined with real-time updates, supports rapid experimentation and integration for AI/ML applications.
Dr. Jonathan Fraine, Chief Technology Officer of Wikimedia Deutschland, highlighted the significance of the collaboration, stating, “DataStax’s platform has enabled us to deliver real-time, high-quality data at unprecedented speeds, laying the foundation for multilingual, multicultural, and dynamic AI-driven applications.”
The platform demonstrated its power by vectorizing over 10 million entries in under three days using DataStax’s Astra DB on AWS. This efficiency ensures that the vectorized data remains freely available under a CC0 license, aligning with Wikimedia’s mission of open access. The advanced architecture handles hundreds of thousands of daily updates contributed by a global community, ensuring real-time accessibility for millions of users.
Lydia Pintscher, Portfolio Lead for Wikidata at Wikimedia Deutschland, emphasized the impact, stating, “DataStax’s innovative approach has unlocked new capabilities, streamlining processes and improving the speed and accuracy of insights delivered to our global audience.”
The collaboration not only simplifies AI development but also enhances the utility of Wikidata. DataStax’s Astra DB serverless model and NVIDIA’s NeMo technologies provide the scalability and near-zero latency required to handle Wikidata’s vast and constantly evolving dataset. Future plans include expanding the platform’s multilingual capabilities and integrating advanced features like graphRAG to improve search reliability.
Ed Anuff, Chief Product Officer at DataStax, expressed enthusiasm for the partnership: “We’re proud to support Wikimedia Deutschland in advancing open-source AI development. This collaboration underscores the transformative potential of open and high-quality data in driving innovation for the public good.”
As part of its broader efforts, DataStax continues to enhance its AI offerings on AWS, including Astra Vectorize and Langflow, which simplify embedding processes and provide developers with powerful tools to test and deploy foundational AI models. These innovations reduce costs and improve efficiency, making AI development more accessible to a global audience.
The partnership between Wikimedia Deutschland and DataStax represents a significant step toward democratizing AI-driven access to knowledge, paving the way for future advancements in multilingual, open-source AI applications.
See What’s Next in Tech With the Fast Forward Newsletter
Tweets From @varindiamag
Nothing to see here - yet
When they Tweet, their Tweets will show up here.