Tuesday, 02 January 2024 12:17 GMT

Wikidata Unveils Open Vector Database For AI Use


(MENAFN- The Arabian Post)

Wikimedia Deutschland has introduced a new vector database designed to make Wikidata's knowledge graph directly usable by AI systems. The initiative, known as the Wikidata Embedding Project, aims to convert structured facts into vector representations so that large language models and related AI tools can conduct semantic queries grounded in verified data.

Under this system, the 119 million or more entries of Wikidata are embedded into high-dimensional vectors using a model developed in collaboration with Jina. AI. Those vector embeddings are hosted on DataStax's Astra DB, which is serving as the scalable backend. The data snapshot currently captures Wikidata information up to September 18, 2024; while new entries made after that date are not yet incorporated, minor edits are unlikely to disrupt the vector representation as the embeddings encode a general“idea” of each item.

The key innovation lies in replacing or augmenting the traditional use of SPARQL and keyword searches with semantic similarity methods. AI systems can now issue natural-language queries and retrieve contextually related items, rather than relying solely on exact-match lookups-a shift intended to reduce hallucinations and improve traceability of AI output. The embedding infrastructure supports the Model Context Protocol, enabling better alignment between AI models and vector databases.

The project currently supports English, French, and Arabic, with further language support planned. Among its intended use cases are fact-checking, entity disambiguation, zero-shot classification, and hybrid search models combining graph reasoning with vector retrieval. Wikimedia is hosting a webinar for developers interested in integration and feedback is being solicited for future updates.

See also Open-Source Pioneer Urges Scrapping Conduct Codes

Wikidata has long been a backbone of Wikimedia's open knowledge ecosystem. It is a collaboratively edited multilingual knowledge graph that feeds into other projects such as Wikipedia and makes structured data available under a public domain license. The challenge for AI systems has been that while the data is machine-readable, it has not always been formatted in ways optimal for semantic or generative AI workflows. The new embedding layer bridges that gap.

Philippe Saadé, the project's AI manager, emphasises the goal of providing fair access:“This Embedding Project launch shows that powerful AI doesn't have to be controlled by a handful of companies,” he said, underscoring the project's open ethos. Lydia Pintscher, Wikidata Portfolio Lead, describes the move as a step toward more trustworthy, transparent AI founded on verifiable data.

Notice an issue? Arabian Post strives to deliver the most accurate and reliable information to its readers. If you believe you have identified an error or inconsistency in this article, please don't hesitate to contact our editorial team at editor[at]thearabianpost[dot]com . We are committed to promptly addressing any concerns and ensuring the highest level of journalistic integrity.

MENAFN12102025000152002308ID1110183656



Legal Disclaimer:
MENAFN provides the information “as is” without warranty of any kind. We do not accept any responsibility or liability for the accuracy, content, images, videos, licenses, completeness, legality, or reliability of the information contained in this article. If you have any complaints or copyright issues related to this article, kindly contact the provider above.