guptaneeru
9 min readJun 19, 2024
Volume of data/information created, captured, copied, and consumed worldwide from 2010 to 2020, with forecasts from 2021 to 2025

Enrich Metadata Using GenAI and Foundation Models in IBM Knowledge Catalog.

With the recent advances in Generative AI we expect data to grow significantly in the next few years. It is estimated that the volume of data created and consumed will reach 180+ zetabytes by 2025. Growing data volumes and the prevalence of data silos can be a bottleneck to timely delivery of data-driven business insights.

In many organizations, the data is owned by different departments each with their own respective naming conventions. As a result data elements are often represented with seemingly cryptic names that provide little insight to their actual content or purpose. This ambiguity makes it difficult to consume the data efficiently. Forrester estimates that between 60% to 73% of all data generated in organizations remains unused for analytics because of this poor quality.

IBM Knowledge Catalog (IKC) makes it easy to discover data from the data silos by automatically adding metadata-based business context to the data tables. By enriching data with well-defined business context, data can be easily discovered and understood by both business users and the data scientists.

Traditionally, metadata is enriched through manual data annotation and automation techniques like linguistic matching and machine
learning based term assignment. With the advance in LLMs and GenAI technology, IKC is now taking it to the next level by utilizing the power of large language models to provide a contextual business meaning to otherwise cryptic metadata. In this article, we will learn how IBM Knowledge Catalog is leveraging the LLMs and Generative AI capabilities to semantically enrich metadata.

LLM-based Metadata Enrichment:

Semantic Name Expansions: As discussed, the data is not always well defined and has cryptic technical names. This prevents data consumers to locate and utilize data effectively. Presently, data stewards invest significant effort in enhancing data usability, particularly when dealing with originally cryptic or abbreviated metadata names.

The LLM-based Metadata Enrichment addressed this issue by offering a capability to intelligently match technical metadata with predefined business concepts. It facilitates the augmentation of technical metadata with more descriptive and meaningful names, drawing from customer-defined domain-specific glossary concepts.

Furthermore, it incorporates a built-in dictionary designed to expand commonly used abbreviations. This functionality allows for the expansion of abbreviated names such as “cust” to “customer” and “txn” to “transaction,” among others.

Moreover, the capability offers the flexibility to accommodate customer-defined abbreviations, incorporating them into the expansion process. This feature significantly reduces the time required for customers to develop comprehensive glossaries, thereby streamlining the data management process.

In the above example, the raw metadata is ingested in IKC. It is then matched against the customer uploaded glossary. The columns like EMPE_ID, CMM_ID are expanded to Employee Identifier, Communication Identifier. PLN_END_DT is expanded to planned end date. The system leverages the customer defined glossary to match the metadata and expand the cryptic names. The clearly defined business names makes it easy to search and find the data using natural language (NLP). If the metadata names are not expanded, then it is difficult to find these datasets. This hinders the self serve capability of the data fabric platform.

Semantic Description Generation: The clearly defined dataset names are useful but those can still be ambiguous without the right context. The LLM-based Metadata Enrichment offers the capability to generate context-aware descriptions for tables and columns. These descriptions are crafted taking into consideration the surrounding columns and the context of the table. For example, a column labeled ‘AGE’ within a table named ‘Customer’ will be expanded to ‘Age of the customer,’ whereas within a table named ‘Building,’ it will be expanded to ‘The age of the building in years.’

These context-aware descriptions greatly enhance the comprehensibility of various datasets at a glance for consumers and makes it easy to discover the data by natural language phrases.

To achieve this, IKC leverages IBM’s Granite 13 billion foundation model which has been fine tuned using Open Government Data.

The foundation model is hosted on IBM WatsonX platform. It provides a seamless experience to get the inference from the model.

LLM based description generation

In the above example, the input asset has a column name PLN_END_DT. This name is expanded using the business glossary to Planned end date. The expanded name is then fed to the LLM. The model understands the context of the column that the column is talking about communication. So it generates a meaningful description: ‘The date the communication is to be completed”. This provides more context to the column and makes it easy for data stewards to understand the metadata. The descriptive information is indexed in IKC search that enables data consumers to find the assets easily.

IKC provides verbose descriptions about the datasets explaining in a glance what kind of information data sets contain.
See below some of the sample descriptions for the datasets:

Semantic Term assignments: The primary goal of metadata enrichment is to provide business meaning to data. With the CPD 5.0 release, IKC now leverages the Generative AI technology to assign business concepts to the technical metadata. The expanded names and generated descriptions become the basis to assign the business terms to the data. The assignment of terms gains accuracy based on semantic meaning of the expanded metadata.

IKC leverages vector embeddings to match the technical metadata (expanded names + descriptions) with the business terms. IKC uses another IBM Slate 30m encoding model. The model is fine tuned on CDO data to generate the embeddings for the term assignments use case. It then computes the embeddings of the glossary using the encoder model and persist them. During enrichment phase, the semantic layer computes the embeddings for the incoming metadata and runs the KNN searches against the glossary embeddings. The embeddings are semantically searched to cosine similarity between the embeddings. The top K matches are then assigned to the technical metadata.

The vector matching, enables IKC to assign semantically matched concepts. Traditionally, the business concepts are matched by linguistic name matching. This approach fails when there are cryptic names because the linguistic name matching does not work when the business terms are not the exact match. For instance, to match a term like Finance to Bank account, the stewards need to define terms and then author relationships between them.
With IKC now there is no need to author complex glossary and establish glossary relationships. The vector matching can understand the semantics of the term and the technical data and find the good match without having to define the relationships explicitly. Terms like client will be matched to customer, or Customer details. A term like Finance will be matched to Bank, Account, Loan etc. The Sales Information will be attached to revenue or sales.

Semantic Term Assignments

In the above example, the term ‘Communication Expected End Date’ is assigned to the column Planned End Date. As you can see the column with original name PLN_END_DT is now correctly matched against the right business term.

These are the three main LLM based capabilities introduced in Cloud Pak For Data 5.0 release in IKC. These new features are designed to make it easy for the data producers to enrich their metadata so that it can be easily discovered by the data consumers. The semantically enriched metadata is the foundation to easily discover the assets using natural language queries.

Lets see how to import metadata and then enrich it using LLM based enrichments in IBM Knowledge Catalog. Here are the main steps to run LLM based metadata enrichments in IKC. For detailed steps please find the link to official documentation below.

Steps:
1) Create a project.

2) Ingest metadata in the project by running an Metadata Import job.

3) Run Metadata Enrichment, select ‘Expand Metadata’ and ‘Semantic Term assignment’ options.

4) Review the results of semantic enrichments.

5) Find your data by running natural language searches in the search bar.

Step1: To start with the semantic enrichments, you need to start with creating a project and a connection pointing to your data source. Once you have setup the data source, then create a metadata import asset and select the assets you would like to enrich from the datasource.

Different types of assets

Step 2: Once assets are imported, you need to run an enrichment job. To leverage the semantic term assignments, you will need to change the project settings and choose ‘Assign Terms’ > ‘Semantic Term Assignment’ option. You can change the thresholds for automatically accepting the AI generated content.

Metadata Enrichment Settings

Once all the settings are set for a project, create a new assets of type ‘Metadata enrichment asset’ (MDE). In the MDE asset, choose the data assets you imported in the step 1. You can either choose the entire MDI asset or specific assets. If you choose the MDI then all the assets imported from import job will be enriched.

In the metadata enrichment job, select ‘Expand Metadata’ and ‘Assign Terms’ objective and follow the wizard to create.

Metadata enrichment objectives for semantic enrichments

Depending on the amount of the data it can take few minutes for the job to finish. Once finished, you can review the enrichment results. You can see the AI generated suggestions and manually accept or reject those.
Note that we have seen that at times the scores for semantic term assignments are low. So if you do not see semantic term assignments, consider lowering the threshold values for term assignments.

Once the data is enriched, a data steward can review the enriched metadata. They can accept or reject the assigned and generated metadata. Once the enrichment results are reviewed and approved they can publish the results to a catalog where these results become available to the other users.

The enriched metadata powers the search index. The data consumers can easily find the data assets by running searches against the enriched metadata in natural language.

Enriched metadata ready to be reviewed

Conclusion: In this article we learnt how IBM knowledge catalog harnesses the value of LLMs and leverages Generative AI to enrich the metadata. The business terms assigned using Generative AI are more accurate and reduces the work for data stewards. The enriched metadata is easy to discover and can be easily used to drive business insights.

Call To Action: Register for our July 10 webinar to learn how your organization can start harnessing gen AI based metadata enrichment to scale data governance

How to try the IKC enrichment capabilities: You can try IKC enrichment capabilities on Saas by provisioning a free instance. https://dataplatform.cloud.ibm.com/registration/stepone?apps=data_catalog,cos&context=cpdaas

See IBM launched the new Knowledge Catalog Standard and Premium Cartridges to help organizations scale data governance with a modern, LLM-powered data intelligence solution. Customers will be able to access these LLM-based capabilities through the new Cartridges or through the Cloud Pak for Data SaaS platform on IBM Cloud.

https://www.ibm.com/docs/en/cloud-paks/cp-data/5.0.x?topic=assets-creating-enrichment-asset for official metadata enrichment documentation.

See details here on how to tweak the settings for semantic enrichments.
https://www.ibm.com/docs/en/cloud-paks/cp-data/5.0.x?topic=assets-default-enrichment-settings#ai-name

Check out: https://coreykeyser.medium.com/enrich-your-data-with-ibm-knowledge-catalogs-large-language-model-powered-metadata-enrichment-44153df3f912

For Step by step instructions with sample assets and glossary checkout:
https://medium.com/@rakhi.sa/run-semantic-expansion-in-ibm-knowledge-catalog-36b47f7b1d25