AI in action

Product innovation

Tracing our impact

Marketplace principles

Inside scoop

We simplified data discovery with LLMs. Here's how

Shreyas Parbat, Lead Product Manager, Data Platforms

. October 3, 2024 . Regional

This article was first published on Tech in Asia.

Back in 2017, data was deemed a commodity more valuable than oil. The importance of data has only grown since then, with companies across industries storing and utilising more of it than ever before.

In today’s data-driven companies, every employee is a data consumer who regularly uses in-house data for a variety of tasks, ranging from decision-making via dashboards and reports to training machine learning models by feeding them with historical info.

This obsession with data comes at a cost. The more data a company stores, the bigger the haystack its employees must search through before finding a dataset that meets their specific needs.

(Watch: How to leverage data for success with Grab’s data science team)

Keyword-based search tools have traditionally been used to find such datasets, but large language models (LLMs) offer a better approach. Here’s how they work and how they can make data discovery easier for various types of businesses.

The data discovery challenge

Streamlining data discovery requires solving problems on multiple levels, from the granular dataset level to organisation-wide mental model shifts.

Here are some of the most significant hurdles faced by companies:

Lack of documentation

Let’s face it: Humans are lazy. This behaviour may have helped us conserve energy in the wild, but it has also turned into one of the biggest issues in organisations that are larger than a couple of hundred people.

Employees responsible for handling data in a company are usually stretched thin. They create and manage multiple datasets, usually with only their team’s use cases in mind. Unless a dataset is highly popular, adding documentation for it is not prioritised. This makes it very difficult for those not in the know to discover and use datasets.

Reliance on tribal knowledge

The lack of widespread, high-quality documentation limits any search tool’s data discovery capabilities. Naturally, most employees may directly turn to those in data teams for help, usually via internal messaging tools like Slack.

This reliance on tribal knowledge comes at a high cost of speed and efficiency. Staff who work with data usually don’t have the time to answer every question within a reasonable timeframe. This means a process that should take seconds can end up dragging on for days.

Data silos

When companies cross a critical threshold or start producing multiple products, they naturally subdivide into large suborganisations. Each suborganisation can be considered a data domain with its own massive set of datasets.

Companies this large unlock their true potential when employees are able to utilise data across domains for various use cases. One good example of this would be analysing how traffic seen in mapping data may cause delays in food delivery.

However, this cross-domain data usage is impossible without good data discovery mechanisms in place.

Work smarter, not harder

Staff in charge of handling data should spend their time making good data, while those who need to use it should focus on utilising that data well. Everything else should be automated as much as possible.

At Grab, we worked to create a system that could understand a data consumer’s natural language query and guide them to the right dataset in mere seconds, no matter which domain the dataset belonged to. LLMs were key to making this vision a reality.

Besides improving our existing data search tools, we launched two major LLM applications that helped us make significant progress on these challenges.

Documentation generator

If the economist Vilfredo Pareto were a data engineer, he’d advise us to focus on improving documentation coverage for datasets. This 20 per cent effort would solve 80 per cent of our problems.

With this in mind, we built a documentation generation engine that uses LLMs to create output based on dataset schemas and sample data automatically. We also put feedback mechanisms in place to ensure that staff responsible for data were always reviewing AI-generated content and that other employees who used that data knew which documents were created by AI.

This feature caused an exponential increase in documentation coverage, which naturally made Grab’s data landscape more comprehensible. At the very least, data users could understand what a dataset contained and how it should be used without having to reach out to colleagues.

However, this feature is far from perfect. To improve it further, we’re working on several initiatives, including:

Adding Slack conversations and wikis that mention the dataset in question as context to the documentation generation prompt
Creating an LLM-based evaluator that can rate documents and give improvement suggestions – something that can also be applied to human-created documents
Experimenting with novel frameworks like Reflexion

Data discovery co-pilot

Asking busy staff to digest extensive documentation for multiple datasets every time they have a new data requirement is a tall order. LLMs, however, can understand large documents in milliseconds.

With a retrieval augmented generation system in place, LLMs can even parse every document a company has to offer. Moreover, they can provide conversational interfaces for discovering datasets, making you feel like you’re reaching out to a colleague who can help you with any question.

Thus, with good documentation in place, we set out to create this conversational data discovery experience for our users. Today, staff at Grab can use this data discovery co-pilot either via our data search tool or via Slack.

We are now working on integrating this co-pilot into the various Slack channels where staff usually seek assistance from colleagues in charge of data. The co-pilot aims to help answer queries without human involvement.

When a question can’t be answered using existing documentation and an employee needs to step in, our co-pilot allows the employee to regenerate relevant documents with just a click of a button, using their entire conversation with the colleague with the query as context.

Data = found

LLM-powered data discovery has made it much easier to find data within Grab. Staff can now find datasets faster than ever before, making the firm more data-driven while cross-domain data usage is on the rise.

It is worth pointing out, however, that Grab-level resources aren’t a prerequisite for success. You can start small and iterate by experimenting with documentation generation for a subset of your most important datasets.

You can then leverage open-source LLMs to keep costs down and focus your efforts on the datasets and domains that are most critical to your business. Finally, you can engage staff who produce data and those who use it early in the process to ensure you aren’t wasting effort and are only building what your company needs.

These principles and approaches can be adapted to fit any data-driven company, no matter the size or industry, with the smart use of LLMs. Your employees – and your bottom line – will thank you for taking this step.

Solutions