{"id":225293,"date":"2024-10-03T13:55:45","date_gmt":"2024-10-03T05:55:45","guid":{"rendered":"https:\/\/www.grab.com\/sg\/?post_type=editorial&#038;p=225293"},"modified":"2024-10-03T13:55:51","modified_gmt":"2024-10-03T05:55:51","slug":"internal-data-llms-engineering-software","status":"publish","type":"editorial","link":"https:\/\/www.grab.com\/sg\/inside-grab\/stories\/internal-data-llms-engineering-software\/","title":{"rendered":"Finding internal data is tough. Here\u2019s how LLMs make it easier"},"content":{"rendered":"\t\t<div data-elementor-type=\"wp-post\" data-elementor-id=\"225293\" class=\"elementor elementor-225293\" data-elementor-post-type=\"editorial\">\n\t\t\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-4628f63 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"4628f63\" data-element_type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-6cb0de0\" data-id=\"6cb0de0\" data-element_type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap\">\n\t\t\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-d09457c elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"d09457c\" data-element_type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-1db8071\" data-id=\"1db8071\" data-element_type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-2bc9896 gr21-boxed-content  editorial-gr21-boxed-content elementor-widget elementor-widget-text-editor\" data-id=\"2bc9896\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t<p><i>This article was first published on <a href=\"https:\/\/www.techinasia.com\/finding-internal-data-tough-heres-llms-easier\">Tech in Asia<\/a>.<\/i><\/p><p>Back in 2017, data was deemed a commodity\u00a0<a href=\"https:\/\/www.economist.com\/leaders\/2017\/05\/06\/the-worlds-most-valuable-resource-is-no-longer-oil-but-data\" target=\"_blank\" rel=\"nofollow noopener\">more valuable than oil<\/a>. The importance of data has only grown since then, with companies across industries storing and utilising more of it than ever before.<\/p><p>In today\u2019s data-driven companies, every employee is a data consumer who regularly uses in-house data for a variety of tasks, ranging from decision-making via dashboards and reports to training machine learning models by feeding them with historical info.<\/p><p>This obsession with data comes at a cost. The more data a company stores, the bigger the haystack its employees must search through before finding a dataset that meets their specific needs.<\/p><p><strong>(Watch: <a href=\"https:\/\/www.grab.com\/sg\/inside-grab\/stories\/watch-leveraging-data-for-success-with-grabs-data-science-team\/\">How to leverage data for success with Grab&#8217;s data science team<\/a>)<\/strong><\/p><p>Keyword-based search tools have traditionally been used to find such datasets, but\u00a0<a href=\"https:\/\/www.ibm.com\/topics\/large-language-models\" target=\"_blank\" rel=\"nofollow noopener\">large language models<\/a>\u00a0(LLMs) offer a better approach. Here\u2019s how they work and how they can make data discovery easier for various types of businesses.<\/p><h5>The data discovery challenge<\/h5><p>Streamlining data discovery requires solving problems on multiple levels, from the granular dataset level to organisation-wide mental model shifts.<\/p><p>Here are some of the most significant hurdles faced by companies:<\/p><p><strong>Lack of documentation<\/strong><\/p><p>Let\u2019s face it: Humans are lazy. This behaviour may have helped us conserve energy in the wild, but it has also turned into one of the biggest issues in organisations that are larger than a couple of hundred people.<\/p><p>Employees responsible for handling data in a company are usually stretched thin. They create and manage multiple datasets, usually with only their team\u2019s use cases in mind. Unless a dataset is highly popular, adding documentation for it is not prioritised. This makes it very difficult for those not in the know to discover and use datasets.<\/p><p><strong>Reliance on tribal knowledge<\/strong><\/p><p>The lack of widespread, high-quality documentation limits any search tool\u2019s data discovery capabilities. Naturally, most employees may directly turn to those in data teams for help, usually via internal messaging tools like Slack.<\/p><p>This reliance on\u00a0<a href=\"https:\/\/helpjuice.com\/blog\/tribal-knowledge\" target=\"_blank\" rel=\"nofollow noopener\">tribal knowledge<\/a>\u00a0comes at a high cost of speed and efficiency. Staff who work with data usually don\u2019t have the time to answer every question within a reasonable timeframe. This means a process that should take seconds can end up dragging on for days.<\/p><p><strong>Data silos<\/strong><\/p><p>When companies cross a critical threshold or start producing multiple products, they naturally subdivide into large suborganisations. Each suborganisation can be considered a data domain with its own massive set of datasets.<\/p><p>Companies this large unlock their true potential when employees are able to utilise data across domains for various use cases. One good example of this would be analysing how traffic seen in mapping data may cause delays in food delivery.<\/p><p>However, this cross-domain data usage is impossible without good data discovery mechanisms in place.<\/p><h5>Work smarter, not harder<\/h5><p>Staff in charge of handling data should spend their time making good data, while those who need to use it should focus on utilising that data well. Everything else should be automated as much as possible.<\/p><p><strong>(Read more: <a href=\"https:\/\/www.grab.com\/sg\/inside-grab\/stories\/the-machine-learning-magic-that-powers-grabs-marketplace\/\">The machine learning magic that powers Grab\u2019s marketplace<\/a>)<\/strong><\/p><p>At Grab, we worked to create a system that could understand a data consumer\u2019s natural language query and guide them to the right dataset in mere seconds, no matter which domain the dataset belonged to. LLMs were key to making this vision a reality.<\/p><p>Besides improving our existing data search tools, we launched two major LLM applications that helped us make significant progress on these challenges.<\/p><p><strong>Documentation generator<\/strong><\/p><p>If the economist\u00a0<a href=\"https:\/\/www.britannica.com\/money\/Vilfredo-Pareto\" target=\"_blank\" rel=\"nofollow noopener\">Vilfredo Pareto<\/a> were a data engineer, he\u2019d advise us to focus on improving documentation coverage for datasets. This 20 per cent effort would solve 80 per cent of our problems.<\/p><p>With this in mind, we built a documentation generation engine that uses LLMs to create output based on dataset schemas and sample data automatically. We also put feedback mechanisms in place to ensure that staff responsible for data were always reviewing AI-generated content and that other employees who used that data knew which documents were created by AI.<\/p><p>This feature caused an exponential increase in documentation coverage, which naturally made Grab\u2019s data landscape more comprehensible. At the very least, data users could understand what a dataset contained and how it should be used without having to reach out to colleagues.<\/p><p>However, this feature is far from perfect. To improve it further, we\u2019re working on several initiatives, including:<\/p><ul><li>Adding Slack conversations and wikis that mention the dataset in question as context to the documentation generation prompt<\/li><li>Creating an LLM-based evaluator that can rate documents and give improvement suggestions \u2013 something that can also be applied to human-created documents<\/li><li>Experimenting with novel frameworks like\u00a0<a href=\"https:\/\/www.promptingguide.ai\/techniques\/reflexion\" target=\"_blank\" rel=\"nofollow noopener\">Reflexion<\/a><\/li><\/ul><h5>Data discovery co-pilot<\/h5><p>Asking busy staff to digest extensive documentation for multiple datasets every time they have a new data requirement is a tall order. LLMs, however, can understand large documents in milliseconds.<\/p><p>With a\u00a0<a href=\"https:\/\/www.promptingguide.ai\/techniques\/rag\" target=\"_blank\" rel=\"nofollow noopener\">retrieval augmented generation system<\/a>\u00a0in place, LLMs can even parse every document a company has to offer. Moreover, they can provide conversational interfaces for discovering datasets, making you feel like you\u2019re reaching out to a colleague who can help you with any question.<\/p><p>Thus, with good documentation in place, we set out to create this conversational data discovery experience for our users. Today, staff at Grab can use this data discovery co-pilot either via our data search tool or via Slack.<\/p><p>We are now working on integrating this co-pilot into the various Slack channels where staff usually seek assistance from colleagues in charge of data. The co-pilot aims to help answer queries without human involvement.<\/p><p>When a question can\u2019t be answered using existing documentation and an employee needs to step in, our co-pilot allows the employee to regenerate relevant documents with just a click of a button, using their entire conversation with the colleague with the query as context.<\/p><h5>Data = found<\/h5><p>LLM-powered data discovery has made it much easier to find data within Grab. Staff can now find datasets faster than ever before, making the firm more data-driven while cross-domain data usage is on the rise.<\/p><p>It is worth pointing out, however, that Grab-level resources aren\u2019t a prerequisite for success. You can start small and iterate by experimenting with documentation generation for a subset of your most important datasets.<\/p><p>You can then leverage open-source LLMs to keep costs down and focus your efforts on the datasets and domains that are most critical to your business. Finally, you can engage staff who produce data and those who use it early in the process to ensure you aren\u2019t wasting effort and are only building what your company needs.<\/p><p>These principles and approaches can be adapted to fit any data-driven company, no matter the size or industry, with the smart use of LLMs. Your employees \u2013 and your bottom line \u2013 will thank you for taking this step.<\/p>\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-e5e249f elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"e5e249f\" data-element_type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-7b5e08b\" data-id=\"7b5e08b\" data-element_type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap\">\n\t\t\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<\/div>\n\t\t","protected":false},"parent":180237,"menu_order":0,"template":"grab21-default","acf":[],"_links":{"self":[{"href":"https:\/\/www.grab.com\/sg\/wp-json\/wp\/v2\/editorial\/225293"}],"collection":[{"href":"https:\/\/www.grab.com\/sg\/wp-json\/wp\/v2\/editorial"}],"about":[{"href":"https:\/\/www.grab.com\/sg\/wp-json\/wp\/v2\/types\/editorial"}],"version-history":[{"count":27,"href":"https:\/\/www.grab.com\/sg\/wp-json\/wp\/v2\/editorial\/225293\/revisions"}],"predecessor-version":[{"id":225470,"href":"https:\/\/www.grab.com\/sg\/wp-json\/wp\/v2\/editorial\/225293\/revisions\/225470"}],"up":[{"embeddable":true,"href":"https:\/\/www.grab.com\/sg\/wp-json\/wp\/v2\/editorial\/180237"}],"wp:attachment":[{"href":"https:\/\/www.grab.com\/sg\/wp-json\/wp\/v2\/media?parent=225293"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}