Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save m0o0scar/84d1f309cb89fe42d00f993e4601c289 to your computer and use it in GitHub Desktop.
Save m0o0scar/84d1f309cb89fe42d00f993e4601c289 to your computer and use it in GitHub Desktop.
KGLens: Towards Efficient and Effective Knowledge Probing of Large Language Models with Knowledge Graphs. Continue this conversation at https://readfm.vercel.app?gist=84d1f309cb89fe42d00f993e4601c289

[arxiv] KGLens: Towards Efficient and Effective Knowledge Probing of Large Language Models with Knowledge Graphs

Source

Authors: Shangshang Zheng, He Bai, Yizhe Zhang, Yi Su, Xiaochuan Niu, Navdeep Jaitly

Abstract: Large Language Models (LLMs) might hallucinate facts, while curated Knowledge Graph (KGs) are typically factually reliable especially with domain-specific knowledge. Measuring the alignment between KGs and LLMs can effectively probe the factualness and identify the knowledge blind spots of LLMs. However, verifying the LLMs over extensive KGs can be expensive. In this paper, we present KGLens, a Thompson-sampling-inspired framework aimed at effectively and efficiently measuring the alignment between KGs and LLMs. KGLens features a graph-guided question generator for converting KGs into natural language, along with a carefully designed importance sampling strategy based on parameterized KG structure to expedite KG traversal. Our simulation experiment compares the brute force method with KGLens under six different sampling methods, demonstrating that our approach achieves superior probing efficiency. Leveraging KGLens, we conducted in-depth analyses of the factual accuracy of ten LLMs across three large domain-specific KGs from Wikidata, composing over 19K edges, 700 relations, and 21K entities. Human evaluation results indicate that KGLens can assess LLMs with a level of accuracy nearly equivalent to that of human annotators, achieving 95.7% of the accuracy rate.

URL: https://arxiv.org/abs/2312.11539#

Conversation

▲ Please explain the paper to reader with no relevant technical or research background, be clear and concise.

This paper explores how to effectively test the factual knowledge of large language models (LLMs), which are powerful AI systems that can generate human-like text. LLMs are great at writing, translating, and answering questions, but they can sometimes make up facts or get things wrong.

The authors propose a new method called KGLens to test LLMs using knowledge graphs (KGs). KGs are like giant databases that store facts about the world in a structured way. Think of them as a network of interconnected facts, where each fact is a relationship between two things.

KGLens works by converting facts from a KG into natural language questions and asking the LLM to answer them. The authors use a clever technique called Thompson sampling to focus on the facts that the LLM is most likely to get wrong, making the testing process more efficient.

The paper shows that KGLens can accurately assess the factual knowledge of LLMs, identifying their strengths and weaknesses. It also demonstrates that KGLens can be used to analyze the knowledge of LLMs across different domains, such as movies, sports, and countries.

The authors believe that KGLens is a valuable tool for researchers and developers who want to build more reliable and factual AI systems. By understanding the limitations of LLMs, we can work towards creating AI that is more trustworthy and helpful.

Data Source Content

Full data source content here # KGLens : Towards Efficient and Effective Knowledge Probing of Large Language Models with Knowledge Graphs

Shangshang Zheng, He Bai11footnotemark:1, Yizhe Zhang, Yi Su,
Xiaochuan Niu, Navdeep Jaitly
Apple
{dan.z, hbai22, yizhe_zhang, yi_su, xniu, njaitly}@apple.com Equal contribution.

Abstract

Large Language Models (LLMs) might hallucinate facts, while curated Knowledge Graph (KGs) are typically factually reliable especially with domain-specific knowledge. Measuring the alignment between KGs and LLMs can effectively probe the factualness and identify the knowledge blind spots of LLMs. However, verifying the LLMs over extensive KGs can be expensive. In this paper, we present KGLens, a Thompson-sampling-inspired framework aimed at effectively and efficiently measuring the alignment between KGs and LLMs. KGLens features a graph-guided question generator for converting KGs into natural language, along with a carefully designed importance sampling strategy based on parameterized KG structure to expedite KG traversal. Our simulation experiment compares the brute force method with KGLens under six different sampling methods, demonstrating that our approach achieves superior probing efficiency. Leveraging KGLens, we conducted in-depth analyses of the factual accuracy of ten LLMs across three large domain-specific KGs from Wikidata, composing over 19K edges, 700 relations, and 21K entities. Human evaluation results indicate that KGLens can assess LLMs with a level of accuracy nearly equivalent to that of human annotators, achieving 95.7% of the accuracy rate.

1 Introduction

The factualness of Large Language Models (LLMs) is crucial for their reliability and utility in various applications. Nonetheless, studies have shown that LLMs can produce information that is nonfactual, hallucinated, or outdated Perez et al. (2022); Ji et al. (2023); Lee et al. (2022); Wang et al. (2021).

To evaluate the factualness of LLMs, fact-checking Thorne et al. (2018); Augenstein et al. (2023) and fact-QA Petroni et al. (2020); Press et al. (2022); Dhingra et al. (2022) approaches are commonly used, but several challenges persist. For fact-checking, distinguishing faithful and unfaithful statements is different from evaluating factualness of the generation. For fact-QA, scaling up the evaluation data is challenging due to the expensive nature of the annotation process. For both approaches, it is hard to exclude their data from the web-crawled pretraining corpus Deng et al. (2023) to ensure the fairness of the evaluation. Last but not least, LLMs may respond differently to the same fact when the quert is phrased in different forms, a challenge that existing fact-checking and fact-QA datasets are not equipped to handle.

In contrast, probing LLMs knowledge by transforming knowledge graph (KG) data into natural language questions addresses these limitations. First, KGs are inherently scalable, factually reliable, and tailored to specific domains. This scalability allows for extensive and efficient evaluation of LLMs across various domains. Second, question generation (QG) from KGs can be automated, enabling the rapid and large-scale creation of evaluation datasets. Moreover, generating various questions from the same set of facts allows us to assess the robustness of LLMs to different phrasings.

However, there are several challenges for QG-based KG evaluation. The first is the efficiency issue. Existing knowledge probing methods Dong et al. (2023); Wang et al. (2023), mainly using the brute-force method which exhaustively evaluates all KG edges, are computationally expensive, time-consuming, and non-scalable. The second challenge comes to transformation of ambiguous KG triplets into natural language queries. Given a subject and predicate, the correct object may not be unique, thus the answer to the question is ambiguous. Furthermore, some subjects may lack specificity, resulting in confusing questions. Examples illustrating these issues are shown in Table. 1. In this case, both text-cloze based methods Petroni et al. (2019); Jiang et al. (2020); Dong et al. (2023) and prompt-based QG methods Wang et al. (2023) fail to evaluate the factualness properly.

In this study, we present a novel knowledge probing framework named KGLens (Fig. 1), to measure the knowledge alignment between KGs and LLMs, and to identify LLM’s knowledge blind spots. To efficiently probe the LLM, KGLens features a Thompson sampling inspired method to rapidly identify knowledge blind spots. More specifically, we first introduce a parameterized knowledge graph (PKG), where each KG edge is augmented with a beta distribution, serving as an indicator of the LLM’s deficiency on that specific edge. We apply Thompson sampling to select the edges based on edge deficiency and then evaluate LLMs with the sampled edges, update the PKG with the evaluation results, and iterate this process until the running metrics converge. Our simulation experiments show that our sampling method with PKG is more efficient than random sampling and brute-force methods.

To accurately probe the LLM, KGLens features a graph-guided question generator for converting KGs into natural language with GPT-4 OpenAI (2023). We design two types of questions (fact-checking and fact-QA) to reduce the ambiguity of the expected answer, where the question type is controlled by the graph structure. We also include the entity aliases during the question generation to provide additional context and reduce the subject ambiguity. Human evaluation results show that 97.7% of our generated questions are sensible to human annotators.

To assess the whole framework of KGLens, we probed 10 popular LLMs with three domain-specific KGs collected from Wikidata, encompassing over 700 relations and 21K entities. We introduce three evaluation metrics to measure the knwoledge alignment between an LLM and a KG: zero-sense rate (percentage of facts that an LLM never answered correctly in all rounds), all-sense rate (percentage of facts that an LLM always answered correctly in all rounds), and win rate (percentage of facts that an LLM answered correctly more than half of the rounds). Human evaluation shows that KGLens can assess LLMs with a level of accuracy nearly equivalent to that of human annotators with 95.7% accuracy rate. Our key contributions are as follows:

We introduce a novel, efficient knowledge probing framework to efficiently identify LLMs’ knowledge blind spots across diverse topics and relationships. 2. 2.

Our QG strategy leverages KG structure to reduce ambiguity thus providing accurate and effective LLM evaluation, surpassing existing methods that focus solely on atomic triplets. 3. 3.

Human evaluation verifies our framework’s efficacy with a human-level accuracy (95.7%) in assessing LLMs’ knowledge, nearly matching human annotator performance. 4. 4.

We propose three metrics to quantify knowledge alignment between LLMs and KGs, successfully identifying knowledge gaps and reliable facts. 5. 5.

Our contributions advance the development of more reliable and factual AI systems, promoting trustworthy user experiences and efficient model improvement. We will open-source our framework and datasets.

2 Method

In probing the knowledge of LLMs, we face a unique challenge: while we have access to a vast corpus of KG edges, we lack insight into how well the LLM understands each of them. This scenario bears a striking resemblance to the multi-armed bandit problem in reinforcement learning, where we must balance exploration and exploitation to maximize our understanding of the LLM’s knowledge gaps.

Our approach, which we call KGLens, aims to efficiently identify the edges where the LLM’s knowledge is weakest or most limited. This allows us to focus our efforts on the areas where the LLM needs the most improvement. The framework of KGLens is illustrated in Fig. 1, with its workflow summarized in the caption. Key aspects of our method include Thompson Sampling Inspired Parameterized KG, Graph-Guided Question Generator, and Answer Verification.

In the following subsections, we will delve into each of these components in detail, explaining how they work together to create an effective knowledge probing system for LLMs.

2.1 Parameterized Knowledge Graph

A knowledge graph 𝒢𝒢\mathcal{G}caligraphic_G, is a set of triplets {(sj,pj,oj)}j=1⁢⋯⁢isubscriptsubscript𝑠𝑗subscript𝑝𝑗subscript𝑜𝑗𝑗1⋯𝑖\{(s_{j},p_{j},o_{j})\}_{j=1\cdots i}{ ( italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_o start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_j = 1 ⋯ italic_i end_POSTSUBSCRIPT where each tuple describes a relationship (predicate) pjsubscript𝑝𝑗p_{j}italic_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT between a subject sjsubscript𝑠𝑗s_{j}italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT and an object ojsubscript𝑜𝑗o_{j}italic_o start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT.

Intuitively, if an LLM failed in answering a question, there is a higher chance that the LLM also lacks knowledge of the related topics. To reflect this intuition, we propose a parameterized KG (PKG), by augmenting each edge (sj,pj,oj)subscript𝑠𝑗subscript𝑝𝑗subscript𝑜𝑗(s_{j},p_{j},o_{j})( italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_o start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) with an additional error probability θjsubscript𝜃𝑗\theta_{j}italic_θ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT reflecting the probability that an LLM may fail on this edge. We use a Beta distribution to model θ𝜃\thetaitalic_θ:

θj∼B⁢e⁢t⁢a⁢(αj,βj),similar-tosubscript𝜃𝑗𝐵𝑒𝑡𝑎subscript𝛼𝑗subscript𝛽𝑗\theta_{j}\sim Beta(\alpha_{j},\beta_{j}),italic_θ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∼ italic_B italic_e italic_t italic_a ( italic_α start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_β start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) , (1)

where α𝛼\alphaitalic_α and β𝛽\betaitalic_β can be interpreted as the number of times the targeted LLM failed or succeeded in answering the question. The prior of each θjsubscript𝜃𝑗\theta_{j}italic_θ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT is set to B⁢e⁢t⁢a⁢(1,1)𝐵𝑒𝑡𝑎11Beta(1,1)italic_B italic_e italic_t italic_a ( 1 , 1 ).

The estimation of the posterior {αj,βj}∀jsubscriptsubscript𝛼𝑗subscript𝛽𝑗for-all𝑗\{\alpha_{j},\beta_{j}\}_{\forall j}{ italic_α start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_β start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } start_POSTSUBSCRIPT ∀ italic_j end_POSTSUBSCRIPT is done in an iterative manner based on the outcome from the LLM. Each iteration consists two stages: edge sampling and parameter updating.

Edge sampling

The edge sampling process favors the edges with larger θ𝜃\thetaitalic_θ values. In each iteration, we sample top-n edges ranked by θ𝜃\thetaitalic_θ sampled from the Beta distributions of PKG, and then send these edges to LLM for examination and verification. The signal regarding the correctness of LLM’s response is collected for each edge.

Parameter estimation and updating

For each edge, the α𝛼\alphaitalic_α and β𝛽\betaitalic_β is updated based on the new observation of whether the response from LLM is correct, following the standard Beta distribution posterior updates. In addition, we also propagate the signal to the neighboring edges to account for the high correlation in error probability among the connected edges. To optimize the computational process, signal propagation is restricted to one degree. Specifically,

αjsubscript𝛼𝑗\displaystyle\alpha_{j}italic_α start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT =αj+𝕀⁢(response is incorrect)+Mj,absentsubscript𝛼𝑗𝕀response is incorrectsubscript𝑀𝑗\displaystyle=\alpha_{j}+\mathbb{I}(\text{response is incorrect})+M_{j},= italic_α start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT + blackboard_I ( response is incorrect ) + italic_M start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , (2)
--- --- --- --- ---
βjsubscript𝛽𝑗\displaystyle\beta_{j}italic_β start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT =βj+𝕀⁢(response is correct)+Nj,absentsubscript𝛽𝑗𝕀response is correctsubscript𝑁𝑗\displaystyle=\beta_{j}+\mathbb{I}(\text{response is correct})+N_{j},= italic_β start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT + blackboard_I ( response is correct ) + italic_N start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , (3)

where Mj=|incorrect neighborhood edges|subscript𝑀𝑗incorrect neighborhood edgesM_{j}=\lvert\text{incorrect neighborhood edges}\rvertitalic_M start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = | incorrect neighborhood edges | and Nj=|correct neighborhood edges|subscript𝑁𝑗correct neighborhood edgesN_{j}=\lvert\text{correct neighborhood edges}\rvertitalic_N start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = | correct neighborhood edges |. An updated PKG (Fig. 2) is then obtained by repeating the edge sampling and parameter updating process iteratively until the running metrics (3.2) converge.

2.2 Graph-guided Question Generation

We use GPT-4 to transform the sampled edge Kisubscript𝐾𝑖K_{i}italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT into the natural questions with few-shot in-context learning. The prompts and demonstrations are shown in Appendix A.6. We design two types of questions for KGLens: Yes/No Questions (judgement) and Wh-Questions (generative), where the question type is controlled by the graph structure (out degree). In addition, to reduce the ambiguity of entities, we provide the entity alias for question generation.

2.2.1 Yes/No Questions

Each KG edge can be transformed into a question by asking if the subject’s relation is the object. But in this way, the answer would always be Yes for all the edges. To formulate hard negative examples, we build a ground truth answer set 𝐓𝐣subscript𝐓𝐣\mathbf{T_{j}}bold_T start_POSTSUBSCRIPT bold_j end_POSTSUBSCRIPT for each (sj,pj)subscript𝑠𝑗subscript𝑝𝑗(s_{j},p_{j})( italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ), and the candidate answer set 𝐂𝐣subscript𝐂𝐣\mathbf{C_{j}}bold_C start_POSTSUBSCRIPT bold_j end_POSTSUBSCRIPT for each pjsubscript𝑝𝑗p_{j}italic_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. Both 𝐓𝐣subscript𝐓𝐣\mathbf{T_{j}}bold_T start_POSTSUBSCRIPT bold_j end_POSTSUBSCRIPT and 𝐂𝐣subscript𝐂𝐣\mathbf{C_{j}}bold_C start_POSTSUBSCRIPT bold_j end_POSTSUBSCRIPT are derived from the full Wikidata knowledge graph to ensure the completeness. Then, for a tuple {(sj,pj,oj)}subscript𝑠𝑗subscript𝑝𝑗subscript𝑜𝑗\{(s_{j},p_{j},o_{j})\}{ ( italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_o start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) }, we use ojsubscript𝑜𝑗o_{j}italic_o start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT to constitute the Yes question, and sample a random oxsubscript𝑜𝑥o_{x}italic_o start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT from 𝐂𝐣−𝐓𝐣subscript𝐂𝐣subscript𝐓𝐣\mathbf{C_{j}}-\mathbf{T_{j}}bold_C start_POSTSUBSCRIPT bold_j end_POSTSUBSCRIPT - bold_T start_POSTSUBSCRIPT bold_j end_POSTSUBSCRIPT to formulate the No question. Considering our QG process is on-the-fly during the evaluation, KGLens can formulate different QA pairs for the same fact. The sampling rate between yes and no question is evenly split, with a 50-50 distribution.

2.2.2 Wh-Questions

Another type of question queries the LLMs to generate the object(s) given the subject and the predicate, where the questions usually begin with when/where/who/what. While this question type presents greater difficulty, it is not universally applicable. Wh-questions may yield hundreds of correct objects, rendering exhaustive enumeration impractical and uninformative.

In KGLens, we opt to generate Wh-Questions only when the out degree of an entity is less than 10. Otherwise, the Yes/No Questions prompt is adopted.

2.3 QAV: Question Answering Verification

Our QA testing involved two difficulty levels: EASY, comprising solely Yes/No questions; and HARD, with each question type generated at a 50% chance. Few-shot in-context learning was employed to test the LLMs.

To verify the response, we guide the LLMs to generate either “Yes” or “No” at the beginning of the response for Yes/No Questions and subsequently generate accompanying explanations. This approach facilitates a straightforward verification process by examining the correspondence of the initial word. For Wh-Questions, we instruct the LLM to list all the correct answers. In this case, the assessment of the answer cannot be done by string matching. Therefore, we employ a GPT-4 model to check the correctness of a response given the question, the ground truth objects and their aliases. The prompts are listed in Appendix A.6.

2.4 Evaluation Efficiency Study

In this section, we conduct a simulation experiment to evaluate the efficiency of different methods in identifying the most challenging edges in a domain specific KG (NBA KG in Table 2), including: Thompson sampling, Epsilon Greedy, Monte Carlo, and the straightforward iteration method (Brute Force), which involves iterating over all edges multiple times. We first estimate the ground truth θ𝜃\thetaitalic_θ by iterating over all edges 20 times using the Brute-Force method. Then, top-k difficult edges can be identified by ranking the estimated θ𝜃\thetaitalic_θ. Since we are interested in the most challenging edges, Mean Square Error (MSE) for these top-k edges will be computed between the estimated θ𝜃\thetaitalic_θ and the ground truth θ𝜃\thetaitalic_θ for each method.

In Fig. 3, we plot different methods’ MSE score of top-200 and top-600 challenging edges over the number of API requests. The vertical epoch lines represent the cumulative number of API requests necessary to complete a full iterations over the PKG edges using the Brute-Force method. From this figure, we observe Thompson sampling method demonstrates superiority in pinpointing the most challenging edges. Notably, both Thompson sampling and Epsilon-Greedy exhibit advantages over Monte Carlo and Brute-Force methods by striking an effective balance between exploration and exploitation. For exmaple, from Fig. 3, we can see that Thompson sampling with propogation achieves the accuracy of a 4.5-epoch Brute-Force method while utilizing only 65% of the compute resources. This finding underscores the potential of our approach to swiftly identify critical edges and optimize resource allocation in large-scale KG applications. In addition, we find the signal propagation mechanism further enhances the capabilities, enabling them to achieve better accuracy rapidly.

3 Experiments

In this paper, we develop three domain-specific KGs using Wikidata to evaluate the knowledge accuracy and reliability of two widely used LLM APIs (GPT-3.5-turbo and GPT-4), two legacy LLMs (Babbage-002 and Davinci-002), together with three variations of GPT-4 (GPT-4-1106-preview, GPT-4-turbo, GPT-4o). We also evaluated three open source LLMs including Vicuna-33b-v1.3 Chiang et al. (2023), Xwin-LM-13B-V0.2 Team (2023), and Yi-34B-Chat 111https://www.01.ai. Web browsing ability is unavailable to OpenAI API calls at the time when experiments are conducted.

3.1 Dataset

We prepare three test datasets (KG) with Wikidata Query Web Service 222https://query.wikidata.org in three topics: country, NBA, and movie. The country KG includes knowledge about 16 countries. The NBA KG contains the knowledge related to 30 NBA teams. And the movies are sampled from films after 2015.

The statistics of our KGs are shown in Tab. 2. The term “dead edges” refers to edges that are less intriguing to inquire about but are still crucial for displaying entity relations. For example, certain predicates such as “member of” and “domestic relation”, exemplify links between entities, but they are less captivating to inquire about and are too prevalent. Conversely, significant and meaningful edges are referred to as “active edges”, and we use them to generate questions. More details of KG construction are provided in the Appendix A.2.

3.2 Metrics

To measure the alignment between KGs and LLMs, here we introduce three edge-level metrics, where each edge is evaluated multiple times with m𝑚mitalic_m successes and n𝑛nitalic_n failures.

Win rate. For each edge, LLM wins if the number of successes surpasses the number of failures, namely n" class="ltx_Math" display="inline" id="S3.SS2.p2.1.m1.1">m>n𝑚𝑛m>nitalic_m > italic_n. The win rate signifies the portion of winning edges out of all the examined edges.

Zero-sense rate. An LLM would has zero-sense about an edge (fact) if the model has never answered the edge correctly, namely m=0𝑚0m=0italic_m = 0. The zero-sense rate signifies the portion of edges with zero-sense.

All-sense rate. An LLM would has all-sense about an edge (fact) if the model has never failed to answer the edge, namely n=0𝑛0n=0italic_n = 0. The all-sense rate signifies the portion of edges with all-sense.

Based on the definition above, win rate is the portion of edges that an LLM has higher chance to answer them correctly, indicating the reliability of LLM. Zero-sense rate is the portion of edges that an LLM always fails to answer. All-sense rate is the portion of edges that an LLM always succeed to answer.

3.3 Main Results

To evaluate the KGs, we run KGLens across LLMs with 60 iterations and 64 batch size for each graph. Tab. 3 show results of win rate. We put results of zero-sense rate and all-sense rate in Appendix Tab. 6, 7, but we visualize 5 models’ results in Fig. 4.

Across varying difficulty levels, knowledge graphs, and the tested models, GPT-4 family consistently outperforms the others in both metrics. The performance across GPT-4 variations are close, where GPT-4, GPT-4o and GPT-4-turbo exhibits the most comparable performance, followed by GPT-4-1106-preview. We find GPT-4o exhibits a more cautious disposition in disclosing personal information. It frequently refrains from providing specific details about individuals, which may be correlated with its performance disparity in NBA KG, where GPT-4 underperforms GPT-4. Further investigation is warranted. We also find the gap between GPT-3.5-turbo and GPT-4 relatively larger across all domains and all difficulty levels, and GPT-3.5-turbo is even worse than the legacy LLMs under NBA KG EASY mode. Upon investigating the evaluation logs, GPT-3.5 exhibits a conservative approach, abstaining from generating answers when lacking confidence rather than providing speculative responses. Responses following this protocol consistently begin with the phrases, “I am sorry, but I couldn’t find any information on/about…”, “I’m sorry, but as an AI assistant, I do not have the capability to provide real-time information …”. In such cases, the edge would be marked as failed when the model declines to answer a question. We also observed such behavior in Yi-34B-Chat and Vicunna-33b-v1.3.

Lastly, we find the two legacy models exhibit comparable performance across evaluations. The random guessing baseline of the win rate is 50% for EASY evaluation, and 25% for HARD evaluation. We find Babbage-002 and Davinci-002 results are only slightly better than the random guessing, clearly showing the gap between the legacy LLMs and the recent LLMs. We also provide examples of different error types in Tab. 5.

3.4 Results Analysis by Edge Attributes

Another advantage of evaluating LLM with KG is that the results can be aggregated by different edge attributes. In this section, we show KGLens can be used for two different focuses of evaluation including the temporal groups and entity groups.

3.4.1 Temporal Analysis of Results

We conduct temporal analysis with movie KG. We group the results by the movie release years in Fig. 5 and full results are in Fig. 10 in Appendix. From this Fig. 5, we observe that both the GPT-3.5 and GPT-4-1106 perform worse for questions after 2020, which is reasonable as they were mainly pretrained with data before September 2021. On the other hand, we found that GPT-4-o significantly outperforms the other models in terms of zero-sense rate and win rate. In Fig. 10, we find all models get worse when evaluated in HARD mode, but GPT-3.5 is more robust. This is because a big portion of GPT-3.5’s failures are caused by refusing to answer the questions, instead of providing the wrong answers, which explains its results in EASY and HARD testing. Finally, it should be noted that it is reasonable that the rankings in Fig. 10 are not strictly aligned with the years, as the temporal difference is not the only factor that affect the evaluation results.

3.4.2 Entity Groups Analaysis

We also show results where we group the Country KG edges by the entity type in Fig. 8 in appendix.

The proficiency levels across countries can be visualized using a color coded table, where a darker color signifies higher zero-sense rate and thus a lower level of proficiency. Taking GPT-4 evaluated against country KG under HARD level difficulty for example, GPT-4 exhibits a recognition accuracy where the Austria, Mexico, and Italy are identified and ranked as 1, 2, and 3 respectively. In contrast, countries such as Canada, Philippines, and the United Kingdom are positioned at the lower end of the ranking scale.

The rationale behind the ranking can be elucidated by examining the dotted heatmap in the appendix(Fig. 6). In this figure, the size of each dot corresponds to the number of edges within the predicate sub-group, normalized by the total size of edges in the entire group. Additionally, the color of each dot serves as an indicator of the knowledge proficiency associated with the predicate sub-group pertaining to the respective country. Contrary to the table color theme, the darker color here indicates lower zero-sense rate and thus higher level of proficiency.

We find KGLens can identify where the errors came from for each country group. Concentrating on the Austria and the Canada, which represent the highest and lowest ranked countries, respectively, it becomes evident that GPT-4 exhibits enhanced proficiency pertaining to specific predicate sub-groups. Notably, these sub-groups include “located in time zone”, “located in the administrative territorial entity”, “electrical plug type,” “emergency phone number,” and “head of state”.

3.5 Human Evaluation

We conduct human evaluation to verify the question generation module and the question answering module of KGLens. A random sample of 300 instances was obtained (100 per domain, 50 per question type), and human annotations were acquired through five rounds of rating. The assessment process is conducted instance by instance, where the annotators were tasked with evaluating two aspects (QG and QA): firstly, the clarity of the generated question’s intent, and secondly, the correctness of the LLM’s response in relation to the ground truth answer and its synonymous expressions. These second objective is to verify if the annotator’s judgement agrees with KGLens’s judgement, and is only conducted for Wh questions, as there is no need to verify Yes/No by human. After collected the ratings, a majority voting mechanism was employed for each instance, wherein a label was assigned as "True" if at least three annotators concurred on the evaluation criterion. The evaluation results are presented in Tab. 4, and KGLens demonstrates robust performance in human evaluation across domains. 96% accuracy in QAV and an impressive 98% accuracy in QG. We also report the overall accuracy of KGLens . For the purpose of this evaluation, we define an instance as correct when two conditions are met: the generated question is marked as correct by human; and the QA correctness judged by KGLens aligns with human judgment. Our findings show a 95.7% accuracy rate for KGLens , indicating its ability to approximate human-level performance.

4 Related Work

It’s an established fact that pre-trained models have the ability to learn and retain knowledge. For example, Petroni et al. (2019) discovered that BERT Devlin et al. (2018), even without finetuning, harbors relational knowledge comparable to traditional NLP methods. With LLMs showcasing superior in-context learning and knowledge retention, evaluating their knowledge becomes pivotal to bolster performance and mitigate hallucination.

The knowledge assessment often tests the model with specific knowledge-related datasets Lewis et al. (2021); Petroni et al. (2020); Roberts et al. (2020); Peng et al. (2023); Press et al. (2022); Mallen et al. (2023). However, given the fact that LLMs are trained on web-crawled corpora and the data is constantly evolving, it is hard to exclude the test examples from the pretraining corpus. For example, Deng et al. (2023) use fill-in probing and multi-choice probing to check the data leakage of pretrained LLMs. Their results show that GPT-3.5-turbo exhibited a noteworthy ability to guess the missing option. Another concern is that the knowledge is dynamic, and the evaluation datasets remain fixed, which makes it challenging to evaluate the LLMs’ knowledge accurately. Dhingra et al. (2022) propose a diagnostic dataset that pairs the text and timestamp together and jointly models text and time. However, their dataset is static and designed for 2010 to 2020, which is not suitable for evaluating the LLMs’ knowledge in the future. Finally, the predominant metric employed by these datasets revolves around the test set accuracy, making it challenging to identify solutions for enhancing the LLM and reducing the hallucination.

On the other hand, knowledge graphs have the advantages of customization to specific domains, evolving knowledge, and reduced potential for test set leakage, which has been employed as a structured knowledge source for LLMs Lin et al. (2019); Agarwal et al. (2020); Rosset et al. (2020) and also been employed as a tool to probe knowledge in LLMs. LAMA Petroni et al. (2019) is the first work to probe a pretrained model with KGs, where they use the KG to generate the cloze statement and evaluate the LM’s knowledge with accuracy. However the cloze statement is not a natural question, and the correct answer is not unique in many cases, making the evaluation inaccurate. LPAQA Jiang et al. (2020) propose to mine the relation words from the web for each subject-object pair, which is impractical for large knowledge graph. In addition, these methods mainly focus on the accuracy but neglect that LLMs may respond differently to the same fact, where reliability should also be considered. KaRR Dong et al. (2023) proposes to solve this issue by using multiple prompts for each KG edge and using the output logits of LLMs to measure the knowledge reliability. However, KaRR is inefficient for large graphs, and it is not generalizable due to the unavailable of LLM’s output logits. Moreover, transforming KG triplets into questions is more natural than the text cloze task, but previous works mainly adopt the text cloze task for simplicity. Finally, to our best knowledge, there is no existing work that visualizes the LLM’s knowledge with KG (Fig. 2).

5 Conclusion

In this work, we introduced KGLens, a novel and efficient method tailored for visualizing and evaluating the factual knowledge embedded in LLMs. Our proposed Thompson sampling inspired framework with parameterized KG offers a more efficient way to revil LLM’s knowledge blind spots than existing brute force iterating methods. By evaluating various LLMs with our developed domain-specific KGs, we show KGLens provides adaptable and customizable views of an LLM’s knowledge. Human evaluation results indicate that KGLens can access LLMs with a level of accuracy nearly qeuivalent to that of human annotators, achieving 95.7% of the accuracy rate. Furthermore, our tool KGLens, together with our assessment KGs, sourced from Wikidata, will be available to the research community, fostering collaboration and serving as a valuable resource for future investigations into language models. For businesses employing LLMs, our contributions facilitate the development of more reliable and factual AI systems, fostering trustworthy user experiences and efficient processes for improving model knowledge.

6 Limitation

KG plays a pivotal role in our approach, and its quality significantly impacts the effectiveness of this method. A high-quality KG is essential not only for the Question Generation step to generate meaningful questions but also for signal propagation. If the KG is fragmented and scattered, signal propagation then becomes less beneficial.

While our current method incorporates counting updates for alpha and beta, we acknowledge the potential for improvement. Exploring alternative methods for updating these parameters is an area of active research for us.

The signal propagation method is another direction that we can dive into, instead of only propagate to neighbour edges, should we also propagate to further edges? Instead of equally update the neighbour edges, should we decay the signal? etc.

Question generation currently is limited to just one hop, being able to generate complicated questions that evolves multiple edge hops would enable our method to evaluation the model not only on factual knowledge retrieval, but also complex reasoning capability.

7 Ethical Considerations

We foresee no ethical issues originating from this work.

References

  • Agarwal et al. (2020)Oshin Agarwal, Heming Ge, Siamak Shakeri, and Rami Al-Rfou. 2020.Knowledge graph based synthetic corpus generation for knowledge-enhanced language model pre-training.arXiv preprint arXiv:2010.12688.
  • Augenstein et al. (2023)Isabelle Augenstein, Timothy Baldwin, Meeyoung Cha, Tanmoy Chakraborty, Giovanni Luca Ciampaglia, David Corney, Renee DiResta, Emilio Ferrara, Scott Hale, Alon Halevy, et al. 2023.Factuality challenges in the era of large language models.arXiv preprint arXiv:2310.05189.
  • Chiang et al. (2023)Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. 2023.Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
  • Deng et al. (2023)Chunyuan Deng, Yilun Zhao, Xiangru Tang, Mark Gerstein, and Arman Cohan. 2023.Benchmark probing: Investigating data leakage in large language models.In NeurIPS 2023 Workshop on Backdoors in Deep Learning - The Good, the Bad, and the Ugly.
  • Devlin et al. (2018)Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018.Bert: Pre-training of deep bidirectional transformers for language understanding.arXiv preprint arXiv:1810.04805.
  • Dhingra et al. (2022)Bhuwan Dhingra, Jeremy R. Cole, Julian Martin Eisenschlos, Daniel Gillick, Jacob Eisenstein, and William W. Cohen. 2022.Time-aware language models as temporal knowledge bases.Transactions of the Association for Computational Linguistics, 10:257–273.
  • Dong et al. (2023)Qingxiu Dong, Jingjing Xu, Lingpeng Kong, Zhifang Sui, and Lei Li. 2023.Statistical knowledge assessment for large language models.In Thirty-seventh Conference on Neural Information Processing Systems.
  • Ji et al. (2023)Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. 2023.Survey of hallucination in natural language generation.ACM Computing Surveys, 55(12):1–38.
  • Jiang et al. (2020)Zhengbao Jiang, Frank F Xu, Jun Araki, and Graham Neubig. 2020.How can we know what language models know?Transactions of the Association for Computational Linguistics, 8:423–438.
  • Lee et al. (2022)Nayeon Lee, Wei Ping, Peng Xu, Mostofa Patwary, Pascale N Fung, Mohammad Shoeybi, and Bryan Catanzaro. 2022.Factuality enhanced language models for open-ended text generation.Advances in Neural Information Processing Systems, 35:34586–34599.
  • Lewis et al. (2021)Patrick Lewis, Yuxiang Wu, Linqing Liu, Pasquale Minervini, Heinrich Küttler, Aleksandra Piktus, Pontus Stenetorp, and Sebastian Riedel. 2021.Paq: 65 million probably-asked questions and what you can do with them.
  • Lin et al. (2019)Bill Yuchen Lin, Xinyue Chen, Jamin Chen, and Xiang Ren. 2019.Kagnet: Knowledge-aware graph networks for commonsense reasoning.arXiv preprint arXiv:1909.02151.
  • Mallen et al. (2023)Alex Mallen, Akari Asai, Victor Zhong, Rajarshi Das, Daniel Khashabi, and Hannaneh Hajishirzi. 2023.When not to trust language models: Investigating effectiveness of parametric and non-parametric memories.In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9802–9822.
  • OpenAI (2023)OpenAI. 2023.Gpt-4 technical report.ArXiv, abs/2303.08774.
  • Peng et al. (2023)Baolin Peng, Michel Galley, Pengcheng He, Hao Cheng, Yujia Xie, Yu Hu, Qiuyuan Huang, Lars Liden, Zhou Yu, Weizhu Chen, et al. 2023.Check your facts and try again: Improving large language models with external knowledge and automated feedback.arXiv preprint arXiv:2302.12813.
  • Perez et al. (2022)Ethan Perez, Saffron Huang, Francis Song, Trevor Cai, Roman Ring, John Aslanides, Amelia Glaese, Nat McAleese, and Geoffrey Irving. 2022.Red teaming language models with language models.arXiv preprint arXiv:2202.03286.
  • Petroni et al. (2020)Fabio Petroni, Aleksandra Piktus, Angela Fan, Patrick Lewis, Majid Yazdani, Nicola De Cao, James Thorne, Yacine Jernite, Vladimir Karpukhin, Jean Maillard, et al. 2020.Kilt: a benchmark for knowledge intensive language tasks.arXiv preprint arXiv:2009.02252.
  • Petroni et al. (2019)Fabio Petroni, Tim Rocktäschel, Patrick Lewis, Anton Bakhtin, Yuxiang Wu, Alexander H Miller, and Sebastian Riedel. 2019.Language models as knowledge bases?arXiv preprint arXiv:1909.01066.
  • Press et al. (2022)Ofir Press, Muru Zhang, Sewon Min, Ludwig Schmidt, Noah A Smith, and Mike Lewis. 2022.Measuring and narrowing the compositionality gap in language models.arXiv preprint arXiv:2210.03350.
  • Roberts et al. (2020)Adam Roberts, Colin Raffel, and Noam Shazeer. 2020.How much knowledge can you pack into the parameters of a language model?In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 5418–5426, Online. Association for Computational Linguistics.
  • Rosset et al. (2020)Corby Rosset, Chenyan Xiong, Minh Phan, Xia Song, Paul Bennett, and Saurabh Tiwary. 2020.Knowledge-aware language model pretraining.arXiv preprint arXiv:2007.00655.
  • Team (2023)Xwin-LM Team. 2023.Xwin-lm.
  • Thorne et al. (2018)James Thorne, Andreas Vlachos, Christos Christodoulopoulos, and Arpit Mittal. 2018.Fever: a large-scale dataset for fact extraction and verification.arXiv preprint arXiv:1803.05355.
  • Wang et al. (2021)Cunxiang Wang, Pai Liu, and Yue Zhang. 2021.Can generative pre-trained language models serve as knowledge bases for closed-book qa?arXiv preprint arXiv:2106.01561.
  • Wang et al. (2023)Weixuan Wang, Barry Haddow, Alexandra Birch, and Wei Peng. 2023.Assessing the reliability of large language model knowledge.arXiv preprint arXiv:2310.09820.

Appendix

Appendix A Appendix

A.1 Cost Analysis

Here we highlight that the cost of GPT-4 is not counted by the number of queries, but the number of tokens. After doing a cost analysis, we conclude GPT-4 only cost about $20 per graph, which is acceptable.

GPT-4 in KGLens is only used to construct the question and verify LLM answers, and all the generation is based on knowledge graph triplets. And for yes/no questions, we simply use string matching to verify the answer. Our QG prompt using GPT-4 is around 60-100 input tokens + 10-30 output tokens, and 60 input token plus 8 output tokens for answer verification. Currently based on OpenAI website, gpt-4-0125-preview and gpt-4-1106-preview costs $10 per 1M input tokens and $30 per 1M output tokens. If we take an upper bound cost of our request 100 input tokens and 30 output tokens, each request costs <$0.002. The final cost of course also depends on the size of the knowledge graph, but in our case for example, NBA knowledge graph has 2689 active edges, 22.5K API requests is sufficient to iterate over the entire edge set 8 times, and this translates to $45 (in reality if using KGLens, using less than half the number of API requests can get to a decent theta estimation $20 per evaluation)

The major cost is actually the hosting cost of the target LLM. For reference 1 A100 GPU cost is around $5 per hour, and this cost scale up easily when evaluating larger LLMs.

In addition, we believe for answer verification step, it is not a requirement to use GPT-4 (we choose it for simplicity), maybe a lower cost model could be used to replace GPT-4.

A.2 Knowledge Graph Building and Cleaning

Given Wikidata’s vastness and inherent noise, we implement multiple strategies to maintain focus, relevance, and precision. Following techniques empower us to delve into specialized domains and ensure us a targeted and reliable exploration of the data.

A.2.1 Sampling Strategies and Preserving Data Distribution

Maintaining the original data distribution is important when cleaning a knowledge graph. To achieve this, random walk with both forward A.7.1 and backward A.7.2 dimension are employed. Sorting by random value of each queried edges, the sub-knowledge graph contains the representative samples that mirror the diversity of the original knowledge graph, we can preserve the inherent distribution of entities and relationships. This approach guarantees that our cleaned knowledge graph remains a faithful representation of the underlying data, enabling us to draw accurate conclusions from our research.

The extent of the random walk distance is flexible and tailored to specific requirements. Within our sub knowledge graphs, we conduct random walks spanning three steps, encompassing both nodes and edges within this range for analysis.

A.2.2 Focus and Curated Relevance

In the realm of knowledge graphs, Wikidata stands out as a repository of extensive information. However, our research necessitates a more nuanced approach. While Wikidata offers comprehensive knowledge, our focus lies in curated topics and entities tailored for specific purposes. This distinction is vital as it allows us to delve deeper into specialized domains, ensuring the precision and relevance of the data we analyze.

To address this issue, the parameterized knowledge graph begins by establishing a set of human selected central entities, from which it initiates random walks to explore neighboring entities. Additionally, we perform predicate analysis to discern and exclude predicates of lesser importance or those that are overly common. This approach ensures the focus on pertinent data while filtering out less relevant information.

A.2.3 Filtering Less Relevant Entities

The other challenges we encounter in Wikidata pertains to the noise within its knowledge graph. This noise manifests in the form of entities that are rarely mentioned or of lesser importance in the context of our research objectives. To maintain the integrity of our analysis, it is important to identify and filter out these less relevant entities.

Filtering by language count: entities mentioned in multiple languages are often more significant and relevant to a broader audience. By focusing on such multilingual entities, we ensure the inclusion of globally relevant information in our analysis.

Filtering by word frequency: entities that are frequently mentioned in various contexts are likely to hold greater importance. By considering word frequency, we prioritize entities that are central to discussions, thereby enhancing the relevance and significance of the data included in our analysis.

Filtering out entities with no alias: entities without aliases are less likely to be widely recognized or referenced. By excluding these entities, we focus our analysis on well-known and frequently mentioned entities, aligning our research with more meaningful and impactful data points.

A.3 Uncovered Error Types

A.4 Zero-sense rate and all-sense rate

A.5 Human Evaluation

We conduct human evaluation with an internal paid crowdsouring service, where 5 annotators participated the annotation process with their consent of using the data. All the annotators are from English speaking countries. The annotation instruction is shown below.

A.6 Prompt

A.7 Wikidata Web Query

A.7.1 Forward Walk

1SELECTDISTINCT?subject?subjectLabel?subjectDesc?predicate?predicateLabel?predicateDesc?object?objectLabel?objectDesc

2WHERE{{

3VALUES?subject{{

4{values}

5}}

6?subject?predicate?object.

7?subjectrdfs:label?subjectLabel.

8?subjectschema:description?subjectDesc.

9?propertywikibase:directClaim?predicate.

10?propertyrdfs:label?predicateLabel.

11?propertyschema:description?predicateDesc.

12?objectrdfs:label?objectLabel.

13?objectschema:description?objectDesc.

14FILTER(lang(?subjectLabel)="en")

15FILTER(lang(?subjectDesc)="en")

16FILTER(lang(?predicateLabel)="en")

17FILTER(lang(?predicateDesc)="en")

18FILTER(lang(?objectLabel)="en")

19FILTER(lang(?objectDesc)="en")

20}}

21ORDERBYUUID()

22LIMIT{limit}

A.7.2 Backward Walk

1SELECTDISTINCT?subject?subjectLabel?subjectDesc?predicate?predicateLabel?predicateDesc?object?objectLabel?objectDesc

2WHERE{{

3VALUES?object{{

4{values}

5}}

6?subject?predicate?object.

7?subjectrdfs:label?subjectLabel.

8?subjectschema:description?subjectDesc.

9?propertywikibase:directClaim?predicate.

10?propertyrdfs:label?predicateLabel.

11?propertyschema:description?predicateDesc.

12?objectrdfs:label?objectLabel.

13?objectschema:description?objectDesc.

14FILTER(lang(?subjectLabel)="en")

15FILTER(lang(?subjectDesc)="en")

16FILTER(lang(?predicateLabel)="en")

17FILTER(lang(?predicateDesc)="en")

18FILTER(lang(?objectLabel)="en")

19FILTER(lang(?objectDesc)="en")

20}}

21ORDERBYUUID()

22LIMIT{limit}

A.8 Additional Figures

{
"url": "https://arxiv.org/abs/2312.11539#",
"type": "arxiv",
"title": "KGLens: Towards Efficient and Effective Knowledge Probing of Large Language Models with Knowledge Graphs",
"subtitle": "Authors: Shangshang Zheng, He Bai, Yizhe Zhang, Yi Su, Xiaochuan Niu, Navdeep Jaitly",
"description": "Abstract: Large Language Models (LLMs) might hallucinate facts, while curated Knowledge Graph (KGs) are typically factually reliable especially with domain-specific knowledge. Measuring the alignment between KGs and LLMs can effectively probe the factualness and identify the knowledge blind spots of LLMs. However, verifying the LLMs over extensive KGs can be expensive. In this paper, we present KGLens, a Thompson-sampling-inspired framework aimed at effectively and efficiently measuring the alignment between KGs and LLMs. KGLens features a graph-guided question generator for converting KGs into natural language, along with a carefully designed importance sampling strategy based on parameterized KG structure to expedite KG traversal. Our simulation experiment compares the brute force method with KGLens under six different sampling methods, demonstrating that our approach achieves superior probing efficiency. Leveraging KGLens, we conducted in-depth analyses of the factual accuracy of ten LLMs across three large domain-specific KGs from Wikidata, composing over 19K edges, 700 relations, and 21K entities. Human evaluation results indicate that KGLens can assess LLMs with a level of accuracy nearly equivalent to that of human annotators, achieving 95.7% of the accuracy rate."
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment