How Google’s Knowledge Graph Updates Itself by Answering Questions
How A Knowledge Graph Updates Itself
To those of us who are used to doing Search Engine Optimization (SEO), we’ve been looking at URLs filled with content, and links between that content, and how algorithms such as PageRank (based upon links pointed between pages) and information retrieval scores based upon the relevance of that content have been determining how well pages rank in search results in response to queries entered into search boxes by searchers. Web pages connected by links have been seen as information points connected by nodes. This was the first generation of SEO.
Chances are good that many of the methods that we have been using to do SEO will remain the same as new features appear in search, such as knowledge panels, rich results, featured snippets, structured snippets, search by photography, and expanded schema covering many more industries and features then it does at present.
Search has been going through a transformation. Back in 2012, Google introduced something it refers to as the knowledge graph, in which they told us that they would begin focusing upon indexing things instead of strings. By “strings,” they were referring to words that appear in queries, and in documents on the Web. By “things,” they were referring to named entities, or real and specific people, places, and things. When people searched at Google, the search engines would show Search Engine Results Pages (SERPs) filled with URLs to pages that contained the strings of letters that we were searching for. Google still does that, and is slowly changing to showing search results that are about people, places, and things.
Google started showing us in patents how they were introducing entity recognition to search, as I described in this post:
How Google May Perform Entity Recognition
They now show us knowledge panels in search results that tell us about the people, places, and things they recognize in the queries we perform. In addition to crawling webpages and indexing the words on those pages, Google is collecting facts about the people, places, and things it finds on those pages.
A Google Patent that was just granted in the past week tells us about how the Google knowledge graph updates itself when it collects information about entities, their properties and attributes and relationships involving them. This is part of the evolution of SEO that is taking place today – learning how Search is changing from being based upon search to being based upon knowledge.
What does the patent tell us about knowledge? This is one of the sections that details what a knowledge graph is like that Google might collect information about when it indexes pages these days:
Knowledge graph portion includes information related to the entity [George Washington], represented by [George Washington] node. [George Washington] node is connected to [U.S. President] entity type node by [Is A] edge with the semantic content [Is A], such that the 3-tuple defined by nodes and the edge contains the information “George Washington is a U.S. President.” Similarly, “Thomas Jefferson Is A U.S. President” is represented by the tuple of [Thomas Jefferson] node 310, [Is A] edge, and [U.S. President] node. Knowledge graph portion includes entity type nodes [Person], and [U.S. President] node. The person type is defined in part by the connections from [Person] node. For example, the type [Person] is defined as having the property [Date Of Birth] by node and edge, and is defined as having the property [Gender] by node 334 and edge 336. These relationships define in part a schema associated with the entity type [Person].
Note that SEO is no longer just about how often certain words appear on pages of the Web, what words appear in links to those pages, in page titles, and headings, alt text for images, and how often certain words may be repeated or related words may be used. Google is looking at the facts that are mentioned about entities, such as entity types like a “person,” and properties, such as “Date of Birth,” or “Gender.”
Note that quote also mentions the word “Schema” as in “These relationships define in part a schema associated with the entity type [Person].” As part of the transformation of SEO from Strings to Things, The major Search Engines joined forces to offer us information on how to use Schema for structured data on the Web to provide a machine readable way of sharing information with search engines about the entities that we write about, their properties, and relationships.
I’m writing about this patent because I am participating in a Webinar online about the Google Knowledge Graph and how it is being used, and updated. The Webinar is tomorrow at:
#SEOisAEO: How Google Uses The Knowledge Graph in its AE algorithm. I haven’t been referring to SEO as Answer Engine Optimization, or AEO and it’s unlikely that I will start, but see it as an evolution of SEO
I’m writing about this Google Patent, because it starts out with the following line which it titles “Background:”
This disclosure generally relates to updating information in a database. Data has previously been updated by, for example, user input.
This line points to the fact that this approach no longer needs to be updated by users, but instead involves how Google knowledge graphs update themselves.
Updating a Knowledge Graph
I attended a Semantic Technology and Business conference a couple of year ago, where the head of Yahoo’s knowledge base presented, and he was asked a number of questions in a question and answer session after he spoke. Someone asked him what happens when information from a knowledge graph changes and it involves very sensitive information, and needs to be updated?
His answer was that a knowledge graph would have to be updated manually to have new information placed within it.
That wasn’t a satisfactory answer because it would have been good to hear that the information from such a source could be easily updated, and it was a little difficult hearing that a search engine would need to be edited like a newspaper would be. This may have been the answer that the people from Yahoo believed was the proper answer, and I’ve been waiting for Google to answer a question like this to see what their answer would be. That made seeing a line like this one from this patent interesting:
In some implementations, a system identifies information that is missing from a collection of data. The system generates a question to provide to a question answering service based on the missing information, and uses the response from the question answering service to update the collection of data.
This would be a knowledge graph update, so that patent provides details using language that reflects that exactly:
In some implementations, a computer-implemented method is provided. The method includes identifying an entity reference in a knowledge graph, wherein the entity reference corresponds to an entity type. The method further includes identifying a missing data element associated with the entity reference. The method further includes generating a query based at least in part on the missing data element and the type of the entity reference. The method further includes providing the query to a query processing engine. The method further includes receiving information from the query processing engine in response to the query. The method further includes updating the knowledge graph based at least in part on the received information.
How does the search engine do this? The patent provides more information that fills in such details.
The approaches to achieve this would be to:
…Identifying a missing data element comprises comparing properties associated with the entity reference to a schema table associated with the entity type.
…Generating the query comprises generating a natural language query. This can involve selecting, from the knowledge graph, disambiguation query terms associated with the entity reference, wherein the terms comprise property values associated with the entity reference, or updating the knowledge graph by updating the data graph to include information in place of the missing data element.
…Identifying an element in a knowledge graph to be updated based at least in part on a query record. Operations further include generating a query based at least in part on the identified element. Operations further include providing the query to a query processing engine. Operations further include receiving information from the query processing engine in response to the query. Operations further include updating the knowledge graph based at least in part on the received information.
A knowledge graph updates itself in these ways:
(1) The knowledge Graph may be updated with one or more previously performed searches.
(2) The knowledge Graph may be updated with a natural language query, using disambiguation query terms associated with the entity reference, wherein the terms comprise property values associated with the entity reference.
(3) The knowledge Graph may use properties associated with the entity reference to include information updating missing data elements.
The patent that describes how Google’s knowledge graph updates themselves is:
Question answering to populate knowledge base
Inventors: Rahul Gupta, Shaohua Sun, John Blitzer, Dekang Lin, and Evgeniy Gabrilovich
US Patent: 10,108,700
Granted: October 23, 2018
Filed: March 15, 2013
Methods and systems are provided for a question answering. In some implementations, a data element to be updated is identified in a knowledge graph and a query is generated based at least in part on the data element. The query is provided to a query processing engine. Information is received from the query processing engine in response to the query. The knowledge graph is updated based at least in part on the received information.
Nicolas Torzec tweeted me a link to a paper published on the Google AI Blog, which shares a number of authors with this patent. It was posted in 2014 (a year after the patent this post is about was filed.) The paper explains in more detail how a knowledge graph might become more complete. As the Abstract of the paper tells us:
We discuss how to aggregate candidate answers across multiple queries, ultimately returning probabilistic predictions for possible values for each attribute. Finally, we evaluate our system and show that it is able to extract a large number of facts with high confidence.
The paper is Knowledge Base Completion via Search-Based Question Answering Reading this paper in addition to the patent is recommended. It presents a much more nuanced look at some of the issues that the people working upon this problem came across, and some of the solutions that they found to address those. One of the problems that they use to illustrate how this system works involves identifying the parents of Frank Zappa (His Band was named “The Mothers of Invention” which made that task have some issues unique, as well.)
It does seem like it is a difficult task trying to update a knowledge graph using questions and answers like this, and is a problem that faces some challenges. It is interesting seeing what stage we are at in having problems like this addressed – so read this paper carefully along with the patent.
We have been seeing other approaches that look at a knowledge graph from other directions such as:
3 Ways Query Stream Ontologies Change Search – this is about Google looking at query stream information to identify data that it can extract from the Web to use to build ontologies. By looking at searchers queries, in effect it is crowdsourcing information about topics that may be helpful in building those ontologies.
Constructing Knowledge Bases with Context Clouds – This tells us about how Google could look at unstructured content that it might be able to use to build up knowledge bases. We see statements like this from the patent the post is about:
Extending the number of attributes known to a search engine may enable the search engine to answer more precisely queries that lie outside a “long tail,” of statistical query arrangements, extract a broader range of facts from the Web, and/or retrieve information related to semantic information of tables present on the Web.
We haven’t reached the point where updating or building a knowledge base can be automated, and updating some knowledge graph information about some sensitive topics that change may be necessary still, but we have some examples of approaches that are underway towards such updates a possibility.