Google’s Linguistic Analysis Is Not All About Web Search
Google’s research branched out well beyond Web search over a decade ago. Unfortunately many Web marketers who lack any formal training at all in the theory of computation, search engine design, learning systems, and other related disciplines take sound bites from occasional Google announcements, news stories, and SEO blog posts about various research projects and transform those sound bites into stipulations of pseudo-fact.
While I think everyone interested in learning more about how search engines work should spend some time browsing the research papers made available via Google Scholar, just reading the abstracts and tying the titles of papers together isn’t enough. Your insights are limited if you don’t look into the backgrounds of the people publishing this research.
Across the years the SEO community has taken many Google patent applications developed by PPC and/or Google News and/or other non-Web search teams and treated those patents as if they describe unpaid (organic) Web search functions. These are important distinctions to note because each team is testing and evaluating ideas that may never be used by the other teams. Furthermore, research or patents that were published years ago may have been utilized at one time but then discarded or replaced by more modern mechanisms.
The team members may go on to work on other projects, so you can’t assume that someone who worked on query translation one year is still working on query translation five years later.
How Many Ways Does Google Deal with Language?
If we ignore all the other companies Alphabet manages under its umbrella, here are a few areas where Google integrates linguistic analysis into its systems. These systems are distinct and separated from each other, even though your browser creates the illusion of tight integration because everything is located on “google.com”.
AdSense – Google AdSense analyzes your Web content to identify contextual properties that can be matched to its advertising inventory. Furthermore, they analyze page and site structure to determine which kinds of ads they can fit into your pages. They also analyze sites to determine if those sites are violating AdSense guidelines.
The AdSense service made machine-learning algorithms available to all Website partners years ago. If you haven’t run any AdSense experiments or activated auto ads, you’re missing out on an opportunity to use free turnkey machine learning tools developed by Google. There is literally nothing you need to do other than activate the service. They’ll now even design experiments for your sites if you allow them to.
AdWords – Google AdWords is the other side of the advertising coin. Although AdWords always took a contextual approach, Google has bough a number of companies to enhance its linguistic analysis and processing capabilities. AdWords doesn’t just power Web search advertising. They power the AdSense network, too.
How you match your advertising copy to on-page context either in search or across the partner network is driven by Google’s linguistic algorithms.
Book Search – Google Book Search has been controversial from the beginning. It’s also the most frustrating and limiting of Google’s search tools. I’m sure that’s because they were sued over the service from the very beginning. Google’s goal of making millions of books accessible to the world conflicts with the still-active intellectual property rights of millions of owners.
Book Search integrates OCR (optical character reading), one of the earliest forms of computerized linguistic analysis, with full-blown text and language analysis.
The book index needs to differentiate between documents by year, format (book or magazine or newspaper), and by structure. Book Search is capable of doing some interesting things but you can’t always get what you want out of the service because of intellectual property rights.
Google Assistant – Although Google is still a very distant no. 2 in the voice search world (Bing powers about 2/3 of English-language voice search), there is no question about how innovative Google has been in this area. All serious comparisons between the two platforms seem to agree that Google’s algorithms understand human speech better. But there is more to it than simply understanding speech.
Voice search still relies heavily on Featured Snippets to answer questions as well as product markup for identifying products relevant to item-specific queries. In many ways Voice Search is the most limited and infantile of search technologies. Voice Search is akin to meta search. I don’t know if Google Assistant and comparable services will ever keep their own indexes of Web content. So far the research appears to be pointing away from that.
Image Search – While everyone knows about the failures of Google Image Search (such as their notoriously bad labeling scheme that matched photos of African Americans with the word “gorilla”), the Image Search teams have continued to refine their integrated pattern recognition systems. Google Image Search can recognize objects within photos. So if you’re looking for a picture of boxes of various sizes and shapes, Google Image Search will try to understand your query (e.g., “pictures of boxes of various sizes and shapes”) and find images that match that description.
While Image Search still depends very much on the attributes that Website publishers associate with their images, over the past few years Image Search has shown me some very interesting, unexpected, and clearly unoptimized but highly accurate results. One example of these kinds of results are the YouTube image captures I occasionally find for very obscure, highly specific queries. While it’s possible that people are annotating these videos either on YouTube or other sites in ways that Image Search can use, the point is that Google’s Image Search (and Bing’s) does a very good job of connecting a lot of not-very-obvious dots.
They need to understand words to do that.
News Search – Google’s News Search is a completely separate index and set of algorithms. They’ve been doing linguistic analysis since the very early days, even though things have changed much since Krishna Bharat developed the Hilltop algorithm (one of the algorithms most poorly understood by the SEO community).
One area where the Google News team has been developing linguistic tools is for the creation of article abstracts. I don’t know if this is driven by Europe’s ridiculous intellectual property rights laws but I suspect that Googlers have been trying to figure out how to report the news without actually quoting any of the source articles.
Scholar – The Google Scholar team work in a very limited subset of the Web, but over the past few years their index has earned intense focus for Web marketing bloggers and conference presentations. Your work sounds more authoritative if you can point to a Google patent or research paper and quote from it.
Of course, if you cannot explain how all that stuff fits into the bigger picture you’re not doing anyone (including yourself) any favors by persuading a significant proportion of the Web marketing world to start using new buzzwords. Just quoting abstracts and summaries is not good enough. You still need to explain how these algorithms work, who uses them, why they would be used, and what evidence there is for assuming they have been deployed into a live system.
I’d like to think I do all that very well but it’s not easy and no one is expert enough in all these types of systems to competently explain them all. There are some patent applications and research papers I won’t comment on for months, or even years, because they are so hard to understand. And yet Web marketers may glibly point to them as proof that Google Web search is using some algorithm.
Google Scholar has to understand the most difficult and challenging words on the Web. The system’s linguistic capabilities are, in my opinion, limited, but they function at a minimally acceptable level. And you still need to utilize their advanced search form to get the best results.
Translate – Google Translate has been both the bane and the boon of the Internet. You can still find some laughably bad translations via the Translate service (and competitive tools still struggle as well). A lot of the linguistic research I see Web marketers citing in their blog posts and conference presentations have more to do with Google Translate and less (if anything) to do with Google Web search.
That doesn’t mean you cannot or should not infer that if Translate’s team can do it some other team can do it. These kinds of inferences are fine, but not only should you stipulate all the usual caveats when you create these posts and presentations, as an audience member you should absolutely be skeptical of these relevance of these algorithms to Web Search.
Web Search – And then there’s the general Web search that everyone knows so well. Google’s Web Search algorithms fall into many sub-categories (most of which we don’t know about). You have the spam-filtering algorithms, the duplicate content/canonical identifying algorithms, the document relevance algorithms, and algorithms for extracting special markup or incidental information. I’m just talking about linguistic analysis functions. The actual selection and scoring algorithms fall into a different area.
Unfortunately, Web marketers have become familiar with so many names (like PageRank, RankBrain, Panda, Penguin, etc.) that they feel very confident in their understanding of how Google works. But everyone’s understanding is unique to their own perspective.
Most people can’t tell you the difference between RankBrain and neural matching (and I’m afraid Danny Sullivan’s explanations are as unhelpful as anything I’ve read in the Web marketing blogosphere). Although Google may have an algorithm they call “the Neural Matching algorithm”, the phrase neural matching has been used for decades to describe a vector-based machine learning process. There are MANY “neural matching” algorithms. I wrote an article last year, “What Is Neural Matching? Google Just Changed How You Search the Web”, on my personal blog. If you’re curious about how neural matching has been used through the years, start there. Google didn’t invent it and they don’t own it.
How Should You Study These Patents and Research Papers?
One thing I’d like to see people do is talk about who is writing these patents and papers. If you take the time to research the authors you’ll start to see that a lot of advertising-related processes have been mistaken for Web search processes. You should resist the temptation to defend or excuse past mistakes in interpretation by saying, “Sure, but they could be using this stuff in Web search”.
Most likely they are NOT using those processes in Web search because it’s a different kind of system.
If you want to write about patents and research papers, it will benefit your readers if you explain to them where these people work and when they did their work. You can’t always find their bios online but it’s been easier for me to find them over the past 1-2 years. We share a lot of these patents and research papers in the SEO Theory Premium Newsletter and I do my best to identify who is writing them and what their specialties are for our subscribers.
If you cannot explain how the processes actually work, where they came from, or why they are important you should take the time to dig deeper and learn about them. Googlers have developed many new ideas and concepts over the past 20 years, but a lot of the machine learning stuff they have been publishing about was developed before they got to it. These phrases and concepts are much older than most of you realize.
I still cringe every time I read how “Neural Matching is Google’s ranking algorithm” (or something equivalent to that). It’s obvious when SEO specialists write explanations about things they don’t understand. You may know how to get to the top of search results by following a rote pattern of steps, but your explanations of how search works are wrong. Remember, many spammers rank in spite of their spam, not because of it. The same is true for how you optimize: you may rank in spite of your ideas about how search works, not because of them.
There is only one authority about how any given search engine works and that is the search engine’s developers; and they usually don’t share all they know.
Quoting your favorite SEO bloggers and conference presenters in your attempts to explain how search patents and research papers fit into the scheme of things weakens your explanation. Peer-reviewed research is taken out of context too easily and too often by the Web marketing community. Many experimental ideas are not intended for integration into Web search. Just because some Googler’s name is at the top of the paper doesn’t mean it really provides insight into what Google search systems are doing.
Use the abstracts and summaries to check your comprehension. You really need to read through these documents several times and try to explain to yourself what they are saying. Go back to the abstracts and conclusions several times to recalibrate what you think you have learned.
You’re not going to become expert enough in these systems to be able to explain them just by reading patents and research presentations. There are whole textbooks of explanation behind these concepts. You’ll have to read at least a chapter’s worth of material before you can begin to understand what all the gobbledy-gook means. And as someone who has been reading this kind of stuff for decades, I am still of the opinion that most computer science textbooks are poorly written.
So good luck getting the machine learning algorithms to help you understand them.
There Are Other Ways to Learn about This Stuff
You can sign up for (sometimes free) online courses to study machine learning, but if you’re unable to make that time commitment you can can (and should) head over to YouTube and watch some of the thousands of hours of presentations and educational videos available there. There are lectures that explain the basics of machine learning and lectures that go into highly detailed minutiae. You won’t understand it all. I don’t. But you can find almost anything you need to understand what is going on in the patents and research papers.
Just avoid all the SEO-produced explanations about machine learning. Every one I have watched so far is worthless video drivel.
How Much Do You Need to Know about “Vectors” and Such?
Every time I see a news story about Google and its vectors I think of the scene from “Airplane” where flight crew are taking off and Peter Graves says, “What’s our vector, Victor?” The whole conversation is a play on the multiple meanings and similar sounds of words (including “roger, Roger”, “clearance Clarence”, etc.).
I learned about vector math in Linear Algebra. To understand what search engineers at Baidu, Bing, Google, and Yandex are doing you don’t need a great deal of math. A vector just an ordered sequence of things, like a sentence. We tend to represent vectors with parenthetical markup like these examples:
In software the elements of a vector are its coordinate. The 1st element is coordinate 1, the 2nd element is coordinate 2, etc. A vector can have more than 1 dimension. If you’re done any programming where you use 2-dimensional arrays, those are 2-dimensional vectors with 2 coordinates, typically represented as (x, y). The classic X,Y coordinate format is also called an ordered pair and you could have a vector that consists of ordered pairs:
The elements of a vector can be assigned attributes. The attributes are also usually represented as numbers and they may be depicted this way:
[(element value):(attribute value)] or [23782:2474]
So why is everything represented as a number? Many of you probably conclude that the numbers are indexes in tables. That’s good enough. The vectors only need to be constructed from concise, distinct data. If the number “3464” occurs in more than 1 element of a vector or in more than 1 vector, it just needs to refer to the same thing. So assuming it’s an index pointer is fair enough and you don’t need to know any more about how the algorithms are fed their data than that.
Machine learning algorithms need a lot of data. Computer scientists have been organizing that data into vectors for decades. Search engines didn’t invent all this vector-based analysis. Graphical mapping systems used in industrial design, mapping, optical analysis, and more have worked with vector-based data for many years. Your smartphones and computers use vector math to “paint their screens”. Vectors permeate your daily life. You’re reading this blog through an interface managed by vector math.
Neural matching is a vector-based method for identifying and mapping patterns in large sets of vectors. The earliest use I could find for neural matching in computer technology was to match satellite photos with maps. There may have been earlier (possibly still-classified) applications of which I am not aware.
Pattern-matching is an interesting and very broad concept in computer science. You can have fuzzy patterns and precise patterns. You account for the fuzziness through tolerances, modulations, and other stuff that sounds like it came right out of an audio engineer’s oscillator manual. We can use vectors to analyze waves and signals. It’s hard for me to think of something that cannot be analyzed through vectors. If it’s analog it can usually be digitalized and if it’s digital data then it can be packed into vectors.
Simply knowing what these things are and how they are done doesn’t provide one with any special insight into which algorithms a search engine uses or what those algorithms do. But it does make it easier to understand what these patents and research papers are saying.
Language Has Been Vectorized
Although I could go on about this, I’ll try to wrap it up quickly. The many different language projects I’ve read about use distinct data sets. These training models are not supposed to be indicative of the broader Web. What we as marketers should be taking away from these patents and papers is that the search engines look for ways to classify and organize meaningful information.
A city name like Paris is data. An order pair of data items like (Paris, France) is meaningful data. There is more than one city named Paris, and Paris is also used as a personal name. But there is only one Paris, France.
That’s the point of sorting all our words into vectors. The search engineers are developing algorithms to extract meaningful information from random text. Anything to do with relevance to queries, authority, and sorting search results comes much, much later in a different part of the corporate system.
These language vectors you read about may not be used by the indexing, selection, and ranking algorithms that provide real-time search results. Maybe some day that will be useful in an unforeseeable way, but real-time machine learning is not practical. Just as your clicks on search results are not being used to compute new ranking positions for those results, the vectorized data you read about isn’t really what these algorithms are working with.
You won’t be able to optimize for search by thinking about how search engines understand language. If anything, you’ll make their jobs much more difficult if you try to do that. They MUST learn to understand natural language. They’ve been fooled all too often by clever Web marketers who identified linguistic patterns that were favored by the algorithms. Now Bing, Google, and the others have the resources to develop many different selection, relevance, and ranking algorithms. That means their search systems are more flexible than you realize.
And while it’s cool to know they are vectorizing all our text, that doesn’t in any way provide you with real advantages over anyone else in search. If anything, you’ll have to succeed in spite of your misunderstandings about linguistics and vectors.
Follow SEO Theory
A confirmation email will be sent to new subscribers AND unsubscribers. Please look for it!