Today Google revealed RankBrain, the AI system that Google has recently started using for a "very large" fraction of the millions of searches per second that Google handles per second, featuring more than a few commonalities with Harvest.ai's Macie analytics. What's special about RankBrain? Google says that it's particularly useful when processing searches that its systems have never seen before.
Google has not released a lot of details about specifically how RankBrain's AI works- but it likely involves some fantastic research that Google has released over the past year with respect to Word and Paragraph vectors-- the same algorithms that we use to form the underpinnings of Harvest.ai's Macie analytics. Word vectors are essentially vector-based representations of words, paragraphs and documents generated by neural networks-- often referred to as neural language models.
Why is this relevant? Moving from a search engine searching for statistically relevant terms to a more knowledge-based system requires not just understanding which terms are statistically interesting, but also the context in which they were used. Take the example sentence below:
I like to stream TV shows on Hulu and stream movies on Netflix
In a traditional word-count (token n-gram, or bag-of-words) model, this sentence would get turned into:
The "token_1gram" example is a typical bag-of-words approach that is useful for many NLP applications, such as search by maintaining a count of the times that each word appeared in an input document. It's great for identifying statistically interesting terms like "stream", "hulu", and "netflix" but loses key semantics that are critical for harder problems such as conveying knowledge. For example, it's impossible to answer the question using this model of whether the user likes to watch movies or shows on Netflix.
In contrast to a BoW model, neural language models maintain a stronger weighting between "TV" and "shows" with "Hulu" and "Movies" with "Netflix", capturing the semantics of the original statement.
Let's do a quick example, and see what we can learn from paragraph vectors trained on a very large corpus of data- such as the full content of Wikipedia or a 3-month data dump of Google News. Given an input search, such as "Hulu" and "Netflix" from the example above, we can find other terms that are often located close to our input query. This can be used to expand the user's original search, which may be something that Google has never seen before to include other relevant terms that may bring in better search results.
Pretty amazing! Our original query of "Netflix" and "Hulu" became:
- Amazon Unbox
How does this work? In this case, we ran the terms "Netflix" and "Hulu" against a trained word vector representation generated from 3 months of Google News articles -- approximately 100 Billion words utilizing the word2vec implementation open sourced by Google. As shown above, we can use this model to build on existing queries (even ones we've never seen before) with intelligent recommendations- users don't have to know or spend the time to think of what terms to put into a query to Google- it can be expanded upon automatically.
At Harvest.ai- our Macie analytics use a similar approach to RankBrain to identify and protect important data across an organization, even with just a small set of training data. Go machines!
POST BY ALEX WATSON, FOUNDER, CEO HARVEST.AI
Former Senior Director of Security Research at Websense. CTO at BTS. Co-Founder of APX Labs, Over 10 years experience in the US Intelligence Community.