If it were just click data, how would they get the terms?
They're either parsing the query out of the url, or violating robots.txt to fetch the result page, almost certainly the former. This seems like a pretty clear indication that they've special-cased clicks from google. It's theoretically plausible that they are treating all query parameters the same for all sites, but very unlikely given how much noise that would introduce into their results. Even so, they would have to know that most clicks with meaningful query parameters come from Google. This isn't something that's going to happen by accident.
> If it were just click data, how would they get the terms?
Exactly. It's not "click data" at all. It is monitoring user behavior on search engines, using both the clicks and the queries. Maybe it's not monitoring just Google as a search engine (although we have no proof of that yet: it seems it's just watching Google) but given Google's market share in search it doesn't make much difference.
From the article:
> I don’t even work in search and I could spot the real situation
This sentence is at the same time arrogant and funny. He doesn't work in search, but he's certain he's spotted "the real situation". How did he fact check it, besides asking himself if he was correct and answering "yes, obviously I'm right. I'm always right -- and I don't even know anything! I amaze myself."
> certain he's spotted "the real situation". How did he fact check it
By reading the official MS blog post[1], watching the Farsight conference live, and being confident in my own judgements based on the observed evidence. No other conclusion is supported by the evidence - you say "It is monitoring user behavior on search engines" - where's the evidence that it's exclusively search engines they're monitoring? Let alone exclusively Google? There has been no evidence yet presented. Therefore all we can conclude from Google's sting is that Bing use URL/click data. Which is exactly what I said in the original thread[2], in my blog post, and lo and behold it's what MS later said. That's how I know it's the real situation.
[1] like where Harry says "A small piece of that is clickstream data we get from some of our customers, who opt-in ... To be clear, we learn from all of our customers".
You checked it against the assertion of the most interested party?
It's not "click" data, it's the correlation between search term and SERP. That someone visited such and such a page after a click isn't the issue. That someone Googled for a unique term before clicking is at issue.
> You checked it against the assertion of the most interested party?
There are currently exactly two sources: The Google post and the MS post. Who are more likely to know about what MS are doing? MS. I check my theory about what MS are doing with the source most likely to contain correct information.
> It's not "click" data
It's 'clickstream' data (MS's term). A 'click' comes from a page and goes to a page. That's the data MS were capturing. The page the click happened on (query happens to be included in URL), and the page the click went to. It's click data.
Your assertion Bing is most likely to be accurate about Bing ignores self-interest and spin.
However, agreed -- clickstream means series of clicks, and the actual data is a series of URLs.
The query "happening" to be in the URL has no "search > result" meaning without a parser being told to look for Google's particular keyword query indicators and correlate the subsequent page. As most URLs are not searches, this is not emergent behavior; it's programmed.
People also talk about this being a "weak" signal, but given search volume (or clickstream volume if you prefer) on Google versus other sources, even if this code is generic (e.g., recognize all "q=blah" or "search=blah" as keywords and correlate the following URL), it seems the signal would be strong indeed. Google's weak signal would provide several times more correlative data to Bing than Bing's own clicks.
Not that there's anything wrong with that! But Bing's blog assertions feel disingenuous -- they play this game well:
> Your assertion Bing is most likely to be accurate about Bing ignores self-interest and spin.
We can't apply skepticism to one source and not the other. Either Google and Bing are not blogging with self-interest and spin - and thus Bing are more trustworthy because they're blogging about themselves, or they both are blogging with self-interest and spin - and still Bing are more trustworthy because they're blogging about themselves.
You just can't legitimately discount what Bing say because of self-interest and spin without also discounting what Google say for the same reasons.
> The query "happening" to be in the URL has no "search > result" meaning without a parser being told to look for Google's particular keyword query indicators and correlate the subsequent page.
No. remove non-alpha from entire URL with no preconception about search queries or any of that. You're left with "google com search q QUERYTERM". All the words apart from QUERYTERM has plenty of other signals in Bing's system. If QUERYTERM is a highly unusual word then all Bing have to go on is the data they gleaned from Google.
> We can't apply skepticism to one source and not the other. Either Google and Bing are not blogging with self-interest and spin - and thus Bing are more trustworthy because they're blogging about themselves, or they both are blogging with self-interest and spin - and still Bing are more trustworthy because they're blogging about themselves.
Libel laws prohibit indiscriminate accusation, while there's no law against puffery.
The premise a company's own public relations messaging is more trustworthy than an outsider because the company's PR is about themselves seems without merit -- otherwise we would deem companies more trustworthy whenever an outsider points fingers, and send all the journalists, whistleblowers, and wiki-leakers home. "Nope, sorry, I believe the company, because they're talking about themselves."
Google presented incontrovertible data. Bing's PR tactic is "Google does this or that worse thing and profits off it" -- distracting hand waving -- "plus we're not copying anyway" -- deliberately disingenuous.
Responsive would be:
Of course our toolbar is recognizing search terms across the top N search sites, and correlating human selected results as an indicator of search intent and result quality for those search terms. This is the same thing you do when you look at your own web stats and check inbound search terms for your own pages: 'How relevant are my pages, and am I showing my users what they are looking for?'
This is the very definition of 'improving your search experience' as outlined when you install our toolbar. We're thrilled so many of you chose our Internet Explorer browser and Bing Toolbar that this provides us meaningful data on user search intent. We want to thank Google for demonstrating we are truly 'improving your search experience' using well accepted Internet crowd-sourcing techniques.
We agree, however, that generating correlations solely from competitor listings -- when we have no existing correlation in our own data corpus -- could be misperceived, so going forward, we will not create correlations solely from competitor results where none existed in our data. However, like every webmaster, we will continue to use crowd-sourced search term and results data from across the web to refine our suggestion order towards best predicting the information you want to find.
> The premise a company's own public relations messaging is more trustworthy than an outsider because the company's PR is about themselves seems without merit
> where's the evidence that it's exclusively search engines they're monitoring? Let alone exclusively Google? There has been no evidence yet presented
The evidence that has been presented by Google does show that MS is monitoring Google's search results, or at the very least users' behavior when using Google.
I agree that there's no evidence that Google is the only search engine that's being monitored, but it makes little difference since for all practical purposes Google == search.
There's also no evidence that MS monitors other websites / behavior other than just search engines, but it's unclear how that would work? You can't just associate two websites because some users go from one to the other (correlation vs. causation, etc.)
- - -
At this point we're still in the "he said / she said" phase, but Google has more evidence and MS is coming out as incredibly defensive (e.g., raising an inquiry from the European Commission[1]: what does that have to do with anything!!?!)
> The evidence that has been presented by Google does show that MS is monitoring Google's search results, or at the very least users' behavior when using Google.
I said "exclusively".
> there's no evidence that Google is the only search engine that's being monitored
There's no evidence that it's only search engines.
> You can't just associate two websites because some users go from one to the other
sure you can, if they go via a link. Now you can just notice the link and say "great, there's an association", or you can monitor which links people click on and weight the graph accordingly. "great, there's an association, and this link is more popular than that one". Nice data to capture. Not constrained to search.
Google (and pretty much every other search engine) puts your search terms right up in the page title, and they show up in several other places around the page. The Bing scraper couldn't miss them.
Honestly, if Bing isn't giving Google special treatment, this whole debacle shows that their click tracking model works.
IE (with certain settings on) is sending page data back to Microsoft. If it sends the URL, title and referrer back then the following session is pretty easy to reverse engineer.
It's really just an extension of page rank by seeing what links are being clicked on and not just which links exist. Whether MS should be capturing this data under false pretenses is another issue.
If this is the case, then it's rather easy to stop Microsoft from doing this. Just use POST instead of GET in the search page if you detect the browser is IE8. The referrer will always be the generic http://www.google.com/search with no search term information.
Using POST instead of GET is not a good idea for a search results page. Most likely, the user would have to click through a dialog box ("are you sure you want to resubmit this form") every time the browser back button is used to return to the search results. Even if Google used redirects to circumvent this problem, searches won't be saved in the browser's history, which is a bit inconvenient.
Changing from POST to GET in IE8 would stop Microsoft from mining data in the short term, but would drastically decrease the UX of Google for a large portion of its users who could very easily switch to Bing.
OK, so for IE8 users return a single page AJAX app instead as a variant of what's already done with Instant. Still no referrers but no POST warning messages (which, BTW, have to be one of the most annoying things about ASP.Net web forms - postback was a boneheaded design decision from someone who didn't understand HTTP.)
Topic drift: but I'm convinced that ASP.NET postback was a very deliberate design decision from someone who perfectly understood HTTP - and is intentionally obscuring HTTP in order to keep programmers and implementers ignorant of it and dependent on the Microsoft ecosystem.
Twenty Google engineers were charged with getting Bing to hit the honeypots and given several weeks to do so. Balancing the oddity of the query terms with the level of resources Google threw into the operation, I'm not sure if 7/100 is impressive SEO or unimpressive.
No, Wikipedia admin Nihiltres intentionally created the "torsoraphy" page (at 3:03 today, no less) and made it redirect to "Tarsorrhaphy". It's not the search algorithm being clever.
"a pretty clear indication that they've special-cased clicks from google" - just like pretty much every piece of web analytic software since awstats has done for over 10 years, right? To imagine that _any_ half serious search engine isn't special-casing any referrer information it can get it's hands on from the top half of http://www.alexa.com/topsites is surely naive?
Imagine I am the Bing toolbar and I am spying on my users.
The Google search result for fake word zxczxczxczxczx is then a web page with a single link, and the user I am observing is clicking that link. So I am Bing, and I make a connection: zxczxczxczxczx (which appears on this web page) is related to said link.
Since the word is artificial and doesn't appear anywhere else, it's eventually going to produce that page as a search result.
So I think the author is right - what Google has done is they have proven that Bing toolbar tracks what users are doing, and sends the results back to Bing to improve their search.
They have not proven that Bing treats google.com differently from any other page. For that they should have seeded 100 random web pages from different domains with artificial words, and have these pages contain a link which the user with Bing toolbar installed then clicks. I bet that it would yield the same result as the Google test, e.g. those fake words would get matched to the linked web pages.
zxczxczxczxczx isn't on the web page. It's on the search results page, but that page isn't for crawling, by robots.txt. Either MS is ignoring robots.txt, or they're parsing the URLs.
From my reading of the article, zxczxczxczxczx isn't in the search results page either, it was special cased at google to display the honeypot page for that specific query.
But is Bing "parsing the url" any different to what Google's doing when I go into Google Analytics -> Traffic Sources -> Search Engines, and select Keyword in the second dropdown (after source)? Google are clearly showing me the search terms parsed out of the referrer urls from people who found my site in Bing.
I think this is perhaps not obvious behaviour to the general public, but surely pretty much anyone who goes to the trouble of installing the Bing (or Google) toolbar has worked out for themselves that they're choosing to send data like this to Microsoft (and Google), and that it'll get used to "improve search" if it's found to be useful for that?
If that were the explanation, I would think they would come out and say it. (Maybe they will shortly.) Even so, I find it hard to believe that they wouldn't notice that most of the benefit of the technique comes from recreating google results, at which point we're back at the original ethical question of whether that's ok.
most of the benefit of the technique comes from recreating google results
I'm not 100% sure this is the case.
Consider that an user wants a chicken enchilada recipe.
1. They go to a site that they trust for high quality data.
(I like food&wine, saveur, epicurious, rick bayless for mexican, ...)
2. They search for "chicken enchilada"
3. They select the most appealing result (not necessarily the first one), based on author, snippet, rating, photo, etc. Domain knowledge.
If you can associate (via the referrer) "chicken enchilada" with that page, you've encapsulated a lot of information - manual selection of site, manual selection of a page within that site. It's potentially useful (especially for long tail - people who have to go directly to sites because search engines fail them).
Is there too much noise? Perhaps - you'd really have to look at the raw data to find out. Maybe you only want to look at "q=", "search=", and whatever phpBB, etc use. Maybe the signal that rises out of that noise is valuable.
One big downside is that the results would be biased towards the preferences of people who install toolbars. :) The other downside is that SEOs could game the system by feeding MS bad data. (True if they're only looking at google data, too.)
Not only does it not seem unreasonable, but for a company like Google, I find it incredibly hard to believe that at some point they wouldn't have tried it themselves, at least as an experiment.
Actually, unless I'm mistaken, parsing the URL may well yield good search terms, but the query string is much less likely. Look at eg HN -- you'd parse out id = some long number. I may be wrong, but I don't recall many sites that would have good information for a search engine in that string.
What seems far more likely is bing is using click tracking on G's results. This was explicitly not denied by their VP on the search engine panel today. If not many sites except search engines have useful keywords in the query string, that pretty much validates Google's complaint.
In fact, if you go to Google's blogpost [1], the bing toolbar specifically calls out monitoring "the searches you do, the websites you visit, [...]" [1]. And the MS guy doesn't deny using clicks on G's search results [2]. In fact, the pretty much just says they copy G on long tail searches.
> I don't recall many sites that would have good information for a search engine in that string.
Well, Google is one such site - as in, if you have bing toolbar installed and are on google.com/search?q=keyword and click on example.com, then Bing can easily extract "google com search q keyword" and associate it with example.com - without anything explicitly or intentionally relating to Google in their code.
I think their terms of use say they may pass along the contents of form fields. On a Google search results page, your search query is sitting in a text field.
Suppose it uses all text fields on the web without treating google specially.
Being as charitable as possible here, I'm willing to believe that some well-meaning engineer coded this up without special casing google. Tested it out, found that it worked amazingly well, and then launched it. This seems unlikely, but possible.
However, somewhere along the line someone must have known that the biggest benefit of this signal was recreating Google results. I'm not willing to believe that no one figured this out even if it wasn't the initial intent. At which point there's an ethical dilemma. At Google, a system like this wouldn't launch.
Regardless of the mechanism, I don't believe that nobody at bing knows that this is what was going on. Maybe it's a cynical attempt to get around robots.txt. Maybe it's an honest mistake that gradually became a dishonest mistake, but I'm not willing to believe that they are oblivious.
If you type the query [site:nytimes.com] into Google News, you've recreated a different presentation of a feed of latest news from the NYTimes. It is inherent in the search business that you're collaging material from elsewhere. And for certain heavily-qualified searches – long-tail, few mentions, hapax legomenon/'googlewhack' – a single source is likely to stick out.
Google is unavoidably a giant signal-source on the web. Even if Microsoft instead sent unique keywords to contract writers to build out findable summary/directory web pages one-by-one, what would those writers do? Research via other search engines, starting with Google, and be heavily influenced by the few (or top) results they found, highlighting the same sites. So your results would still percolate outward, via a slower, more expensive, more manual process. (Would that process, laundered through time and multiple agents, meet your ethical standards?)
Such is the nature of Google's position today. As Rich Skrenta of Blekko has put it: "The net isn't a directed graph. It's not a tree. It's a single point labeled G connected to 10 billion destination pages."
A little of Google's proprietary wisdom is leaking back out. The amount seems small compared to all the freely-offered info Google sucked in to create that wisdom. And, the proprietary wisdom is leaking back out via the same sort of bulk, automated mining of implicitly expressed preferences that for which Google itself is famous. So to me this seems more like karmic balance than an ethical transgression against Google.
There is a huge difference between recreating Google results and incorporating a Google honeypot link into their results through legitimate means. One link isn't a search result, a search result is an ordered list of results. If its whole search results then Bing's got some questions to answer and deserves some bad press. Otherwise, not so much.
Really? I doubt replicating Google long tail queries is the biggest benefit. User clicks seem like a useful signal of relevance for the same reason links from other sites are a useful signal of relevance.
Long tail queries are the ones which Google could easily demonstrate the effect without any confounding variables. I'd imagine this is affecting all of their ranking, as much as their own click data weighted by volume.
"They're either parsing the query out of the url, or violating robots.txt to fetch the result page, almost certainly the former."
Where's your evidence for any of that?
There's another way. Store the input of any Text Box which is immediately followed by a form submit,(search term), and then store the href of the subsequent click (the result which the user finds helpful).
The above is a way of handling it (naive and abstract) which doesn't target Google, and in fact would work for insite searches as well, or any search engine.
Since you've already made up your mind that your two ways were the only way of doing things, what's your opinion given the above?
I'm not sure why you are being voted down, the situation could be this. Also, Bing toolbar could simply be collecting tuples "(search_string_entered_in_toolbar, subsequent_href_clicked)" regardless of the search engine the user has specified and phoning these home... this seems like the easiest implementation (no parsing nonsense) and does not target google in particular.
Considering they're a search engine, whose job is to crawl the internet and determine the main keywords of any random page on the internet, I would imagine they have text parsing abilities that make it possible to infer the topic of any arbitrary page.
They're either parsing the query out of the url, or violating robots.txt to fetch the result page, almost certainly the former. This seems like a pretty clear indication that they've special-cased clicks from google. It's theoretically plausible that they are treating all query parameters the same for all sites, but very unlikely given how much noise that would introduce into their results. Even so, they would have to know that most clicks with meaningful query parameters come from Google. This isn't something that's going to happen by accident.