Hacker Timesnew | past | comments | ask | show | jobs | submitlogin
The Barbell Effect of Machine Learning (medium.com/nickbeim)
62 points by wallflower on June 6, 2016 | hide | past | favorite | 18 comments


>What allowed Google to rapidly take over the search market was not primarily its PageRank algorithm or clean interface, but these factors in combination with its early access to the data sets of AOL and Yahoo, which enabled it to train PageRank on the best available data on the planet and become substantially better at determining search relevance than any other product.

This seem like nonsense. I'm I wrong in thinking that when Google still used PageRank that it wasn't based on ML?


Apart from PageRank not being based on ML, there's also the fact that Google did better search than those other services which also had the data, and it does better search than Bing now though Microsoft has no shortage of data, processing power, or money to acquire more of both.

Generally how much data you need for ML varies with the problem and to say that it's all in the data and large server farms is perhaps a nice talking point for VCs but can be completely true, completely false, or anywhere in-between depending on the problem.


Yeah, it's mostly nonsense.

PageRank meant the system was less easily gamed and produced higher quality results. The real magic was combining that with their revolutionary data center operations and being able to treat a whole lab of machines as a single device which ran via map/reduce.

The clean design helped as well but it was incremental. Google's advantage was that they produced better results faster with lower operating costs. Better and faster were good for customers. Then the rollout of relevant ads meant that Google could turn a profit even at lower cpm than competitors and their ads didn't have to be intrusive or distracting. The combination put them on much surer footing than their competitors so they kept growing, increasing revenue and profit while their competitors struggled.


I think that this post sums it up and illustrates why it's hard to replicate this kind of success : Google won because Google. There were a slew of innovations (adwords, pagerank, clean design, big data, 20% time, management style) that added up to a win.

The replication recipe - be really clever and try really hard.


Well, to be fair, you can adopt some of their component strategies and still be successful, which I'm sort of surprised isn't more common. Traditional data center management is still ubiquitous, it's been like a decade and a half now, I'm surprised that the "google-way" isn't more common. Though partly that is somewhat dependent on vertical integration in system manufacturing.

The interesting thing about a lot of the big successful tech companies is that they blend things that are often difficult to find together naturally, which makes their success difficult to replicate. Google basically runs on alien technology, they're sort of the ... I guess Tesla or SpaceX of web-apps. It's possible to do what they do, but it's so outside the range of the traditional way of doing things that it would take a tremendous amount of talent, vision, and determination to pull it off. Though google is also very dysfunctional in places so from a business sense it's not actually as hard to compete against them as it may seem. Apple is built on a weird marriage of solid aesthetic sensibilities and solid engineering fundamentals. And then Amazon which marries cutting edge fulfillment to a solid web store front-end.

For a lot of these things it'd be straightforward to try to build one aspect, but getting all of them together is a huge challenge.


I like your binaries for Apple and Amazon, my observation in other corporates is that these opposed capabilities often steal resource and opportunity from one another. Somehow that isn't happening at Apple and Amazon, I guess the question is : "why not"?


PageRank doesn't really get "trained" per se, since it's not going to be used to make predictions about other, unseen data.

However, it is learnt from the data: links from one document to another are recorded in a huge adjacency matrix (sort of) and the PageRanks are dominant eigenvector of that matrix.

The original paper never mentions AOL or Yahoo (except as an example of a popular website), but does describe how they built their own crawler (see here: http://ilpubs.stanford.edu:8090/422/1/1999-66.pdf) but I suppose it's possible they got data from those companies later.


It's a kind of unsupervised learning, an intrinsic measure of the network.


It's wrong - PageRank isn't trained.

But the traffic Google got from the Yahoo and AOL deals did allow them to see user behavior, and that did allow them to improve their search ranking algorithms (which of course wasn't really PageRank anymore).


PageRank still "learned" from the data, and it wouldn't be able to do that without a ton of data. So while the data wasn't the most important thing to get Google off the ground, it was arguably equally important as the algorithm.

Though as others have said, after the start gun went off, a lot of other factors came into play because PageRank, like other algorithms, is easily gamed on it's own.


PageRank doesn't learn from the data. It just calculates metrics on the data. It's totally different from Machine Learning.


It's definitely different from supervised learning. However, I think you could make a case for being some sort of form of unsupervised learning and it's not as though there's a codified list of what is "Machine Learning" and what isn't.


Well, in that case calculating the sample mean of a dataset is machine learning too. So is sorting a list alphabetically. Regardless of whether it's supervised or unsupervised, how can you speak of machine learning if the metric is fixed before seeing any data? PageRank is human learning: people look at a bunch of pages they want categorized, figure out an intuitively appealing way to categorize them, and then encode those rules in an algorithm.

Edit: after thinking about this a bit more, I guess you could in fact think of e.g. k-means clustering as just a very advanced form of descriptive statistics, perhaps not fundamentally different from calculating a mean or a kernel density estimate. And in that sense PageRank would be unsupervised learning too, but it still feels to me like that's obscuring rather than clarifying how it works?


It's not obvious that the data came from AOL or Yahoo. The initial papers describe Brin and Page's attempts to design their own web crawler. Google collecting the data itself, instead of getting it from an incumbent, undermines the article's thesis quite a bit.

That said, one interpretation of the PageRank model is of a user randomly clicking from page to page, occasionally giving up and starting over. AOL would have access to data like this, which might help them refine the "damping" term in the PageRank model.


I don't think data is as important as they make it to be. Yes, it was important at some point when they scaled supervised learning from thousands to millions or billions of examples. But after that, did they continue scaling up to trillions of examples? No. Because there are diminishing returns. Maybe the model can't handle that much data, or the accuracy doesn't increase as much, or the data is just too big for the current computing power. The kind of data that is used today even at Google scale is just millions or billions of examples.

On the other hand, what kind of data could be so secret that only the big companies could accumulate? If you have access to a service, you can mine it and construct a dataset. Then you can train your own model to imitate their results. If this process is done right, it becomes easy to do transfer learning from other AI systems, even if they keep their algorithms secret.

For example, Google's latest POS tagger SyntaxNet, which made a splash a few days ago, was trained on the results of the Stanford parser. Interestingly, the student model surpassed the teacher model - probably because it was better at cancelling out errors in the training set and generalizing better from examples.

If data is the end game of these companies, then it will be a target of espionage, leaks and public disclosures. A hard drive containing the prized dataset of some company can be copied in a few minutes and if one copy escapes, then it is circulated and becomes public domain. So it's hard to protect data. Data likes to flow (remember the Ok Cupid dataset - that was crawled without permission from the behind the login wall?). Flowing data can be turned back into machine learning models.


I assume people claim IP protection of one kind or another on these datasets, and even if their claims were spurious, they would still make trouble.

That's not going to stop some hacker could crawl OkCupid and balckmail some people. But if Google or Microsoft suspect FooStartup Inc. is using one of their datasets in its service, then the lawyers will descend.


> Proprietary algorithms can help, but they are secondary in importance to the data sets themselves.

This is flat wrong. No algo, no intelligence or synthetic congnition

> The dramatic rise of Google provides a glimpse into what this kind of privileged access can enable. What allowed Google to rapidly take over the search market was not primarily its PageRank algorithm or clean interface, but these factors in combination with its early access to the data sets of AOL and Yahoo, which enabled it to train PageRank on the best available data on the planet and become substantially better at determining search relevance than any other product.

This is so wrong on so many fronts. A) They has open access to the web just like DMOZ and Yahoo and many others via crawling systems. B) They were attractive to software engineers who in turn made their IT depts in large corps switch to Google as the default search engine C) They stole the ad matching algo from Bill Gross which in turn made them successful. E) too many other factors to list

Lets also not forget that PageRank was a variant of an early link analysis algo that had been around a while.


On the first statement, I think it's mostly right and he's merely just re-stating what Peter Norvig has stated all along albeit not as precisely.

On the second statement, I believe you both missed the point. Google was able to acquire click-through data which helped train it's rankin algorithms better by establishing the partnerships with Yahoo and AOL. It was the implicit feedback, not the crawled data that helped. But that obviously wasn't the only factor--having very smart engineers/scientists matters. But I think now that a lot more is open sourced, computing power is cheaper, etc the OPs thesis is that data is now the hardest thing to come by, and on this I fully agree.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: