Hacker Timesnew | past | comments | ask | show | jobs | submitlogin

This is super interesting to me, is there anything else you can share about how you approached this? In my scraping Google experience I have found roughly the same thing where once you've passed the captcha test, you can scrape a lot more.

Were you scraping with real browsers or something like Mechanize/Curl? Rate limiting at all? Proxies or real servers?



We had a pool of IPs we were leasing on our own, proxy services get abused and are poisoned.

We didn't rate limit, we would just increase the size of the cookie pool if a captcha was hit, which was rare because we would scrape n-pages till a threshold was met to prevent that session from being captcha'ed so we wouldn't have to captcha solve it. We had two pools, the primary pool and the "chilling" pool, cookies near their captcha life would cool off for a few hours before returning to the active pool which behaves just like any other resource pool, every page scraped would "borrow" a cookie out of the pool, customize the encrypted location key, and make the request with a common user agent string.

Scaling it was difficult but once we had it figured out, Erlang was invaluable to us and our dependence on IPs dropped once we figured out the cookie methodology.

Solving captchas is cheaper than renting IPs.


Thank you for the reply! Sorry but one last question: can you share anymore about this statement "customize the encrypted location key"?


When you set the location in Google it customizes the cookie with a named field that is a capital L I believe. That field is encoded or encrypted and I could never figure it out so I just constructed a rainbow table by using phantomjs to set a location and scrape the cookie out, pairing the known location value with the encrypted value so that we could customize "the location of the search".


Oh and I just used a generic HTTP request client in Erlang and xpath / HTML parser to extract what we needed.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: