Hacker Timesnew | past | comments | ask | show | jobs | submitlogin

We had a pool of IPs we were leasing on our own, proxy services get abused and are poisoned.

We didn't rate limit, we would just increase the size of the cookie pool if a captcha was hit, which was rare because we would scrape n-pages till a threshold was met to prevent that session from being captcha'ed so we wouldn't have to captcha solve it. We had two pools, the primary pool and the "chilling" pool, cookies near their captcha life would cool off for a few hours before returning to the active pool which behaves just like any other resource pool, every page scraped would "borrow" a cookie out of the pool, customize the encrypted location key, and make the request with a common user agent string.

Scaling it was difficult but once we had it figured out, Erlang was invaluable to us and our dependence on IPs dropped once we figured out the cookie methodology.

Solving captchas is cheaper than renting IPs.



Thank you for the reply! Sorry but one last question: can you share anymore about this statement "customize the encrypted location key"?


When you set the location in Google it customizes the cookie with a named field that is a capital L I believe. That field is encoded or encrypted and I could never figure it out so I just constructed a rainbow table by using phantomjs to set a location and scrape the cookie out, pairing the known location value with the encrypted value so that we could customize "the location of the search".




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: