Straight screen scrapin' yo. I worked for a similar startup that collected more ...

encoderer · on July 22, 2010

I work for a fairly successful start-up in the advertising industry.. We've done a MASSIVE amount of screen-scraping. phpQuery ftw.

I had a few months solid of nothing but scraper-building.

Besides, what's Google if not a screen scraper? :D

If anybody thinks it's "silly" they probably have a characterization in their mind that's not really how it works in the wild when done skillfully.

Sukotto · on July 22, 2010

Would you please share some code samples or favorite blog posts on how you're using phpQuery as part of a scrape app? What set of tools/libraries are you using?

I'd really like to hear more about the current state of the art (without you telling any company secrets). I have quite a lot of experience scraping utility and government websites (no javascript) in perl + LWP... but I'm getting a little tired of perl and am looking to give a new toolset a try. Perferably one that can handle a broader range of modern websites.

encoderer · on July 22, 2010

I will throw you 1 bone, but past that, I'm careful because my work here is not my own, it's my employers.

libxml_use_internal_errors(true);

This little function call is the secret to solving what seem at first to be intractable memory-leaks. The trouble is the scraper uses libxml, and libxml issues a notice/warning every time the HTML is malformed. Without this call, those errs will bubble up to the PHP error handler and it'll murder performance and memory usage.

One more i suppose...

If you scrape inside a loop (and unless you're using a distributed job queue,if you're scraping more than one URL at a time you almost certainly are), missing a unloadDocument() call is going to cost you each time you iterate. The objects it creates, IIRC, have some circular dependency issues and if you don't explicitly unloadDocument you'll run into trouble. (Suppose it should be OK tho if you've enabled that GC feature in PHP 5.3)

And, generally, a tip... sometimes it's tempting to write a simple regex instead of a chain like, say, pq($this->node->find('a')->get(0))->attr('href')

Write the chain. Regex is just too brittle.

Sukotto · on July 23, 2010

Thanks for the tips. You most likely saved me a bunch of headache and I appreciate it.

I'm pretty intrigued by a library that can apparently handle ajax/json updates and content creation. Heh... I thought PHP was only for page generation and had no idea you could purpose it for something like webscraping.

So it should be fun playing with it.

My email's in my profile if you ever want to talk shop.

justliving · on July 22, 2010

I have been using quite a bit of python and beautifulsoup.

http://www.crummy.com/software/BeautifulSoup/

asnyder · on July 22, 2010

php|architect's recently released book entirely on this topic: http://www.phparch.com/books/phparchitects-guide-to-web-scra...

l4u · on July 22, 2010

I have been using spynner, written in python. It uses pyQtWebkit, and jQuery. Use the trunk version at http://code.google.com/p/spynner/

webjunkie · on July 22, 2010

What if the bank is very innovative one and makes frequent changes to their web pages?

_pi · on July 22, 2010

Best joke I heard all week!

djb_hackernews · on July 22, 2010

Changes happen and is why I was employed. But this is mitigated by the fact that most big banks probably have miles of red tape just to deploy a fix for a typo and the smaller banks used off the shelf products that were rarely upgraded and when they were we identified these products and could group banks together that used similar software and so the gathering of data was essentially the same.

edit: we also built and deployed to production every night, so we had no problem keeping up. Sometimes we'd even deploy midday if we felt like a fix needed to go out immediately.

grep · on July 22, 2010

Can you elaborate in how "screen scraping" works?

Sukotto · on July 22, 2010

Most "screen scraping" these days is just extracting content from web pages.

1) Write a program that can load webpages as if it was a user of the site.

2) Have it save everything it loads.

3) Write a program that can extract the data you care about out of the html and put it into a more useful format (or into a database or something)

Many languages have libraries for this or you can use a tool like cURL or wget. I do this a lot with perl and the LWP family of modules but the sites I work on don't use javascript or dom manipulation... There's so much javascript and ajax out there now though that I'm not sure if you can scrape those kinds of sites with perl.

phuff · on July 22, 2010

Screen scraping means: you write a web crawler which loads up the web page (in this case, takes your bank login user name and password, puts them in the login form on the bank website, pretends to be you and loads up the relevant web pages). Then you write an html parser which grabs the relevant bits from the bank's web page (account balance, number, name, etc.) and stores those bits somewhere useful in the local database.

wlievens · on July 26, 2010

My bank (all of them where I live) uses a challenge code that I have to enter on a separate hardware card reader. So only human login permitted :-(

eof · on July 22, 2010

"Screen Scraping" seems to be a sort of misnomer here. Essentially they are just loading a URL and extracting the information from whatever is returned. Whether that happens intelligently, or if they are just making specific scrapers for each bank, I have no idea.

But basically, 'look for the number in this div region, this is the account balance'.. etc.

cjg · on July 22, 2010

http://en.wikipedia.org/wiki/Web_scraping

tricky · on July 22, 2010

You load the contents of a web page into a string and use a regex to grab the data.

HA! Just kidding about the regex.

Sukotto · on July 23, 2010

djb_hackernews, what set of tools did you use for your scrapes?