Straight screen scrapin' yo. I worked for a similar startup that collected more detailed information than yodlee/mint, it was a product for financial managers instead of consumers. We collected over a 1mil transactions per night from over 3000 financial institutions. It was no joke. You might think screen scraping is silly but the bottom line is if a bank had an api (OFX, and very few do offer OFX) or formatted data downloads(csv,xls) the data tended to be stale or incorrect. Reasoning behind that is more eyeballs are on the web pages and so bugs/inconsistencies are noticed quicker. There was more of an expectation for the web pages to be accurate.
Would you please share some code samples or favorite blog posts on how you're using phpQuery as part of a scrape app? What set of tools/libraries are you using?
I'd really like to hear more about the current state of the art (without you telling any company secrets). I have quite a lot of experience scraping utility and government websites (no javascript) in perl + LWP... but I'm getting a little tired of perl and am looking to give a new toolset a try. Perferably one that can handle a broader range of modern websites.
I will throw you 1 bone, but past that, I'm careful because my work here is not my own, it's my employers.
libxml_use_internal_errors(true);
This little function call is the secret to solving what seem at first to be intractable memory-leaks. The trouble is the scraper uses libxml, and libxml issues a notice/warning every time the HTML is malformed. Without this call, those errs will bubble up to the PHP error handler and it'll murder performance and memory usage.
One more i suppose...
If you scrape inside a loop (and unless you're using a distributed job queue,if you're scraping more than one URL at a time you almost certainly are), missing a unloadDocument() call is going to cost you each time you iterate. The objects it creates, IIRC, have some circular dependency issues and if you don't explicitly unloadDocument you'll run into trouble. (Suppose it should be OK tho if you've enabled that GC feature in PHP 5.3)
And, generally, a tip... sometimes it's tempting to write a simple regex instead of a chain like, say, pq($this->node->find('a')->get(0))->attr('href')
Thanks for the tips. You most likely saved me a bunch of headache and I appreciate it.
I'm pretty intrigued by a library that can apparently handle ajax/json updates and content creation. Heh... I thought PHP was only for page generation and had no idea you could purpose it for something like webscraping.
So it should be fun playing with it.
My email's in my profile if you ever want to talk shop.
Changes happen and is why I was employed. But this is mitigated by the fact that most big banks probably have miles of red tape just to deploy a fix for a typo and the smaller banks used off the shelf products that were rarely upgraded and when they were we identified these products and could group banks together that used similar software and so the gathering of data was essentially the same.
edit: we also built and deployed to production every night, so we had no problem keeping up. Sometimes we'd even deploy midday if we felt like a fix needed to go out immediately.
Most "screen scraping" these days is just extracting content from web pages.
1) Write a program that can load webpages as if it was a user of the site.
2) Have it save everything it loads.
3) Write a program that can extract the data you care about out of the html and put it into a more useful format (or into a database or something)
Many languages have libraries for this or you can use a tool like cURL or wget. I do this a lot with perl and the LWP family of modules but the sites I work on don't use javascript or dom manipulation... There's so much javascript and ajax out there now though that I'm not sure if you can scrape those kinds of sites with perl.
Screen scraping means: you write a web crawler which loads up the web page (in this case, takes your bank login user name and password, puts them in the login form on the bank website, pretends to be you and loads up the relevant web pages). Then you write an html parser which grabs the relevant bits from the bank's web page (account balance, number, name, etc.) and stores those bits somewhere useful in the local database.
"Screen Scraping" seems to be a sort of misnomer here. Essentially they are just loading a URL and extracting the information from whatever is returned. Whether that happens intelligently, or if they are just making specific scrapers for each bank, I have no idea.
But basically, 'look for the number in this div region, this is the account balance'.. etc.