I've just launched No-NSFW (NSFW content warning system) which relies on user feedback to determine site ratings.
I'm now thinking of introducing a Bayesian filter to determine site content. Does this make sense ?
Also, where do I hunt for seed data - I'm using nsfw.reddit for NSFW data (thanks kirubakaran), what do i use for SFW data ?
I'm not sure what you are looking for in terms of safe for work data; maybe technorati tags?