The sample images are of two types: images which are mostly of the subject (cat ...

The sample images are of two types: images which are mostly of the subject (cat or dog), and images which have a cat or dog in them, but are not necessarily focused on them.

In computer vision, these two types of images are traditionally handled separately. First, a detector for a class (like "dog" or "cat") is run across the image at all locations and multiple scales to find where the things are. Once you have the locations, then an image classification algorithm is run for each detection window to either confirm it, or to give you more information about the object.

The latter often takes the form of giving more fine-grained category information, such as what species of dog/cat it is. Both leafsnap [1] and dogsnap [2] take the form of this type of program; i.e., they both assume that you've captured a single subject, roughly centered in the photo window, and that you already know that it's a plant/dog.

Sometimes you don't have to run a detector even if the object is not the focus of the image, if the context/setting can narrow down the answer for you. For example, if you were deciding between dogs and airplanes, it would be pretty unlikely to see a dog on a runway or a plane in a living room, so just by classifying the entire image, you can do reasonably well. That's not the case here, as dogs and cats will, for the most part, appear in pretty similar environments.

So if I were attacking this problem, I'd first see how many images were of the non-focused type. If not many, I'd basically ignore them and focus on building a classification system. Note also that if you're constrained to make a hard choice between only two classes, that's a much easier problem than a more open-ended "what is this?"

As many have pointed out, deep learning approaches seem to be the current state of the art on classification tasks such as these. But deep learning requires a lot of training data to be effective. A procedure I've been hearing many people use to great success is to use the Imagenet [3] hierarchy and images to train a deep learning classifier (i.e., as if you were going to compete in the Imagenet Large Scale Visual Recognition Challenge [4]). Then use the trained network, chop off the last stage (which makes the final prediction), and replace it with an SVM trained on your specific training data. In this way, you'd be using the network only as a feature extractor.

I'm happy to try and answer other questions.

[1] http://leafsnap.com or see my project page for more details on how it works: http://homes.cs.washington.edu/~neeraj/projects/leafsnap/

[2] https://itunes.apple.com/app/dogsnap/id532468586?mt=8

[3] http://www.image-net.org/

[4] http://www.image-net.org/challenges/LSVRC/2013/index