Hacker Timesnew | past | comments | ask | show | jobs | submit | jme3's commentslogin

I never cease to be amazed at how people who work with data large enough to make tools like plyr slow assume that everyone works with data like that.

I have been using R daily for 7-8 years now and have only occasionally turned to somethig like data.table for performance reasons. "Big data" receives waaay more attention and hype than there are actual human beings working on data of that scale. I can assure you that for the vast majority of R users world wide plyr is plenty fast enough for their needs.


This pretty much perfectly illustrates my comment below, pointing out that these sorts of recommendations are entirely subjective and useless.

Many of your points are quite subjective. I could do the same thing with Matlab. For instance, I find it mind boggling that anyone could get anything done when you have to devote a separate file to every single functions. That seems incomprehensible to me. And yet, I realize that that's probably a mostly subjective thing that you get used to.

Personally, I find R's documentation excellent. When people complain about it, it's usually because they have mistaken it for a tutorial. It's not. It's documentation.

Without any data, I seriously doubt your claim that Matlab has a much larger user base. (There is considerably more activity in R on StackOverflow than in Matlab.)

Your complaint about matrices, lists and data frames is similar. Data frames exist for the same reason that there's a mean() function: a columnar data structure that holds differently data types in each column comes up so often and is considered so useful that it is built in.

pandas in Python was developed in a way that went out of its way to specifically _mimic_ these data structures because data frames are considered such a vital aspect of R.

And keep in mind that these criticisms are all coming from someone who _also_ recommended against switching...!


You make good points, however I have to take issue with the documentation.

> When people complain about it, it's usually because they have mistaken it for a tutorial. It's not. It's documentation.

I don't really see the distinction. Documentation is supposed to explain to you how to use the code. You can call it whatever you want. If it's through a tutorial, then why not. R - and especially the non-standard packages you download through CRAN - have very terse documentation that barely explain how each function works on it's own, and much less how it works in the context of the rest of the package. You can't just tell the user what goes into the black box and what comes out and expect people to be able to use your software.

Sure they're are vignettes (I think that's the term), but they're really inadequate b/c they only scratch the surface of how the package is meant to be used.

Anyways, that's my 2 cents. I've spent soooo many hours fighting with R documentation trying to figure out how to get what I needed done. Sometimes months later I would find out there is a much better way to do something that simply was not explained anywhere. I'm OK at R now, but I went through a lot of pain to get to where I am now. I'd never wish it on anyone else.

My experience with MATLAB on the other hand has always been very pleasant. I spent like 3 hours going over the tutorial on how to use it (much better then R's "Introduction to R") and I hit the ground running. When I needed something a quick search through the help or online always turned up results.


From my memory, MATLAB's documentation not only discusses the implementation, but also discusses the statistical/engineering methodology. It's overkill and can be pretty annoying (paging back and forth between different parts of the help can be somewhat time consuming) when you actually know the statistics but just want to understand the implementation. Hence the distinction between "documentation" and "a tutorial".

I don't know whether it's an explicit or implicit design choice or just a happy accident, but I'm grateful that the R documentation doesn't try to hold anyone's hand and guide them through data analysis beyond their training.


You shouldn't, necessarily.

I use R almost exclusively and I absolutely love it. But people can get a little carried away with picking and recommending the "best" language for particular task.

If you feel very comfortable and skilled in Matlab, and you aren't finding that there are regular situations where Matlab can't do what you want, I wouldn't really advocate switching. Same goes for someone working primarily in Python.

The biggest reason I would recommend switching away from a language you're already very comfortable in would be the availability of statistical methods that aren't present in your language of choice. Statisticians tend to work in R, so much of the cutting edge work ends up in packages on CRAN.

So I wouldn't dump Matlab if you're happy with it, unless you're just looking to learn something new, out of intellectual curiosity, which is always fun.


I think your response here and to the other poster above is quite sound. I also program regularly in Matlab, R, and Python, but when it comes to data analysis, I do find that R is just much more concise than the other two (the data manipulation and statistical analysis tools are more high-level). Learning R and going through its tutorials (the MASS book and actually the S-PLUS Statistics Guides) and learning about functions available to me made me learn a lot more about stats. I sometimes use Matlab for image processing and optimization, or some simple simulation but less and less these days (trying to replace it with SciPy since I use Python a lot in my workflow).

But I also agree that if you're already proficient with Matlab and happy, then maybe you don't need to learn R (though often you can be blissfully ignorant of your possibilities if you are unaware of the vast libraries that another language/environment offers).


Given your clarification...

The two methodologies can give somewhat different results, but not as often as you might think, and the differences aren't as large as you might think. In my experience, the instances where you get radically different answers from Bayesian/Frequentist methods are quite rare, and tend toward the pathological example invented purely to demonstrate the "superiority" of one over the other.

That said, sometimes one or the other are significantly more convenient or easy to apply for a given model or type of data.


I see, thanks. Are the differences hard to reconcile, given that we can do Monte Carlo on a model and see whose predictions are correct?

I'm just annoyed by there being two camps in science, where one gets slightly different results from the other. It seems to me that one is obviously wrong, since there's only one truth.


You think that's bad, you should check out physics: http://en.wikipedia.org/wiki/Theory_of_everything

Statistics is more about data sets and interpretation than "right" vs. "wrong".


> It seems to me that one is obviously wrong, since there's only one truth.

A bold claim.



The point of Gelman's reply is that the comic is actually comparing a Bayesian to an absurdly incompetent Frequentist, so there' really no conflict. No (modestly intelligent) Frequentist would mis-apply this methodology in this circumstance.


Sorry, I wasn't talking about the comic. In general, don't the two approaches give different results? Surely, only one is "correct".


Both are correct but they target different things. The disagreement is around what is the target should be and the advantages and disadvantages of choosing these targets. Bayesians are interested in p(unknown|data) and frequentists are interested in p(data|unknown = H0). Inference can be framed either way but means different things.


Are there any situations where you want to use a frequentist procedure?

I've concluded that given a perfect, infinite-power MCMC simulator, I would always do a Gelman-style Bayesian analysis (with model falsification and improvement), but in practice, frequentist methods are computationally convenient.

Inference can be framed either way but means different things.

A Bayesian posterior P(H|D,M) is the probability that hypothesis H is true given data D and modelling assumptions M.

What does a frequentist p-value mean?


Sure, see my link above (http://stats.stackexchange.com/a/2287/1122). If you want to put an upper bound on the worst-case probability of making a mistake, you use a p-value. If you want to express the conditional probability of a particular hypothesis given the observation (and given a prior belief), you use a posterior probability. The Bayesians also can do silly things (see the cookie example with the inept Bayesian robots). In the end there is no free lunch.


The frequentist p-value is about H0, not (directly) the hypothesis you are testing. More specifically, it denotes the probability of rejecting H0, even though it's true.


Wow thank you, this is the clearest and most straightforward explanation of the difference between the two camps in this thread.


They are both models and as such, you might consider that neither of them are "correct." But they are both useful, sometimes in different circumstances.

"Essentially, all models are wrong, but some are useful." — George Box


I see, thank you all very much for the clarifications.


I don't see how, really. Gelman himself is a (rather famous) Bayesian, and if you read the comments, you'll see that Randall himself pops up and basically cedes Gelman's point.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: