Showing posts with label statistics. Show all posts
Showing posts with label statistics. Show all posts

Thursday, October 08, 2009

Language Learning

Today Iris and I are each trying to learn a language. Iris is out, walking around Rostock, investigating the various schools in town that offer German classes to auslanders. I am focusing on learning a much more broadly used language, R. Now Germany is certainly used by more people than is R, and R isn't really anybody's first language, but R is used all around the world, and by a surprising range of people. Yesterday a colleague who has been in Germany for a year and not yet learned German said to me, "I'm not staying in Germany forever, and German isn't going to do me a whole lot of good outside of a few countries, but R I will need for every job I might ever have."

R is a simple programming language intended for statistics and data analysis. It is rapidly becoming the standard for advanced data analysis, in the natural and social sciences, from advanced college students to statistics professors, and in every country where people with internet connections need to analyze data.

Back in the 1970s, Bell Labs developed a programming language called S (for "statistical") and somehow, in the mid '90s had the wisdom to release an open source version of it, called R. R had the wonderful property of being easy to extend. Any user can, invent new words for this language and tell the computer exactly what to do when users used those words. This is equivalent to English's allowance of the sentence, " From now on, let's use the word reflop to mean 'to flip something over, and then flip it back to its original position.'" Users can also find something they don't think works well, look at the underlying language, and tell the computer, "from now on, I want this word to mean X, not Y as it did before." Users have added and modified Graphical User Interfaces, make implementations that work inside other programs, and compiled packages for every major operating system.

These extensions and modifications can be uploaded to the R website, and other users can decide which bits and pieces they want. Every once in a while a pre-fab version is released, with all the most recommended bits and pieces, and with someone having checked that they all work well together. So every user is necessarily a programmer, and every programmer can fairly straightforwardly improve on the model. It is as though every user of an open source browser such as Firefox in learning how to use the browser also necessarily learned how to make improvements to the browser. By this model R quickly and clearly outstripped S and S+. I am sure there is someone out there who still uses S, but not many. R is more versatile, more widely used, has elegant add-ons in fields from architecture to phylogenetics, and is entirely free. It's the feel good statistics package of the decade, and a serious threat to the business model of anyone who makes money selling data analysis software (which can often cost hundreds of dollars for a single user).

At the Max Planck Institute for Demographic Research, where I've recently started working, everybody uses R. The simulations are in R, the data queries are in R, statistics are in R and the figures and graphs are created in R. R is more necessary than German for anyone at the Institute. Which is why I am dedicating the next couple of weeks to learning it. As with any language, the largest part of learning R is trying to using it, failing to be understood, and trying again. So I've given myself a task, outlining in English a simple simulation I need to perform for a paper I'm revising. Programming this requires about 50 steps. So far I've figured out the first three, and I'm stumped on the fourth. Even so, I think my R is already better than my German.