For the last six months I have been trying to learn R for several reasons. I felt like it covers a great part of the data science spectrum and a basic knowledge of the language is necessary in order to understand a lot of the work that can be found online and especially in the data science and statistics community. As it turns out, understanding the way R works helped me figure out pandas and the python data science ecosystem from a new perspective. I have never worked with S or similar statistics packages, so I searched for literature, enrolled in some online courses and ended up with a sort of curriculum that enabled me to do the following:
- load data into R
- understand functions, the sapply family of function applications and the likes
- use Wickham’s new school packages dplyr, tidyr, ggplot2
- understand the way machine learning algorithms are applied
- produce modest notebooks (Rmarkdown) and move towards reproducible analysis
- wrangle data in an easier way (I hate to admit it)
While there is a sort of hyperinflation of books covering just about any software engineering/data science topic in the world, it is difficult to pinpoint a couple of really good books. Three textbooks, however, proved phenomenal and, in my opinion, offer a great introduction and cover the basics in a way that should enable you to start playing with data, lay out a project and then search around and stackoverflow you way out of it.
R For Data Science by Wicham and Grolemund is an online and offline gem that drops you directly into the ocean (at the shore of New Zealand, I guess) and gives you just enough information to keep you afloat. It is an interesting journey and a great way of exposing you to the intricacies and peculiarities of the R tidyverse by the authors of the software themselves. The book covers the packages that you will end up using most of the time: dplyr, tidyr, readr, stringr and others in great detail. There are also some videos, presented by Grolemund, on the topic and they should be worthy of your time. If you have to start somewhere, start there.
Beginning Data Science in R by Thomas Mailund is a real gem! It takes you right from the basics, following a more conventional teaching approach - starting from simple (base) R - introducing the basic data types, operations, function definitions etc but without being slow or boring. At just over 350 pages it packs quite a punch with examples and nice code snippets. Mailund leaves no stones unturned and the chapter on reproducible data analysis is one of the best in the book and one of the most straightforward approaches on the topic. There is also a nice coverage of ML algorithms and of all the most used operations that you could face doing a project. Highly recommended!
Finally, there is Machine Learning with R by Brett Lantz. Once I was able to find my way around the R ecosystem, Lantz’s book came in very handy with a thorough approach to the most widely used algorithms. The book spends quite some pages on the explanation of different algorithms, but it is a very welcome diversion. Some will say that it is overly verbose and prosaic, but I enjoyed the style and the code snippets that followed, illustrating the underlying concepts.
If you’re a book-type learner, these three books should enable you to do some simple and not-so-simple projects. You should also grab an A3 printer and print the R Studio cheat sheets and stick them in front of you. Afterwards, it is just a matter of practice, stackoverflow and more practice.
I should also mention another couple of books that I had the opportunity to skim over.
Beginning Data Science with R is a dense and rather short book, concentrating on the bare basics of the language and covering some interesting workflows in machine learning.
Finally, Introduction to Statistical Learning is much more than an R introductory book. It is widely considered as the bible of statistical learning and thus one of the theoretical pillars of data science in general. I will dedicate a separate post about the book, but it is worth noting that it is probably a very good idea to chew chinks of it while learning from one of the books from the list as there is a nice correlation between theoretical concepts and math and practical implementations.
I truly believe online courses are one of the biggest disruptive factors of our age, for better and for worse. Having the experience of going through a long and old school scientific/engineering curriculum myself, I often sigh at some of the opportunities for online learning that didn’t exist just a couple of years ago. While I shall try to write at least a small review for each and every course that I have seen, watched, audited, today we are talking about R.
DS and ML Bootcamp with R is taught by one of Udemy’s all stars - Jose Portilla. It is a very good course, covering both the basics of R, the new Dplyr/Ggplot2 packages and it has some interesting projects and solutions. I have very little time for courses - I might grab half an hour in the morning and maybe an hour or so in the evening, so I tend to be really careful and try not to waste time. This course is very good and very fast, especially in the part concerning dplyr, tidyr and the sapply family of transformations. The only thing missing that I could think of would probably be an introduction to knitr and shiny (R’s web server), but there are nice tutorials on youtube and in Mailund’s book.
I was able to see some of O’Reilly’s videos on R, presented by Grolemund and they look awesome. If you get a chance to use that material, do it, by all means.
I collected a ton of snippets, guides, solutions, articles etc using my favorite online clipping tool and, as soon as I create a nice pipeline for extracting the links, I shall update this post.