A Data Scientist's Resources

Written by Matt Dancho



Getting up and running in data science is tough. It’s easy to get overwhelmed, and your biggest asset is time (don’t waste it). Here’s some resources to help speed you along. I’ll continually update these as I get time. Feel free to comment or email me if I’m missing something.

About the photo: A sextant is a device used in navigation. By measuring the angle between the horizon and the stars, one can determine a ships course. Data science can be used in much the same way. By applying data science concepts, one can develop strategies that set the direction of an organization.

-Matt

Data Science


These are the best resources I’ve found to get you up and running quickly with analysis. The resources cover both programming and statistics.

Free Books from Profs and Pros that Know Their Stuff:

Hadley Wickham’s website - Hadley has developed many of the most commonly used R packages such as dplyr, tidyr, ggplot2, and has several books (some free) and technical papers written on various topics.

Introduction to Statistical Learning - Five stars. This is by no means an easy read, but is probably the best book I’ve read bridging the gap between statistical analysis and R. The book is geared towards individuals that want to apply statistical learning to real-world problems. It spends less time on theory, and more time on the actual tools.

Elements of Statistical Learning - I have not read this one, but I imagine it as the next step in the progression from ISL. I’ll report back.

R Programming for Data Science - Good book you can get for free. This is pure R, whereas the others are more statistics driven. It accompanies the Coursera R Programming course on Coursera.

Exploratory Data Analysis with R and Art of Data Science - The focus is on graphology… the art of communicating data. Exploratory Data Analysis shows the R programming side of data communication while the Art of Data Science focuses on the fundamentals behind data communication.

Courses:

Intro to Statistical Learning (Stanford) - Free course that accompanies the ISL book mentioned above. Uses R as the language. I have not taken it yet, but it gets very good feedback. I’ll report back after I finish it.

Machine Learning - Coursera (Stanford) - Free course by professor Andrew Ng. The course uses Octave as the programming language, which is the only downside in my opinion (I wish he used R!). With that said, professor Ng is the best professor I can imagine for such a course. He masterfully explains the nuts and bolts of complex topics in machine learning.

Data Science Specialization (John Hopkin’s University) - This course costs $423 as of April 9, 2016. It will set you back a bit, but the price is on the order of one credit-hour at a major institution… well worth it in my opinion. Update: I’m getting farther into the courses, and I’d give the lectures a 5/10. The redeeming quality is the companion books that Dr. Peng has put together. Overall, I’d still say the program is worth it, but the benefit is closer than I’d like…

Data Analysis and Statistical Inference (Duke University) - Free course that will certainly help with introductory statistics. Course uses R as the programming language.

Tools of the Trade


You need to be part programmer if you want to process and analyze data. These are the tools that I have found most useful.

R:

A language dedicated to statistics. Comes out of the box with everything you need to begin analyzing data. The real power is in the libraries R and in the open source IDE RStudio. Has a huge community supporting it - Check out R-Bloggers if you don’t believe me. RStudio’s cheatsheets help instead of remembering all the commands. The website I use for searching main packages is rdocumentation.org, which has package categorization and searching (a necessity given the number of R packages out there). In writing this post, I yet found another good reference website for books and tutorials to get one up and running: r-dir.com.

Python:

Fast, and ideal for big data applications. Also has a huge community of support and great libraries such as matplotlib, scikit-learn, numpy, and pandas. Probably one of the best IDE’s (I think it’s an IDE???) for learning is Jupyter Notebooks. Before you get started, I recommend checking out Anaconda’s distribution, which comes with many of the most widely used packages pre-installed.

Excel:

It may not be sexy, but it’s great for visualizing and pivoting data. It has issues will larger datasets, some of which can be managed with powerpivots and powerqueries. Excel integrates nicely with relational databases. A great reference for Excel with lots of good tutorials is Chandoo.org. Chandoo is amazing, and his website is a will help you “become awesome in excel”. I use it a ton.

Databases (SQL / MS Access / MySQL):

You need a way to collect data. Databases are the key. For those interested in SQL, a free tutorial/book is Learn SQL the Hard Way.

Git & Github:

Version control is key, which is where Git comes in. GitHub helps with the communication/collaboration aspects. And, this website is built on GitHub! You can download it here.

Now go forth and learn. Onward and upward.