Git for data science at iHeartRadio

Last year, I walked iHeartRadio's Data Science team through the theory and practice of git.

Version control falls into the category of "practical stuff you learn after finishing grad school," and many folks -- such as Software Carpentry, Trey Causey, and Kaggle -- have written about it at length. I figured I would toss in my two cents, with a focus on our situation and goals at the time.

Some context: around the time I joined iHeartRadio, collaboration within the Data Science team had been limited to analyses and dashboards. We had not operationalized much software, and most of our work lived in a single Github repository with folders named after their owners.

Our team had (and continues to have!) ambitious plans for different projects, but this approach to version control didn't scale to projects with multiple owners (either within or across teams). We agreed to adopt best practices for version control, and I presented git's theory and tools for collaboration. A lot has changed since then, and we now use Jenkins and Kubernetes to test our various repositories on commit, build releases after merges to master, and push package updates to PyPI.

Here are the slides for my talk:


Last posts

  1. Introductory notes on regularization
  2. Beyond Trending Topics: identifying important conversations in communities
  3. Presenting leptoid at Surge 2012
  4. Installing R and rpy2 from scratch
  5. Sleeplessness => bad code