3 Stolen Analytics Team Workflows

Data science team photos were scarce, so here's a very serious looking photo of the 1876 Yale Bulldogs.
I googled “data science team photo” and found this pic of the 1876 Yale Bulldogs, national champions. Apparently this predates photographers yelling, “Look at the camera, and say cheese!” (Courtesy Wikipedia)

TL;DR: Data science teams don’t need to create a new way to work together. It’s better to steal ideas on how to collaborate from older, more established disciplines. Below are three possible models for your data science team to improve collaboration on your next data science project.


Obama on Data Science Team Sports

At Strata + Hadoop 2015, our Commander-in-Chief had a very important message to share with us data scientists. In this video presentation, President Obama decreed, “data science is a team sport.”

Data science is no different than any other activity where multiple brains are better than one. Its close relative, software engineering, has already explored and established ways to work together as a team. We don’t need to re-invent the wheel. We can borrow and steal collaborative approaches from those disciplines that struggled before us.

Here are three examples you can use to improve data science team effectiveness, or simply how to better collaborate with others on your next analytics project.

Relay Race Pipeline Model

This model is easy to understand, easy to implement, but has some drawbacks. It works well when there are clear parts of the pipeline, like a beginning, a middle, and an end. In this example, you would have three people: each responsible for one of the three parts, as in the beginning, middle and end. You want to make someone is responsible for each leg of the relay race to get your project across the finish line.

The problem is that data science projects rarely work in one direction in the real world. Just like software projects attempting waterfall, they get messy, and often need to back-track. That’s OK though: the benefit of the clear distinction of one person in charge of the front end visualization part, and another person in charge of the cleaning and munging, and another responsible for getting the raw original source data fed into the top of the pipeline, makes the communication easy. It’s more about the roles, and ignore the other parts of the relay race metaphor.

Microservices Delegation Model

This is an upgrade from the relay race, because it’s multi-dimensional. I credit Tom DeMarco for clearly explaining this.

There’s an author named Tom DeMarco who wrote Peopleware and The Deadline. In one of those books (and possibly both), DeMarco argues that work should be delegated in an isometric fashion, meaning that if you have a blob of 100% of a project, it should be carved up into sub-components that are small and defined well enough for each sub-component to be owned by a single person.

I would argue that what DeMarco wrote about is the basis for microservice architectures—except that he wrote about the concept many, many years before that word became all the hype it is today.

Another way to imagine this is that each part of your project is a sub-component, and that sub-component is a black box that accepts some input and provides some output. The inputs and outputs don’t have to be in any particular order. For instance, someone can make a little mini-app that simply receives a zipcode and a latitude and longitude as input. This app can return true if they match, otherwise false. A tight little single-purpose program is modular, and makes it easy to use in future projects, too.

Open Source Model

This is the most sophisticated of the three examples, but perhaps the most important. The software world has enabled people to work in a collaborative setting with distributed teams, making great software, for many years. There are successes and failures from that experience that are worth considering when trying to work with as a team on data science.

This is even more important when attempting to work with volunteers, such as Linux, or Firefox, or Code For America, where I have the honor of being a part of the Data Science Working Group of the SF Brigade.

I really like considering the world of open source as the way to effectively collaborate with other people. Someone makes a copy of your code or analytics package, and then makes changes to it, and then shares it with you. You as the owner of your original masterpiece can choose to accept their suggestion, or ignore it. If you accept it, that change is merged automatically. You don’t have to futz with the code. Or, if you have to futz with the code, you can make that as an “issue” and someone who wants to be considered a contributor can put their name down.

Version Control is Mandatory

I’ve heard an argument against this approach regarding working on volunteer projects like we do at Code For America, saying that when you have volunteers who may be entry-level analysts and data scientists, making the use of version control or a particular service like Github is an unnecessary barrier to entry. At the same time, these same individuals that think Github is too complicated are usually also the ones that struggle with merging everyone’s analytical contributions at the end of the project.

I would challenge the arguments against volunteers learning version control with this. A data scientist is jokingly defined as one who knows more statistics than the average programmer, and more programming than the average statistician. There are certain skills you must learn in order to be a data scientist. For example, you must learn a programming language like R or Python. You must understand basic statistics. I would also argue that understanding version control is just as important.

You may not make a lifelong career out of data science. Regardless, understanding the basics of version control will give you a major advantage over those who do not. If you know how to contribute to an open source project with a system like Github, you have a powerful skill to add to your resume. If you actually make contributions to the open source community, that is great experience. Are volunteering with team like we have at Code For San Francisco, is it because you want to learn? Are you volunteering because you want to collaborate with other great people? Why would you limit yourself by choosing not to learn to use a tool as powerful as version control?

There are plenty of resources that explain how Github works, and how to contribute to open source with Github. There are also plenty of articles about how to get started with open source in general. By the way, you don’t need to be a programmer to contribute to open source projects. By another way, Github is not the only version control option out there, but arguably the most popular.

Better Teams are Thieves

The point of this post was to jog your brain about theft. Steal ideas about the success and failure of those disciplines that already figured out how to play as a team. Learn how non data science models they work, and emulate a couple of parts that will make your next project better. This isn’t just going to improve data science. Being a thief of teamwork models will reward you in your future in unexpected ways. In the meantime, go forth and be more successful, and a happier collaborator on your next project.