Git For Everything
Git is a core tool most developers use every day. Yet, it’s rarely used by non-developers. Even adjacent professionals such as data scientists and analysts don’t always know how or when to use it. Let alone project, product, or people managers!
As a developer converted to a manager it was clear to me that a lot of disparate tools and processes used for tasks such as tracking progress, writing documentation, maintaining a roadmap, providing changelogs, etc. could be replaced by a single tool and approach: Git with pull requests / Github Flow. And, once I took it to the logical extreme, moving most of the team’s operation to Git – the Git For Everything approach that I describe here.
This article will be relevant primarily to team leads/engineering managers/heads of data or engineering teams and departments, so I will not spend much time talking about what Git is or other technicalities. If you’re not aware what Git is, you should check out this article first. Let’s dive in.
Before we go to the specifics we need to paint a bit of a background. I was working as a head of data in a mid-sized company. Before I joined, the data team had limited resources (only 1 full-time data engineer, 1 full-time and 1 part-time data analyst) and an insurmountable volume of incoming tasks.
Naturally, plates got dropped, and the “pre” state was roughly this:
- there was a ticketing system, but it was not really used
- priorities changed daily
- uptime was miserable; outages were happening every week, some of them with data loss
- no metrics definitions – and, as a result, different metrics were called the same name, which made reports and dashboards non-interpretable by most business stakeholders
- there were multiple versions of “dashboards” scattered across tools such as Excel, Google Sheets, Tableau, and Superset
- code quality was poor; no docs, no tests
- the team was demotivated, since no matter how much they worked, the “To Do” list only grew bigger.
Not Only Technical Fixes
And no, Git by itself will not solve those problems! However, it has proven to be invaluable in combination with:
- growing the team: in the end we got to having 1 product manager, 2 data engineers, 1 data scientist, and 3 data analysts
- enforcing usage of a ticketing system by the data team itself and other teams who required services provided by the team
- stricter triage and prioritization of tasks
- individual career development plans and regular 1-on-1s with all team members
- major re-architecture of the data system and gradual rewrite that touched data ingestion, processing, storage, and analysis components to varying degrees
- dockerizing all components and implementing CI/CD
- introduction of reusable queries and dashboards packaged in a data app built on top of Streamlit.
As usual, the greatest effect comes from non-technical changes. However, when you consider that the team was still comparatively small and that we implemented those changes while searching for people and keeping the system alive, the technology choices and the effectiveness of approaches does play a major role. Thus, Git.
Git comes with a unique combination of features that are not matched by any popular “project management” tools. Git is built to enable asynchronous work by a possibly very large team of individuals working on a set of shared assets.
It only really works for text, but when you consider software engineering or data science/analytics department, pretty much everything is text: code behind the apps, dashboards, data ingestion pipelines, or machine learning models, issues, task lists and roadmaps, infrastructure and deployment logic, documentation, etc.
Since Git is a version control system, you can ensure that nothing ever gets lost and that reverting or comparing with the previous state is straightforward. Every single change and every single line of every file at any point in time can be referenced.
Git can be used from the command line, with an advanced text editor such as vim or Visual Studio Code for those comfortable with those tools, and via a GUI that Github / Gitlab provide for those who are not. There’s still a learning curve for sure, but it’s not that big of a jump compared to learning to use Jira and co. for non-technical folks anyway!
Let’s take a look at some of the ways I used Git to make the team’s work more manageable.
Code, Internal Libraries, and The Monorepo
Before the changes, team used to have some code and related artifacts stored in various repositories and another set that was not stored in any version control or was partially version controlled in weird locations such as Google Drive.
Unifying everything in a monorepo was the logical first step. “Monorepo” just means that instead of having a Git repository per project or component you have one Git repository in which all projects/components coexist. This reduces cognitive load for everybody involved, allows you to simultaneously update several components making things like updates to shared libraries easier, reduces the amount of places people need to look for things, and enables better code reuse (esp. when we talk about things such as CI/CD pipelines).
Since all the code now lives in a monorepo, setting up automated testing and continuous integration / continuous delivery using Github Actions or (since the company used Gitlab) Gitlab CI/CD becomes straightforward.
This naturally required unifying build & deploy story. Docker is the default option here, and that’s what I went with.
Storing your CI/CD pipelines (and possibly other regular/manually triggered automated tasks!) in YAML files in the same repository as your code is probably one of the most significant improvements to the developer workflow in the last decade. I don’t think most other models of devops, such as having separate infrastructure as code repositories makes much sense nowadays.
In fact, at another job I implemented a set of development environments that you could deploy to by attaching a label to a pull request - thus, multiple developers/data scientists could show their work in a live app to business stakeholders while spending no time on deploys. And when those pull requests are merged, the main publicly accessible version of the app updated without going down. A very similar system powers this website too: you shouldn’t notice any downtime when I add a new blog article or such, because Kubernetes magic. The sky is the limit.
A data team in most organizations will provide the following services to other teams in the org:
- on-demand analysis to investigate a specific question or problem; the resulting artifacts can be anything like a single sentence, a SQL query, a notebook showing some charts with explanations, or a full blown dashboard.
- dashboard development; this can be done according to the internal roadmap and priorities of the data team or by request from other teams
- developing & integrating machine learning models into a product developed by the organization or developing an ML-based product for internal use (e.g. things like automatic triage of customer support tickets).
The common requirements of all those varying tasks is being able to query some database, do some data processing, draw some charts, and, possibly, train an ML model. Jupyter Notebooks are the standard way to do all of those and are typically used by data scientists and analysts. Developers and data engineers can also use them to prototype and test ideas.
In recent times, a lot of proprietary solutions appeared that try to provide Jupyter-like experience, but in practice most of them are not compatible with the most natural place to store notebooks: Git.
Both Github and Gitlab allow you to preview notebooks in the browser, making the workflow straightforward for the people doing data work and for the people requesting it:
- data scientist/analyst creates a notebook that answers a particular question or shows specific data
- they push it into a Git repo for review
- once the notebook is merged into the main branch, a link to the notebook is shared with the internal client and they can view the result in a browser without installing any additional tools.
Since both Github and Gitlab provide SSO integrations, enabling sign-on by non-tech colleagues with reader access is typically easy to do for the IT department as well. And since the notebooks are stored in a centralized location, and versioned and visible, there’s more opportunity to reuse them and larger incentive to keep them up-to-date, especially if you work with evolving data and regular requests from other teams.
Lately, some notable open-source Git-compatible Jupyter alternatives started to appear too. I suggest you check out Livebook for example.
A lot of companies pay for SaaS solutions of various levels of usability for creating and working with dashboards. In my opinion, all those solutions invariably suck.
Some of those tools target non-technical audience, some of them try to attract both tech and non-tech folks. Non-tech folks typically know Excel/Google Sheets and use those in their own departments. If you’re developing a product, you often will have integrated analytics in your product too. And data science folks and more complex data analysis will require data wrangling and visualizations that are not doable or impractical to do in “no code” tools/Excel. So adopting those tools ends up creating 3-5 ways of “making a dashboard”, and yes, in most orgs you will end up with 3-5 incompatible versions of dashboards showing the same data in different and conflicting ways.
This becomes even more problematic when your data itself is a moving target. There’s no problem in using GUI/no code tools to build a dashboard – you probably can do it slightly faster (but not by much and only in “supported” cases) than coding it, right? However, the moment your data changes (and it will change as long your product develops) you will discover that most of the no-code dashboarding tools have exactly 0 support for data evolution and migration. Your dashboards will just break or (worse!) show incorrect data with no easy way for you to track which dashboards could be affected by a data change.
Testing and simultaneous asynchronous work/merging changes developed by multiple people in the same dashboard are usually not supported or supported as a checkbox features – it’s just not really useable for any non-trivial cases.
Finally, assuming that data analysis can be done by people not specialized in doing data analysis is usually unrealistic. The number of times I saw (perfectly smart and amazing at their professions) colleagues make basic statistical/data interpretation mistakes at multiple companies is pretty large. I rarely saw those mistakes made by people who did spend some time learning how to analyze data. Simply put, data analysis is a profession for a reason, and there’s a world of difference between dashboards made by people who specialize in that vs people who have completely different field of expertise and are often pressured by their bosses to do some charts in Excel pronto for a standup.
So code is not a problem. I posit that to learn how to do data analysis properly and with any degree of certainty in the results you must have some basic coding knowledge. Thus, the question is: how do we enable people to create dashboards quickly but without compromising their UI and data quality? And how do we evolve and maintain those dashboards?
Once you start looking at dashboards as “just another piece of code” the answers to those questions are clear: we store the dashboards’ code in the monorepo and use pull requests and peer reviews to update them and ensure quality!
When dashboards are implemented with code, you can easily search them to find references to the data fields that are changing, you can implement automated testing and CI/CD for them, reuse common data pre-processing and metric extraction libraries, and more.
Charts are only useful when people understand what metrics and dimensions in those charts mean. There are 2 main ways to make metrics understandable:
- metrics explainers integrated into dashboards themselves, for example “show on hover”
- a data glossary - a document (that can be interactive or a dashboard itself!) that summarizes and explains each metric, dimension, and piece of terminology used in dashboards.
If we store dashboards in Git as code, it’s easy to see that metrics computation & documentation can be easily reused.
If we want to have a data glossary, you can start as simple as a Markdown document stored in your Git monorepo. Note, that Markdown naturally enables embedding tables and images and both Github and Gitlab support diagram languages and extended Markdown features such as expanders and writing mathematical formulas.
And since we store our Markdown or dashboard-as-code glossary under Git, the standard pull request flow applies and allows us to easily version and asynchronously change the glossary.
The only place where implementation-level documentation should live is next to the code it documents. All modern programming languages support that, so this naturally ends up in your Git repository.
However, many orgs try to store the higher level docs in “special” systems such as Confluence and such. I do not recommend it, especially for the technical departments. The “corporate wiki” is the place where things go to die and never to be seen or updated again. I’m yet to see an up-to-date corporate wiki, despite many orgs I worked for having teams and processes in place that are supposed to keep the wiki alive.
Thus, the best place to store all higher-level documentation such as “how to deploy the dashboards”, “how to provision local dev environment”, “what’s the high-level architecture of the data system”, “what are team processes”, etc. is the same exact monorepo where you store your code.
It gives you all the benefits of working with text in Git + since it’s right in front of people working with the system in the main tool they work day to day, it has significantly higher chance of being updated than anything stored in an external system no matter how much that external system is “optimized” or “nice” for the task.
Architectural Decision Records
One special case of documentation that I’d like to mention is something that a colleague working in my team suggested for us to adopt as a practice that she used in a previous company: Architectural Decision Records aka ADRs.
Adopting them has proven invaluable for us to keep track of “why” we did certain changes and simplified onboarding of new team members significantly. Check out the link above to learn more about this technique.
And, you guessed it, we just stored them as a folder of Markdown files in our monorepo :)
Tasks, Roadmaps, OKRs
This was not implemented for that particular company, but I do use it for my personal projects and in case I will need to manage a similar team again, I will do it this way for sure.
Typically, a company has a project management tool (be happy if it’s only one! usually it’s multiple incompatible ones) and a spreadsheets/separate tools for tracking roadmaps and OKRs. Those tools usually have no versioning and don’t integrate well with other tooling.
What seems like a better approach is to, yup, use the “folder of Markdown” files approach yet again. You have checklists supported by Github-flavored Markdown already, and most other Markdown tools support checklists. On Github and Gitlab you can also @mention people, #commits, #refs, pull requests, etc.
For OKRs, you can either do a dashboard or you can also express them as a plain and simple Markdown files. A file per quarter will do, and version control improves accountability since all changes are public and it’s clear who authored which change where and what the scope of the change was. Same with roadmaps and changelogs.
Alternatively, it’s reasonable & easy to use tools built-in into your Git GUI solution to track issues and connect them with pull requests.
Github Flow as the “Default Process”
Apart from distributed version control, attribution (who did what to which assets), and storing related artifacts together, using Git provides you with a default process for anything of sorts: almost anything can be a pull request, can go through a review, and be merged – the Github Flow.
- changes to code, dashboards, and notebooks are naturally done through the standard pull request process
- documentation, ADRs, data glossary - also works with this process!
- prioritization, selecting tasks for a sprint/iteration, assigning them - yup, just a bunch of changes to Markdown files that can be easily reviewed even by people with no technical knowledge.
And each version, comment, and artifact can be referenced and linked to. The access level can also be controlled from a central place. So much overhead and complexity erased compared to using a set of disjoint and incompatible tools for each of those.
If that sounds like something you’d like to experience, give the Git For Everything approach a try in your own team.