GitHub Flow, where feature branches are first-class citizens. It is important to acknowledge that other models exist. Always adjust to your current workflow.
A green branch is simply a Git branch in which all commits pass the tests, hence, are green. The reason why this is so important has to do with the three basic tasks we’re doing when facing other people’s code:
Let us explain why. For each aspect, green commits are so important, starting with the simplest during code review.
Submitting a pull request with red commits is akin to explicitly asking to merge code that do not behave as expected, or that the tests are not to be trusted. Either way, the code is obviously not ready to be merged.
Green commits by themselves don’t prove anything. It is possible that they are green for the wrong reason: false positives, flaky tests, misunderstanding of requirements. Red commits on the other hand, certainly prove that the original author does not believe its code to be correct.
In the absence of any test, the indication of correctness solely relies on the author’s word that it works. I don’t mind giving feedback on untested code, but I would probably not call this a code review, nor would I give my Approved on GitHub.
One could argue that it’s okay to have one red commit as long as the second one is explicitly labeled “Fix the tests”. I hope that by the end of this article, I will have changed your mind on that point.
Another important reason for having green commits is when we use Git to browse the history. Imagine you were to check out old commits in the so-called detached HEAD state, what would you do if your first noble reflex, to run the whole test suite, were to fail you?
This is what happens when a ‘yellow2__’ branch gets carelessly merged into development. People may have had a good reason to do it at the time - maybe the test was fixed in a subsequent commit. But how would you know if the commit message was not explicit? How can you branch off something from it in plain confidence?
Besides, having red commit will defeat the purpose of the git bisect command. Indeed, this command lets run use a divide & conquer approach to quickly find a commit that introduces a bug. If some of your commits are red for the wrong reason, this will prove really challenging to use it effectively.
Finally, the last reason why green branches are so important: when merging other people’s code into ours. In a perfect world, merges would always happen smoothly and Git would always automagically know what to do. But the world is far from perfect.
In the real world, merge conflicts happen. If code reviews are hard, merge conflicts are even harder.
First, conflicts require you to understand one of the multiple Git conflict-resolution layout styles which, depending on the layout you use, is arguably as hard as understanding monads. Then, you have to figure out which line to keep, which lines to drop, and, finally, which lines to actually merge by hand.
By merging by hand, I mean understanding that this conflict, presented in diff3:
function getUserById(id) {
change-signature
Should in all likelihood be resolved like this:
function getUserById(userId, opts) {
This implies to figure out that one side of the merge is concerned about making the signature more explicit (by diffing it with the common ancestor thanks to diff3) whereas the other side wants to allow for a second parameter and make the first parameter more explicit. Then, you need to figure out how to apply both changes at once! A tedious process that in practice is always both annoying and error-prone.
Now, after doing so, you’ll often want to run the tests to ensure your merge still makes sense. Unless…one of two sides of the merge did not pass the tests in the first place. In this case, resolving conflicts can prove to be really challenging.
The further you are to the ‘Unit’ end of the automated test spectrum, the less likely you will be to face this issue. However, if you test close to your code, this will happen frequently, which in practice will often result in coding the whole function again in pair-programming with the original author. People who have experienced this will tell you how annoying it is, both for the reviewer and the author.
Shade n°1 Each commit should pass the tests.
The second idea we came up with when figuring out what makes a git branch look great, were commit descriptions. By description, I mean commit messages and commit description bodies.
For commit messages, we settled for classic recommendations from seasoned Git users. This is the template we used:
Capitalized, short (50 char or less) summary
More detailed explanatory text, if necessary. Wrap it to about 72 characters or so. In some contexts, the first line is treated as the subject of an email and the rest of the text as the body. The blank line separating the summary from the body is critical (unless you omit the body entirely); tools like rebase can get confused if you run the two together.
A properly formed git commit subject line should always be able to complete the following sentence3:
If applied, this commit will …
The Capitalized present-tense first verb ensures that commits will look like Git automatic commit messages such as when doing a merge or a revert. The length limit of 50 characters will help ecosystem tools like GitHub always display the full message instead of truncating it with ‘…’, which is nice to have.
As do most habits, writing commit messages this way can take some time to get used to. However, I don’t think most of the value from a VCS tool like Git really lies with commit messages.
Description bodies are by far the most overlooked Git “feature” of all time, in my experience. In fact, they weren’t part of the guidelines my team wanted to enforce on each commit, but I decided to add it to this article as a personal note. I believe that each commit, even the ones consisting of what appear to be minor changes, should always have a description body.
I believe such a restrictive rule should be enforced because each single commit is always motivated by a reason. This reason is probably the most important piece of information after the code itself, and yet, by lack of proper description bodies, many commits fail to capture it. Eventually, the reason for the change will fade away its original author’s memory and nobody will ever remember it.
One common objection to the idea of always having description bodies is that sometimes, they really feel overkill. Cosmetic changes fall in this category. We’ll address this objection in guideline 5 on code churn.
Another objection is that, sometimes, the architecture of a project is so that features always go through the same series of steps, each one materialized by a different commit, which makes the description bodies quickly redundant. Here is an example of such well-defined steps that could occur:
I will admit that if your architecture is very mature and stable, and your codebase legacy-free, without a shred of dust, then it is true that commit messages such as:
0befe5d - Add a candidate table migration 1fe0f5c - Add a candidate model 2baef5b - Add a candidate repository 5fe0fbf - Add a candidate serializer
can encompass all the context of each change set. However in practice, I have yet to see a codebase like this. This is what i’m used to instead:
0befe5d - refactor mig directory and add candidate migration 1fe0f5c - Added a candidate model and some cleanup 22eff5a - Fixed tests after rebase 2baef5b - added a method users repository to fetch candidates 5fe0fbf - Change serializers signatures
Aside from the fact that these commits don’t follow the previous guideline regarding commit messages, there is nothing fundamentally wrong happening. The actual codebase could actually be very healthy, but the least we can say is that there are a lot more going on here than just creating a migration, a model, a repository and a serializer.
If those additional changes (refactoring, out-of-scope improvements…) were to be committed separately, maybe the branch history would resemble this:
1fe0f5c - Refactor the migration directory 0befe5d - Add a candidate table migration [x] 947ef42 - Leverage functional programming in model importer 1fe0f5c - Add a candidate model [x] 5fe0fbf - Add a method to user repository to fetch candidates 2baef5b - Extract candidate repo out of user repository 5fe0fbf - Update serializers to use jsonize2.js abc0fbf - Add a candidate serializer [x]
Then, perhaps commits [x] 0befe5d, 1fe0f5c and abc0fbf could skip having a description body without any loss of context. But the other ones beg some serious questions: why was the migration directory refactored? (assuming refactoring such a directory have any sense) Why the apparent need to leverage functional programming in the model importer? Why introducing the jsonize2 library in the serializer?
All these enigmas could probably be solved by the end of the day if their authors were still working on the project… but how about when they’re not around? How about when they are not working on the project anymore?
One could argue that at some point, some commits are too old to be relevant. Therefore, description bodies are overkill because by the time they could be useful, the modified code gets obsolete and newer versions of it take precedence; versions for which we can always be sure to find the author around, if the need for explanations arises.
To put it another way, “old” commits are virtually useless. I find this idea bothering for at least three reasons:
First, how old is old? There are absolutely no guarantees that the author will still be around at all, nor that anyone on the team for that matter, will be able to provide detailed explanations of why a change was made.
Second, being blocked on my task because I lack information that I can only get by taking a co-worker out of his zone is annoying at best, and unthinkable for other people.
Last but not least, the refactoring argument. Every refactoring usually requires first to understand the code subject to a refactoring. After all, if we are to change the code without altering the behavior, we better understand the initial behavior in the first place.
However, the code rarely says the whole truth about itself. Comments are notoriously hard to maintain, and self documenting code has limits. This is why we often find ourselves figuring out the code the hard way instead of simply reading it, just like we would read a book. This is where commit description bodies prove the most valuable to me. Understanding the whole truth about some code, why it was introduced in the first place, why it evolved this way.
For all these reasons, I believe that providing context to each commit, past and future, is a game changer for a project’s long term success.
Shade n°2 Each commit metadata should be templated (and mandatory).
No matter which side of the argument you’re on, one thing remains true either way: description bodies are related to the content of the commit. As we saw, depending on the maturity of your codebase, a description body may or may not be mandatory.
This brings us to our third recommendation: how should each commit be crafted?
As we just pointed out, the code rarely commits itself without human intervention. Each commit always has at least one reason to be. The third guideline is that each commit should have precisely one.
This is somewhat tantamount to the SRP, applied to Git commits. The only problem with this principle is that nobody really knows what a responsibility is. Although it is true that a few categories of responsibilities have reached consensus, such as the ones we used in our fictitious example before, heated arguments still happen everyday over what a responsibility is or is not. In fact, it is even likely that the original author had something completely different in mind when he coined the term.
For our purpose, we needed a more concrete definition. To specify what we meant by reason to be to avoid unnecessary debates. We came up with three useful criteria which we labeled typology, order and atomicity.
Commit’s typology refers to the type of commit. There can be only two kinds of commits : structurals and behaviorals. This binary way of classifying changes to the codebase happened to have a lot of benefits, as it was very easy to distinguish between the two.
Structurals commits were about changing the structure of the code without changing the behavior, a.k.a, refactorings.
Behavioral commits were commits that directly changed the behavior, a.k.a, new business requirements from the product owner.
Once commits are divided like this and commit messages are well written, the practice starts to feel more like Git grooming than Git hacking. This is a good sign. Let’s continue.
As is turns out the order in which these commits are written can also impact both the development and the code review productivity. Many of us had trouble figuring out what refactorings were in or out of scope of their current task. Just like with pull requests, before getting on the same page, we all did things differently, and behavioral commits showing up randomly between refactoring commits were quite common.
To readjust all these misalignments of expectations, we decided to follow a well-known craftsman’s adage that advocated first to make the change easy, then make the easy change, which became our second criterion. In practice, this led to the following history:
1fe0f5c - Add new algorithm to compute scoring (+250/-50)
To be rewritten:
1febf5c - Extract a scoring service from game service (+100/-100) 5ab025e - Add Nash’s algorithm to scoring service (+250/-50)
At this point, the history looks a little clearer, but that’s just splitting one commit in two. Nothing groundbreaking. Besides, the behavioral commit 5ab025e still contains a generous diff. This leads us to our third criteria regarding commit scope. Atomicity.
When going the atomic way, one must legitimately ask at which points a commit gets objectively too small. If Git could speak, his answer would be straightforward: never.
Git is designed to work well with the smallest possible diffs. Most of its job consists of tracking changes to the project’s directory and recording the tree as it grows4, which it does far more easily processing leaves of diffs than trunks.
But Git is no human. There are two competitive forces that tend to make commits expand bigger. Unfortunately, we believe that only one of them is a valid reason.
With behavior commits, we found that the line was easily drawn. Every method with its associated unit test seemed to be right candidates. Although remember, any block of code that gets in your way should be refactored and committed separately.
Structural commits on the other hand, are a different ball game, although not for the reason one might think. Coming up with a list of recurrent refactorings we do all the time and decide that those should appear in separate commits is the easy part. For a comprehensive list of good atomic commit candidates, Refactoring could prove useful.
The hard part is overcoming our own reluctances to making commits that atomic. Why? The reasons are unclear. However, I believe it is probably the same inner reason why we occasionally refrain ourselves from extracting one liners into a separate function.
Though these reasons have no sound basis. Furthermore, going against them for months have had really surprisingly great results. The initial psychological discomfort was quickly overruled by the mental comfort of reading crystal clear git log outputs and the productivity gain that followed. Besides sometimes, even with the slightest modification, making a separate commit just makes more sense.
For instance in the renaming of a file, high fan-in classes will cause a huge diff, and therefore a lot of noises when committed with another refactoring. Plus, if the renamed file includes more than 50%5 differences with the source file, Git will not treat it as a rename but as pair of addition/deletion of files, thereby disconnecting part of the history, unless the --follow flag is passed.
Building upon the previous example, such a history could be:
Make the change easy
1febfff - Extract a scoring service from game service (+100/-100) (struct.) 4febb5c - Extract a _computeScoring() method (+45/-40) (struct.) 2fbaac5 - Replace Promise calls by async/await (+15/-12) (struct.)
Make the easy change
5ab025e - Add Nash’s algorithm to scoring service (+50/0) (behavioral)
Ultimately, we would realize that the feature would be implemented by just adding fifty lines of code, which is much better than our initial monolithic commit with the diff 250+/50-.
As you gradually adopt these guidelines regarding the ideal shape of a commit, you will most likely realize that coding first and committing then may not be the ideal flow of operations. The git add --patch command certainly proves helpful in many situations, but it won’t let you chop chunks of code beneath a certain threshold, and even so, crafting commits _afte_r will quickly feel awkward.
Once you feel the awkwardness of trying to fit your diff in multiple commits after writing the code, it won’t be long before you find yourself hooked into what could be referred to as “Commit Driven Development”. Thinking about your next commit first, both its subject and description body, is the best way to ensure it will be as atomic and freestanding as possible - if only for the fact that it will force you to think thoroughly about the problem instead of rushing to code a solution.
Shade n°3 Each commit should be either structural or behavioral.
The three previous guidelines were somewhat agnostic of your git branching model. They could be considered solid advice for just about any commit. The following is specific to the GitFlow branching model, in which many feature branches originate from a main branch. When the feature is ready and peer approved after code review, the branch gets merged.
It is considered a best practice by many teams to always rebase its local branch the latest version of the common branch. This way, a developer can ensure that its latest work is compatible with the latest version of the code. The purpose of this practice is twofold:
First up is code review. Similarly to green commits, submitting your feature branch for review without ensuring first that it merges without conflict or that the resulting merge is functionally correct should be the responsibility of the team working on the branch, not the team reviewing the code.
This may be obvious for many people but I’ve seen developers argue about who should fix a merge conflict, as if it was a matter of courtesy. I don’t think that it is the case, nor that it should be. Such decisions should not be driven by a misplaced sense of politeness, but by economics instead. In this case, it is exceedingly easier for the team on the pull-request side to know how to reconcile two conflicting pieces of code than it is for the other side. Hence, always ensure your branch merges integrates smoothly.
Second is to have a linear history. When Git is seen as a tool to manage the source code history, we need to strive at making the output of:
git log --oneline master
to look like a poem instead of a rough draft. We want to be able to think of our application releases at three different levels of abstractions:
The application level:
v1.0 ---- v1.1 ---- v1.2 ---- v2.0 ---- v2.1 (origin/master)
The development level:
feat1 ---- feat2 ---- feat3 (origin/dev)
And finally, the feature level :
C1 ---- C2 ---- C3 ---- C4 ---- C5 (origin/my-feature)
When accidentally merging a branch without rebasing it first, the history becomes harder to read and the mental representation above distances itself from what really happened.
Shade n°4 Each pull-request should be based on dev.
Assuming your commits:
should we stop now, before being accused of Git fanaticism? On the one hand, the previous guidelines make for some really awesome branches. On the other, we’d be remiss to have gone this far and not discuss this last piece of advice - one that will make your Git branches shine so bright that your co-workers will need sunglasses to review your work.
This last one is best illustrated with an example. Consider the following history:
1febfff - Add a getByItemIds() method --------------------------------------- The order service needs to be able to fetch multiple products by item in one batch for performance reasons.
4febb5c - Refactor getByItemIds() to leverage lodash --------------------------------------- Most of the other repository methods leverage lodash in order to leverage the _(collection) syntax. This commit refactors the function to conform to what’s already present in the codebase.
54efa78 - Rename get into find to match semantics --------------------------------------- The current semantic for repository methods is that get should throw an error when nothing is found. The desired behavior here is to return an empty array [], hence the renaming getByItemIds() → findByItemIds()
What is wrong with this branch? All the previous guidelines have been duly respected, yet this three commits have a serious problem. Each commit obsoletes the previous ones, for a reason that has nothing to do with a change of business rules.
To put it another way, we are being presented the chaotic and tedious development phase the author went through. The problem being that this information has no added-value for the reviewer.
When multiple commits affect the same group of lines in a row, we increase the code churn. A churn like this is perfectly okay during development. However, before sharing your code with the rest of the world, eliminating these redundancies is much appreciated. To make the magic happen, simply rebase your work on top of the latest version of development, using the interactive flag:
git fetch -p git rebase -i origin/dev
This will open up an editor asking you what you want to do:
pick 1febf5c Add a getByItemIds() method pick 4febb5c Refactor getByItemIds() to leverage lodash pick 54efa78 Rename get into find to match semantics
In our case, we would like to merge all commits, and adjust the resulting commit message. We will instruct Git to fixup 4febb5c into 1febf5c, to fixup 54efa78 into the result of the previous fixup, and finally to reword the result:
reword 1febf5c Add a getByItemIds() method fixup 4febb5c Refactor getByItemIds() to leverage lodash fixup 54efa78 Rename get into find to match semantics
After editing the commit message interactively, our history now looks much cleaner, with one single purpose commit:
f42eabc - Add a findByItemIds() method --------------------------------------- The order service needs to be able to fetch multiple products by item in one batch for performance reasons.
Interestingly, only the commit message changed. The description body did not since all subsequent commits were simple refactorings and added no business value.
Shade n°5 Use interactive rebases to reduce the churn.
Once you engage in an interactive rebase, things can get messy faster than expected. Although a git rebase status command exists, reducing the cognitive load by any means necessary is a healthy practice. Here are some advice that will help you manage the complexity.
The first one is always to do one action per interactive rebase6. The reason is because if one of your actions were to fail and get you to a point where it would be easier to just start over, all your previous actions would be lost. Therefore, it is always a best practice to chain multiple interactive rebases instead of doing it all at once. To say it differently, use interactive rebases a lot, but keep the interactive part to a minimum every time.
During rebases, interactives or not, Git offerts a convenience command:
git rebase --abort
This commands allows you to cancel any modification happening with the current rebase. However, once the rebase has been done, the command does not work. To go back in the state you were at before doing multiple successful interactive rebases, simply hard reset your current local branch on the remote tracked one:
git reset --hard origin/<my-branch>
This command is useful when you are certain that you have messed something up. To ensure that you have not messed something up after doing multiple squash, fixup, edit and reword actions, simply ask for the diff between the result of your current branch and the one before rebase:
git diff <my-branch-after-interactive-rebase> origin/<my-branch>
If the diff gives you an empty output, you’re good to go. Otherwise, it probably means that you deviated a bit and need to double check on it before pushing to the remote.
Now, no Git article would be complete without a scary and, equally importantly, a virtually useless Git command. One that the author would use to insinuate that even Hamano has got nothing on him. Sadly for me, I did not shape the following… it was given as it is by a co-worker and as turns out, it is quite useful. I call it the GitHub-diff command, as it is the diff that GitHub presents you when you open a pull-request:
git diff --numstat HEAD $(git log --reverse --boundary --format=%h HEAD ^origin/dev | head -1) | awk '{plus+=$1;minus+=$2} END {print "+"plus" -"minus}'
The goal of this command7 is to find the diff between your branch and the main development branch. This is what GitHub presents you when your doing a pull-request. Let’s break it down into three parts.
First, we need to find the forking point. We do so by requesting all the commits accessible from the tip of our branch but not from the development branch. Then, we ask Git to include the boundary (the forking point), reverse the order to feed it to the head utility and take the first line.
$(git log --reverse --boundary --format=%h HEAD ^origin/dev|head -1)
Then, we ask git the diff between this forking point and our HEAD in a specific format with --numstat.
git diff --numstat HEAD ${forking point}
Finally, we feed it to the awk program… and, well, man awk for the rest.
What I like about this command is that it lets me precisely control the diff that I will send in each of my pull requests, to ensure they don’t get too large. I'm sure it can be used in many other creative ways that I haven't thought of.
Shade n°6 Sharpen your interactive rebases skills everyday.
For all the time we spent discussing these guidelines, there was one aspect of our workflow regarding Git, on which we could not reach consensus. The only commits that are allowed to land on the branch after the branch has been pushed, caused by the Changes requested label on GitHub. .
On the one hand, pushing a “Code review” commit will have both the benefit of being quick and preserving the information that was modified because of peer review. It will also ensure that no code pushed to origin and collectively discussed will disappear from the radars.
On the other hand, this can quickly stain the Git history with commits that can be both long and hard to read because they can get potentially lengthy, and whose sole context will be that they were caused by a code review. The naive fix would be to make multiple code review commits, but then cause a lot of churn, which, as per guideline #5, is something we would like to avoid.
Consequently, it is up to you to decide what you do with these changes. I personally tend to favor the history’s readability and lose track of code reviews discussions. Although occasionally, I consider the first one to be more appropriate.
Shade n°7 Figure out code-review commits for yourself ¯\_(ツ)_/¯
Past a certain degree of complexity, lone coders disappear. No single human mind can maintain millions of line of code on its own. Against popular belief, professional software development is a fundamentally collaborative discipline.
Thanks to my colleagues, I realized that Git could be more than just a tool to save and share my work… that should be more. It should be used as a tool to make the result of beautiful years of hard collaborative work shine brightly.
If you’ve read this far, I hope that this article has helped you realize that too.
^1 ^ That one made me feel like a code ninja everytime I typed it ^2 ^ Half green, half red, shameless attempt to coin a technical term ^3 ^ Some strategies put this to the next level by adding semantics to the subject. Git semver and conventional commits are populars implementations of this practice ^4 ^ Figuring out the diff between two files is a hard problem. Hard as in NP hard. In fact at its inception, Git delegated this task to the unix diff executable. 5 The default value. man git-diff for more information about the heuristics Git uses. 6 It is the exact opposite of what we did demonstrating how to reduce the churn. 7 There are less dramatic ways to find the forking point, such as git merge-base HEAD dev
The official Git documentation https://git-scm.com/doc
About the different branching models: Git flow vs trunk-based development https://www.toptal.com/software/trunk-based-development-git-flow
A great article on the advantages of a single big repository over many small ones. https://yosefk.com/blog/dont-ask-if-a-monorepo-is-good-for-you-ask-if-youre-good-enough-for-a-monorepo.html
A 4-part article masquerading as a code review of Git itself: https://fabiensanglard.net/git_code_review/diff.php
A great talk discussing numerous aspects of Git.