Every non-trivial piece of software — every model, every pipeline, every research script that outlives the afternoon it was written — lives inside a version control system. For the last twenty years that system has, in practice, been Git. The tool is famously confusing, the mental model is worth acquiring, and the team habits built on top of it are the difference between a codebase that compounds and one that collapses under its own weight. This chapter covers Git at the level a practitioner needs, then the collaborative practices — branching strategies, code review, releases — that turn it from a personal tool into a team one.
The first five sections build the mental model: what version control is for, the snapshot-based way Git represents history, the three-state flow between working directory and repository, the day-to-day commands, and branches as movable pointers. Sections six through eleven cover the collaboration mechanics — merging and rebasing, resolving conflicts, working with remotes, the pull-request review loop, the main branching strategies, and the small discipline of good commits and messages. Sections twelve and thirteen are about working safely with a history that can feel fragile: how to undo, and how to rewrite. Sections fourteen through sixteen cover tags and semantic versioning, the special case of large binary artefacts (models, datasets), and the collaboration tooling — issues, project boards, CI hooks — that grows up around git. The last section brings it back to ML, where reproducibility and experiment tracking live or die by these habits.
Conventions: command-line examples assume git 2.40+ and a Unix-like shell. Where GitHub, GitLab, and Bitbucket differ materially, the difference is called out; otherwise a "pull request" and a "merge request" are the same object under two names. The goal is to leave a practitioner fluent enough to diagnose what git is doing, reason about branch history, and cooperate with a team of any size without dropping commits on the floor.
Almost every answer to "why do I need version control?" is really an answer to a quieter question: how do I work on this tomorrow without being afraid of today's mistakes? Version control is the tool that makes software engineering reversible — and it is that reversibility, more than any single feature, that changes how confidently a team can change its own code.
A codebase is a living document edited by many hands over months or years. Without a version control system, every change is committed to one shared present: if someone breaks a function, the previous working version is lost; if two people edit the same file, one silently overwrites the other; if the question "what did this file look like last Tuesday?" needs an answer, there is none. Version control stores the full history of every change, tracks who made it and why, and allows a project to go backward as cheaply as it goes forward.
The deepest effect of version control is behavioural, not archival. Once every change is recoverable, engineers are willing to try things — refactors, experiments, risky renames — that would otherwise feel too expensive to reverse. A codebase whose authors can safely experiment stays healthier than one whose authors cannot, and most of the engineering discipline this chapter describes is really a scaffolding around that single fact.
Git, designed by Linus Torvalds in 2005 to manage the Linux kernel, has won so completely that "version control" and "git" are, in practice, synonyms. It is distributed — every clone is a full repository, with the complete history — and it is fast enough that routine operations (commit, branch, diff) are essentially free. Its user interface is notoriously uneven, built for the author of the Linux kernel rather than a new developer on day one; the payoff for learning it is that almost every open-source project, every major company, and every cloud platform speaks it natively.
For ML practitioners, version control earns its place twice: once for the code that trains and serves models, and a second time for the artefacts around it — configurations, dataset snapshots, evaluation reports, experiment logs. A model that cannot be re-trained from a specific commit on a specific dataset is a model no one can reproduce; a research project whose notebooks are scattered across laptops is one whose claims no one can check. The discipline of putting things in git (or in systems that extend git, which the later sections cover) is the difference between research that compounds and research that has to be redone every quarter.
Version control is not, primarily, a way to save old copies of files. It is the infrastructure that makes change safe. Every practice in this chapter — branches, commits, reviews, rebases — is a refinement of that core: keep the project moving forward without anyone being afraid to press enter.
Most version control systems before Git stored history as a chain of diffs — "file X changed in these lines" — which is the intuitive model but the wrong one for understanding how Git behaves. Git stores snapshots of the whole tree at each commit and computes diffs on demand. Getting this inversion right is the prerequisite for almost every advanced operation.
When you run git commit, Git takes the state of every tracked file at that moment and stores it — in full, not as a patch — addressed by a SHA-1 hash of its contents. Files that have not changed since the previous commit are not re-stored; Git reuses the existing hashed blob by reference. The result is a content-addressable store of snapshots, linked together by parent pointers, with each snapshot identified by a forty-character hexadecimal hash.
A commit is a small object that records: the tree (a pointer to the root directory snapshot), one or more parent commits (usually one; two for merges), an author, a committer, a timestamp, and a commit message. That is all. Everything else — branches, tags, the log output — is derived from traversing the graph of commits backward through parent pointers.
Because commits have parent pointers, the history of a repository is a directed acyclic graph — not a linear timeline. Branches create new tips in the graph; merges create commits with two parents. Almost every git command is, under the hood, a query or a manipulation on this graph. Learning to read git log --graph --oneline is the single best investment for understanding what a repository's history actually looks like.
* a4f91b2 (HEAD -> main) Merge branch 'feature-x'
|\
| * 7c3d108 (feature-x) Add validation for edge case
| * 2b8e5f0 Wire new endpoint into router
|/
* 58fa221 Refactor config loader
* 91c0e7d Initial project skeleton
A commit's SHA is its permanent identifier; two commits with identical contents, identical trees, identical parents, and identical metadata have identical SHAs in any git repository in the world. This is the property that makes git's distributed model work: clones do not share version numbers; they share content-addressed objects. When a coworker references a commit by its first seven characters (a4f91b2), you know exactly which snapshot they mean, forever.
When something in git surprises you, re-state the question in terms of snapshots and parent pointers. Most of the weirdness disappears once you stop thinking of history as a list of diffs and start thinking of it as an append-only graph of content-addressed snapshots.
Git maintains three states for every tracked file, and nearly every command either inspects or moves a file between two of them. The three-state model is the small map that makes the daily commands make sense; without it, add, commit, restore, and reset feel like an arbitrary set of incantations.
The working directory is the files on disk you edit with your text editor. The index (also called the staging area) is a small manifest of what will go into the next commit — it holds the contents of the files you have marked as ready. The repository (under .git/) is the committed, permanent history: the graph of snapshots.
Files move between these three states with a small set of commands. git add copies changes from the working directory into the index. git commit takes the current state of the index and writes it as a new snapshot into the repository. git restore --staged unstages a change (index → working directory), and git restore discards a working-directory change (repository → working directory). The mental picture is three buckets and arrows between them; almost every everyday operation is one of those arrows.
Developers new to git sometimes wonder why the index exists at all — why not just commit the working directory? The answer is that the index lets you shape a commit: stage only some of your changes, leave others in the working directory for a later commit, build up a coherent set of modifications even when your editor has been busy. git add -p (for "patch") lets you stage individual hunks within a file, which is the key to turning a week of scattered editing into a clean series of small, purposeful commits.
git status is the map: it shows what is in the working directory (modified, untracked), what is staged, and what the current branch is. git diff by default shows working-directory-versus-index changes; git diff --staged shows index-versus-last-commit. git log shows the repository's commit history. These four commands — status, diff, diff --staged, log — are the primary way you orient yourself before running anything that changes state.
When a git command surprises you, run git status first. It tells you which of the three states you are actually in and what the next legal operation is. Most "git is broken" moments dissolve within two minutes of reading it.
Ninety percent of a working engineer's git usage is a dozen commands in predictable combinations. If those dozen become reflex, the other ninety commands can stay in the manual until you need them. This section is that dozen, in the order you will type them most days.
A day typically starts with git pull (or git fetch followed by git merge/rebase; the difference appears later) to take in whatever has been committed to the shared branch overnight. Then git status to see what is in flight, git log --oneline -10 to see the last ten commits, and git branch to see which branch you are on.
Make your changes. Run tests. Run git diff to read your own change the way a reviewer would. Stage the pieces that belong together with git add path/to/file (or git add -p for hunks). Run git status again to confirm what is staged. Then git commit -m "message" — or, for a multi-line message, git commit with no flag, which opens your editor.
# Make and inspect changes
$ git status
$ git diff
# Stage the pieces that belong together
$ git add src/feature.py tests/test_feature.py
$ git diff --staged
# Commit with a real message
$ git commit -m "Add feature X and its tests"
When you are ready, git push sends your commits to the remote. If the remote has moved in the meantime, push will be refused, and you will need to git pull --rebase (or merge) before pushing again. This cycle — pull, work, commit, push — is the entire rhythm of working on a shared branch; everything else is refinement.
Most working engineers define a handful of aliases in ~/.gitconfig — git co for checkout, git st for status, git lg for a pretty log — and these become invisible infrastructure within weeks. The first alias worth adding is git lg = log --oneline --graph --decorate --all: it compresses the entire repository's recent history into one readable screen.
Good git usage is not memorising exotic commands. It is running git status and git diff before every commit, committing small coherent changes, and pushing when tests pass. Every advanced operation in the later sections is built on this loop.
In most pre-git systems, a "branch" was a heavyweight operation — copying a tree of files, sometimes to a new server. In Git it is one of the cheapest things in the system: a branch is a forty-one-byte file in the repository whose content is the SHA of a commit. Creating and destroying branches is essentially free, and this is the feature on which modern workflows are built.
Under .git/refs/heads/ you will find one text file per branch, each containing a single commit hash. That is the branch. When you commit on a branch, two things happen: a new snapshot is written to the repository, and the branch's file is updated to point at the new commit. "Moving between branches" is really moving a pointer called HEAD between these files.
HEAD is the reference that says "which branch are you on right now?" — it is usually a pointer to a branch (e.g., ref: refs/heads/main). When you git checkout main, HEAD updates to point to the main branch; when you then commit, the commit is written, and the branch HEAD points to is advanced to the new commit. The indirection is what makes "commit on this branch" a simple operation even though branches share most of their history.
git branch feature-x creates a new branch pointing at your current commit; git switch feature-x (or the older git checkout feature-x) points HEAD at it. git switch -c feature-x does both in one step. git branch -d feature-x deletes a branch that has been merged; git branch -D feature-x deletes it even if it has not (with the caveat that you may strand commits reachable only from that branch).
# Create and switch in one step
$ git switch -c feature-x
# Do some work, commit a few times
$ git commit -m "Wire feature X into the handler"
# Go back to main, branch still exists
$ git switch main
# Delete when done (refuses if unmerged)
$ git branch -d feature-x
The consequence of branches being cheap is that feature work, experiments, and bug fixes each live on their own short-lived branch. The common shape is: branch off main, work, open a pull request, merge back, delete the branch. Long-lived branches and elaborate branching strategies — which the next few sections cover — are exceptions to this default, not the norm. A healthy repository typically has a handful of active branches and dozens of short-lived ones born and dying every week.
A branch is a disposable label pointing at a commit. Treat it as a note you are leaving for yourself, not a place you are moving into. Create branches freely, delete them freely, and let the commit graph — not the branch names — be the durable history.
When you are done on a feature branch, its commits have to rejoin the main line. Git offers two fundamentally different ways to do this — merge and rebase — and teams argue about which is better as if it were a religious question. It is not; they produce different shapes of history, and both are correct tools for different purposes.
git merge feature-x from the main branch creates a new merge commit — a commit with two parents, the previous tip of main and the tip of feature-x. The merge commit's tree is the result of combining both lines of development. The full history of the feature branch stays in the graph, with a visible fork and join. This is the "true history" view: every commit that ever happened is preserved in place.
main A---B---C---------M
\ /
feature D---E
after git checkout main && git merge feature
M is a merge commit with two parents (C and E)
If the main branch has not moved since feature-x branched off, git merge feature-x does not create a merge commit at all — it simply moves main's pointer forward to the tip of feature-x. This is a fast-forward merge, and it is indistinguishable in history from as if the feature had been committed directly on main. Teams that prefer linear histories often insist on fast-forwardable branches for exactly this reason.
git rebase main (run on feature-x) takes each commit of the feature branch and re-applies it on top of main's current tip, producing a new linear sequence of commits. The old commits are abandoned (still reachable from the reflog for a while), and the branch now looks as if it had been written against the latest main all along. The history becomes a straight line; the original fork is erased.
main A---B---C
\
feature D---E (before rebase)
main A---B---C
\
feature D'---E' (after rebase onto C)
The working rule: rebase your own local work before sharing it — to clean it up, to put it on top of the latest main, to produce a tidy sequence of commits for review. Merge shared branches back into main, so that the merge commit records the fact that a body of work rejoined the trunk. Never rebase a branch that other people have pulled, because their local commits will be orphaned and their next pull will produce a mess; golden rule of rebase: do not rewrite public history.
Use rebase locally to make your own history readable; use merge to integrate shared branches. Linear history is a preference, not a moral position — some teams prefer merge commits as a record of what happened, others prefer rebase for its cleanliness. Both are fine; pick one and stay consistent.
A merge conflict happens when two branches have edited the same region of the same file and Git cannot decide which edit to keep. It is not an error, not a bug, and not cause for concern; it is Git asking for human judgement. Treating conflicts as routine, rather than as a crisis, is most of what separates comfortable Git users from anxious ones.
When a merge or rebase cannot auto-resolve a change, Git writes the file with both alternatives marked by conflict markers — <<<<<<<, =======, >>>>>>> — and pauses. The file on disk is not the final state; it is a document asking for your decision. Your job is to edit the file so the markers are gone and only the intended resolution remains, then git add it and continue the operation.
def compute_total(cart):
<<<<<<< HEAD
return sum(item.price * item.qty for item in cart)
=======
return sum(item.price * item.quantity for item in cart)
>>>>>>> feature-rename-qty
The tool that resolves a conflict is your text editor (or a three-way merge tool such as meld, kdiff3, or the IDE's built-in view). Decide what the file should say, write that, save, and stage. git status will tell you there are still conflicted files remaining; when all are resolved and staged, git merge --continue (or git rebase --continue) completes the operation.
If a merge or rebase goes in a direction you did not intend, git merge --abort or git rebase --abort returns the repository to its pre-operation state, as if nothing had happened. This is the escape hatch. Use it freely; rebasing or merging is never a one-shot decision. Before aborting, however, check whether the conflicts are actually small — most are.
The cheapest conflict is the one that never happens. Short-lived branches conflict less than long ones; PRs under a few hundred lines conflict less than thousand-line monsters; frequent rebasing off of main keeps a branch current. A team whose branches all last under a week will see conflicts rarely, and trivially when they do; a team with six-month feature branches will be resolving conflicts for days when those branches eventually try to land.
If conflicts feel scary, your branches are too long-lived. The fix is shorter branches, not braver developers. A conflict that involves twenty lines is thirty seconds of work; one involving two thousand is two days of work and three meetings.
Git's headline design choice — the one distinguishing it from everything that came before — is that every clone is a complete, independent repository with the full history. A remote is simply another clone that yours knows about by name and URL. Understanding this shape makes terms like "origin", "upstream", "fetch", and "push" concrete instead of confusing.
When you git clone a repository, Git stores the URL you cloned from under the name origin. Your local branch main is created to track origin/main, which is your local cached reference to where main was on the remote when you last synced. The local and remote-tracking branches are two independent pointers that stay close to one another by convention, not by magic.
In open-source workflows, "upstream" conventionally refers to the original project, while "origin" is your personal fork. A typical open-source contributor has two remotes: origin (their own fork on GitHub), and upstream (the project they forked from). Changes flow: upstream/main → origin/main → local main, and pull requests go from your fork back toward the upstream project.
git fetch updates your local copy of the remote-tracking branches — it brings down new commits from the remote but does not touch your local working branch. Your local main stays where it was; only origin/main advances. git pull is git fetch followed by git merge (or git rebase, if configured): it updates your local branch with the new commits. Use fetch when you want to look before you leap; use pull when you are ready to update.
git push sends your new commits to the remote. If the remote branch has new commits that you do not have locally, Git refuses the push — not to be difficult, but because accepting it would erase those commits. The fix is to git pull (or fetch + rebase) first, integrate the remote changes, and then push again. Force-pushing (git push --force) overrides this check and should be used only on branches you alone are working on.
Git is technically peer-to-peer — any two clones can sync with each other — but nearly every team in practice designates one remote (usually on GitHub, GitLab, or Bitbucket) as the canonical source of truth. The distributed architecture is still valuable: offline work, robustness to server outages, the ability to prototype a workflow without configuring the central server. But the day-to-day mental model is one central repository with many clones, and the workflow conventions that follow assume that.
When a remote command feels mysterious, check which refs are involved. git branch -vv shows each local branch alongside the remote-tracking branch it follows. Nine times out of ten, the surprise is that your mental picture of "the remote" and Git's local cached copy of it have diverged.
The pull request — "PR" on GitHub and Bitbucket, "merge request" on GitLab — is the social protocol layered on top of branches. A branch is proposed for integration; reviewers read it, comment, and approve; CI runs its checks; and then it is merged. Almost every team larger than two people has converged on some variant of this workflow, and the practices around it are more important than the specific platform.
Technically, a pull request is a request to merge one branch into another, usually a feature branch into main. The hosting platform shows the diff, runs configured CI checks, and collects comments and approvals. When enough approvals are gathered and checks pass, the PR is merged — by a button click, typically, but under the hood this is the same git merge or git rebase operation that would happen at the command line.
The single most important rule of PRs is that small ones get better reviews. Three hundred lines is a useful upper bound; a PR of a thousand lines is reviewed cursorily at best, because reviewer attention is finite. If a change genuinely needs more lines, it almost always decomposes into two or three PRs that land sequentially: rename first, then refactor, then add the feature. Each small PR gets read; the big PR gets rubber-stamped.
A good PR description explains what changed, why, and how to verify. Link the relevant issue or ticket. Call out anything unusual about the implementation. If the change touches config, migrations, or deployment, say so explicitly. The reviewer should not have to reverse-engineer your intent from the diff; you already have it in your head, so write it down.
Code review is where a team's standards are transmitted, and a good review is collaborative, not adversarial. Read the code for correctness, design, and safety — the things humans do well — and let linters and formatters argue about whitespace. Comment why, not just what. Acknowledge what is good, not just what needs to change; reviews that arrive as a wall of critiques lose authors faster than any other dynamic. And know when to approve: if a change is fine with small improvements, approve with comments rather than block.
Authors and reviewers sometimes disagree on design. The cheapest tie-break is: if it is a matter of taste with similar long-term cost, the author gets their way; if it is a matter of correctness or a commitment the team has made, the reviewer does. Disagreements that resist this rule — deep architectural choices, major refactors — deserve a thirty-minute conversation, not a twenty-comment thread. Nothing kills team velocity like PRs that become litigation.
A PR is a conversation about a change, not a submission to a judge. The reviewer's job is to help the author ship; the author's job is to make that job easy. Teams that internalise this have PRs that merge in hours, not days, and they ship quickly without skipping the read.
A branching strategy is the team's agreement about where code lives before it reaches production: how many long-lived branches there are, how features enter them, and how releases are cut. Three strategies dominate, and the right one depends more on how often the team releases than on any abstract preference.
Trunk-based development is the simplest and, for most modern teams, the default. There is one long-lived branch — main — and all work happens on short-lived feature branches that merge back within a day or two. Unfinished features hide behind feature flags rather than staying on a branch. The entire team is always within a day or two of every other team member's changes, and integration pain is amortised into small, routine merges. This is how Google and most continuous-deployment shops work.
GitHub Flow is trunk-based with a slightly more ceremonial shape: one main branch, feature branches off of it, pull requests for review, merge back to main, deploy whatever is on main. It is the dominant pattern for web applications and open-source projects, and it is what GitHub's UI encourages by default. The only meaningful difference from pure trunk-based development is the PR step, which adds a forced review pause.
GitFlow (Vincent Driessen, 2010) is a more elaborate model with two long-lived branches — main and develop — plus short-lived branches for features, releases, and hotfixes. It was designed for software with scheduled, versioned releases (installable desktop apps, on-premises software) and tends to add complexity that modern web teams do not need. Driessen himself has since noted that GitFlow is not the right default for continuously deployed software.
The pattern that makes trunk-based development practical is feature flags — boolean toggles in configuration that turn features on and off per user, per environment, or per deploy. With flags, a half-built feature can live on main without being visible, then be switched on for an internal test, then for a small rollout, then for everyone. Branches stay short; risk stays low; experiments become routine. Tools like LaunchDarkly, Unleash, and the open-source Flagsmith exist to manage flags at scale.
For a team that deploys daily, trunk-based development with feature flags is the right default. For a team that deploys weekly, GitHub Flow is fine. For software released on a cadence that is measured in months — embedded firmware, packaged enterprise software — GitFlow or one of its descendants may earn its keep. In any case, the worst choice is to use GitFlow by ritual on software that does not need it; the complexity is real and pays for itself only in a narrow band of release patterns.
Pick the simplest strategy that supports your release cadence. Trunk-based development is usually that strategy, and the teams that resist it usually do so for reasons of history rather than engineering. If the question is "should we add a long-lived branch?", the first answer is almost always "not unless we have to".
Commits are the permanent record of a project, and the discipline of writing them well is one of the cheapest and highest-leverage habits a team can build. A year from now, the git log is the only thing left of most of today's decisions; what it says determines how fast anyone can understand why the system is the way it is.
A commit should do one thing. "Rename the function and add the feature that uses it" is two commits. "Refactor the handler and fix the bug" is two commits. The rule is: a commit should be individually revertable without breaking the build, and reviewable without having to split it in your head. This is the discipline that makes git bisect — the binary search over history that finds which commit introduced a bug — actually work.
A conventional commit message has two parts: a short subject line (ideally under fifty characters, imperative mood — "Fix pagination bug", not "Fixed pagination bug" or "Fixes pagination bug"), a blank line, and a longer body (wrapped to seventy-two characters) explaining why the change was made and anything a reviewer might wonder. The subject line becomes the one-line log entry; the body is the place for context that would not be obvious from the diff.
Fix pagination bug on empty result set
The previous implementation divided by zero when the filter
returned no rows, producing a 500. Guard with an early return
and an explicit empty-list response. Adds a test covering the
previously-crashing case.
Fixes #2104.
Conventional Commits is a lightweight convention — a subject line of the form type(scope): description (e.g., feat(auth): add 2FA, fix(api): handle missing header) — that allows tools to infer release notes, version bumps, and changelogs from the commit log. It pays off in automated-release workflows and is a small amount of discipline for a large amount of tooling.
The commit message is where the why lives. "Upgrade library X to 2.3" is a commit description; the message body should say why — "to get feature Y that we need for Z", or "to fix CVE-xxx that we were exposed to". Six months from now, the log is what tells someone looking at git blame why this line exists. Make that person's job easier.
Write commit messages for the engineer who will need to understand this change in a year — which is often you. "Fix bug" is not a message; "Fix off-by-one in pagination that silently dropped the last page" is. The extra twenty seconds pays back dozens of times across the life of the project.
Nearly every git mistake is recoverable, which is the feature that makes the tool worth learning in the first place. Knowing which "undo" command does what — and which are safe on shared branches versus local-only — is the difference between a thirty-second recovery and an hour of panic. The commands are few; the distinctions matter.
git restore path/to/file discards your uncommitted changes in the working directory and replaces the file with the version from the last commit. git restore --staged path/to/file unstages a file without touching its contents on disk. These are the two most common "I changed my mind" operations and they are safe: they never touch committed history.
git revert <sha> creates a new commit that undoes the changes introduced by the specified commit. The original commit stays in history; the revert is a forward-motion undo. This is the correct way to undo changes on a shared branch because it does not rewrite history — the reverted commit and the revert both exist, and everyone else's clone stays in sync.
git reset <sha> moves the current branch's pointer backward (or forward) to the specified commit. The three flavours are: --soft (move the pointer only; leave index and working directory untouched), --mixed (the default: move the pointer, reset the index to match, but keep working-directory changes), and --hard (reset everything, discarding your working-directory changes). --hard is powerful and dangerous — it cannot be undone from the working directory — but the commits are still recoverable via the reflog for about ninety days.
# Undo the last commit but keep the changes staged
$ git reset --soft HEAD~1
# Undo the last commit; keep changes in working dir but unstaged
$ git reset HEAD~1
# Undo the last commit and DISCARD its changes (destructive locally)
$ git reset --hard HEAD~1
git reset rewrites the local history of a branch. If that branch has been pushed and others have pulled it, resetting locally and force-pushing erases commits that their clones still reference. This is the single most common way to lose shared work. The safe rule: use git reset only on branches you alone have seen; use git revert on anything shared.
git reflog is a local log of every HEAD change — every commit, every reset, every checkout — for the last ninety days by default. If you rebase or reset yourself into what looks like catastrophe, the reflog almost always has a line showing where HEAD was before the operation, and git reset --hard <reflog-sha> returns you there. This is the "undo button" for git itself.
On a shared branch, use revert. Locally, use reset. When in doubt, run git reflog first — it will tell you exactly which commits are still recoverable, and most "I just lost everything" moments turn out to be "I just need to check out a SHA from the reflog".
Git has powerful tools for reshaping history after the fact — squashing several commits into one, reordering them, fixing messages, dropping commits entirely. These are not daily operations, but they are the tools that let a messy local branch become a clean, reviewable series of commits. The cost of that power is that they rewrite SHAs, which is the one thing you must not do to shared history.
git commit --amend replaces the most recent commit with a new one that includes whatever is currently staged, plus (optionally) a new message. It is the right tool for "I forgot to add a file" or "I typed the wrong message" — as long as the commit has not yet been pushed. Once pushed and fetched by others, amending requires a force push and is no longer safe by default.
git rebase -i <sha> opens your editor with a list of every commit since the specified SHA, each prefixed with a keyword — pick, squash, reword, edit, drop. Changing the keywords and reordering the lines lets you reshape the branch's history. Squash two commits into one; reword a message; drop a commit that turned out to be a mistake; reorder commits so related changes are adjacent. When done, save the file and git replays the rebase with your instructions.
pick 7a8b3c1 Add feature X skeleton
squash 4c9e018 Fix typo in handler
squash 11d4e22 Rename variable for clarity
pick 9f201aa Wire up the tests
reword 2b5e331 Update docs (editor opens for new message)
The typical use is: do your work on a feature branch across a dozen messy commits ("WIP", "fix tests", "try again"), then before opening the PR run git rebase -i main and reshape them into three or four clean commits that each tell a story. The PR reviewer sees the clean version; the messy experimentation history stays in your reflog for a few days and then disappears. This is the workflow rebasing was designed for.
Do not rewrite public history. If you have already pushed the branch and others have pulled it, rebasing and force-pushing will silently orphan their local work. There are escape valves — git push --force-with-lease refuses to overwrite changes it does not know about, for example — but the safest rule is: rebase before the first push, never after, unless you have coordinated with everyone else on the branch.
For deeper history rewrites — removing a file that never should have been committed, renaming an email across every commit, splitting a repository in two — git filter-repo (the modern replacement for the deprecated git filter-branch) is the tool. These operations rewrite every commit SHA in the repository, so they are equivalent to starting a new repository with a derived history; everyone else working on the project needs to re-clone. They are real, valuable tools for specific jobs, and not to be used casually.
Rewrite history to communicate more clearly, not to erase evidence. The goal is that a reviewer, or a future debugger, can follow the project's development as a series of coherent steps. Messy reality stays in the reflog; the public log tells a comprehensible story.
A branch is a moving pointer; a tag is a fixed one. When you want to name a specific commit — "this is the version we shipped on April 12" — a tag is the right tool. Tags, combined with semantic versioning and a lightweight release process, are how a project goes from a git history to an artefact that other people can depend on.
git tag v1.4.2 creates a lightweight tag pointing at the current commit. git tag -a v1.4.2 -m "Release 1.4.2" creates an annotated tag, which is a real object in the repository with its own author, date, and message. Annotated tags are the ones worth using for releases: they carry context, they are signable with GPG, and they show up in tools that distinguish the two. Tags are pushed to remotes explicitly: git push origin v1.4.2 or git push --tags.
Semantic Versioning (SemVer) is the near-universal convention for library and application version numbers: MAJOR.MINOR.PATCH, where MAJOR bumps on breaking changes, MINOR on new features that do not break existing users, and PATCH on bug fixes that do not add functionality. The rule is simple and the discipline is harder than it looks: once a project is at 1.0.0, every breaking change must bump the major version, and every user relying on the promise expects the project to stick to it.
Pre-1.0 versions are by convention "unstable" and anything can change between them. Post-1.0, breaking a user's code in a minor version is a defect; it is why library authors sometimes sit on 0.x for years while they figure out the API they want to commit to.
A release is typically more than the tag itself: it is release notes (what changed, who contributed), artefacts (a Docker image, a wheel, a binary), and a changelog entry. GitHub Releases, GitLab Releases, and similar features wrap the tag with this metadata and provide a stable download URL. The convention that scales is: every tag is a release, every release has notes, every release has a changelog entry, and the release process is automated from the commit log.
The Keep a Changelog project (keepachangelog.com) defines a simple format for a human-readable CHANGELOG.md, with sections for Added, Changed, Deprecated, Removed, Fixed, and Security. Combined with Conventional Commits and tools like standard-version or semantic-release, changelog entries can be generated automatically from the commit log. The format is less important than the discipline; a project whose changelog is up to date is one whose users trust the version numbers.
For a library, invest the time to get semver right; your users' build systems depend on it. For an application, a calendar-based version scheme (2026.04.12) or a build number is often simpler and honest. In either case, a release that no one can describe in three bullet points is a release that should not ship yet.
Git was designed for source code — thousands of small text files with frequent, diffable edits. It is a poor fit for the other things ML projects accumulate: multi-gigabyte datasets, trained model checkpoints, high-resolution images. Two conventions have grown up to solve this, and any practitioner working with non-trivial data needs to know both.
Because git stores snapshots of every version of every file, a five-gigabyte model checkpoint committed ten times becomes fifty gigabytes of repository. Clones become slow, pushes become painful, and hosting services start to refuse the project. The structural fix is not to store the big files in git at all — only pointers to them — with the actual bytes living in an artefact store.
Git Large File Storage (LFS) is the lightweight solution. Files matching a configured pattern (*.pt, *.h5, data/**/*.parquet) are replaced in git with a tiny pointer file, while the actual bytes go to an LFS-aware server. Clones pull down only the pointers by default and the full bytes on demand. LFS is native in GitHub, GitLab, and most hosting services; it is the right default for a handful of large binary artefacts in an otherwise normal repository.
DVC (Data Version Control) goes further. It stores dataset and model artefacts in an external backend (S3, GCS, Azure, or a local cache) and records their content hashes in small .dvc files that live in git. DVC also models ML pipelines — dataset → preprocessing → training → evaluation — with commands like dvc repro that re-run only the stages whose inputs have changed. It is the closest thing the field has to "git for machine learning".
The parallel for experiments is MLflow or Weights & Biases — neither is a version-control system in the git sense, but each keeps a queryable record of every training run (config, code SHA, metrics, artefacts). A well-run project pairs git (code) + DVC (data/pipelines) + an experiment tracker (runs and metrics), and every model can be traced back through all three.
A .gitignore file tells git which paths to leave alone. In an ML project, it should exclude: virtual environment directories, raw datasets, cached intermediate files, __pycache__ and compiled artefacts, editor swap files, logs, and — most importantly — secrets (.env, credentials). A single committed API key, once pushed, is effectively public forever; the repository's history has to be rewritten and the key rotated. Prevention beats remediation by a wide margin.
Git handles code; LFS handles occasional large binaries; DVC handles dataset and pipeline lineage; experiment trackers handle runs and metrics. Each layer knows what it is good at. A project that tries to put multi-gigabyte artefacts into plain git will spend more time fighting the tool than doing the work.
Git itself is just a version-control engine; the ecosystem around it — issues, project boards, CI, release automation, code search — is what turns a repository into a collaboration platform. Fluency with one or two of these tools matters less than understanding what each layer does and how they compose.
Every major hosting platform (GitHub, GitLab, Bitbucket, Gitea) wraps a repository with an issue tracker — a simple queue of bugs, feature requests, and tasks, each with a numeric ID, a description, labels, and a conversation thread. Issues are the long-term to-do list of a project; PRs can reference them (Fixes #214) so that merging a PR auto-closes the issue. The discipline of filing good issues — reproducible bug reports, scoped feature descriptions — is a parallel craft to the discipline of writing good commit messages.
For anything larger than a side project, a simple kanban board — columns for To do, In progress, In review, Done — is the visible state of the team. GitHub Projects, GitLab Iterations, Linear, and Jira all implement some version of this; the tool matters less than the practice of actually updating it. The anti-pattern is a board that reflects what someone wishes were happening; the pattern is a board that reflects what is.
Continuous integration — GitHub Actions, GitLab CI, CircleCI, Buildkite — runs tests, linters, and type checkers on every PR, and reports a green or red status next to it. Branch protection rules (on GitHub: Settings → Branches) can require those checks to pass before a PR can merge, turning conventions into enforced gates. Together with required reviews, this is how "the test suite must pass before merge" goes from a team agreement to a mechanical fact.
pre-commit (pre-commit.com) is a framework for running formatters, linters, and custom checks at commit time, locally, before anything even reaches CI. A well-configured pre-commit config eliminates an entire class of code-review comments — "please run the formatter", "remove this debug print" — by making them impossible to commit. The cost is a thirty-second config file; the benefit is continuous.
The three major hosting platforms are more alike than different. GitHub has the largest open-source ecosystem and is the de facto public platform; GitLab has the most integrated DevOps offering (CI, container registry, deployment in one product); Bitbucket has the deepest integration with Atlassian's other products. For an internal team, any of the three is fine; for open-source, GitHub is where the audience already is.
Choose the minimum ceremony that keeps the team unblocked. A small team can ship with just PRs and CI; a larger team benefits from issues, boards, and release automation. The purpose of every tool in this list is to remove friction from "the change is done" → "the change is in production". Anything that adds friction without paying it back should be re-examined.
Version control discipline is almost invisible on a good day — the branch merged, the tests passed, the deploy went out. The payoff shows up over months, in the questions the team can actually answer and the mistakes they can recover from. For ML work specifically, that payoff shows up in a handful of predictable places.
Reproducibility. A model result is reproducible exactly when someone else can check out a commit, install a locked environment, run a command, and get the same numbers. Every term in that sentence — the commit, the environment lock, the command, the numbers — depends on version control discipline. A project that cannot cite a SHA for each claim is a project whose claims no one can verify.
Experiment lineage. When a model goes to production, the team will sooner or later ask: what exactly trained this? The answer is a SHA, a config file, and a dataset snapshot — all three under version control, all three linked. Teams that record this proactively can answer the question in seconds; teams that do not will spend a week reconstructing it from notebooks and Slack messages.
Bisecting regressions. git bisect — binary search over commit history — is the cheapest way to find when a bug was introduced, if the commits are atomic enough to bisect cleanly. A codebase of thousand-line commits cannot be bisected usefully; a codebase of focused commits can have a regression localised to a single commit in minutes. Commit hygiene from section 11 is not aesthetic; it is a debugging superpower.
Safe refactors. The refactors that actually make a codebase better — renaming a misleading concept, splitting a module that grew too large, replacing a library that accumulated debt — happen only in projects where engineers are confident they can revert if something breaks. Version control is the scaffold that makes those refactors affordable; the engineers who know the git side are the engineers who make the cleanups.
Scaling the team. Two engineers can share a codebase with almost no process. Ten engineers need branches, reviews, CI, and a branching strategy. A hundred engineers need all of that plus release automation, issue hygiene, and a carefully considered trunk-based discipline. The investments in git-adjacent tooling that feel heavy at four engineers pay their own salary back at twenty, and at a hundred the absence of them is the main bottleneck.
The habits in this chapter compound. A single well-written commit is worth almost nothing; a thousand of them are the reason a codebase is legible three years later. The same is true for reviews, for branch hygiene, for releases. The point of version control is not to make any one change easy — it is to make the next hundred changes possible.
This is the last chapter of Part II: Programming & Software Engineering. The six chapters together — Python fluency, scientific computing, algorithms and data structures, software-engineering principles, databases and SQL, and version control — are the engineering foundation that every technique in the rest of the compendium rests on. The next part turns from engineering in the small to engineering at scale: data pipelines, distributed systems, and the infrastructure that moves ML from the laptop to the cluster.
Git has a near-infinite amount written about it, of sharply uneven quality. The list below picks the canonical references, a few classics on collaborative software development, the ML-specific tooling docs, and the pages that are worth bookmarking for the day something unexpected happens.
gitglossary and gittutorial pages are short and unusually good for a reference manual.MAJOR.MINOR.PATCH version numbers. Short — one page — and precise. If you publish a library, you are committing to this or to something explicitly not this; either way, read the specification once.CHANGELOG.md. Two pages. Adopting it — even loosely — makes upgrades across library versions dramatically less painful for your users.feat:, fix:, BREAKING CHANGE:) that lets tools automate release notes and version bumps. Worth adopting as soon as your release cadence gets serious enough that changelogs are being hand-edited..gitattributes syntax, migration of existing large files, limitations and pitfalls (there are a few with force-pushing and squashing). Essential if your repository has any binary artefacts larger than a few megabytes..pre-commit-config.yaml. The simplest single investment that raises the floor on a team's commit hygiene.This page is the sixth and final chapter of Part II: Programming & Software Engineering. The compendium now turns to Part III: Data Engineering & Systems — the scale-out version of everything the engineering chapters have covered, from pipelines and orchestration to distributed training and inference infrastructure. If the first two parts are about writing code that one person can run on one machine, the third is about making it work when the data, the model, and the team are all larger than that.