Why Git is so complicated

Why Git is so complicated

When you learn to program, people will often recommend learning Git. In theory, it sounds easy: a program to track changes to your code that helps you restore previous versions of the code when you need them. Besides that, it’s a tool that is used almost everywhere in the industry. In practice… Git can be confusing to say the least.

Why does it have to be this way?

Command-line interface—CLI

Most everyday users of computers are not used to the luxury of text-based interfaces. I call it luxury, because as I’ve written previously, there are some big advantages to it—although it can take some time to get used to the user experience. So if your associations match the ones on the list:

  • cat—a fluffy animal,
  • cd—how people used to listen to music, and
  • grep—some onomatopoeia,

then chances are your first experience with Git will have the additional difficulty of learning command line basics. Something that Windows 95 and its ilk took away from you.

Design goals of Git

Git was never meant to be simple. It was made with different goals in mind: more important and more challenging ones as well.

Securely storing the saved data

Git is meant to always give you back the data exactly as it was saved OR let you know the repository is corrupted. And it does an amazing job with it—you can be sure that the state that you get when checking out a commit is exactly as its author meant it to be. Of course, this assumes they knew what they were doing.

How does Git achieve such good data security? It does so in the clever way it stores everything—the commits, folders and files. Each object is identified by a cryptographic hash of its content—a number that is generated based on the content of the object, which will change if anything inside the files is changed. Because references between objects are hashed too, there is almost no way of tampering inside a repository without Git noticing something is off. And even in those cases, the meaningful code would be replaced with long text gibberish.

Work across unsecured networks an with non-trusted participants

Because everything is identified by cryptographic hashes, Git doesn't have to trust much of the network to keep the data intact. Git is secured from man-in-the-middle attacks: there is no way someone could inject meaningful code between two nodes of Git. Of course, if someone can read commits that they were not meant to, this is also a problem, but with less dangerous consequences than attackers injecting code into the codebase.

For an extra layer of security, Git offers signing up commits—an additional means of ensuring the commit was authored by the person that is set to be its author.

Maintaining backward compatibility

The Git interface is more complicated than it has to be because it respects the old habits of its users. I learned Git in 2011, and until last week I hadn't noticed there is a new git switch command that is recommended to change branches. The way I’m used to doing it (git checkout + various flags) is still working fine. Git prioritizes older users and their habits over simplicity for newer users—which takes us to the next point.

User experience for super-users

Git is a tool made with super-users in mind. It was written by Linus Torvalds to manage the Linux code base—not a beginner’s task. Git’s primary users develop operating systems, have decades of experience in programming, and live inside the command line. All the uses beside that are accidental and not the main concern for people developing Git.

Simplicity trade-offs

When you design systems, nothing is for free—simplicity for the users included.

Hiding the complexity

When you are dealing with a problem that is inherently complex, you make solutions easy by abstracting away complexity. Is it gone? No, it’s just hidden from the user. For example, in the case of internet connection, I and most understand connection quality only as number of bars at 📶:

  • fewer is bad
  • more is good

This is enough for using the internet, but it makes it difficult to understand and troubleshoot any connection issues.

Git is focused on exposing all the complexity that comes with changing files in parallel and synching in an asynchronous way. On the opposite end, you have direct access to the shared disk. It’s easy to use, but what will happen when two people try to edit the same file at the same time? One will be lucky and will keep their changes, and the other will lose it all. Not a nice workflow, especially if the lost changes are worth hours of expensive labor. Because Git exposes to the user all the problems that can appear in this situation, it makes it possible to resolve conflicts in files in an intelligent way—which in some cases requires users to do it manually.

Easy vs. secure

Another part of Git that confuses users is commit IDs that are very long, as well as impossible to memorize numbers. Even in the user-friendly, abbreviated form, they look like 0828ae1. And the full ID is 0828ae10b20500fbc777cc40aa88feefd123ea5e. Could we have just numbers in order instead, or only use branch names? The problem is that the commit IDs are hashes that guarantee the data integrity—they guarantee that the commit X on my machine is the same as commit X on remote or your machine. And mismatches between them can appear for different reasons:

  • intentional—rebase, amends, squash, or any other operations that change the history
  • unintentional—a mistake or an attack on the codebase

Simplifying the interface and hiding the commit ID form the user would remove a possibility for the user to understand what is happening on the codebase they are running on their machine. And because code is by definition run on the machine, any security exploit would be extremely dangerous.

Is this the right approach?

When I was learning Git, it was still a novelty—I was learning it in 2011, and Git was created in 2005 (first commit is from Thu Apr 7 15:13:13 2005 -0700). At that time I was using SVN at work, and Mercurial was often considered as more user-friendly alternative to Git. Have you heard those names recently? Git dominated the version control market almost completely. It gained a lot of popularity with the rise of GitHub, but even if it’s rough for beginners it’s a very efficient tool, and it’s here to stay.

What to do as a beginner programmer?

My advice is to learn it sooner than later. It's highly unlikely that Git will soon become simpler. Or ever. Version control systems in general save a lot of headaches while you program. I cannot imagine being efficient with working with code if you struggle to move around versions of your codebase. Learning Git well will help you focus on code challenges instead of struggling with lost files or fixing so-called “Git issues”—which are almost always just people messing up their repository themselves.

Besides that, learning Git became a rite of passage for new developers. As a developer, you have to use it, and Git will make your life miserable if you try to use it without understanding it well. The smart choice is to learn it on your terms, and not wait until reviewer feedback or code issues force you to learn it while handling other responsibilities at the same time.

How to learn more?

If you are interested in learning more about Git, sign up here to get updates about my Git-focused content.