2010-07-30

Git Internals: An Executive Summary in 30 Lines of Perl, for smart newbies.

Update: Modified code a bit to handle the 'pack' specials. They're not so straight forward, will blog more on that later.

This blog post is not intended as a replacement for a real in-depth understanding of Gits command line interface, but it does aim to maximise the exposure of how it works internally, as really, its internal logic is astoundingly simple, and anyone with a good background in graph theory and databases will pretty much be able to quickly see the elegance in it. For more details, check out the excellent book, Pro Git, especially the internals chapter

The code

Gits core essentials, are almost nothing more than a bunch of deflated(zlib) text files. I'm going to assume you've got enough intelligence to RTFM and get a copy of something gitty and text based checked out. Perl Modules are good examples of this. I'm using my Dist::Zilla::PluginBundle::KENTNL::Lite tree.

git clone git://github.com/kentfredric/Dist-Zilla-PluginBundle-KENTNL-Lite.git /tmp/SomeDirName

I'm going to show you the core of git's system, which is just the "object" store.

cd /tmp/SomeDirName/.git
find objects/

Woot, there is all your files and stuff in git. How does it work? Thats where the perl script comes in.

#!/usr/bin/perl
use strict;
use warnings;

use Compress::Zlib;
use Carp qw( croak );

sub inflate_file {
    my ( $filename , $OFH ) = @_;
    my ( $inflator, $status ) = Compress::Zlib::inflateInit or croak("Cannot create inflator: $@");
    my $input = '';
    open my $fh, '<', $filename or croak("Can't open $filename, $@ $! $?");
    binmode $fh;
    binmode $OFH;

    my ( $output );
    while ( read( $fh, $input, 4096 )) {
        ( $output , $status ) = $inflator->inflate( \$input );
        print { $OFH } $output if $status == Compress::Zlib::Z_OK or $status == Compress::Zlib::Z_STREAM_END;
        last if $status != Compress::Zlib::Z_OK;
    }
    croak( "Inflation failed of $filename , $@" ) unless $status == Compress::Zlib::Z_STREAM_END;
}

for ( @ARGV ) {
    next if $_ =~ /\.(idx|pack)|packs/;
    print qq{<--------BEGIN $_ --------->\n};
    inflate_file( $_ , *STDOUT );
    print qq{<--------END $_ --------->\n};

}

Pretend you cargo-cult dump that code to /tmp/deflate.pl

Now check this out:

perl /tmp/deflate.pl $( find objects/ -type f ) | cat -v | less

Awesome, you're now seeing the guts of how your repository works. For real. All we did was deflate each and every object. You'll see 3 types of object, ( each object says at the front what type they are before the ^@ ), tree's, blobs, and commits ( with trees being the most complicated of all ).

Blobs, they're just a files contents

Commits, all they are is a blob of text, with commit messages and stuff, timestamps, etc, and with text references (pretend its like an a-href in a web page or something ) to preceding ( parent ) commits, and a commit tree.

Trees are probably the hardest to work out just by looking at it. Its more or less just another text file, with another list of text references, except text references are pointing at either blobs, or other trees. So, you can pretend a "tree" is like a "dir" in some ways. There's data besides this, like file/dir names, and permissions, but thats the gist of it.

This has been your executive summary =)

3 comments:

  1. Unfortunately it is so easy only for the loose format. WHen you fetch from other repositories or clone other repositories, you would get objects in packed format. It saves disk space thans to deltaification, and it improves performance because IO patterns with one open and mmapped file is better than with many small files.

    IIRC "Pro Git" (and other documentation) describes packed format quite well.

    ReplyDelete
  2. @Jakub Narebski: Right you are. I should delve into decoding that stuff as well =)

    ReplyDelete
  3. The trouble with tree objects is that for some strange reason SHA-1 (and only SHA-1) of items / elements it consists of are in binary format instead of text.

    ReplyDelete