git-rs

Git-RS Internals Documentation ๐Ÿง 

This document provides a deep dive into Gitโ€™s internal mechanisms as implemented in git-rs.

๐Ÿ“‚ Repository Structure

Git-rs supports two directory structure modes for different use cases:

Educational Mode (Default): .git-rs/

Safe for learning - uses .git-rs/ to avoid conflicts with real Git repositories:

.git-rs/
โ”œโ”€โ”€ objects/              # Object database (content-addressed storage)
โ”‚   โ”œโ”€โ”€ 5a/
โ”‚   โ”‚   โ””โ”€โ”€ 1b2c3d4e...  # Blob object (file content)
โ”‚   โ”œโ”€โ”€ ab/
โ”‚   โ”‚   โ””โ”€โ”€ cd1234ef...  # Tree object (directory listing)
โ”‚   โ”œโ”€โ”€ fe/
โ”‚   โ”‚   โ””โ”€โ”€ dcba9876...  # Commit object (snapshot + metadata)
โ”‚   โ”œโ”€โ”€ info/            # Object database metadata
โ”‚   โ””โ”€โ”€ pack/            # Packed objects (future feature)
โ”œโ”€โ”€ refs/                # Reference storage
โ”‚   โ”œโ”€โ”€ heads/          # Branch references
โ”‚   โ”‚   โ”œโ”€โ”€ main        # Contains: commit hash
โ”‚   โ”‚   โ””โ”€โ”€ feature-x   # Contains: commit hash  
โ”‚   โ””โ”€โ”€ tags/           # Tag references
โ”‚       โ””โ”€โ”€ v1.0        # Contains: commit hash
โ”œโ”€โ”€ HEAD                 # Current branch pointer
โ”œโ”€โ”€ git-rs-index        # Staging area (JSON format)
โ”œโ”€โ”€ config              # Repository configuration
โ””โ”€โ”€ description         # Repository description

Git Compatibility Mode: .git/

Activated with --git-compat flag - uses standard Git structure for interoperability:

.git/
โ”œโ”€โ”€ objects/              # Same object database structure
โ”‚   โ”œโ”€โ”€ 5a/
โ”‚   โ”‚   โ””โ”€โ”€ 1b2c3d4e...  # Identical object format
โ”‚   โ””โ”€โ”€ ...              # Same as educational mode
โ”œโ”€โ”€ refs/                # Same reference structure
โ”‚   โ”œโ”€โ”€ heads/          
โ”‚   โ””โ”€โ”€ tags/           
โ”œโ”€โ”€ HEAD                 # Same HEAD format
โ”œโ”€โ”€ index               # Standard Git index name
โ”œโ”€โ”€ config              # Same configuration format
โ””โ”€โ”€ description         # Same description format

Mode Selection

Command Directory Created Index File Use Case
git-rs init .git-rs/ git-rs-index Safe learning
git-rs --git-compat init .git/ index Git compatibility testing

๐ŸŽฏ Object Model

Git stores everything as objects in a content-addressed database:

Blob Objects (File Content)

Format: "blob <size>\0<content>"
Example: "blob 11\0Hello World"
SHA-1: 5d41402abc4b2a76b9719d911017c592
Storage: .git-rs/objects/5d/41402abc4b2a76b9719d911017c592

Tree Objects (Directory Listings)

Format: "tree <size>\0<entries>"
Entry: "<mode> <name>\0<20-byte-sha>"
Example: "tree 37\0100644 hello.txt\0[20-byte-hash]"

Commit Objects (Snapshots)

Format: "commit <size>\0<content>"
Content:
tree <tree-hash>
parent <parent-hash>  # (optional, for non-initial commits)
author <name> <email> <timestamp> <timezone>
committer <name> <email> <timestamp> <timezone>

<commit message>

๐Ÿ”„ Three Trees Concept

Git manages content through three main areas:

1. Working Directory

2. Staging Area (Index)

3. Repository (HEAD)

State Transitions

Working Directory โ”€โ”€addโ”€โ”€โ–ถ Staging Area โ”€โ”€commitโ”€โ”€โ–ถ Repository
      โ–ฒ                                                 โ”‚
      โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ checkout โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

๐Ÿ“‹ Index Format

Our implementation uses JSON for educational clarity:

{
  "entries": {
    "README.md": {
      "hash": "5d41402abc4b2a76b9719d911017c592",
      "mode": "100644",
      "size": 11,
      "ctime": 1692000000,
      "mtime": 1692000000
    },
    "src/main.rs": {
      "hash": "a1b2c3d4e5f6789012345678901234567890abcd",
      "mode": "100644", 
      "size": 245,
      "ctime": 1692000100,
      "mtime": 1692000100
    }
  },
  "version": 1
}

File Modes:

๐Ÿ”— Reference System

References are human-readable names pointing to objects:

Branches

Tags

๐Ÿงฎ Hash Calculation

Git uses SHA-1 for content addressing:

Blob Hash Calculation

fn calculate_blob_hash(content: &[u8]) -> String {
    let header = format!("blob {}\0", content.len());
    let full_content = [header.as_bytes(), content].concat();
    sha1::digest(&full_content)
}

Tree Hash Calculation

fn calculate_tree_hash(entries: &[(String, String, String)]) -> String {
    let mut content = Vec::new();
    for (mode, name, hash) in entries {
        content.extend_from_slice(mode.as_bytes());
        content.push(b' ');
        content.extend_from_slice(name.as_bytes());
        content.push(b'\0');
        content.extend_from_slice(&hex::decode(hash).unwrap());
    }
    let header = format!("tree {}\0", content.len());
    let full_content = [header.as_bytes(), &content].concat();
    sha1::digest(&full_content)
}

๐Ÿ“Š Status Determination Algorithm

How git-rs determines file status:

1. Scan working directory โ†’ get current file hashes
2. Load staging area โ†’ get staged file hashes  
3. Load HEAD commit โ†’ get committed file hashes
4. Compare:
   - staged_hash != committed_hash โ†’ "Changes to be committed"
   - working_hash != staged_hash โ†’ "Changes not staged for commit"  
   - working_exists && !staged_exists โ†’ "Untracked files"
   - !working_exists && staged_exists โ†’ "deleted"

Status Matrix

Working Staged HEAD Status
A A A Clean
A A - New file (staged)
A A B Modified (staged)
A B B Modified (unstaged)
A - - Untracked
A - B Deleted (staged)
- A A Deleted (unstaged)

๐Ÿ—œ๏ธ Object Storage Details

Compression

Objects are compressed using zlib deflate:

use flate2::{Compress, Compression};

fn compress_object(content: &[u8]) -> Result<Vec<u8>> {
    let mut compressor = Compress::new(Compression::default(), false);
    let mut output = Vec::new();
    compressor.compress_vec(content, &mut output, flate2::FlushCompress::Finish)?;
    Ok(output)
}

Directory Structure

Objects are stored with first 2 hex digits as directory name:

This prevents having too many files in one directory.

๐Ÿ” Educational Insights

Why Content Addressing?

  1. Deduplication: Identical content stored only once
  2. Integrity: Corruption changes hash, detectable
  3. Distributed: Objects transferable between repositories
  4. Immutability: Objects never change, only referenced

Why Three Trees?

  1. Flexibility: Stage partial changes
  2. Safety: Review before committing
  3. Efficiency: Only stage what changed
  4. Workflows: Support complex merge scenarios

Why SHA-1 (historically)?

  1. Collision resistance: Extremely unlikely for different content
  2. Performance: Fast to calculate
  3. Fixed size: Always 40 hex characters
  4. Distributed: Works across different systems

๐Ÿš€ Implementation Benefits

Our educational implementation:

๐Ÿ”ฌ Debugging Git Internals

Inspect Objects

# Find all objects
find .git-rs/objects -type f

# Examine object (compressed)
hexdump -C .git-rs/objects/5d/41402abc4b2a76b9719d911017c592

# Decompress object (requires zpipe or similar)
zpipe -d < .git-rs/objects/5d/41402abc... | hexdump -C

Inspect Index

# View staging area
cat .git-rs/git-rs-index | jq .

# Pretty print
jq '.entries | keys[]' .git-rs/git-rs-index

Inspect References

# Current branch
cat .git-rs/HEAD

# All branches
find .git-rs/refs/heads -type f -exec echo {} \; -exec cat {} \;

# Branch content
cat .git-rs/refs/heads/main

๐ŸŽฏ Next Steps for Learning

  1. Implement log command: Display commit history and graph traversal
  2. Branch operations: Create, switch, merge branches
  3. Enhanced remote operations: Push, fetch, pull
  4. Advanced features: Rebasing, cherry-picking, submodules
  5. Performance optimization: Pack files, delta compression