Advanced Git branch filtering

Filter
Branch filters is an advanced Git feature, which is used less frequently, but there are situations where it can be quite handy. The branch filters can be used to manipulate (rewrite) multiple Git commits in the repository in a single step. It can be either a specific set of commits, the entire branch, or even all branches in the repository, including the tags. The branch filters can be used to modify the commit messages, authors/e-mails/dates of the commits, remove or rename files and folders, or even change the contents of files, in a single run of the filter.

The general idea is that the commands entered as the Git filter parameter will be executed for each commit in the specified range, to update all the commits appropriately.

Such capabilities make this feature very powerful, but also potentially dangerous (make sure to read the remarks below). It is mostly useful in the cases, where all the commits in the repository need to be modified in some way, and the repository has not been published yet – which could be typically the case of migrating a repository from a different VCS, like SubVersion or CVS. Another use-case might be, if there is some sensitive information committed in the repository (like a file containing a password or a certificate), which needs to be wiped out from the entire repository (nevertheless, if such commits were already published, such sensitive information must be always considered compromised).

Quicklinks

Remarks on using the Git branch filters

It is strongly discouraged to use this feature on Git commits, which have been already published!

(this actually applies to all operations changing history, like amend, rebase etc., on commits that are already potentially available to other people)

The reason for this warning is, that the Git branch filters modify the original commits and create new different commits (with different SHA hashes), similarly like the "--amend" parameter of "git commit".
The modified (filtered) commits can still be re-published in the public central repository or to GitHub, when the "--force" parameter of "git push" is used (the same like amended commits), but such action has some important consequences.

In particular, the users who already cloned the published repository need to fix their local cloned repositories manually and rebase the local changes appropriately (and of course be notified of the change and the need to fix the local clone in the first place), otherwise they would experience great mess in the local repositories after pulling the modified repository.
In the principle, after the pull they will have all the modified commits twice in the local repositories (the original and the modified ones), and might be tempted to do a merge in the local repository if not aware of the change. That would be the worst thing to do, as that would bring back the original unfiltered commits into the original repository once such merge is pushed, so there would be both versions in the central repository back again and the central repository would be severely screwed.

See e.g. here for some discussion about forcing modifications on published commits and the consequences:
http://git-scm.com/docs/git-tag#_on_re-tagging

In any case, before starting to fiddle with the Git history, always make a backup of the repository!

It is true, that the old commits will not be actually lost, unless "git gc" is called, and the original branches will be backed up under the ".git/refs/original" folder inside of the repository meta-data; but the recovery through reflog can still be quite an ugly task, especially if multiple filters were already applied successively and the ".git/refs/original" overwritten.

Types of Git branch filters

Git provides several types of branch filters, each of different purpose and use cases:

  • --env-filter: Can only be used to modify the commit environment settings. That covers changing the date, author and e-mail of commits.
  • --tree-filter: Modify the tree and its contents. For example moving, renaming or removing files, directories, change the contents of files etc. This is the most general and most powerful filter to manipulate the files within multiple commits – it can do virtually any change on the file tree. In the principle, each affected commit is checked out, the filter commands are applied, and the result is stored back as the new commit.
  • --index-filter: Similar to tree-filter, but only operates on the commit index (does not check out the files for each commit). That means it can only do a subset of what the tree-filter can do (for example, it can be used to remove a file or a directory from each commit); but on the other side, it is much faster than the tree-filter because of that.
  • --subdirectory-filter: Allows to filter a single specific subdirectory. The specified subdirectory will be moved to be the new repository root, and all other files (outside of the specified directory) will be removed.
  • --parent-filter: Can be used to modify the list of commit parents (this can be handy if rebasing merges, where both parents of the merge have been also modified).
  • --msg-filter: Allows to modify the commit messages.
  • There are some others, for the complete reference, see the git filter-branch documentation.

Environment filter (env-filter)

The following Git env-filter example can be used to fix or change the author e-mail, and reset the commit date to the original authored date (the commit date will differ from the authored date, if the commit has been amended, rebased, cherrry-picked etc.):

git filter-branch -f --env-filter '

    # Get the current values of a single commit
    a_name="$GIT_AUTHOR_NAME"
    a_mail="$GIT_AUTHOR_EMAIL"
    a_date="$GIT_AUTHOR_DATE"
    c_name="$GIT_COMMITTER_NAME"
    c_mail="$GIT_COMMITTER_EMAIL"
    c_date="$GIT_COMMITTER_DATE"

    if [ "$a_name" = "Foo Hoo" ] || [ "$c_name" = "Foo Hoo" ] 
    then
        # Fix the author/committer name/mail
        a_name="Foo Hoo"
        a_mail="foo.hoo@gmail.com"
        c_name="Foo Hoo"
        c_mail="foo.hoo@gmail.com"

        # Restore the commit date
        c_date="$a_date"

        # Export the changed values back to the environment
        export GIT_AUTHOR_NAME="$a_name"
        export GIT_AUTHOR_EMAIL="$a_mail"
        export GIT_AUTHOR_DATE="$a_date"
        export GIT_COMMITTER_NAME="$c_name"
        export GIT_COMMITTER_EMAIL="$c_mail"
        export GIT_COMMITTER_DATE="$c_date"
    fi

' --tag-name-filter cat -- --all

The "--tag-name-filter" parameter specifies, that the changes will also be applied to tags; the "--all" parameter is to apply the changes to all commits in the current repository. If the filter is to be applied only to a specific branch, you can just specify the branch name instead. If the branch parameter is omitted, the filter will be applied to the currently active branch.

The "-f" parameter is to force the filter processing even if the ".git/refs/original" folder already exists inside of the repository meta-data, i.e. some branch filter has been already run before.

Message filter (msg-filter)

As the name suggests, this filter is intended for modifications of commit messages. For example:

git filter-branch -f --msg-filter '
    sed "
    s/Update the copyright/Actualize the copyright information/g
    s/Update/Improve/g
    "
' -- --all

Index filter (index-filter)

Index-filter is usually used to remove specific files directly from the index (without checking-out the commit):

git filter-branch -f --index-filter '
    # Remove the directory Testing
    git rm --cached -r -f -q --ignore-unmatch "Testing"
    # Remove the file key.txt
    git rm --cached -f -q --ignore-unmatch "key.txt"
' --prune-empty --tag-name-filter cat -- --all

In particular, only the commands which operate on the Git index will have any effect (like "git rm", but not "rm").

Note the "--ignore-unmatch" parameter of "git rm", which is used to skip the operation for commits, which do not contain the removed files. Otherwise the "git rm" command would fail and stop the Git filter processing.

Also note the "--prune-empty" parameter, which makes the empty commits to be removed (the commit might become empty, if the only changes in the commit have been made on files being removed).

Even though the files are not checked out (contrary to the tree-filter), you can still do some modifications as in the case of tree-filter. The index just needs to be updated appropriately, or else the changes will be lost. For example, this will add the file "ReadMe.txt" to every existing commit:

git filter-branch -f --index-filter '
    # Copy the file from outside to the rewrite working directory
    cp -f "/full/path/ReadMe.txt" "ReadMe.txt"
    # Add the file to the index
    git add "ReadMe.txt"
' --tag-name-filter cat -- --all

Subdirectory filter (subdirectory-filter)

The subdirectory filter is used to filter out everything else than a specific subdirectory, and moving the subdirectory to be a new root of the repository. This can be used for example on multi-project repository (which can be result of SubVersion or CVS conversion) when splitting the repository to multiple per-project repositories.

Example:

git filter-branch -f --subdirectory-filter "projects/MyProject" --prune-empty --tag-name-filter cat -- --all

This will move the "projects/MyProject" folder in each commit into root and remove everything else. Again, the "--prune-empty" parameter is used to remove the empty commits (which can be frequent in this scenario).

Tree filter (tree-filter)

As already mentioned, the tree-filter is the most general and most powerful branch filter and allows to do virtually anything with the directory structure and file contents.

The following script will walk through the entire directory structure and in every file replace some e-mail addresses by another ones:

git filter-branch -f --tree-filter '
    find . | while read file
    do
        if [ -f "$file" ]
        then
            sed -i "
            s/user1@company1\.com/user1@company2.com/g
            s/user2@company1\.com/user2@company2.com/g
            " "$file"
        fi
    done
' --tag-name-filter cat -- --all

This was only a light example, what tree-filter can do. The following example is a complex script, which does the following:

  • move the "include" folder to "inc"
  • rename files with names starting by "ax" to start by "bx"
  • do various replacements in the files ("ax" to "bx""include" to "inc", remove spaces after opening and before closing parentheses, etc.)
  • in all C++ source files, remove all single-line SVN tags (like "/* $Id ...$ */") and add a copyright notice to every source file
git filter-branch -f --tree-filter '

    # Move "include" to "inc"
    [ -d "include" ] && git mv "include" "inc"

    # Rename filenames starting by "ax" to start by "bx"
    find . -name "ax*" | while read file
    do
        git mv "$file" "$(echo "$file" | sed "s/ax\(.*\)/bx\1/g")"
    done

    # Process all files, various replacements
    find . | while read file
    do
        if [ -f "$file" ]
        then
            sed -i "
            s|ax|bx|g
            s|include\([\/]\)|inc\1|g
            s|\([\/]\)include|\1inc|g
            s/AX_/BX_/g
            s/( *\([^ ]\)/(\1/g
            s/\([^ ]\) *)/\1)/g
            " "$file"
        fi
    done

    # Process C++ source files
    find . -name "*.h*" -o -name "*.c*" -o -name "*.inl" | while read file
    do
        # Remove the SVN tags
        sed -i "s| */\* \$.*\$ \*/ *||g" "$file"
        # Add the licence comment
        # Delete all leading blank lines at top of the file
        echo -e "/*
    Copyright (C) 2014 The Company

    Licensed under the Apache License, Version 2.0 (the \"License\");
    you may not use this file except in compliance with the License.
    You may obtain a copy of the License at

        http://www.apache.org/licenses/LICENSE-2.0

    Unless required by applicable law or agreed to in writing, software
    distributed under the License is distributed on an \"AS IS\" BASIS,
    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
    implied.
    See the License for the specific language governing permissions and
    limitations under the License.
*/

" > "$file.tmp" && awk "NF { X=1 } X" "$file" >> "$file.tmp" && mv "$file.tmp" "$file"
    done

' --tag-name-filter cat -- --all

This nicely illustrates, what the tree-filter can do.

Note than many of these things would be also possible by some other branch filters, like the index-filter – in the script, you can always checkout a particular file from the index, modify it and check it back into the index area. However, if the intent is to modify multiple (or all) files like in the example above, it is actually faster to checkout everything by a single checkout command (as the tree-filter does) instead of checking out the files one by one e.g. by "git show" – the index-filter is particularly designed to be used in cases, where the checkout is not necessary. If the checkout is necessary, the tree-filter is supposed to be better choice.

Further reading

For further information regarding the Git branch filters, please refer to the git filter-branch manual.

Advertisements

One thought on “Advanced Git branch filtering

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s