Migrate CVS to Git with cvs2svn

Migrate

cs CZE

The following article describes how to migrate an existing CVS legacy repository to the more recent Git version control system, using the Tigris.org cvs2svn script.

The Git actually has some sort of direct support to import from CVS repositories, the git cvsimport command. However, although this command is better suited for incremental conversion from CVS to Git than the other alternatives, for one-time migrations there are some known issues. Therefore I prefer to use the cvs2svn bundle (specifically its cvs2git module) for one-time CVS-to-Git migrations.

Note that the cvs2git script only works for local repositories, as it needs direct access to the CVS repository to parse the repository structure. For remote repositories, the git cvsimport command needs to be used.

Step 1: Prerequisites

First, you need to get the cvs2svn bundle from Tigris.org:
http://cvs2svn.tigris.org/servlets/ProjectDocumentList

If you have Subversion installed, you might checkout the current trunk directly from the Tigris.org SVN repository:

$ svn co --username=guest --password="" http://cvs2svn.tigris.org/svn/cvs2svn/trunk cvs2svn

It is a Python script, thus you need to download and install Python, if you do not have it installed yet:
https://www.python.org/downloads/
(under Linux, refer to your packaging system)

You need to extract the script’s tar.gz archive – under Unix/Linux, the gzip command is usually already installed, under Windows you need some packaging tool, which is able to extract the tar.gz archives. For example 7-zip:
http://www.7-zip.org/download.html

Step 2: Prepare the script and repository

Extract the downloaded script package somewhere. You could also install the bundle right into the Python installation folder by running the install script, but that isn’t necessary in general.

The script accesses the migrated CVS repository in read-only mode (it should not change the CVS repository), but it might still be wise to create a copy of the migrated CVS repository and migrate the copy (not only to be sure that the original CVS repo will not be modified – but it is also possible that you might need to do some adjustments on the CVS repository, as a damaged CVS repository is not an entirely rare issue). Especially if the CVS repository is accessible externally (so somebody might try to commit a change during the conversion process).

Step 3: Dump the CVS repository

Now we are ready to start the repository migration. The cvs2git script will create a dump of the CVS repository, which then can be imported into a Git repository. If the CVS repository contains multiple projects, you can either migrate the entire repository in a single step, or migrate the projects one by one (by specifying the path to the project after the path to the repository).

To create the dump, the following script parameters can be used:

$ python <path_to_cvs2svn>/cvs2git --blobfile=cvs2svn-tmp/git-blob.dat --dumpfile=cvs2svn-tmp/git-dump.dat "--username=Firstname Lastname" <PATH_TO_CVS_REPOSITORY>

If there are any file names with some high ASCII characters, you might need to specify the code page by using the --encoding parameter, e.g. "--encoding=cp1252" (see Troubleshooting).

Refer to Options file usage to see some extra possibilities, only achievable when options file is used, such as CVS author name transformation.

After a successful completion (note that it can take a lot of time, especially for large CVS repositories), the files "cvs2svn-tmp/git-blob.dat" and "cvs2svn-tmp/git-dump.dat" will be created. These can be imported into a Git repository afterwards.

Step 4: Import the CVS dumps to the Git repository

First, create a new Git repository for the import (under Windows, it is recommended to continue under GitBash instead of the regular command line):

$ mkdir NewGitRepo
$ cd NewGitRepo
$ git init

The dumps could actually be imported to any existing Git repository too, but that is highly discouraged (some adjustments, cleanup and rebasing might be needed after the import). And if you want to join the migrated repo with another existing Git repository later, you can always do that by using the git fetch command.

The git init --bare parameter can also be used to just create a bare git repository (but I usually create a regular repository first, do the import, check it and perhaps do some rebases, amends, squashes, filtering etc. as needed).

Now the CVS dumps can be imported into the Git repository:

$ git fast-import --export-marks=../cvs2svn-tmp/git-marks.dat < ../cvs2svn-tmp/git-blob.dat
$ git fast-import --import-marks=../cvs2svn-tmp/git-marks.dat < ../cvs2svn-tmp/git-dump.dat

At this point, the migration is complete in the principle. Although you might want to check the status of the repository and do some additional adjustments and fixes if necessary.

After the repository adjustments are finished, you might also garbage-collect and re-pack the repository for optimal performance:

$ git gc --prune=now
$ git repack -a -d -f

Transformation of CVS author names

The CVS uses simple author names (nicks), whereas the Git uses author/committer name (or nick) and the email address in addition. By default, the cvs2git script sets the Git author name to "cvsauthor <cvsauthor>" (where “cvsauthor” means the actual CVS author name). But it is possible to translate (“transform”) the CVS author names into the appropriate Git author names.

The user name transformation can be set up in the options file. The options file should contain all the options needed for exporting the CVS repository, so it is then used instead of the command line options. You can take the sample "cvs2svn/cvs2git-example.options" file as a reference and make the required adjustments to it.

It is simple enough to change the example options file according the needs, you need to set at least these options to process the CVS repository with author name transformation:

  • the CVS repository root directory (search for “run_options.set_project”)
  • the names to be translated (search for “author_transforms”)

Of course you can set many more other options. The following diff shows the values which I’m usually changing (besides the above, there is a code page option for the log messages):

diff --git a/cvs2git-example.options b/cvs2git-example.options
index 44646f6..2ffc678 100644
--- a/cvs2git-example.options
+++ b/cvs2git-example.options
@@ -195,9 +195,9 @@ ctx.cvs_author_decoder = CVSTextDecoder(
 ctx.cvs_log_decoder = CVSTextDecoder(
 [
 #'utf8',
 #'latin1',
- 'ascii',
+ 'cp1252',
 ],
 #fallback_encoding='ascii',
 eol_fix='\n',
 )
@@ -511,17 +511,15 @@ ctx.retain_conflicting_attic_files = False
 # values can either be strings in the form "name " or tuples
 # (name, email). Please substitute your own project's usernames here
 # to use with the author_transforms option of GitOutputOption below.
 author_transforms={
- 'jrandom' : ('J. Random', 'jrandom@example.com'),
- 'mhagger' : 'Michael Haggerty <mhagger@alum.mit.edu>',
- 'brane' : (u'Branko Čibej', 'brane@xbc.nu'),
- 'ringstrom' : 'Tobias Ringström <tobias@ringstrom.mine.nu>',
- 'dionisos' : (u'Erik Hülsmann', 'e.huelsmann@gmx.net'),
+ 'username1' : 'First1 Last1 <username1@company.com>',
+ 'username2' : 'First2 Last2 <username2@company.com>',
+ 'username3' : 'First3 Last3 <username3@company.com>',

 # This one will be used for commits for which CVS doesn't record
 # the original author, as explained above.
- 'cvs2svn' : 'cvs2svn <admin@example.com>',
+ 'cvs2svn' : 'Default User <default.user@company.com>',
 }

 # This is the main option that causes cvs2svn to output to a
 # "fastimport"-format dumpfile rather than to Subversion:
@@ -559,9 +557,9 @@ run_options.profiling = False
 run_options.set_project(
 # The filesystem path to the part of the CVS repository (*not* a
 # CVS working copy) that should be converted. This may be a
 # subdirectory (i.e., a module) within a larger CVS repository.
- r'test-data/main-cvsrepos',
+ r'path_to_cvs_repository',

 # A list of symbol transformations that can be used to rename
 # symbols in this project.
 symbol_transforms=[

Once the options file is ready, the --options parameter is used to specify the options file on the command line:

$ python <path_to_cvs2svn>/cvs2git --options=<options_file_path>

Troubleshooting

Invalid CVS repository files

As I mentioned already before, there is a decent possibility, that some files in the original CVS repository might be damaged. A damaged file normally doesn’t break the entire CVS repository, just the operations on that specific file fail. Because of that, the corrupted files often remain unnoticed, and therefore such situation is surprisingly common in CVS repositories.

In such case, you will experience error messages like this:

ERROR: '..\\.CVSbase\\apps\\MyProject\\make\\linux\\sparc\\makefile,v' is not a valid ,v file
Traceback (most recent call last):
  File "cvs2svn\cvs2git", line 70, in 
    git_main(os.path.basename(sys.argv[0]), sys.argv[1:])
  File "c:\Users\emaskovsky\test\cvs2svn\cvs2svn_lib\main.py", line 119, in git_main
    main(progname, run_options, pass_manager)
  File "c:\Users\emaskovsky\test\cvs2svn\cvs2svn_lib\main.py", line 96, in main
    pass_manager.run(run_options)
  File "c:\Users\emaskovsky\test\cvs2svn\cvs2svn_lib\pass_manager.py", line 181, in run
    the_pass.run(run_options, stats_keeper)
  File "c:\Users\emaskovsky\test\cvs2svn\cvs2svn_lib\passes.py", line 109, in run
    walk_repository(project, file_key_generator, cd.record_fatal_error),
  File "c:\Users\emaskovsky\test\cvs2svn\cvs2svn_lib\collect_data.py", line 1165, in process_project
    self._process_cvs_file_items(cvs_file_items)
  File "c:\Users\emaskovsky\test\cvs2svn\cvs2svn_lib\collect_data.py", line 1140, in _process_cvs_file_items
    cvs_file_items.remove_unneeded_initial_trunk_delete(self.metadata_db)
AttributeError: 'NoneType' object has no attribute 'remove_unneeded_initial_trunk_delete'

You need to check the specified file (most likely it really will be damaged in some way or the other). If you cannot fix the file (e.g. no backups are available and it can’t be fixed manually), it needs to be removed to not stay in the way (you can “replace” it by some other “similar” file, like I did when I hit the case above, just to not lose the history – I couldn’t recover the original file, so I replaced it by a versioned makefile for another platform first, and then fixed it in the final Git repository).

High ASCII characters in names or commit messages

If there are high ASCII (> 0x7Fhex) characters in the file names or commit messages, you might experience error messages like this:

----- pass 2 (CleanMetadataPass) -----
Converting metadata to UTF8...
WARNING: Problem decoding log message:
---------------------------------------------------------------------------
The initial release of GŘnter C++ Framework

---------------------------------------------------------------------------
ERROR: There were warnings converting author names and/or log messages
to Unicode (see messages above).  Please restart this pass
with one or more '--encoding' parameters or with
'--fallback-encoding'.

As the message suggests, use the --encoding parameter with an appropriate encoding, e.g. "--encoding=cp1252" (or set the appropriate encoding in the options file).

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s