HTAR - Introduction

HTAR is a utility that is used for aggregating a set of files from the local file system directly into HPSS, creating a file that conforms to the POSIX TAR specification.

It does this without having to first create an intermediate file on the local filesystem; instead, it uses a sophisticated multithreaded buffering scheme to write files directly into HPSS, thereby achieving a high rate of performance.

HTAR was originally developed by Michael Gleicher in early 2000-2001, and is in use at many HPSS sites, providing the critical “file bundling” that is required in order to avoid storing potentially tens or hundreds of millions of small files, each of which would require separate metadata.  HTAR was featured at Supercomputing 2007 for the IBM “Billion File” demonstration, and a specially modified version of HTAR is used for the aggregation capabilityfor the IBM GHI feature.

HTAR is freely available to HPSS sites that have an HPSS license.  Support is provided at no extra cost; support is in the process of transitioning from Gleicher Enterprises to the HPSS Collaboration Development partners.

When HTAR creates the TAR file, it also builds an index file, which is stored in the same directory as the TAR file, as shown by the diagram below:

HTAR Archive/Index file creation

The index filename is normally the same as the TAR file name, with a ".idx" suffix added. 

HTAR provides a number of commands to work with archive files, including:

  • actions to create, list and verify the contents of archive files, and to randomly extract files from within an archive file, using the offset(s) stored in the index file
  • new in V6: ability to optionally exclude classes of files during a create operation
  • ability to recreate an index file that has been been accidentally deleted, and to create index files for TAR-format archive files that were created by other versions of TAR.
  • ability to create and verify CRC checksums for member files within the archive file
  • ability to store member files within the archive of up to 8^12-1 (approximately 68GB)
  • ability to specify the HPSS Class of Service for either the archive file, the index file, or both

HTAR was originally designed to work with 50000-100000 small (1 to 3 megabyte) files, but it has proven to be capable of working efficiently with very large archive files containing large numbers of member files.  It is now routinely used at some sites to create archives containing tens of millions of files, with archive files exceeding sizes of up to 10 Terabytes.  

HTAR is used as the aggregation mechanism for the newly developed IBM GPFS-HPSS interface, and was demonstrated as part of the Billion File Demo at Supercomputing '2007.