扫描gzip数据,可以生成下面这个结构,用来存储随机access point,一个index包含了这个段数据在gzip文件中的cmp_offset,以及他在解压后的数据中的ucomp_offset,还有一个bits,用来处理gzip数据是非byte对齐的情况,最后就是一个data部分,是这段数据对应的字典,当我想解压一段数据的时候,只需要根据要读取的未压缩的offset范围找到对应的 access point,然后通过cmp_offset字段来读取压缩文件对应位置并解压。

/* Access point entry. */
struct point {
    off_t out;          /* corresponding offset in uncompressed data */
    off_t in;           /* offset in input file of first full byte */
    int bits;           /* number of bits (1-7) from byte at in-1, or 0 */
    unsigned char window[WINSIZE];  /* preceding 32K of uncompressed data */
};

z_stream内部数据结构

typedef struct z_stream_s {
    z_const Bytef *next_in;     /* next input byte */
    uInt     avail_in;  /* number of bytes available at next_in */
    uLong    total_in;  /* total number of input bytes read so far */

    Bytef    *next_out; /* next output byte will go here */
    uInt     avail_out; /* remaining free space at next_out */
    uLong    total_out; /* total number of bytes output so far */

    z_const char *msg;  /* last error message, NULL if no error */
    struct internal_state FAR *state; /* not visible by applications */

    alloc_func zalloc;  /* used to allocate the internal state */
    free_func  zfree;   /* used to free the internal state */
    voidpf     opaque;  /* private data object passed to zalloc and zfree */

    int     data_type;  /* best guess about the data type: binary or text
                           for deflate, or the decoding state for inflate */
    uLong   adler;      /* Adler-32 or CRC-32 value of the uncompressed data */
    uLong   reserved;   /* reserved for future use */
} z_stream;

The Z_BLOCK option assists in appending to or combining deflate streams. To assist in this, on return inflate() always sets strm->data_type to the number of unused bits in the last byte taken from strm->next_in, plus 64 if inflate() is currently decoding the last block in the deflate stream, plus 128 if inflate() returned immediately after decoding an end-of-block code or decoding the complete header up to just before the first byte of the deflate stream. The end-of-block will not be indicated until all of the uncompressed data from that block has been written to strm->next_out. The number of unused bits may in general be greater than seven, except when bit 7 of data_type is set, in which case the number of unused bits will be less than eight. data_type is set as noted here every time inflate() returns for all flush options, and so can be used to determine the amount of currently consumed input in bits.

z_stream的data_type中低三位是存放最后一个字节中有多少未使用的bit,如果正在decoding上一个block,那么会+64,如果是正在decoding end-of-block或者是一个gzip stream的header部分都会+128。我们只有在确定是一个完整的block的时候,才添加一个index point,所以一般代码都会这么写,确认是一个index point,这个时候才可以是一个index point。

if ((strm.data_type & 128) && !(strm.data_type & 64) &&
    (totout == 0 || totout - last > span)) {
    index = addpoint(index, strm.data_type & 7, totin,
                     totout, strm.avail_out, window);
    if (index == NULL) {
        ret = Z_MEM_ERROR;
        goto deflate_index_build_error;
    }
    last = totout;
 }

构建Index的过程如下:

  1. 从压缩的数据中,读取一段内容,这个大小可以自定义,默认是16K
  2. 创建z_stream,先调用inflateInit2 初始化z_stream,windowsBits为31(gzip)或者47(gzip、zlib自动探测),然后将读取到的一段数据,赋值给z_stream来处理
  3. 开始会检查这个压缩的数据是否是31、139、8这三个字节开头,表明这是一个gzip stream
  4. 掉用inflate(&strm, Z_BLOCK) 生存z_stream

解压的过程: