ZEP 8 — URL pipeline syntax


Author: Jeremy Maitin-Shepard <jbms@google.com>, Google Research

Status: Draft

Type: Specification

Created: 2023-09-07

Abstract

This proposal defines a URL pipeline syntax for specifying how to locate a zarr node, a plain key-value store, or other resources.

Motivation and Scope

A URL syntax for zarr nodes that is common across multiple zarr implementations enables users to more easily share dataset locations between different tools.

While in simple cases it is sufficient to specify the zarr node location using an existing, well-established URL scheme like file or http, or existing but less-standard URL schemes like s3 and gs, for nested storage mechanisms like the ZIP format, and for nodes within a zarr hierarchy, there is no existing established URL syntax.

Additionally, for implementations that support other data formats than zarr v3, it may be necessary to also indicate the data format as part of the URL syntax.

This proposal defines a new absolute and relative URL syntax addresses this need.

Usage and Impact

Zarr implementations that support this URL syntax are expected to provide an API for opening a Zarr node at a given URL, and for obtaining the URL corresponding to an open Zarr node.

Using these APIs, data can easily be shared between different Zarr implementations that support the proposed URL syntax.

Implementations may also optionally support this syntax for uses beyond Zarr, such as specifying a bare key-value store or specifying data in non-zarr formats.

Detailed description

This ZEP defines a URL pipeline syntax that may be optionally supported by Zarr implementations in order to allow the location of a zarr array or group to be specified in a convenient, implementation-independent way.

More precisely, it defines a URL syntax that may specify:

  • A root key-value store, such as a path within an S3 bucket or a path on the local filesystem, e.g. s3://bucket/path;

  • A nested key-value store, such as a sub-directory within a ZIP file within an S3 bucket, e.g. s3://bucket/path/to/archive.zip|zip:path/within/zip/;

  • A zarr v2 or v3 node within a zarr v3 hierarchy within a particular key-value store, e.g. s3://bucket/path/to/zarr/|zarr3:path/within/hierarchy/;

  • A non-zarr dataset within a zarr v3 hierarchy within a particular key-value store, e.g. s3://bucket/path/to/image.tiff|tiff: or s3://bucket/path/to/dataset.n5/|n5:path/within/hierarchy/.

Resource kinds

The proposed URL pipeline syntax may refer to several different kinds of resources:

  • file: Single file within a key-value store, no specific data format.

  • directory: Single directory within a key-value store, no specific data format.

  • dataset: An array, group, or other dataset with a defined format e.g. a zarr array or group.

Depending on the URL schemes involved, in some cases the resource kind can be determined syntatically from the URL alone, while in other cases it can only be resolved by actually accessing the resource.

Absolute URL syntax

The proposed “zarr URL pipeline syntax” has the following ABNF grammar:

absolute_url_pipeline = root_url *( "|" adapter )

where a root_url specifies an absolute resource location, such as file:/path/to/local/file, and adapter specifies a nested resource using a specified protocol, such as zip:path/within/zip. The root_url and each of the adapter portions are considered “sub-URLs”.

For a given adapter, the sequence of root_url and prior adapter sub-URLs is called the “base URL”.

The following root_url schemes are defined:

  • file: as defined by RFC8089.

    file:/absolute/path and file:///absolute/path are supported.

    Implementations that have a current working directory MAY support the non-standard extension file:relative/path where relative/path is resolved relative to the current working directory.

    Implementations SHOULD not support file://relative/path since that is ambiguous with the file://hostname/path syntax defined by RFC8089.

    If the path is empty or ends with /, the resultant resource kind is known syntatically to be a directory.

    Otherwise, the resultant resource kind may be either file or directory.

  • http: and https: for generic HTTP servers

    In general only single-key read operations (corresponding to GET requests) are supported.

    Implementations may choose to heuristically support list operations by detecting server support for HTML directory listings and/or S3-compatible listing.

    The resultant resource kind may be either file or directory.

  • s3://bucket/path/within/bucket for AWS S3

    The endpoint, appropriate credentials, and bucket region (for non-anonymous access) must be determined automatically.

    If the path is empty or ends with /, the resultant resource kind is known syntatically to be a directory.

    Otherwise, the resultant resource kind may be either file or directory.

  • s3+http: and s3+https: for S3-compatible servers

    This URL syntax is AMBIGUOUSLY either:

    • s3+https://endpoint/path/within/bucket, assuming endpoint corresponds to a single bucket, e.g. s3+https://mybucket.s3.amazonaws.com/bucket/path, or;

    • s3+https://endpoint/bucket/path/within/bucket, assuming endpoint corresponds to multiple buckets, e.g. s3+https://s3.amazonaws.com/bucket/path.

    The appropriate credentials (and bucket region for non-anonymous access) must be determined automatically.

    For most operations, e.g. to GET a single object, this ambiguity does not matter, but for LIST operations the implementation must determine the path to the root of the bucket, e.g. by initially attempting the LIST operation with both possible paths and checking which one succeeds. For subsequent operations on the same endpoint the result can be cached to avoid overhead.

    If the path is empty or ends with /, the resource kind is known syntatically to be a directory.

    Otherwise, the resultant resource kind may be either file or directory.

    For the purpose of relative URLs, the path component includes the bucket/ prefix even if the endpoint is in fact a multi-bucket endpoint.

  • gs://bucket/path/within/bucket for Google Cloud Storage (GCS)

    If the path is empty or ends with /, the resultant resource kind is known syntatically to be a directory.

    Otherwise, the resultant resource kind may be either file or directory.

The following adapter URL schemes are defined:

  • zip:path/within/zip for ZIP archive format

    The base URL must refer to a file resource (which is expected to be in ZIP format).

    If the path is empty or ends with /, the resultant resource kind is known syntatically to be a directory.

    Otherwise, the resultant resource kind may be either file or directory.

  • ocdbt: for OCDBT

    The base URL must refer to a directory resource (which is expected to be in OCDBT format).

    The URL syntax is ocdbt:path or ocdbt:@version/path, where version is either v123 or 2025-01-01T01:23:45.678Z.

    For example:

    • file:///tmp/dataset.ocdbt/|ocdbt:@v1/path/within/database
    • file:///tmp/dataset.ocdbt/|ocdbt:path/within/database
    • file:///tmp/dataset.ocdbt/|ocdbt:@2025-01-01T01:23:45.678Z/path/within/database

    While @ is normally allowed within the path component of a URL, with the ocdbt: URL scheme, if the path starts with @ it must be percent-encoded as %40 to avoid ambiguity with the @version component. For example, a path of @abc can be specified as ocdbt:%40abc.

    The resultant resource kind may be either file or directory.

    For the purpose of relative URLs, the path component does not include the @version/ prefix if present.

  • icechunk: for Icechunk

    The base URL msut refer to a directory resource (which is expected to contain an Icechunk database).

    The following syntaxes are supported:

    • icechunk:path/to/node/
    • icechunk:@branch.BRANCH/path/to/node/
    • icechunk:@tag.TAG/path/to/node/
    • icechunk:@SNAPSHOT/path/to/node/

    For example:

    • file:///path/to/repo.zarr.icechunk/|icechunk:|zarr3:path/to/array/
    • file:///path/to/repo.zarr.icechunk/|icechunk:@branch.other/|zarr3:path/to/array/
    • file:///path/to/repo.zarr.icechunk/|icechunk:@tag.v5/|zarr3:path/to/array/
    • file:///path/to/repo.zarr.icechunk/|icechunk:@4N0217AZA4VNPYD0HR0G/|zarr3:path/to/array/

    While @ is normally allowed within the path component of a URL, with the icechunk: URL scheme, if the path starts with @ it must be percent-encoded as %40 to avoid ambiguity with the @version component. For example, a path of @abc/ can be specified as icechunk:%40abc/.

    If the path is empty or ends with /, the resultant resource kind is known syntatically to be a directory.

    Otherwise, the resultant resource kind may be either file or directory.

    For the purpose of relative URLs, the path component does not include the @version/ prefix if present.

  • zarr3:path/within/hierarchy/ to specify a zarr v3 node

    The base URL must refer to a directory resource, which is expected to contain a zarr v3 hierarchy.

    The resultant resource kind is always dataset.

  • zarr2:path/within/hierarchy/ to specify a zarr v2 node

    The base URL must refer to a directory resource, which is expected to contain a zarr v2 hierarchy.

    The resultant resource kind is always dataset.

  • zarr:path/within/hierarchy/ to specify a zarr v2 or v3 node

    The base URL must refer to a directory resource, which is expected to contain a zarr v2/v3 hierarchy.

    The implementation must determine the zarr format version automatically.

    The resultant resource kind is always dataset.

  • n5:path/within/hierarchy to specify an N5 group or array

    The base URL must refer to a directory resource, which is expected to contain an N5 hierarchy.

    Because N5 is defined to inherit attributes from ancestor groups in the hierarchy, it is recommended that the base URL refers to the root of the n5 hierarchy, and any path within the hierarchy be specified through the n5: scheme.

    The resultant resource kind is always dataset.

  • gzip:, zstd: for transparent access to compressed files

    The base URL must refer to a file resource, which is expected to be in the format indicated by the URL scheme.

    Currently no path is supported.

    The resultant resource kind is always file.

    For example:

    • gs://bucket/path/to/data.gz|gzip:
    • gs://bucket/path/to/data.zstd|zstd:
  • byte-range:start-end for specifying a byte range within a file

    The base URL must refer to a file resource, which is expected to support byte range access.

    The start and end components of the URL specify byte offsets in base 10. The start bound is inclusive while the end bound is exclusive; the total length is end - start.

    The resultant resource kind is always file.

    For example:

    • gs://bucket/path/to/data|byte-range:1000-2000
  • tiff:, jpeg:, png:, bmp:, avif:, webp:

    The base URL must refer to a file resource, which is expected to contain an image in the format indicated by the URL scheme.

    Currently no path is supported.

    The resource kind is always dataset.

  • neuroglancer-precomputed: for Neuroglancer precomputed

    The base URL must refer to a directory resource, which is expected to contain a neuroglancer precomputed dataset.

    Currently no path component is allowed by the neuroglancer-precomputed URL scheme.

    The resource kind is always dataset.

  • json:path

    The base URL must refer to a file resource, which is expected to contain an encoded JSON document.

    The path is in JSON pointer syntax and indicates a sub-value within the JSON document. An empty path corresponds to the entire JSON document.

    The resource kind is always dataset, specifically a rank-0 array with data type json.

  • ..:path/within/outer/sub-URL and ..:/path/within/outer/sub-URL may be used to traverse out from the prior adapter.

    This scheme is primarily useful within the relative URL pipeline syntax defined below.

    Any .. adapters are resolved in order. The presence of a .. adapter causes the prior adapter to be discarded. The path component is interpreted relative to the path of the sub-URL immediately prior to the discarded adapter sub-URL.

    It is an error if there are no remaining adapter sub-URLs when resolving the .. adapter.

    For security, implementation SHOULD place limits on where this scheme is permitted.

If the adapter URL would otherwise consist of just the scheme followed by “:”, it is permitted to omit the final “:”. For example:

  • https://example.com/path/to/archive.zip|zip|zarr3 is equivalent to https://example.com/path/to/archive.zip|zip:|zarr3:.

It is expected that additional URL schemes may be standardized in the future.

Examples

Examples:

  • https://server.example.com:1234/path/to/array

    Specifies a normal HTTPS URL.

  • s3://bucket/path/to/file

    Specifies:

    • within the AWS S3 bucket named bucket,
    • the path path/to/file.
  • gs://bucket/path/to/outer.zip|zip:path/to/inner.zip|zip:path/to/zarr/hierarchy|zarr3:path/to/array

    Specifies:

    • within the GCS bucket named bucket,
    • within the ZIP file at the path path/to/outer.zip,
    • within the ZIP file at the path path/to/inner.zip,
    • within the Zarr v3 hierarchy at the path path/to/zarr/hierarchy/,
    • the Zarr v3 node at the path path/to/array/.
  • gs://bucket/path/to/outer.zip|zip:path/to/inner.zip|..:other/zarr/hierarchy|zarr3:path/to/array

    Normalizes to:

    gs://bucket/path/to/other/zarr/hierarchy/|zarr3:path/to/array

Format auto-detection

Implementations MAY support format auto-detection for certain adapter URL schemes.

For a given base URL specifying a file or directory resource, the implementation determines a set of matching adapter URLs:

  • For a base file resource, this is typically done by reading a prefix and/or suffix of the file in order to match expected signatures;

  • For a base directory resource, this is typically done by checking for the presence of certain files.

Given a base URL specifying a file or directory resource, to obtain a dataset resource using format auto-detection, the implementation:

  1. Determines the set of matching adapter URLs for the current base URL. If there is exactly one match, add the matching adapter to the current base URL to obtain a new base URL. Otherwise, return an error.

  2. If the new base URL is a dataset resource, return the new base URL as the successful format auto-detection result. Otherwise, continue back at step 1 with the new base URL as the current base URL.

Context-dependent URL pipeline interpretation

Implementations MAY interpret URL pipelines in a context-dependent way. For example, consider the following hypothetical APIs (which may not all be part of the same software):

  • open_array: opens an arbitrary array from a URL

    If passed a URL that resolves to a file or directory resource, performs format auto-detection to obtain a dataset resource.

    If format auto-detection fails or the resultant dataset resource is not an array, fails with an error.

    Otherwise, opens the resolved URL as an array.

  • open_zarr_array: opens a zarr array from a URL with format auto-detection

    Same as open_array, except that if the resolved dataset resource is not a zarr array, fails.

  • open_zarr_array_without_auto_detection: opens a zarr array without format auto-detection

    If passed a URL that resolves to a file resource, fails with an error.

    If passed a URL that resolves to a directory resource, append the zarr: adapter and open it.

    If passed a URL that resolves to a dataset resource, open it and fail if it is not in zarr format.

  • open_kvstore: opens a key-value store file or directory from a URL

    If passed a URL that resolves to a file or directory resource, opens it.

    If passed a URL that resolves to a dataset resource, returns an error.

  • open_file: opens a file from a URL

    If passed a URL that resolves to a file resource, opens it.

    Otherwise, returns an error.

Relative URL pipeline syntax

Relative URL pipelines permit the locations of resources to be specified relative to some base URL pipeline that is specified separately, potentially traversing through one or more layers of adapter.

For example:

A zarr attribute may be defined that specifies the location of some other
related array using the relative URL pipeline syntax.

The referencing array may be located at
`s3://bucket/path/to/dataset.zip|zip:path/within/zip/|zarr3:`.  Using only a
relative path, it could specify the path of another array within
`s3://bucket/path/to/dataset.zip:zip:`, e.g. the relative path
`../another/array/` would refer to
`s3://bucket/path/to/dataset.zip|zip:path/another/array/`.  To refer to
`s3://bucket/path/of/another.zip|zip:other/array/`, the relative URL
pipeline `..:../of/another.zip|zip:other/array/|zarr3:` can be used.

The relative URL pipeline syntax has the following ABNF grammar:

relative_url_pipeline = ( absolute_path / relative_path ) *( "|" adapter )
                      / absolute_url_pipeline

A relative zarr URL is always resolved relative to a specified base URL pipeline. The initial absolute_path or relative_path applies to the path component of the inner-most (last) sub-URL. If the relative_path is the empty string, the path component of the inner-most sub-URL remains unchanged. After applying the absolute_path or relative_path to the existing absolute URL, any specified adapter sub-URLs are appended.

Note: An absolute_path overrides any existing path component of the inner-most sub-URL of the base , but is still relative to the scheme and other components of the inner-most sub-URL of the base URL pipeline that precede its path component, if any. The specific scheme of the sub-URL defines what portion, if any, constitutes the path component.

As with regular URL syntax, it is not permitted for the first component of relative_path to contain a colon (:), e.g. a:b, since that would be ambiguous with specifying the base URL scheme for an absolute URL. Instead, such a relative URL must be prefixed with ./, e.g. ./a:b.

Examples

    • Base URL: gs://bucket/path/to/
    • Relative URL: file.zip|zip:path/within/zip
    • Resolved URL: gs://bucket/path/to/file.zip|zip:path/within/zip
    • Base URL: gs://bucket/path/to/file.zip|zip:path/within/zip
    • Relative URL: ..:/path/to/other.zip|zip:path/in/other/zip
    • Resolved URL: gs://bucket/path/to/other.zip|zip:path/in/other/zip

Rationale

This proposal takes into account several key considerations:

  • The URL syntax must support specifying:
    • The underlying key-value store;
    • The path within the key-value store of the root Zarr node;
    • Optionally, a path within the Zarr hierachy starting from the root Zarr node. Note: Currently, as no storage transformers have been defined, the path to any Zarr node may be specified directly as a path within the underlying key-value store, making this additional path unneccessary.
  • Must support nested key-values stores, like one or more layers of a ZIP archive within some other key-value store.
  • The URL syntax must be compatible with interactive completion as the user types.
  • The URL syntax must also be extensible for use with non-zarr formats.

The use of outer-to-inner order for the sub-URLs enables completion of both paths and sub-URL schemes as the user types.

The sub-URL delimiter of | was chosen because it is not a valid URL character, and therefore does not have any existing valid interpretation within URLs, and also is evocative of POSIX shell pipe syntax.

Implementations

  • TensorStore (https://google.github.io/tensorstore/spec.html#json-TensorStoreUrl)

    Format auto-detection is also implemented.

    The relative_url_pipeline syntax is not supported.

  • Neuroglancer (https://neuroglancer-docs.web.app/datasource/index.html#url-syntax)

    The http: and https: schemes automatically detect and support HTML and S3-compatible directory listing.

    Format auto-detection is also implemented.

    The relative_url_pipeline syntax is not supported.

  • zarr-python (https://github.com/zarr-developers/zarr-python/pull/3369)

    The relative_url_pipeline syntax is not supported.

fsspec

The fsspec library is widely used with the zarr-python library to access a variety of storage systems, and includes support for ZIP files and other nested stores.

Like this proposal, the fsspec URL syntax consists of a sequence of sub-URLs separated by a delimiter, but differs as follows:

  • fsspec uses a delimiter of :: rather than | as used in this proposal.
  • fsspec orders sub-URLs from innermost to outermost, which is the opposite order from what is proposed here.

The use of :: as a delimiter of the sub-URLs means that fsspec URLs may conform to the syntax of a normal URL, because :: is permitted within the path, query, and fragment components of a URL. This has both advantages and disadvantages:

  • An fsspec URL may be accepted by existing URL parsers/matchers not specifically designed for fsspec.
  • Because the interpretation of the :: delimiter within an fsspec URL differs from the normal interpretation within a URL, operations such as relative path resolution designed to operate on URLs generically may execute without errors on an fsspec URL but produce an incorrect result. In contrast, the use of | within this proposal ensures that the resultant syntax will not be confused with a valid regular URL, because | is not a permitted character within URLs.

The outer-to-inner order of sub-URLs in the fsspec URL syntax is not compatible with the usual operation of text completion as the user types. It is also opposite to the outer-to-inner order used for specifying paths within URLs.

Apache Commons VFS

The Apache Commons VFS is a Java library that provides capabilities similar to those of the fsspec Python library.

The Apache Commons VFS URL syntax specifies the base scheme and all of the sub-schemes, in inner to outer order, delimited by :, followed by the paths for each scheme, in outer-to-inner order, delimited by !.

For example:

  • gz:tar:file:///extra/data/tryVfs/archive.tar!/tardir/content.txt.gz!content.txt, which under this proposal would be file:///extra/data/tryVfs/archive.tar|tar:tardir/content.txt.gz|gz (assuming the existence of tar and gz adapter schemes).

As with the fsspec syntax, this URL syntax conforms to the standard URL syntax but has a different interpretation, which has both advantages and disadvantages.

Separating the adapter scheme from the adapter path makes the association of adapter and path less obvious, particularly if there is more than one adapter.

While the outer-to-inner order of the nested paths makes text completion of the paths feasible, the URL syntax is not readily compatible with completion of the nested schemes.

GDAL Virtual File Systems

https://gdal.org/user/virtual_file_systems.html

This uses a path syntax rather than a URL syntax. It supports chaining but makes assumptions about paths (e.g. that a zip file always ends with .zip).

For example:

  • /vsizip//vsicurl/ftp://user:password@example.com/foldername/file.zip/example.shp, which under this proposal would be ftp://user:password@example.com/foldername/file.zip|zip:example.shp.

Backward Compatibility

If Zarr implementations wish to add support for this proposed URL syntax to an existing generic “open” interface that already supports other syntax, such as a plain non-absolute file path or the fsspec URL syntax, there are potential ambiguities:

  • A relative file path such as file:/abc can also be interpreted as a URL. Presumably implementations would disambiguate this as a URL, which may (in rare cases) change the behavior of existing code.
  • A nested fsspec URL is unlikely to be a valid URL under this proposal, but a non-nested fsspec URL may well be a valid URL under this proposal. In many cases the interpretation will also be the same, but in some cases it may be subtly different.

Discussion

None yet.

This document has been placed in the public domain.