ZEP 8 — URL pipeline syntax
Author: Jeremy Maitin-Shepard <jbms@google.com>, Google Research
Status: Draft
Type: Specification
Created: 2023-09-07
Abstract
This proposal defines a URL pipeline syntax for specifying how to locate a zarr node, a plain key-value store, or other resources.
Motivation and Scope
A URL syntax for zarr nodes that is common across multiple zarr implementations enables users to more easily share dataset locations between different tools.
While in simple cases it is sufficient to specify the zarr node location using an existing, well-established URL scheme like file or http, or existing but less-standard URL schemes like s3 and gs, for nested storage mechanisms like the ZIP format, and for nodes within a zarr hierarchy, there is no existing established URL syntax.
Additionally, for implementations that support other data formats than zarr v3, it may be necessary to also indicate the data format as part of the URL syntax.
This proposal defines a new absolute and relative URL syntax addresses this need.
Usage and Impact
Zarr implementations that support this URL syntax are expected to provide an API for opening a Zarr node at a given URL, and for obtaining the URL corresponding to an open Zarr node.
Using these APIs, data can easily be shared between different Zarr implementations that support the proposed URL syntax.
Implementations may also optionally support this syntax for uses beyond Zarr, such as specifying a bare key-value store or specifying data in non-zarr formats.
Detailed description
This ZEP defines a URL pipeline syntax that may be optionally supported by Zarr implementations in order to allow the location of a zarr array or group to be specified in a convenient, implementation-independent way.
More precisely, it defines a URL syntax that may specify:
-
A root key-value store, such as a path within an S3 bucket or a path on the local filesystem, e.g.
s3://bucket/path; -
A nested key-value store, such as a sub-directory within a ZIP file within an S3 bucket, e.g.
s3://bucket/path/to/archive.zip|zip:path/within/zip/; -
A zarr v2 or v3 node within a zarr v3 hierarchy within a particular key-value store, e.g.
s3://bucket/path/to/zarr/|zarr3:path/within/hierarchy/; -
A non-zarr dataset within a zarr v3 hierarchy within a particular key-value store, e.g.
s3://bucket/path/to/image.tiff|tiff:ors3://bucket/path/to/dataset.n5/|n5:path/within/hierarchy/.
Resource kinds
The proposed URL pipeline syntax may refer to several different kinds of resources:
-
file: Single file within a key-value store, no specific data format. -
directory: Single directory within a key-value store, no specific data format. -
dataset: An array, group, or other dataset with a defined format e.g. a zarr array or group.
Depending on the URL schemes involved, in some cases the resource kind can be determined syntatically from the URL alone, while in other cases it can only be resolved by actually accessing the resource.
Absolute URL syntax
The proposed “zarr URL pipeline syntax” has the following ABNF grammar:
absolute_url_pipeline = root_url *( "|" adapter )
where a root_url specifies an absolute resource location, such as file:/path/to/local/file, and adapter specifies a nested resource using a specified protocol, such as zip:path/within/zip. The root_url and each of the adapter portions are considered “sub-URLs”.
For a given adapter, the sequence of root_url and prior adapter sub-URLs is called the “base URL”.
The following root_url schemes are defined:
-
file:as defined by RFC8089.file:/absolute/pathandfile:///absolute/pathare supported.Implementations that have a current working directory MAY support the non-standard extension
file:relative/pathwhererelative/pathis resolved relative to the current working directory.Implementations SHOULD not support
file://relative/pathsince that is ambiguous with thefile://hostname/pathsyntax defined by RFC8089.If the path is empty or ends with
/, the resultant resource kind is known syntatically to be adirectory.Otherwise, the resultant resource kind may be either
fileordirectory. -
http:andhttps:for generic HTTP serversIn general only single-key read operations (corresponding to GET requests) are supported.
Implementations may choose to heuristically support list operations by detecting server support for HTML directory listings and/or S3-compatible listing.
The resultant resource kind may be either
fileordirectory. -
s3://bucket/path/within/bucketfor AWS S3The endpoint, appropriate credentials, and bucket region (for non-anonymous access) must be determined automatically.
If the path is empty or ends with
/, the resultant resource kind is known syntatically to be adirectory.Otherwise, the resultant resource kind may be either
fileordirectory. -
s3+http:ands3+https:for S3-compatible serversThis URL syntax is AMBIGUOUSLY either:
-
s3+https://endpoint/path/within/bucket, assumingendpointcorresponds to a single bucket, e.g.s3+https://mybucket.s3.amazonaws.com/bucket/path, or; -
s3+https://endpoint/bucket/path/within/bucket, assumingendpointcorresponds to multiple buckets, e.g.s3+https://s3.amazonaws.com/bucket/path.
The appropriate credentials (and bucket region for non-anonymous access) must be determined automatically.
For most operations, e.g. to GET a single object, this ambiguity does not matter, but for LIST operations the implementation must determine the path to the root of the bucket, e.g. by initially attempting the LIST operation with both possible paths and checking which one succeeds. For subsequent operations on the same endpoint the result can be cached to avoid overhead.
If the path is empty or ends with
/, the resource kind is known syntatically to be adirectory.Otherwise, the resultant resource kind may be either
fileordirectory.For the purpose of relative URLs, the path component includes the
bucket/prefix even if the endpoint is in fact a multi-bucket endpoint. -
-
gs://bucket/path/within/bucketfor Google Cloud Storage (GCS)If the path is empty or ends with
/, the resultant resource kind is known syntatically to be adirectory.Otherwise, the resultant resource kind may be either
fileordirectory.
The following adapter URL schemes are defined:
-
zip:path/within/zipfor ZIP archive formatThe base URL must refer to a
fileresource (which is expected to be in ZIP format).If the path is empty or ends with
/, the resultant resource kind is known syntatically to be adirectory.Otherwise, the resultant resource kind may be either
fileordirectory. -
ocdbt:for OCDBTThe base URL must refer to a
directoryresource (which is expected to be in OCDBT format).The URL syntax is
ocdbt:pathorocdbt:@version/path, whereversionis eitherv123or2025-01-01T01:23:45.678Z.For example:
file:///tmp/dataset.ocdbt/|ocdbt:@v1/path/within/databasefile:///tmp/dataset.ocdbt/|ocdbt:path/within/databasefile:///tmp/dataset.ocdbt/|ocdbt:@2025-01-01T01:23:45.678Z/path/within/database
While
@is normally allowed within the path component of a URL, with theocdbt:URL scheme, if the path starts with@it must be percent-encoded as%40to avoid ambiguity with the@versioncomponent. For example, a path of@abccan be specified asocdbt:%40abc.The resultant resource kind may be either
fileordirectory.For the purpose of relative URLs, the path component does not include the
@version/prefix if present. -
icechunk:for IcechunkThe base URL msut refer to a
directoryresource (which is expected to contain an Icechunk database).The following syntaxes are supported:
icechunk:path/to/node/icechunk:@branch.BRANCH/path/to/node/icechunk:@tag.TAG/path/to/node/icechunk:@SNAPSHOT/path/to/node/
For example:
file:///path/to/repo.zarr.icechunk/|icechunk:|zarr3:path/to/array/file:///path/to/repo.zarr.icechunk/|icechunk:@branch.other/|zarr3:path/to/array/file:///path/to/repo.zarr.icechunk/|icechunk:@tag.v5/|zarr3:path/to/array/file:///path/to/repo.zarr.icechunk/|icechunk:@4N0217AZA4VNPYD0HR0G/|zarr3:path/to/array/
While
@is normally allowed within the path component of a URL, with theicechunk:URL scheme, if the path starts with@it must be percent-encoded as%40to avoid ambiguity with the@versioncomponent. For example, a path of@abc/can be specified asicechunk:%40abc/.If the path is empty or ends with
/, the resultant resource kind is known syntatically to be adirectory.Otherwise, the resultant resource kind may be either
fileordirectory.For the purpose of relative URLs, the path component does not include the
@version/prefix if present. -
zarr3:path/within/hierarchy/to specify a zarr v3 nodeThe base URL must refer to a
directoryresource, which is expected to contain a zarr v3 hierarchy.The resultant resource kind is always
dataset. -
zarr2:path/within/hierarchy/to specify a zarr v2 nodeThe base URL must refer to a
directoryresource, which is expected to contain a zarr v2 hierarchy.The resultant resource kind is always
dataset. -
zarr:path/within/hierarchy/to specify a zarr v2 or v3 nodeThe base URL must refer to a
directoryresource, which is expected to contain a zarr v2/v3 hierarchy.The implementation must determine the zarr format version automatically.
The resultant resource kind is always
dataset. -
n5:path/within/hierarchyto specify an N5 group or arrayThe base URL must refer to a
directoryresource, which is expected to contain an N5 hierarchy.Because N5 is defined to inherit attributes from ancestor groups in the hierarchy, it is recommended that the base URL refers to the root of the n5 hierarchy, and any path within the hierarchy be specified through the
n5:scheme.The resultant resource kind is always
dataset. -
gzip:,zstd:for transparent access to compressed filesThe base URL must refer to a
fileresource, which is expected to be in the format indicated by the URL scheme.Currently no path is supported.
The resultant resource kind is always
file.For example:
gs://bucket/path/to/data.gz|gzip:gs://bucket/path/to/data.zstd|zstd:
-
byte-range:start-endfor specifying a byte range within a fileThe base URL must refer to a
fileresource, which is expected to support byte range access.The
startandendcomponents of the URL specify byte offsets in base 10. Thestartbound is inclusive while theendbound is exclusive; the total length isend - start.The resultant resource kind is always
file.For example:
gs://bucket/path/to/data|byte-range:1000-2000
-
tiff:,jpeg:,png:,bmp:,avif:,webp:The base URL must refer to a
fileresource, which is expected to contain an image in the format indicated by the URL scheme.Currently no path is supported.
The resource kind is always
dataset. -
neuroglancer-precomputed:for Neuroglancer precomputedThe base URL must refer to a
directoryresource, which is expected to contain a neuroglancer precomputed dataset.Currently no path component is allowed by the
neuroglancer-precomputedURL scheme.The resource kind is always
dataset. -
json:pathThe base URL must refer to a
fileresource, which is expected to contain an encoded JSON document.The
pathis in JSON pointer syntax and indicates a sub-value within the JSON document. An empty path corresponds to the entire JSON document.The resource kind is always
dataset, specifically a rank-0 array with data typejson. -
..:path/within/outer/sub-URLand..:/path/within/outer/sub-URLmay be used to traverse out from the prior adapter.This scheme is primarily useful within the relative URL pipeline syntax defined below.
Any
..adapters are resolved in order. The presence of a..adapter causes the prior adapter to be discarded. The path component is interpreted relative to the path of the sub-URL immediately prior to the discarded adapter sub-URL.It is an error if there are no remaining adapter sub-URLs when resolving the
..adapter.For security, implementation SHOULD place limits on where this scheme is permitted.
If the adapter URL would otherwise consist of just the scheme followed by “:”, it is permitted to omit the final “:”. For example:
https://example.com/path/to/archive.zip|zip|zarr3is equivalent tohttps://example.com/path/to/archive.zip|zip:|zarr3:.
It is expected that additional URL schemes may be standardized in the future.
Examples
Examples:
-
https://server.example.com:1234/path/to/arraySpecifies a normal HTTPS URL.
-
s3://bucket/path/to/fileSpecifies:
- within the AWS S3 bucket named
bucket, - the path
path/to/file.
- within the AWS S3 bucket named
-
gs://bucket/path/to/outer.zip|zip:path/to/inner.zip|zip:path/to/zarr/hierarchy|zarr3:path/to/arraySpecifies:
- within the GCS bucket named
bucket, - within the ZIP file at the path
path/to/outer.zip, - within the ZIP file at the path
path/to/inner.zip, - within the Zarr v3 hierarchy at the path
path/to/zarr/hierarchy/, - the Zarr v3 node at the path
path/to/array/.
- within the GCS bucket named
-
gs://bucket/path/to/outer.zip|zip:path/to/inner.zip|..:other/zarr/hierarchy|zarr3:path/to/arrayNormalizes to:
gs://bucket/path/to/other/zarr/hierarchy/|zarr3:path/to/array
Format auto-detection
Implementations MAY support format auto-detection for certain adapter URL schemes.
For a given base URL specifying a file or directory resource, the implementation determines a set of matching adapter URLs:
-
For a base
fileresource, this is typically done by reading a prefix and/or suffix of the file in order to match expected signatures; -
For a base
directoryresource, this is typically done by checking for the presence of certain files.
Given a base URL specifying a file or directory resource, to obtain a dataset resource using format auto-detection, the implementation:
-
Determines the set of matching
adapterURLs for the current base URL. If there is exactly one match, add the matching adapter to the current base URL to obtain a new base URL. Otherwise, return an error. -
If the new base URL is a
datasetresource, return the new base URL as the successful format auto-detection result. Otherwise, continue back at step 1 with the new base URL as the current base URL.
Context-dependent URL pipeline interpretation
Implementations MAY interpret URL pipelines in a context-dependent way. For example, consider the following hypothetical APIs (which may not all be part of the same software):
-
open_array: opens an arbitrary array from a URLIf passed a URL that resolves to a
fileordirectoryresource, performs format auto-detection to obtain adatasetresource.If format auto-detection fails or the resultant
datasetresource is not an array, fails with an error.Otherwise, opens the resolved URL as an array.
-
open_zarr_array: opens a zarr array from a URL with format auto-detectionSame as
open_array, except that if the resolveddatasetresource is not a zarr array, fails. -
open_zarr_array_without_auto_detection: opens a zarr array without format auto-detectionIf passed a URL that resolves to a
fileresource, fails with an error.If passed a URL that resolves to a
directoryresource, append thezarr:adapter and open it.If passed a URL that resolves to a
datasetresource, open it and fail if it is not in zarr format. -
open_kvstore: opens a key-value store file or directory from a URLIf passed a URL that resolves to a
fileordirectoryresource, opens it.If passed a URL that resolves to a
datasetresource, returns an error. -
open_file: opens a file from a URLIf passed a URL that resolves to a
fileresource, opens it.Otherwise, returns an error.
Relative URL pipeline syntax
Relative URL pipelines permit the locations of resources to be specified relative to some base URL pipeline that is specified separately, potentially traversing through one or more layers of adapter.
For example:
A zarr attribute may be defined that specifies the location of some other
related array using the relative URL pipeline syntax.
The referencing array may be located at
`s3://bucket/path/to/dataset.zip|zip:path/within/zip/|zarr3:`. Using only a
relative path, it could specify the path of another array within
`s3://bucket/path/to/dataset.zip:zip:`, e.g. the relative path
`../another/array/` would refer to
`s3://bucket/path/to/dataset.zip|zip:path/another/array/`. To refer to
`s3://bucket/path/of/another.zip|zip:other/array/`, the relative URL
pipeline `..:../of/another.zip|zip:other/array/|zarr3:` can be used.
The relative URL pipeline syntax has the following ABNF grammar:
relative_url_pipeline = ( absolute_path / relative_path ) *( "|" adapter )
/ absolute_url_pipeline
A relative zarr URL is always resolved relative to a specified base URL pipeline. The initial absolute_path or relative_path applies to the path component of the inner-most (last) sub-URL. If the relative_path is the empty string, the path component of the inner-most sub-URL remains unchanged. After applying the absolute_path or relative_path to the existing absolute URL, any specified adapter sub-URLs are appended.
Note: An absolute_path overrides any existing path component of the inner-most sub-URL of the base , but is still relative to the scheme and other components of the inner-most sub-URL of the base URL pipeline that precede its path component, if any. The specific scheme of the sub-URL defines what portion, if any, constitutes the path component.
As with regular URL syntax, it is not permitted for the first component of relative_path to contain a colon (:), e.g. a:b, since that would be ambiguous with specifying the base URL scheme for an absolute URL. Instead, such a relative URL must be prefixed with ./, e.g. ./a:b.
Examples
-
- Base URL:
gs://bucket/path/to/ - Relative URL:
file.zip|zip:path/within/zip - Resolved URL:
gs://bucket/path/to/file.zip|zip:path/within/zip
- Base URL:
-
- Base URL:
gs://bucket/path/to/file.zip|zip:path/within/zip - Relative URL:
..:/path/to/other.zip|zip:path/in/other/zip - Resolved URL:
gs://bucket/path/to/other.zip|zip:path/in/other/zip
- Base URL:
Rationale
This proposal takes into account several key considerations:
- The URL syntax must support specifying:
- The underlying key-value store;
- The path within the key-value store of the root Zarr node;
- Optionally, a path within the Zarr hierachy starting from the root Zarr node. Note: Currently, as no storage transformers have been defined, the path to any Zarr node may be specified directly as a path within the underlying key-value store, making this additional path unneccessary.
- Must support nested key-values stores, like one or more layers of a ZIP archive within some other key-value store.
- The URL syntax must be compatible with interactive completion as the user types.
- The URL syntax must also be extensible for use with non-zarr formats.
The use of outer-to-inner order for the sub-URLs enables completion of both paths and sub-URL schemes as the user types.
The sub-URL delimiter of | was chosen because it is not a valid URL character, and therefore does not have any existing valid interpretation within URLs, and also is evocative of POSIX shell pipe syntax.
Implementations
-
TensorStore (https://google.github.io/tensorstore/spec.html#json-TensorStoreUrl)
Format auto-detection is also implemented.
The
relative_url_pipelinesyntax is not supported. -
Neuroglancer (https://neuroglancer-docs.web.app/datasource/index.html#url-syntax)
The
http:andhttps:schemes automatically detect and support HTML and S3-compatible directory listing.Format auto-detection is also implemented.
The
relative_url_pipelinesyntax is not supported. -
zarr-python (https://github.com/zarr-developers/zarr-python/pull/3369)
The
relative_url_pipelinesyntax is not supported.
Related Work
fsspec
The fsspec library is widely used with the zarr-python library to access a variety of storage systems, and includes support for ZIP files and other nested stores.
Like this proposal, the fsspec URL syntax consists of a sequence of sub-URLs separated by a delimiter, but differs as follows:
- fsspec uses a delimiter of
::rather than|as used in this proposal. - fsspec orders sub-URLs from innermost to outermost, which is the opposite order from what is proposed here.
The use of :: as a delimiter of the sub-URLs means that fsspec URLs may conform to the syntax of a normal URL, because :: is permitted within the path, query, and fragment components of a URL. This has both advantages and disadvantages:
- An fsspec URL may be accepted by existing URL parsers/matchers not specifically designed for fsspec.
- Because the interpretation of the
::delimiter within an fsspec URL differs from the normal interpretation within a URL, operations such as relative path resolution designed to operate on URLs generically may execute without errors on an fsspec URL but produce an incorrect result. In contrast, the use of|within this proposal ensures that the resultant syntax will not be confused with a valid regular URL, because|is not a permitted character within URLs.
The outer-to-inner order of sub-URLs in the fsspec URL syntax is not compatible with the usual operation of text completion as the user types. It is also opposite to the outer-to-inner order used for specifying paths within URLs.
Apache Commons VFS
The Apache Commons VFS is a Java library that provides capabilities similar to those of the fsspec Python library.
The Apache Commons VFS URL syntax specifies the base scheme and all of the sub-schemes, in inner to outer order, delimited by :, followed by the paths for each scheme, in outer-to-inner order, delimited by !.
For example:
gz:tar:file:///extra/data/tryVfs/archive.tar!/tardir/content.txt.gz!content.txt, which under this proposal would befile:///extra/data/tryVfs/archive.tar|tar:tardir/content.txt.gz|gz(assuming the existence oftarandgzadapter schemes).
As with the fsspec syntax, this URL syntax conforms to the standard URL syntax but has a different interpretation, which has both advantages and disadvantages.
Separating the adapter scheme from the adapter path makes the association of adapter and path less obvious, particularly if there is more than one adapter.
While the outer-to-inner order of the nested paths makes text completion of the paths feasible, the URL syntax is not readily compatible with completion of the nested schemes.
GDAL Virtual File Systems
https://gdal.org/user/virtual_file_systems.html
This uses a path syntax rather than a URL syntax. It supports chaining but makes assumptions about paths (e.g. that a zip file always ends with .zip).
For example:
/vsizip//vsicurl/ftp://user:password@example.com/foldername/file.zip/example.shp, which under this proposal would beftp://user:password@example.com/foldername/file.zip|zip:example.shp.
Backward Compatibility
If Zarr implementations wish to add support for this proposed URL syntax to an existing generic “open” interface that already supports other syntax, such as a plain non-absolute file path or the fsspec URL syntax, there are potential ambiguities:
- A relative file path such as
file:/abccan also be interpreted as a URL. Presumably implementations would disambiguate this as a URL, which may (in rare cases) change the behavior of existing code. - A nested fsspec URL is unlikely to be a valid URL under this proposal, but a non-nested fsspec URL may well be a valid URL under this proposal. In many cases the interpretation will also be the same, but in some cases it may be subtly different.
Discussion
None yet.
Copyright
This document has been placed in the public domain.