neilotoole/sq v0.47.0 on GitHub

This is a significant release, focused on improving i/o, responsiveness, and performance. The headline features are caching of ingested data for document sources such as CSV or Excel, and download caching for remote document sources. There are a lot of under-the-hood changes, so please open an issue if you encounter any weirdness.

Added

Long-running operations (such as data ingestion, or file download) now result in a progress bar being displayed. Display of the progress bar is controlled by the new config options progress and progress.delay. You can also use the --no-progress flag to disable the progress bar.
- 👉 The progress bar is rendered on stderr and is always zapped from the terminal when command output begins. It won't corrupt the output.
#307: Ingested document sources (such as CSV or Excel) now make use of an ingest cache DB. Previously, ingestion of document source data occurred on each sq command. It is now a one-time cost; subsequent use of the document source utilizes the cache DB. Until, that is, the source document changes: then the ingest cache DB is invalidated and ingested again. This is a significantly improved experience for large document sources.
There are several new commands to interact with the cache (although you shouldn't need to):
- sq cache enable and sq cache disable control cache usage. You can also instead use the new ingest.cache config option.
- sq cache clear clears the cache.
- sq cache location prints the cache location on disk.
- sq cache stat shows stats about the cache.
- sq cache tree shows a tree view of the cache.
#24: The download mechanism for remote document sources (e.g. a CSV file at https://sq.io/testdata/actor.csv) has been completely overhauled. Previously, sq would re-download the remote file on every command. Now, the remote file is downloaded and cached locally. Subsequent sq invocations check for staleness of the cached download, and re-download if necessary.
As part of the download revamp, new config options have been introduced:
- http.request.timeout is the timeout for the initial response from the server, and http.response.timeout is the timeout for reading the entire response body. We separate these two timeouts because it's possible that the server responds quickly, but then for a large file, the download takes too long.
- https.insecure-skip-verify controls whether HTTPS connections verify the server's certificate. This is useful for remote files served with a self-signed certificate.
- download.cache controls whether remote files are cached locally.
- download.refresh.ok-on-err controls whether sq should continue with a stale cached download if an error occurred while trying to refresh the download. This is a sort of "Airplane Mode" for remote document sources: sq continues with the cached download when the network is unavailable.
There are two more new config options introduced as part of the above work.
- cache.lock.timeout controls the time that sq will wait for a lock on the cache DB. The cache lock is introduced for when you have multiple sq commands running concurrently, and you want to avoid them stepping on each other.
- Similarly, config.lock.timeout controls the timeout for acquiring the (newly-introduced) lock on sq's config file. This helps prevent issues with multiple sq processes mutating the config concurrently.
sq's own logs previously outputted in JSON format. Now there's a new log.format config option that permits setting the log format to json or text. The text format is more human-friendly, and is now the default.

Changed

Ingestion performance for json and jsonl sources has been significantly improved.

Fixed

Opening a DB connection now correctly honors conn.open-timeout.