Compression Benchmarks

RaQuet achieves significant compression compared to source formats, especially for ASCII-based rasters.

Dataset Source Format Source Size RaQuet Size Reduction
World Elevation AAIGrid 3.2 GB 805 MB 75%
World Solar PVOUT AAIGrid 2.8 GB 255 MB 91%
CFSR SST (time series) NetCDF 854 MB 75 MB 91%
TCI (Sentinel-2 RGB) GeoTIFF 224 MB 256 MB Similar*
Spain Solar GHI GeoTIFF 15 MB

*TCI includes full pyramid (zoom 0-12), adding overhead vs. single-resolution source.

Why the Compression?

  1. Parquet columnar storage — Efficient encoding for repetitive data patterns
  2. Gzip block compression — Each tile’s pixel data is gzip-compressed
  3. QUADBIN spatial indexing — Compact tile addressing (single int64 per tile)
  4. No redundant headers — Single metadata row vs. per-tile overhead

DuckDB vs BigQuery (Small Dataset)

We benchmarked both RaQuet implementations using TCI.parquet (Sentinel-2 True Color imagery, 261 MB, 3,225 tiles across zoom levels 7-14).

Query Performance

Query Type DuckDB BigQuery Native BigQuery GCS
Point Query (pixel at coordinate) 4.0s 3.2s 4.7s
Single Tile Statistics 1.3s 2.4s 2.5s
Region Statistics (545 tiles) 2.7s 6.0s ~7s
Resolution Distribution 0.8s 2.0s 2.4s
Full Table Aggregation 0.7s 2.0s 2.6s

DuckDB tested on Apple M3 Max, querying directly from GCS via HTTPS. BigQuery tested with native tables and GCS external tables.

Key Findings

DuckDB is 2-3x faster for interactive queries due to native C++ implementation versus BigQuery’s JavaScript UDFs. The performance gap is most noticeable on compute-heavy operations like region statistics where DuckDB processes 545 tiles in 2.7s compared to BigQuery’s 6s.

BigQuery GCS external tables add only 20-30% overhead compared to native tables. This makes them a practical option for exploring RaQuet files without ETL — you can query parquet files directly in GCS and upgrade to native tables later if needed.

All three approaches query the same parquet files. A single RaQuet file in cloud storage can be accessed by DuckDB (via HTTPS), BigQuery external tables, or loaded into BigQuery native tables.

Architecture Comparison

Aspect DuckDB BigQuery
Implementation C++ Extension JavaScript UDFs
QUADBIN Functions Native C++ CARTO Analytics Toolbox
Raster Decompression zlib (C) pako (JavaScript)
Data Access Local, HTTPS, S3, GCS Native tables, external tables
Cold Start <1 second 2-5 seconds
Scaling Single machine Distributed

Multi-Resolution Pyramid Queries

RaQuet files contain tiles at multiple zoom levels (overviews). When querying regions:


Large-Scale Benchmark: 15GB Slope Raster (563K tiles)

We benchmarked a real-world use case: data center site suitability analysis on a 15GB slope raster covering the DC metro area and Maryland (1-meter resolution LiDAR-derived slope). The dataset was partitioned into 156 files using raquet-io partition and loaded into both DuckDB (local SSD) and Snowflake (Small warehouse).

Dataset

Asset Size Details
Source DEM tiles ~15 GB USGS 3DEP 1m resolution
Slope as RaQuet 14.1 GB 563,517 tiles, zoom 17, 156 partition files

Query A: Score a Candidate Site

Given a polygon, compute slope statistics for all tiles that intersect it.

Site Size DuckDB (local SSD) Snowflake (Small WH) Tiles
~0.5 km² 1.9s 10.8s 16
~25 km² 1.7s 13.3s 1,776
~3,000 km² 4.3s 31.2s 60,390

Query B: Full Area Suitability Scan

Scan all tiles, compute per-tile statistics, filter by slope threshold.

Query DuckDB (local SSD) Tiles
Cells with mean slope < 3° 16.9s 475,068 data tiles
Cells with mean slope < 5° 16.6s 475,068 data tiles
Top 20 flattest cells 28.3s Full scan + sort

Snowflake Query B (full table scan of 563K tiles through JavaScript UDFs) was impractically slow on a Small warehouse.

Key Findings

DuckDB is 6-8x faster for spatial queries due to the native C++ extension versus Snowflake’s JavaScript UDFs. The gap is mostly warehouse startup + UDF serialization overhead — notice how Snowflake’s small query (16 tiles, 10.8s) isn’t much faster than its medium query (1,776 tiles, 13.3s).

Full-table scans are DuckDB’s sweet spot. Processing 475K tiles in 17 seconds on local SSD is fast. Snowflake’s per-row JavaScript UDF overhead makes full scans impractical on small warehouses — a larger warehouse size would help but won’t close the gap.

Spatial indexing is critical. On the 15GB dataset, a small-polygon query reads only 16 tiles out of 563K — that’s 99.997% data skipped via QUADBIN row group pruning.

Snowflake’s value is integration, not speed. The same slope data in Snowflake can be joined with land cost, power grid, fiber connectivity, and zoning datasets — all in SQL. That workflow is hard to replicate with DuckDB alone.

Polyfill Modes

The __RAQUET_REGION_BLOCKS function supports multiple spatial matching modes:

Mode Behavior Use Case
intersects (default) All tiles touching the polygon Complete coverage, matches DuckDB
center Tiles whose center is inside Faster, may miss edge tiles
contains Tiles fully inside the polygon Conservative, fewer tiles
-- Snowflake: intersects mode (matches DuckDB read_raquet behavior)
SELECT f.VALUE::NUMBER AS block
FROM TABLE(FLATTEN(__RAQUET_REGION_BLOCKS(
    ST_GEOGRAPHYFROMWKT('POLYGON((...))'), 17, 17, 'intersects'
))) f;

-- BigQuery: same API
SELECT block FROM `project.dataset.__RAQUET_REGION_BLOCKS`(
    ST_GEOGFROMTEXT('POLYGON((...))'), 17, 17, 'intersects'
);

How Queries Work

  1. Parquet footer read — Single range request to get file metadata
  2. Row group pruning — QUADBIN index enables skipping irrelevant row groups
  3. Column projection — Only requested bands are read
  4. Block decompression — Single gzip decompress for the matching tile

Spatial Indexing with QUADBIN

RaQuet uses QUADBIN — a Discrete Global Grid System (DGGS) based on the Web Mercator projection.

How QUADBIN Enables Fast Queries

QUADBIN cell ID: 5270498377487261695
                 ↓
Encodes: zoom level + tile X + tile Y in a single int64

When you query a point:

  1. Point coordinates → QUADBIN cell ID (O(1) computation)
  2. Parquet min/max statistics on block column → row group pruning
  3. Only matching row groups are read from storage

Row Group Pruning Example

For a 805 MB file with 7,424 tiles across 200 row groups:

Result: 99%+ data skipped for point queries


Pyramid Structure

RaQuet stores multiple zoom levels in a single file, similar to COG overviews.

Zoom 0: 1 tile (global overview)
Zoom 1: 4 tiles
Zoom 2: 16 tiles
Zoom 3: 64 tiles
...
Zoom 7: 16,384 tiles (full resolution)

Benefits


Remote Query Optimization

# Smaller row groups = better pruning for range requests
raquet-io convert raster input.tif output.parquet --row-group-size 100

# Split by zoom for very large files
raquet-io split-zoom large.parquet ./by_zoom/

Row Group Size Trade-offs

Row Group Size File Size Remote Query Speed Local Query Speed
50 Larger Fastest Slower
200 (default) Medium Fast Fast
1000 Smaller Slower Fastest

Rule of thumb: Use 100-200 for files that will be queried remotely.


Split by Zoom

For very large datasets, splitting by zoom level optimizes remote access:

raquet-io split-zoom world_elevation.parquet ./split/

Output:

split/
├── zoom_0.parquet (1 tile, ~1 KB)
├── zoom_1.parquet (4 tiles, ~4 KB)
├── zoom_2.parquet (16 tiles, ~16 KB)
├── zoom_3.parquet (64 tiles, ~64 KB)
├── zoom_4.parquet (256 tiles, ~256 KB)
├── zoom_5.parquet (1,024 tiles, ~1 MB)
├── zoom_6.parquet (4,096 tiles, ~4 MB)
└── zoom_7.parquet (16,384 tiles, ~800 MB)

Benefit: A web viewer at zoom 5 only downloads the 1 MB file, not the full 805 MB.


Comparison with COG

Aspect RaQuet COG
Point query ~100ms (SQL) ~50ms (GDAL range read)
Window read Not optimized Optimized (native)
SQL joins Native Requires export
Zonal stats SQL aggregation GDAL/rasterio
Data warehouse Native Parquet Requires ETL

Choose RaQuet when: SQL access, data warehouse integration, and governance matter more than raw window-read performance.

Choose COG when: GIS pipelines, visualization, and GDAL ecosystem are primary use cases.


Memory Efficiency

DuckDB Streaming

DuckDB processes RaQuet files with streaming execution:

-- This doesn't load the entire file into memory
SELECT AVG(ST_RasterSummaryStat(block, band_1, 'mean', metadata))
FROM read_raquet('https://storage.googleapis.com/raquet_demo_data/world_elevation.parquet');

Block-Level Processing

Each tile is independently compressed and can be processed without loading neighbors:


Best Practices

For Conversion

  1. Use appropriate resamplingbilinear or cubic for continuous data (elevation, temperature), near for categorical data (land cover)

  2. Match block size to use case — 256 (default) works well for most cases; 512 for very dense data

  3. Include statistics — Always enabled by default; enables query optimization

For Querying

  1. Always use spatial filtersST_RasterIntersects or equivalent enables massive data skipping

  2. Project only needed bandsSELECT band_1 not SELECT * for multi-band files

  3. Use read_raquet() in DuckDB — Automatically propagates metadata for raster functions

For Remote Access

  1. Use HTTPS URLshttps://storage.googleapis.com/... works directly in DuckDB

  2. Enable CORS — Required for browser-based access (viewer)

  3. Consider CDN — CloudFlare or similar for high-traffic files