DuckDB Network Architecture
kreview aggregates tens of millions of rows of heavily nested fragmentomics data on a consumer MacBook, utilizing DuckDB as the high-velocity backing query engine.
Processing large-scale cohorts located on remote network-mounted directories requires significant logic tuning to prevent systemic failures from concurrent file handle exhaustion.
The Unix maxfiles Limit
When loading a massive cohort (e.g., 4,600 samples), a naive query using glob-scanning:
will cause DuckDB to aggressively initiate parallel filesystem thread expansion, attempting to open all 4,600 file handles simultaneously. This trips the macOS POSIX ulimit for concurrent open files (usually hardcapped at 256 or 1024):
I/O Thread Throttling
To fix this natively inside Python, we override DuckDB's default configuration:
- Limits DuckDB to 4 concurrent I/O threads instead of
os.cpu_count() - Prevents DuckDB from consuming all available memory during large joins
Feature Batching (chunk_size)
Inside load_feature_cohort, we completely abandoned read_parquet(*/*) glob functionality.
Instead, we generate an explicit Python list of file paths by discovering files manually using Path.iterdir(), then chunk that list:
chunk_size = 500
for i in range(0, len(file_paths), chunk_size):
chunk = file_paths[i:i + chunk_size]
df_chunk = conn.execute(
"SELECT * FROM read_parquet(?, union_by_name=true)",
[chunk]
).df()
By default, chunk_size is tuned to 500 to sit comfortably underneath the default 1024 macOS limits. You can tune this further downward using the CLI --chunk-size flag.
Exponential Backoff Retry
Even with chunking, transient I/O failures can occur when reading parquet files from network-mounted directories. To handle this gracefully, kreview implements automatic retry logic with exponential backoff:
max_retries = 3
for attempt in range(max_retries):
try:
df_chunk = conn.execute(query, [chunk]).df()
df_list.append(df_chunk)
break
except Exception as e:
if attempt < max_retries - 1:
log.warning("duckdb_retry", attempt=attempt+1, error=str(e))
time.sleep(2 ** attempt) # 1s, 2s, 4s backoff
else:
log.error("chunk_load_failed", error=str(e))
return pd.DataFrame() # Permanent failure
When to Use --chunk-size
If you see PermissionError or IO Error during large cohort loading:
- First retry: the backoff mechanism will handle transient failures automatically
- Persistent failures: reduce
--chunk-sizeto100or50 - Very congested networks: also try
--workers 1to serialize I/O completely