dawarich/docs/import_optimization_plan.md
2025-10-05 20:59:03 +02:00

2.3 KiB

Import Optimisation Plan

Goals

  • Prevent large imports from exhausting memory or hitting IO limits while reading export archives.
  • Maintain correctness and ordering guarantees for all imported entities.
  • Preserve observability and operability (clear errors and actionable logs).

Current Status

  • Replaced File.read + JSON.parse with streaming via Oj::Parser(:saj).load, so data.json is consumed in 16KB chunks instead of loading the whole file.
  • Users::ImportData now dispatches streamed payloads section-by-section, buffering places in-memory batches and spilling visits/points to NDJSON for replay after dependencies are ready.
  • Points, places, and visits importers support incremental ingestion with a fixed batch size of 1,000 records and detailed progress logs.
  • Added targeted specs for the SAJ handler and streaming flow; addressed IO retry messaging.
  • ⚙️ Pending: archive-size guardrails, broader telemetry, and production rollout validation.

Remaining Pain Points

  • No preflight check yet for extreme data.json sizes or malformed streams.
  • Logging only (no metrics/dashboards) for monitoring batch throughput and failures.

Next Steps

  1. Rollout & Hardening
    • Add size/structure validation before streaming (fail fast with actionable error).
    • Extend log coverage (import durations, batch counts) and document operator playbook.
    • Capture memory/runtime snapshots during large staged imports.
  2. Post-Rollout Validation
    • Re-run the previously failing Sidekiq job (import 105) under the new pipeline.
    • Monitor Sidekiq memory and throughput; tune batch size if needed.
    • Gather feedback and decide on export format split or further streaming tweaks.

Validation Strategy

  • Automated: streaming parser specs, importer batch tests, service integration spec (already in place; expand as new safeguards land).
  • Manual: stage large imports, inspect Sidekiq logs/metrics once added, confirm notifications, stats, and files restored.

Open Questions

  • What thresholds should trigger preflight failures or warnings (file size, record counts)?
  • Do we need structured metrics beyond logs for long-running imports?
  • Should we pursue export format splitting or incremental resume once streaming rollout is stable?