dawarich/tracks_performance_optimization_options.md
2025-07-23 18:21:21 +02:00

7.1 KiB

Tracks Feature Performance Optimization Options

Current State Analysis

Performance Characteristics

  • Time Complexity: O(n log n) where n = number of GPS points
  • Memory Usage: Loads entire dataset into memory (~200-400 bytes per point)
  • Processing Mode: Single-threaded, sequential segmentation
  • Database Load: Multiple PostGIS distance calculations per point pair

Performance Estimates (Bulk Mode)

Points Processing Time Memory Usage Database Load
10K 30-60 seconds ~50 MB Low
100K 5-15 minutes ~200 MB Medium
1M+ 30-90 minutes 400+ MB High

Current Bottlenecks

  1. Memory constraints - Loading all points at once
  2. PostGIS distance calculations - Sequential, not optimized
  3. Single-threaded processing - No parallelization
  4. No progress indication - Users can't track long-running operations

Optimization Options

Option 1: Enhanced Time-Based Batching

Complexity: Low | Impact: High | Risk: Low

Implementation

  • Extend existing :daily mode with configurable batch sizes
  • Add 1-point overlap between batches to maintain segmentation accuracy
  • Implement batch-aware progress reporting

Benefits

  • Memory reduction: 90%+ reduction (from 400MB to ~40MB for 1M points)
  • Better UX: Progress indication and cancellation support
  • Incremental processing: Can resume interrupted operations
  • Lower DB pressure: Smaller query result sets

Changes Required

# Enhanced generator with configurable batching
Tracks::Generator.new(
  user, 
  mode: :batched,
  batch_size: 24.hours,
  enable_overlap: true
).call

Edge Cases to Handle

  • Tracks spanning batch boundaries (solved with overlap)
  • Midnight-crossing tracks in daily mode
  • Deduplication of overlapping segments

Option 2: Spatial Indexing Optimization

Complexity: Medium | Impact: Medium | Risk: Low

Implementation

  • Replace individual PostGIS calls with batch distance calculations
  • Implement spatial clustering for nearby points before segmentation
  • Use PostGIS window functions for distance calculations

Benefits

  • Faster distance calculations: Batch operations vs individual queries
  • Reduced DB round-trips: Single query for multiple distance calculations
  • Better index utilization: Leverage existing spatial indexes

Changes Required

-- Batch distance calculation approach
WITH point_distances AS (
  SELECT 
    id,
    timestamp,
    ST_Distance(
      lonlat::geography,
      LAG(lonlat::geography) OVER (ORDER BY timestamp)
    ) as distance_to_previous
  FROM points 
  WHERE user_id = ? 
  ORDER BY timestamp
)
SELECT * FROM point_distances WHERE distance_to_previous > ?

Option 3: Parallel Processing with Worker Pools

Complexity: High | Impact: High | Risk: Medium

Implementation

  • Split large datasets into non-overlapping time ranges
  • Process multiple batches in parallel using Sidekiq workers
  • Implement coordination mechanism for dependent segments

Benefits

  • Faster processing: Utilize multiple CPU cores
  • Scalable: Performance scales with worker capacity
  • Background processing: Non-blocking for users

Challenges

  • Complex coordination: Managing dependencies between batches
  • Resource competition: Multiple workers accessing same user's data
  • Error handling: Partial failure scenarios

Architecture

# Parallel processing coordinator
class Tracks::ParallelGenerator
  def call
    time_ranges = split_into_parallel_ranges
    
    time_ranges.map do |range|
      Tracks::BatchProcessorJob.perform_later(user_id, range)
    end
  end
end

Option 4: Incremental Algorithm Enhancement

Complexity: Medium | Impact: Medium | Risk: Medium

Implementation

  • Enhance existing :incremental mode with smarter buffering
  • Implement sliding window approach for active track detection
  • Add automatic track finalization based on time gaps

Benefits

  • Real-time processing: Process points as they arrive
  • Lower memory footprint: Only active segments in memory
  • Better for live tracking: Immediate track updates

Current Limitations

  • Existing incremental mode processes untracked points only
  • No automatic track finalization
  • Limited to single active track per user

Option 5: Database-Level Optimization

Complexity: Low-Medium | Impact: Medium | Risk: Low

Implementation

  • Add composite indexes for common query patterns
  • Implement materialized views for expensive calculations
  • Use database-level segmentation logic

Benefits

  • Faster queries: Better index utilization
  • Reduced Ruby processing: Move logic to database
  • Consistent performance: Database optimizations benefit all modes

Proposed Indexes

-- Optimized for bulk processing
CREATE INDEX CONCURRENTLY idx_points_user_timestamp_track 
ON points(user_id, timestamp) WHERE track_id IS NULL;

-- Optimized for incremental processing
CREATE INDEX CONCURRENTLY idx_points_untracked_timestamp 
ON points(timestamp) WHERE track_id IS NULL;

Phase 1: Quick Wins (Week 1-2)

  1. Implement Enhanced Time-Based Batching (Option 1)
    • Extend existing daily mode with overlap
    • Add progress reporting
    • Configurable batch sizes

Phase 2: Database Optimization (Week 3)

  1. Add Database-Level Optimizations (Option 5)
    • Create optimized indexes
    • Implement batch distance calculations

Phase 3: Advanced Features (Week 4-6)

  1. Spatial Indexing Optimization (Option 2)
    • Replace individual distance calculations
    • Implement spatial clustering

Phase 4: Future Enhancements

  1. Parallel Processing (Option 3) - Consider for v2
  2. Incremental Enhancement (Option 4) - For real-time features

Risk Assessment

Low Risk

  • Time-based batching: Builds on existing daily mode
  • Database indexes: Standard optimization technique
  • Progress reporting: UI enhancement only

Medium Risk

  • Spatial optimization: Requires careful testing of distance calculations
  • Incremental enhancement: Changes to existing algorithm logic

High Risk

  • Parallel processing: Complex coordination, potential race conditions
  • Major algorithm changes: Could introduce segmentation bugs

Success Metrics

Performance Targets

  • Memory usage: < 100MB for datasets up to 1M points
  • Processing time: < 10 minutes for 1M points
  • User experience: Progress indication and cancellation

Monitoring Points

  • Database query performance
  • Memory consumption during processing
  • User-reported processing times
  • Track generation accuracy (no regression)

Next Steps

  1. Choose initial approach based on urgency and resources
  2. Create feature branch for selected optimization
  3. Implement comprehensive testing including edge cases
  4. Monitor performance in staging environment
  5. Gradual rollout with feature flags