mirror of
https://github.com/Freika/dawarich.git
synced 2026-01-10 17:21:38 -05:00
7.1 KiB
7.1 KiB
Tracks Feature Performance Optimization Options
Current State Analysis
Performance Characteristics
- Time Complexity: O(n log n) where n = number of GPS points
- Memory Usage: Loads entire dataset into memory (~200-400 bytes per point)
- Processing Mode: Single-threaded, sequential segmentation
- Database Load: Multiple PostGIS distance calculations per point pair
Performance Estimates (Bulk Mode)
| Points | Processing Time | Memory Usage | Database Load |
|---|---|---|---|
| 10K | 30-60 seconds | ~50 MB | Low |
| 100K | 5-15 minutes | ~200 MB | Medium |
| 1M+ | 30-90 minutes | 400+ MB | High |
Current Bottlenecks
- Memory constraints - Loading all points at once
- PostGIS distance calculations - Sequential, not optimized
- Single-threaded processing - No parallelization
- No progress indication - Users can't track long-running operations
Optimization Options
Option 1: Enhanced Time-Based Batching
Complexity: Low | Impact: High | Risk: Low
Implementation
- Extend existing
:dailymode with configurable batch sizes - Add 1-point overlap between batches to maintain segmentation accuracy
- Implement batch-aware progress reporting
Benefits
- Memory reduction: 90%+ reduction (from 400MB to ~40MB for 1M points)
- Better UX: Progress indication and cancellation support
- Incremental processing: Can resume interrupted operations
- Lower DB pressure: Smaller query result sets
Changes Required
# Enhanced generator with configurable batching
Tracks::Generator.new(
user,
mode: :batched,
batch_size: 24.hours,
enable_overlap: true
).call
Edge Cases to Handle
- Tracks spanning batch boundaries (solved with overlap)
- Midnight-crossing tracks in daily mode
- Deduplication of overlapping segments
Option 2: Spatial Indexing Optimization
Complexity: Medium | Impact: Medium | Risk: Low
Implementation
- Replace individual PostGIS calls with batch distance calculations
- Implement spatial clustering for nearby points before segmentation
- Use PostGIS window functions for distance calculations
Benefits
- Faster distance calculations: Batch operations vs individual queries
- Reduced DB round-trips: Single query for multiple distance calculations
- Better index utilization: Leverage existing spatial indexes
Changes Required
-- Batch distance calculation approach
WITH point_distances AS (
SELECT
id,
timestamp,
ST_Distance(
lonlat::geography,
LAG(lonlat::geography) OVER (ORDER BY timestamp)
) as distance_to_previous
FROM points
WHERE user_id = ?
ORDER BY timestamp
)
SELECT * FROM point_distances WHERE distance_to_previous > ?
Option 3: Parallel Processing with Worker Pools
Complexity: High | Impact: High | Risk: Medium
Implementation
- Split large datasets into non-overlapping time ranges
- Process multiple batches in parallel using Sidekiq workers
- Implement coordination mechanism for dependent segments
Benefits
- Faster processing: Utilize multiple CPU cores
- Scalable: Performance scales with worker capacity
- Background processing: Non-blocking for users
Challenges
- Complex coordination: Managing dependencies between batches
- Resource competition: Multiple workers accessing same user's data
- Error handling: Partial failure scenarios
Architecture
# Parallel processing coordinator
class Tracks::ParallelGenerator
def call
time_ranges = split_into_parallel_ranges
time_ranges.map do |range|
Tracks::BatchProcessorJob.perform_later(user_id, range)
end
end
end
Option 4: Incremental Algorithm Enhancement
Complexity: Medium | Impact: Medium | Risk: Medium
Implementation
- Enhance existing
:incrementalmode with smarter buffering - Implement sliding window approach for active track detection
- Add automatic track finalization based on time gaps
Benefits
- Real-time processing: Process points as they arrive
- Lower memory footprint: Only active segments in memory
- Better for live tracking: Immediate track updates
Current Limitations
- Existing incremental mode processes untracked points only
- No automatic track finalization
- Limited to single active track per user
Option 5: Database-Level Optimization
Complexity: Low-Medium | Impact: Medium | Risk: Low
Implementation
- Add composite indexes for common query patterns
- Implement materialized views for expensive calculations
- Use database-level segmentation logic
Benefits
- Faster queries: Better index utilization
- Reduced Ruby processing: Move logic to database
- Consistent performance: Database optimizations benefit all modes
Proposed Indexes
-- Optimized for bulk processing
CREATE INDEX CONCURRENTLY idx_points_user_timestamp_track
ON points(user_id, timestamp) WHERE track_id IS NULL;
-- Optimized for incremental processing
CREATE INDEX CONCURRENTLY idx_points_untracked_timestamp
ON points(timestamp) WHERE track_id IS NULL;
Recommended Implementation Strategy
Phase 1: Quick Wins (Week 1-2)
- Implement Enhanced Time-Based Batching (Option 1)
- Extend existing daily mode with overlap
- Add progress reporting
- Configurable batch sizes
Phase 2: Database Optimization (Week 3)
- Add Database-Level Optimizations (Option 5)
- Create optimized indexes
- Implement batch distance calculations
Phase 3: Advanced Features (Week 4-6)
- Spatial Indexing Optimization (Option 2)
- Replace individual distance calculations
- Implement spatial clustering
Phase 4: Future Enhancements
- Parallel Processing (Option 3) - Consider for v2
- Incremental Enhancement (Option 4) - For real-time features
Risk Assessment
Low Risk
- Time-based batching: Builds on existing daily mode
- Database indexes: Standard optimization technique
- Progress reporting: UI enhancement only
Medium Risk
- Spatial optimization: Requires careful testing of distance calculations
- Incremental enhancement: Changes to existing algorithm logic
High Risk
- Parallel processing: Complex coordination, potential race conditions
- Major algorithm changes: Could introduce segmentation bugs
Success Metrics
Performance Targets
- Memory usage: < 100MB for datasets up to 1M points
- Processing time: < 10 minutes for 1M points
- User experience: Progress indication and cancellation
Monitoring Points
- Database query performance
- Memory consumption during processing
- User-reported processing times
- Track generation accuracy (no regression)
Next Steps
- Choose initial approach based on urgency and resources
- Create feature branch for selected optimization
- Implement comprehensive testing including edge cases
- Monitor performance in staging environment
- Gradual rollout with feature flags