dawarich/RAW_DATA_ARCHIVAL_PLAN.md
2025-12-06 22:43:22 +01:00

65 KiB
Raw Blame History

Points raw_data S3 Archival System Implementation Plan

Version: 1.2 Date: 2025-11-25 Last Updated: 2025-12-06 Status: Planning Phase

🔄 Version 1.2 Updates (2025-12-06)

Key Changes:

  • Optional archival system - Controlled by ARCHIVE_RAW_DATA environment variable (default: disabled)
  • User model integration - Added has_many :raw_data_archives with cascade deletion
  • GDPR compliance - Complete data removal on user deletion (DB + S3)
  • Configuration section - New section documenting ENV vars and opt-in behavior
  • Appendix C added - User Model Integration with cascade deletion details

Why These Changes:

  • Safe deployment: Feature disabled by default, can be enabled gradually
  • Data privacy: Full cascade deletion ensures GDPR compliance
  • Operational flexibility: Can disable archival instantly if issues arise
  • User deletion: Archives and S3 files automatically cleaned up

🔄 Version 1.1 Updates (2025-11-25)

Key Changes:

  • No column accessor override - Uses raw_data_with_archive method instead of overriding raw_data column
  • Archivable concern - Extracted all archival logic to reusable Archivable concern
  • Migration guide - Added comprehensive guide for updating existing code
  • JSONL appendix - Added detailed explanation of JSONL format and why we use it
  • Updated file checklist - Added concern files and additional modifications needed

Why These Changes:

  • Avoids ActiveRecord column accessor conflicts
  • Cleaner separation of concerns (Point model stays focused)
  • Explicit method names (raw_data_with_archive vs raw_data)
  • Reusable pattern for other models if needed

Executive Summary

Implement a system to archive points.raw_data JSONB column using ActiveStorage to reduce database size from 50GB+ to ~15-20GB while maintaining ability to restore data for migrations and fixes.

Key Benefits:

  • 60-70% database size reduction (~30-35GB saved)
  • 10-20% query performance improvement
  • ~$55/month storage costs
  • Zero data loss with append-only architecture
  • Full restore capabilities via rake tasks

Table of Contents

  1. Architecture Overview
  2. Configuration NEW
  3. Why ActiveStorage?
  4. Implementation Phases
  5. Migration Guide for Existing Code NEW
  6. Rake Tasks for Restoration
  7. Deployment Strategy
  8. Monitoring & Success Metrics
  9. Rollback Plan
  10. Timeline
  11. Cost Analysis
  12. FAQ
  13. Appendix A: What is JSONL? NEW
  14. Appendix B: File Checklist
  15. Appendix C: User Model Integration NEW

Architecture Overview

Current State

┌─────────────────────────────┐
│   Points Table (50GB)       │
│                             │
│  - id                       │
│  - user_id                  │
│  - timestamp                │
│  - lonlat (PostGIS)         │
│  - raw_data (JSONB) ← 30GB  │  ← Problem: 60% of table size
│  - ...other columns         │
└─────────────────────────────┘

Target State

┌─────────────────────────────┐     ┌──────────────────────────────────┐
│   Points Table (15-20GB)    │     │  Points::RawDataArchive Model    │
│                             │     │                                  │
│  - id                       │     │  - id                            │
│  - user_id                  │     │  - user_id                       │
│  - timestamp                │     │  - year, month, chunk_number     │
│  - lonlat (PostGIS)         │     │  - point_count, checksum         │
│  - raw_data (NULL) ← freed  │     │  - metadata (JSONB)              │
│  - raw_data_archived (bool) │────▶│  - has_one_attached :file        │
│  - raw_data_archive_id      │     │                                  │
│  - timestamp_year (gen)     │     └──────────────┬───────────────────┘
│  - timestamp_month (gen)    │                    │
└─────────────────────────────┘                    │ ActiveStorage
                                                   │
                                    ┌──────────────▼───────────────────┐
                                    │  S3 / Local Storage              │
                                    │                                  │
                                    │  raw_data_000001.jsonl.gz (1MB)  │
                                    │  raw_data_000002.jsonl.gz (50KB) │
                                    │  ...append-only chunks           │
                                    └──────────────────────────────────┘

Data Flow

Archival (Monthly Cron):

1. Find months 2+ months old with unarchived points
2. Create Points::RawDataArchive record
3. Compress points to JSONL.gz
4. Attach via ActiveStorage (handles S3 upload)
5. Atomically: Mark points archived, NULL raw_data

Access (Lazy Loading):

1. Point.raw_data called
2. Check if archived → yes
3. Check cache (1 day TTL) → miss
4. Fetch from archive.file.blob.download
5. Cache result, return data

Restoration (Rake Task):

1. Find all archives for user/month
2. Download via ActiveStorage
3. Decompress, parse JSONL
4. Update points: restore raw_data, unmark archived

Configuration

Environment Variables

The archival system is opt-in and controlled by a single environment variable:

# .env or environment
ARCHIVE_RAW_DATA=true    # Enable/disable archival (default: false)
STORAGE_BACKEND=s3       # Already exists: s3, local, etc.

Important: If ARCHIVE_RAW_DATA is not set to "true", the entire archival system is disabled:

  • Archive jobs won't run
  • Archive service returns early
  • No S3 costs incurred
  • Existing code works unchanged

The archival lag period (2 months) is a constant in the code and can be modified if needed.

This allows gradual rollout and easy disabling if issues arise.


Why ActiveStorage?

Consistency with Existing Code

Dawarich already uses ActiveStorage for Import and Export models:

Existing Pattern (Import model):

class Import < ApplicationRecord
  has_one_attached :file
end

# Usage in services
import.file.attach(io: File.open(path), filename: name)
content = import.file.blob.download

Our Pattern (Archive model):

class Points::RawDataArchive < ApplicationRecord
  has_one_attached :file
end

# Usage in archiver
archive.file.attach(io: StringIO.new(compressed), filename: "...")
content = archive.file.blob.download

Benefits Over Direct S3

Feature Direct S3 ActiveStorage
Backend flexibility Manual config Automatic via STORAGE_BACKEND env
Code consistency New S3 client code Same as Import/Export
Checksums Manual implementation Built-in via Blob
Local dev Need MinIO/localstack Works with local disk
Testing Mock S3 clients Use ActiveStorage test helpers
Cleanup Manual delete calls file.purge_later
Streaming Custom chunking Built-in `download {

No Additional Configuration Needed

# config/storage.yml - Already configured!
s3:
  service: S3
  access_key_id: <%= ENV.fetch("AWS_ACCESS_KEY_ID") %>
  secret_access_key: <%= ENV.fetch("AWS_SECRET_ACCESS_KEY") %>
  region: <%= ENV.fetch("AWS_REGION") %>
  bucket: <%= ENV.fetch("AWS_BUCKET") %>

# config/environments/production.rb - Already configured!
config.active_storage.service = ENV.fetch('STORAGE_BACKEND', :local)

Result: Zero new infrastructure setup!


Implementation Phases

Phase 1: Database Schema (3 Migrations)

Migration 1: Create Points::RawDataArchive Table

# db/migrate/YYYYMMDDHHMMSS_create_points_raw_data_archives.rb
class CreatePointsRawDataArchives < ActiveRecord::Migration[8.0]
  def change
    create_table :points_raw_data_archives do |t|
      t.bigint :user_id, null: false
      t.integer :year, null: false
      t.integer :month, null: false
      t.integer :chunk_number, null: false, default: 1
      t.integer :point_count, null: false
      t.string :point_ids_checksum, null: false
      t.jsonb :metadata, default: {}, null: false
      t.datetime :archived_at, null: false

      t.timestamps
    end

    add_index :points_raw_data_archives, :user_id
    add_index :points_raw_data_archives, [:user_id, :year, :month]
    add_index :points_raw_data_archives, :archived_at
    add_foreign_key :points_raw_data_archives, :users
  end
end

Migration 2: Add Archival Columns to Points

# db/migrate/YYYYMMDDHHMMSS_add_archival_columns_to_points.rb
class AddArchivalColumnsToPoints < ActiveRecord::Migration[8.0]
  def change
    add_column :points, :raw_data_archived, :boolean, default: false, null: false
    add_column :points, :raw_data_archive_id, :bigint, null: true

    add_index :points, :raw_data_archived,
      where: 'raw_data_archived = true',
      name: 'index_points_on_archived_true'
    add_index :points, :raw_data_archive_id

    add_foreign_key :points, :points_raw_data_archives,
      column: :raw_data_archive_id,
      on_delete: :nullify  # Don't delete points if archive deleted
  end
end

Migration 3: Add Generated Timestamp Columns

# db/migrate/YYYYMMDDHHMMSS_add_generated_timestamp_columns_to_points.rb
class AddGeneratedTimestampColumnsToPoints < ActiveRecord::Migration[8.0]
  def change
    # Use PostgreSQL generated columns for automatic year/month extraction
    add_column :points, :timestamp_year, :integer,
      as: "(EXTRACT(YEAR FROM to_timestamp(timestamp))::int)",
      stored: true

    add_column :points, :timestamp_month, :integer,
      as: "(EXTRACT(MONTH FROM to_timestamp(timestamp))::int)",
      stored: true

    # Composite index for efficient archival queries
    add_index :points, [:user_id, :timestamp_year, :timestamp_month, :raw_data_archived],
      name: 'index_points_on_user_time_archived'
  end
end

Why generated columns?

  • No application code needed to maintain
  • Automatically updated on insert/update
  • Can be indexed for fast queries
  • PostgreSQL calculates on write (not on read)

Phase 2: Models

Points::RawDataArchive Model

# app/models/points_raw_data_archive.rb
class Points::RawDataArchive < ApplicationRecord
  belongs_to :user
  has_many :points, foreign_key: :raw_data_archive_id, dependent: :nullify

  has_one_attached :file

  validates :year, :month, :chunk_number, :point_count, presence: true
  validates :year, numericality: { greater_than: 1970, less_than: 2100 }
  validates :month, numericality: { greater_than_or_equal_to: 1, less_than_or_equal_to: 12 }
  validates :chunk_number, numericality: { greater_than: 0 }
  validates :point_ids_checksum, presence: true

  validate :file_must_be_attached, on: :update

  scope :for_month, ->(user_id, year, month) {
    where(user_id: user_id, year: year, month: month)
      .order(:chunk_number)
  }

  scope :recent, -> { where('archived_at > ?', 30.days.ago) }
  scope :old, -> { where('archived_at < ?', 1.year.ago) }

  def month_display
    Date.new(year, month, 1).strftime('%B %Y')
  end

  def filename
    "raw_data_#{user_id}_#{year}_#{format('%02d', month)}_chunk#{format('%03d', chunk_number)}.jsonl.gz"
  end

  def size_mb
    return 0 unless file.attached?
    (file.blob.byte_size / 1024.0 / 1024.0).round(2)
  end

  private

  def file_must_be_attached
    errors.add(:file, 'must be attached') unless file.attached?
  end
end

Archivable Concern

⚠️ Important: We use a concern instead of overriding the raw_data column accessor to avoid ActiveRecord conflicts.

# app/models/concerns/archivable.rb
module Archivable
  extend ActiveSupport::Concern

  included do
    # Associations
    belongs_to :raw_data_archive,
      class_name: 'Points::RawDataArchive',
      foreign_key: :raw_data_archive_id,
      optional: true

    # Scopes
    scope :archived, -> { where(raw_data_archived: true) }
    scope :not_archived, -> { where(raw_data_archived: false) }
    scope :with_archived_raw_data, -> {
      includes(raw_data_archive: { file_attachment: :blob })
    }
  end

  # Main method: Get raw_data with fallback to archive
  # Use this instead of point.raw_data when you need archived data
  def raw_data_with_archive
    # If raw_data is present in DB, use it
    return raw_data if raw_data.present? || !raw_data_archived?

    # Otherwise fetch from archive
    fetch_archived_raw_data
  end

  # Alias for convenience (optional)
  alias_method :archived_raw_data, :raw_data_with_archive

  # Restore archived data back to database column
  def restore_raw_data!(value)
    update!(
      raw_data: value,
      raw_data_archived: false,
      raw_data_archive_id: nil
    )
  end

  # Cache key for long-term archive caching
  def archive_cache_key
    "raw_data:archive:#{self.class.name.underscore}:#{id}"
  end

  private

  def fetch_archived_raw_data
    # Check temporary restore cache first (for migrations)
    cached = check_temporary_restore_cache
    return cached if cached

    # Check long-term cache (1 day TTL)
    Rails.cache.fetch(archive_cache_key, expires_in: 1.day) do
      fetch_from_archive_file
    end
  rescue StandardError => e
    handle_archive_fetch_error(e)
  end

  def check_temporary_restore_cache
    return nil unless respond_to?(:timestamp_year) && timestamp_year && timestamp_month

    cache_key = "raw_data:temp:#{user_id}:#{timestamp_year}:#{timestamp_month}:#{id}"
    Rails.cache.read(cache_key)
  end

  def fetch_from_archive_file
    return {} unless raw_data_archive&.file&.attached?

    # Download and search through JSONL
    compressed_content = raw_data_archive.file.blob.download
    io = StringIO.new(compressed_content)
    gz = Zlib::GzipReader.new(io)

    result = nil
    gz.each_line do |line|
      data = JSON.parse(line)
      if data['id'] == id
        result = data['raw_data']
        break
      end
    end

    gz.close
    result || {}
  end

  def handle_archive_fetch_error(error)
    Rails.logger.error(
      "Failed to fetch archived raw_data for #{self.class.name} #{id}: #{error.message}"
    )
    Sentry.capture_exception(error) if defined?(Sentry)

    {} # Graceful degradation
  end
end

Point Model (Clean!)

# app/models/point.rb
class Point < ApplicationRecord
  include Nearable
  include Distanceable
  include Archivable  # ← All archival logic in concern

  belongs_to :import, optional: true, counter_cache: true
  belongs_to :visit, optional: true
  belongs_to :user, counter_cache: true
  belongs_to :country, optional: true
  belongs_to :track, optional: true

  validates :timestamp, :lonlat, presence: true
  # ... rest of existing code ...

  # Keep existing scope for query optimization
  def self.without_raw_data
    select(column_names - ['raw_data'])
  end

  # ... rest of existing methods ...
end

Usage:

# In services that need raw_data:
point.raw_data_with_archive  # Gets from DB or archive

# Regular column access (doesn't check archive):
point.raw_data  # May be NULL if archived

# For restoration:
point.restore_raw_data!(data)

Phase 3: Archive Services

Service: Points::RawData::Archiver

# app/services/points/raw_data/archiver.rb
class Points::RawData::Archiver
  SAFE_ARCHIVE_LAG = 2.months

  def initialize
    @stats = { processed: 0, archived: 0, failed: 0 }
  end

  def call
    unless archival_enabled?
      Rails.logger.info('Raw data archival disabled (ARCHIVE_RAW_DATA != "true")')
      return @stats
    end

    Rails.logger.info('Starting points raw_data archival...')

    archivable_months.find_each do |month_data|
      process_month(month_data)
    end

    Rails.logger.info("Archival complete: #{@stats}")
    @stats
  end

  def archive_specific_month(user_id, year, month)
    month_data = {
      'user_id' => user_id,
      'year' => year,
      'month' => month
    }

    process_month(month_data)
  end

  private

  def archival_enabled?
    ENV['ARCHIVE_RAW_DATA'] == 'true'
  end

  def archivable_months
    # Only months 2+ months old with unarchived points
    safe_cutoff = Date.current.beginning_of_month - SAFE_ARCHIVE_LAG

    Point.select(
      'user_id',
      'timestamp_year as year',
      'timestamp_month as month',
      'COUNT(*) as unarchived_count'
    ).where(raw_data_archived: false)
     .where('to_timestamp(timestamp) < ?', safe_cutoff)
     .group('user_id, timestamp_year, timestamp_month')
  end

  def process_month(month_data)
    user_id = month_data['user_id']
    year = month_data['year']
    month = month_data['month']

    lock_key = "archive_points:#{user_id}:#{year}:#{month}"

    # Advisory lock prevents duplicate processing
    ActiveRecord::Base.with_advisory_lock(lock_key, timeout_seconds: 0) do
      archive_month(user_id, year, month)
      @stats[:processed] += 1
    end
  rescue ActiveRecord::AdvisoryLockError
    Rails.logger.info("Skipping #{lock_key} - already locked")
  rescue StandardError => e
    Rails.logger.error("Archive failed for #{user_id}/#{year}/#{month}: #{e.message}")
    Sentry.capture_exception(e) if defined?(Sentry)
    @stats[:failed] += 1
  end

  def archive_month(user_id, year, month)
    # Find unarchived points for this month
    points = Point.where(
      user_id: user_id,
      timestamp_year: year,
      timestamp_month: month,
      raw_data_archived: false
    ).where.not(raw_data: nil)  # Skip already-NULLed points

    return if points.empty?

    point_ids = points.pluck(:id)

    Rails.logger.info("Archiving #{point_ids.count} points for user #{user_id}, #{year}-#{sprintf('%02d', month)}")

    # Create archive chunk
    archive = create_archive_chunk(user_id, year, month, points, point_ids)

    # Atomically mark points and NULL raw_data
    Point.transaction do
      Point.where(id: point_ids).update_all(
        raw_data_archived: true,
        raw_data_archive_id: archive.id,
        raw_data: nil  # Reclaim space!
      )
    end

    @stats[:archived] += point_ids.count

    Rails.logger.info("✓ Archived chunk #{archive.chunk_number} (#{archive.size_mb} MB)")
  end

  def create_archive_chunk(user_id, year, month, points, point_ids)
    # Determine chunk number (append-only)
    chunk_number = Points::RawDataArchive
      .where(user_id: user_id, year: year, month: month)
      .maximum(:chunk_number).to_i + 1

    # Compress points data
    compressed_data = Points::RawData::ChunkCompressor.new(points).compress

    # Create archive record
    archive = Points::RawDataArchive.create!(
      user_id: user_id,
      year: year,
      month: month,
      chunk_number: chunk_number,
      point_count: point_ids.count,
      point_ids_checksum: calculate_checksum(point_ids),
      archived_at: Time.current,
      metadata: {
        format_version: 1,
        compression: 'gzip',
        archived_by: 'Points::RawData::Archiver'
      }
    )

    # Attach compressed file via ActiveStorage
    filename = "raw_data_#{user_id}_#{year}_#{sprintf('%02d', month)}_chunk#{sprintf('%03d', chunk_number)}.jsonl.gz"

    archive.file.attach(
      io: StringIO.new(compressed_data),
      filename: filename,
      content_type: 'application/gzip'
    )

    archive
  end

  def calculate_checksum(point_ids)
    Digest::SHA256.hexdigest(point_ids.sort.join(','))
  end
end

Helper: Points::RawData::ChunkCompressor

# app/services/points/raw_data/chunk_compressor.rb
class Points::RawData::ChunkCompressor
  def initialize(points_relation)
    @points = points_relation
  end

  def compress
    io = StringIO.new
    gz = Zlib::GzipWriter.new(io)

    # Stream points to avoid memory issues with large months
    @points.select(:id, :raw_data).find_each(batch_size: 1000) do |point|
      # Write as JSONL (one JSON object per line)
      gz.puts({ id: point.id, raw_data: point.raw_data }.to_json)
    end

    gz.close
    io.string  # Returns compressed bytes
  end
end

Service: Points::RawData::Restorer

# app/services/points/raw_data/restorer.rb
class Points::RawData::Restorer
  def restore_to_database(user_id, year, month)
    archives = Points::RawDataArchive.for_month(user_id, year, month)

    raise "No archives found for user #{user_id}, #{year}-#{month}" if archives.empty?

    Rails.logger.info("Restoring #{archives.count} archives to database...")

    Point.transaction do
      archives.each do |archive|
        restore_archive_to_db(archive)
      end
    end

    Rails.logger.info("✓ Restored #{archives.sum(:point_count)} points")
  end

  def restore_to_memory(user_id, year, month)
    archives = Points::RawDataArchive.for_month(user_id, year, month)

    raise "No archives found for user #{user_id}, #{year}-#{month}" if archives.empty?

    Rails.logger.info("Loading #{archives.count} archives into cache...")

    cache_key_prefix = "raw_data:temp:#{user_id}:#{year}:#{month}"
    count = 0

    archives.each do |archive|
      count += restore_archive_to_cache(archive, cache_key_prefix)
    end

    Rails.logger.info("✓ Loaded #{count} points into cache (expires in 1 hour)")
  end

  def restore_all_for_user(user_id)
    archives = Points::RawDataArchive.where(user_id: user_id)
                                   .select(:year, :month)
                                   .distinct
                                   .order(:year, :month)

    Rails.logger.info("Restoring #{archives.count} months for user #{user_id}...")

    archives.each do |archive|
      restore_to_database(user_id, archive.year, archive.month)
    end

    Rails.logger.info("✓ Complete user restore finished")
  end

  private

  def restore_archive_to_db(archive)
    decompressed = download_and_decompress(archive)

    decompressed.each_line do |line|
      data = JSON.parse(line)

      Point.where(id: data['id']).update_all(
        raw_data: data['raw_data'],
        raw_data_archived: false,
        raw_data_archive_id: nil
      )
    end
  end

  def restore_archive_to_cache(archive, cache_key_prefix)
    decompressed = download_and_decompress(archive)
    count = 0

    decompressed.each_line do |line|
      data = JSON.parse(line)

      Rails.cache.write(
        "#{cache_key_prefix}:#{data['id']}",
        data['raw_data'],
        expires_in: 1.hour
      )

      count += 1
    end

    count
  end

  def download_and_decompress(archive)
    # Download via ActiveStorage
    compressed_content = archive.file.blob.download

    # Decompress
    io = StringIO.new(compressed_content)
    gz = Zlib::GzipReader.new(io)
    content = gz.read
    gz.close

    content
  rescue StandardError => e
    Rails.logger.error("Failed to download/decompress archive #{archive.id}: #{e.message}")
    raise
  end
end

Phase 4: Export Optimization

# app/services/users/export_data/points.rb (modify existing)

class Users::ExportData::Points
  def call
    # ... existing query code ...

    result.filter_map do |row|
      # ... existing code ...

      # Handle archived raw_data
      raw_data = if row['raw_data_archived']
        fetch_raw_data_from_archive(row['raw_data_archive_id'], row['id'])
      else
        row['raw_data']
      end

      point_hash = {
        # ... existing fields ...
        'raw_data' => raw_data
      }

      # ... existing code ...
      point_hash
    end
  end

  private

  # Cache downloaded archives to avoid re-downloading per point
  def fetch_raw_data_from_archive(archive_id, point_id)
    return {} if archive_id.nil?

    @archive_cache ||= {}

    unless @archive_cache[archive_id]
      archive = Points::RawDataArchive.find(archive_id)
      @archive_cache[archive_id] = parse_archive(archive)
    end

    @archive_cache[archive_id][point_id] || {}
  end

  def parse_archive(archive)
    # Download once, parse all points
    compressed = archive.file.blob.download
    io = StringIO.new(compressed)
    gz = Zlib::GzipReader.new(io)

    result = {}
    gz.each_line do |line|
      data = JSON.parse(line)
      result[data['id']] = data['raw_data']
    end

    gz.close
    result
  rescue StandardError => e
    Rails.logger.error("Failed to parse archive #{archive.id}: #{e.message}")
    {}
  end
end

Phase 5: Background Jobs

# app/jobs/points/raw_data/archive_job.rb
class Points::RawData::ArchiveJob < ApplicationJob
  queue_as :default

  def perform
    stats = Points::RawData::Archiver.new.call

    Rails.logger.info("Archive job complete: #{stats}")
  rescue StandardError => e
    Rails.logger.error("Archive job failed: #{e.message}")
    Sentry.capture_exception(e) if defined?(Sentry)
    raise
  end
end
# app/jobs/points/raw_data/re_archive_month_job.rb
class Points::RawData::ReArchiveMonthJob < ApplicationJob
  queue_as :default

  def perform(user_id, year, month)
    Rails.logger.info("Re-archiving #{user_id}/#{year}/#{month} (retrospective import)")

    Points::RawData::Archiver.new.archive_specific_month(user_id, year, month)
  rescue StandardError => e
    Rails.logger.error("Re-archive failed: #{e.message}")
    Sentry.capture_exception(e) if defined?(Sentry)
    raise
  end
end

Trigger re-archival after imports:

# app/services/imports/create.rb (or wherever import completes)
class Imports::Create
  def call
    # ... existing import logic ...

    # After successful import, check for retrospective points
    check_for_archived_months_needing_update
  end

  private

  def check_for_archived_months_needing_update
    # Find months where we added points to already-archived data
    affected_months = import.points
      .where(raw_data_archived: true)
      .select('DISTINCT timestamp_year, timestamp_month')

    affected_months.each do |month|
      # Queue job to create append-only chunk
      Points::RawData::ReArchiveMonthJob.perform_later(
        import.user_id,
        month.timestamp_year,
        month.timestamp_month
      )
    end
  end
end

Migration Guide for Existing Code

After implementing the archival system, you'll need to update existing code that accesses point.raw_data.

Step 1: Find All Usages

# Find all places where raw_data is accessed
grep -r "\.raw_data" app/services app/models --include="*.rb" > raw_data_usages.txt

# Common locations:
# - app/services/points/raw_data_lonlat_extractor.rb
# - app/services/google_maps/*_importer.rb
# - app/services/users/export_data/points.rb
# - app/serializers/*_serializer.rb

Step 2: Decision Tree

For each usage, ask:

Question 1: Is this code creating/importing new points?

  • Yes → Keep point.raw_data (data is in DB during import)
  • No → Go to Question 2

Question 2: Does this code need to access potentially archived data?

  • Yes → Change to point.raw_data_with_archive
  • No → Keep point.raw_data (but add comment why)

Step 3: Update Common Services

Example 1: RawDataLonlatExtractor

Before:

# app/services/points/raw_data_lonlat_extractor.rb
class Points::RawDataLonlatExtractor
  def extract_lonlat(point)
    if point.raw_data.dig('activitySegment', 'waypointPath', 'waypoints', 0)
      # ... extract coordinates ...
    elsif point.raw_data['longitudeE7'] && point.raw_data['latitudeE7']
      # ... extract coordinates ...
    end
  end
end

After:

# app/services/points/raw_data_lonlat_extractor.rb
class Points::RawDataLonlatExtractor
  def extract_lonlat(point)
    # Use raw_data_with_archive to support archived points
    raw = point.raw_data_with_archive

    if raw.dig('activitySegment', 'waypointPath', 'waypoints', 0)
      # ... extract coordinates ...
    elsif raw['longitudeE7'] && raw['latitudeE7']
      # ... extract coordinates ...
    end
  end
end

Why: This service is called for coordinate fixes/migrations, which may need archived data.

Example 2: Importer Services

Keep as-is:

# app/services/google_maps/semantic_history_importer.rb
class GoogleMaps::SemanticHistoryImporter
  def build_point_from_location(longitude:, latitude:, timestamp:, raw_data:, accuracy: nil)
    {
      longitude: longitude,
      latitude: latitude,
      timestamp: timestamp,
      raw_data: raw_data  # ← Keep as-is, we're CREATING points
      # ...
    }
  end
end

Why: Import services create new points, so raw_data will be in the database.

Example 3: Export Service

Before:

# app/services/users/export_data/points.rb
class Users::ExportData::Points
  def call
    points_sql = <<-SQL
      SELECT p.id, p.raw_data, ...
      FROM points p
      WHERE p.user_id = $1
    SQL

    result = ActiveRecord::Base.connection.exec_query(points_sql, 'Points Export', [user.id])

    result.map do |row|
      {
        'raw_data' => row['raw_data'],  # ← Problem: may be NULL if archived
        # ...
      }
    end
  end
end

After (Option A - Use concern method):

class Users::ExportData::Points
  def call
    # Fetch points with archive association eager-loaded
    points = user.points.with_archived_raw_data.order(:id)

    points.map do |point|
      {
        'raw_data' => point.raw_data_with_archive,  # ← Handles archived data
        # ...
      }
    end
  end
end

After (Option B - Batch fetch archives, see Phase 4 in plan):

# Already implemented in plan - caches downloaded archives

Example 4: Serializers

Before:

# app/serializers/export_serializer.rb
class ExportSerializer
  def serialize(point)
    {
      id: point.id,
      raw_data: point.raw_data,  # ← May be NULL
      # ...
    }
  end
end

After:

class ExportSerializer
  def serialize(point)
    {
      id: point.id,
      raw_data: point.raw_data_with_archive,  # ← Fetches from archive if needed
      # ...
    }
  end
end

Step 4: Testing Your Changes

# spec/services/points/raw_data_lonlat_extractor_spec.rb
RSpec.describe Points::RawDataLonlatExtractor do
  context 'with archived raw_data' do
    let(:archive) { create(:points_raw_data_archive, user: user, year: 2024, month: 6) }
    let(:point) { create(:point, user: user, raw_data: nil, raw_data_archived: true, raw_data_archive_id: archive.id) }

    before do
      # Mock archive content
      allow(archive.file.blob).to receive(:download).and_return(
        gzip_compress({ id: point.id, raw_data: { 'lon' => 13.4, 'lat' => 52.5 } }.to_json)
      )
    end

    it 'extracts coordinates from archived raw_data' do
      service = described_class.new(point)
      service.call

      expect(point.reload.longitude).to be_within(0.001).of(13.4)
      expect(point.reload.latitude).to be_within(0.001).of(52.5)
    end
  end
end

Step 5: Gradual Rollout Strategy

  1. Week 1: Update services, add tests
  2. Week 2: Deploy changes (before archival starts)
  3. Week 3: Start archival (code already handles it)

This ensures services work with both:

  • Points with raw_data in DB (current state)
  • Points with raw_data archived (future state)

Common Patterns Summary

Code Location Change? Reason
Importers (creating points) No raw_data is in DB during import
RawDataLonlatExtractor Yes Used for fixes/migrations
Export services Yes Users export all their data
Serializers for API Yes May serialize archived points
Display views Yes May show archived points
Background jobs (processing new imports) No Processing fresh data
Data migrations Yes May process old data

Rake Tasks for Restoration

# lib/tasks/points_raw_data.rake
namespace :points do
  namespace :raw_data do
    desc 'Restore raw_data from archive to database for a specific month'
    task :restore, [:user_id, :year, :month] => :environment do |t, args|
      validate_args!(args)

      user_id = args[:user_id].to_i
      year = args[:year].to_i
      month = args[:month].to_i

      puts "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━"
      puts "  Restoring raw_data to DATABASE"
      puts "  User: #{user_id} | Month: #{year}-#{sprintf('%02d', month)}"
      puts "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━"
      puts ""

      restorer = Points::RawData::Restorer.new
      restorer.restore_to_database(user_id, year, month)

      puts ""
      puts "✓ Restoration complete!"
      puts ""
      puts "Points in #{year}-#{month} now have raw_data in database."
      puts "Run VACUUM ANALYZE points; to update statistics."
    end

    desc 'Restore raw_data to memory/cache temporarily (for data migrations)'
    task :restore_temporary, [:user_id, :year, :month] => :environment do |t, args|
      validate_args!(args)

      user_id = args[:user_id].to_i
      year = args[:year].to_i
      month = args[:month].to_i

      puts "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━"
      puts "  Loading raw_data into CACHE (temporary)"
      puts "  User: #{user_id} | Month: #{year}-#{sprintf('%02d', month)}"
      puts "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━"
      puts ""
      puts "Data will be available for 1 hour via Point.raw_data accessor"
      puts ""

      restorer = Points::RawData::Restorer.new
      restorer.restore_to_memory(user_id, year, month)

      puts ""
      puts "✓ Cache loaded successfully!"
      puts ""
      puts "You can now run your data migration."
      puts "Example:"
      puts "  rails runner \"Point.where(user_id: #{user_id}, timestamp_year: #{year}, timestamp_month: #{month}).find_each { |p| p.fix_coordinates_from_raw_data }\""
      puts ""
      puts "Cache will expire in 1 hour automatically."
    end

    desc 'Restore all archived raw_data for a user'
    task :restore_all, [:user_id] => :environment do |t, args|
      raise 'Usage: rake points:raw_data:restore_all[user_id]' unless args[:user_id]

      user_id = args[:user_id].to_i
      user = User.find(user_id)

      puts "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━"
      puts "  Restoring ALL archives for user"
      puts "  #{user.email} (ID: #{user_id})"
      puts "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━"
      puts ""

      archives = Points::RawDataArchive.where(user_id: user_id)
                                     .select(:year, :month)
                                     .distinct
                                     .order(:year, :month)

      puts "Found #{archives.count} months to restore"
      puts ""

      archives.each_with_index do |archive, idx|
        puts "[#{idx + 1}/#{archives.count}] Restoring #{archive.year}-#{sprintf('%02d', archive.month)}..."

        restorer = Points::RawData::Restorer.new
        restorer.restore_to_database(user_id, archive.year, archive.month)
      end

      puts ""
      puts "✓ All archives restored for user #{user_id}!"
    end

    desc 'Show archive statistics'
    task status: :environment do
      puts "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━"
      puts "  Points raw_data Archive Statistics"
      puts "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━"
      puts ""

      total_archives = Points::RawDataArchive.count
      total_points = Point.count
      archived_points = Point.where(raw_data_archived: true).count
      percentage = total_points > 0 ? (archived_points.to_f / total_points * 100).round(2) : 0

      puts "Archives: #{total_archives}"
      puts "Points archived: #{archived_points} / #{total_points} (#{percentage}%)"
      puts ""

      # Storage size via ActiveStorage
      total_blob_size = ActiveStorage::Blob
        .joins("INNER JOIN active_storage_attachments ON active_storage_attachments.blob_id = active_storage_blobs.id")
        .where("active_storage_attachments.record_type = 'Points::RawDataArchive'")
        .sum(:byte_size)

      puts "Storage used: #{ActiveSupport::NumberHelper.number_to_human_size(total_blob_size)}"
      puts ""

      # Recent activity
      recent = Points::RawDataArchive.where('archived_at > ?', 7.days.ago).count
      puts "Archives created last 7 days: #{recent}"
      puts ""

      # Top users
      puts "Top 10 users by archive count:"
      puts "─────────────────────────────────────────────────"

      Points::RawDataArchive.group(:user_id)
                          .select('user_id, COUNT(*) as archive_count, SUM(point_count) as total_points')
                          .order('archive_count DESC')
                          .limit(10)
                          .each_with_index do |stat, idx|
        user = User.find(stat.user_id)
        puts "#{idx + 1}. #{user.email.ljust(30)} #{stat.archive_count.to_s.rjust(3)} archives, #{stat.total_points.to_s.rjust(8)} points"
      end

      puts ""
    end

    desc 'Verify archive integrity for a month'
    task :verify, [:user_id, :year, :month] => :environment do |t, args|
      validate_args!(args)

      user_id = args[:user_id].to_i
      year = args[:year].to_i
      month = args[:month].to_i

      puts "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━"
      puts "  Verifying Archives"
      puts "  User: #{user_id} | Month: #{year}-#{sprintf('%02d', month)}"
      puts "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━"
      puts ""

      archives = Points::RawDataArchive.for_month(user_id, year, month)

      if archives.empty?
        puts "No archives found."
        exit
      end

      all_ok = true

      archives.each do |archive|
        print "Chunk #{archive.chunk_number}: "

        # Check file attached
        unless archive.file.attached?
          puts "✗ ERROR - File not attached!"
          all_ok = false
          next
        end

        # Download and count
        begin
          compressed = archive.file.blob.download
          io = StringIO.new(compressed)
          gz = Zlib::GzipReader.new(io)

          actual_count = 0
          gz.each_line { actual_count += 1 }
          gz.close

          if actual_count == archive.point_count
            puts "✓ OK (#{actual_count} points, #{archive.size_mb} MB)"
          else
            puts "✗ MISMATCH - Expected #{archive.point_count}, found #{actual_count}"
            all_ok = false
          end
        rescue => e
          puts "✗ ERROR - #{e.message}"
          all_ok = false
        end
      end

      puts ""
      if all_ok
        puts "✓ All archives verified successfully!"
      else
        puts "✗ Some archives have issues. Please investigate."
      end
    end

    desc 'Run initial archival for old data (safe to re-run)'
    task initial_archive: :environment do
      puts "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━"
      puts "  Initial Archival (2+ months old data)"
      puts "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━"
      puts ""
      puts "This will archive points.raw_data for months 2+ months old."
      puts "This is safe to run multiple times (idempotent)."
      puts ""
      print "Continue? (y/N): "

      response = $stdin.gets.chomp.downcase
      unless response == 'y'
        puts "Cancelled."
        exit
      end

      puts ""
      stats = Points::RawData::Archiver.new.call

      puts ""
      puts "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━"
      puts "  Archival Complete"
      puts "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━"
      puts ""
      puts "Months processed: #{stats[:processed]}"
      puts "Points archived: #{stats[:archived]}"
      puts "Failures: #{stats[:failed]}"
      puts ""

      if stats[:archived] > 0
        puts "Next steps:"
        puts "1. Verify a sample: rake points:raw_data:verify[user_id,year,month]"
        puts "2. Check stats: rake points:raw_data:status"
        puts "3. (Optional) Reclaim space: VACUUM FULL points; (during maintenance)"
      end
    end
  end
end

def validate_args!(args)
  unless args[:user_id] && args[:year] && args[:month]
    raise 'Usage: rake points:raw_data:TASK[user_id,year,month]'
  end
end

Deployment Strategy

Phase 1: Deploy Non-Breaking Changes (Week 1)

Goal: Add infrastructure without using it

# 1. Deploy migrations
rails db:migrate

# 2. Verify schema
rails db:schema:dump
git diff db/schema.rb

# 3. Check indexes created
psql -c "\d points" | grep -i index
psql -c "\d points_raw_data_archives" | grep -i index

# 4. Deploy code (models, services inactive)
git push production main

# 5. Monitor for 24 hours
# - Check database performance
# - Check for any errors related to new columns
# - Verify generated columns populate correctly

Rollback: Simple rails db:rollback STEP=3

Phase 2: Test on Small Dataset (Week 2)

Goal: Validate archival works end-to-end

# Pick a test user with old data
# Example: User 123 with data from 2022

# 1. Manual archive
rails runner "Points::RawData::Archiver.new.archive_specific_month(123, 2022, 1)"

# 2. Verify archive created
rails runner "puts Points::RawDataArchive.where(user_id: 123, year: 2022, month: 1).inspect"

# 3. Verify ActiveStorage blob exists
rails runner "archive = Points::RawDataArchive.find_by(user_id: 123, year: 2022, month: 1); puts archive.file.attached?"

# 4. Verify points marked archived
rails runner "puts Point.where(user_id: 123, timestamp_year: 2022, timestamp_month: 1, raw_data_archived: true).count"

# 5. Test lazy loading
rails runner "point = Point.where(user_id: 123, timestamp_year: 2022, timestamp_month: 1).first; puts point.raw_data.inspect"

# 6. Verify integrity
rake points:raw_data:verify[123,2022,1]

# 7. Test restore
rake points:raw_data:restore_temporary[123,2022,1]

# 8. Clean up test
# (Leave archived for continued testing)

Expected Results:

  • Archive file created in S3/local storage
  • Points have raw_data = NULL in DB
  • Point.raw_data still returns data (from cache/S3)
  • No errors in logs

Phase 3: Gradual Rollout (Weeks 3-5)

Goal: Archive progressively older data, monitoring each step

Week 3: Archive 3+ Years Old

# Safety: Very old data, rarely accessed
rake points:raw_data:initial_archive

# Monitor:
# - Database size (should decrease)
# - S3 storage (should increase)
# - Query performance (should improve)
# - Error logs (should be empty)

# Check stats
rake points:raw_data:status

Week 4: Archive 2-3 Years Old

# Adjust threshold temporarily
# In Points::RawData::Archiver, change SAFE_ARCHIVE_LAG to 2.years
rake points:raw_data:initial_archive

# Monitor same metrics

Week 5: Archive 1-2 Years Old

# Adjust threshold to 1.year
rake points:raw_data:initial_archive

# Monitor same metrics

Week 6: Enable Monthly Cron

# config/schedule.yml (or crontab)
0 2 1 * * cd /app && rails points_raw_data:archive_job
# 2 AM on 1st of each month

Phase 4: Reclaim Space (Week 7)

Goal: Actually reclaim disk space from NULLed raw_data

-- During maintenance window (low traffic period)

-- 1. Check current table size
SELECT
  pg_size_pretty(pg_total_relation_size('points')) as total_size,
  pg_size_pretty(pg_relation_size('points')) as table_size,
  pg_size_pretty(pg_indexes_size('points')) as indexes_size;

-- 2. Vacuum full (can take hours, locks table!)
-- IMPORTANT: This locks the table. Do during maintenance!
VACUUM FULL points;

-- 3. Reindex
REINDEX TABLE points;

-- 4. Update statistics
ANALYZE points;

-- 5. Check new size
SELECT
  pg_size_pretty(pg_total_relation_size('points')) as total_size,
  pg_size_pretty(pg_relation_size('points')) as table_size,
  pg_size_pretty(pg_indexes_size('points')) as indexes_size;

Alternative (No Downtime):

-- Use pg_repack extension if available
pg_repack -d dawarich_production -t points

-- Or create new table, copy data, swap
CREATE TABLE points_new (LIKE points INCLUDING ALL);
INSERT INTO points_new SELECT * FROM points;
-- ... swap tables atomically

Monitoring & Success Metrics

Database Metrics

-- Query to monitor archival progress
SELECT
  COUNT(*) FILTER (WHERE raw_data_archived = false) as not_archived,
  COUNT(*) FILTER (WHERE raw_data_archived = true) as archived,
  COUNT(*) as total,
  ROUND(100.0 * COUNT(*) FILTER (WHERE raw_data_archived = true) / COUNT(*), 2) as archived_percentage
FROM points;

-- Table size over time
SELECT
  pg_size_pretty(pg_total_relation_size('points')) as total_size,
  pg_size_pretty(pg_relation_size('points')) as table_size;

-- Average row size
SELECT
  pg_size_pretty(AVG(pg_column_size(points.*))::bigint) as avg_row_size
FROM points
LIMIT 10000;  -- Sample

Application Metrics

# config/initializers/prometheus.rb (if using Prometheus)

# Archive operations
archive_operations = Prometheus::Client::Counter.new(
  :points_raw_data_archives_total,
  docstring: 'Total number of archive operations',
  labels: [:status]  # success, failure
)

# Archived points
archived_points_total = Prometheus::Client::Gauge.new(
  :points_raw_data_archived_count,
  docstring: 'Number of points with archived raw_data'
)

# Storage size
archive_storage_bytes = Prometheus::Client::Gauge.new(
  :points_raw_data_archive_storage_bytes,
  docstring: 'Total storage used by archives'
)

# Cache hit rate
raw_data_cache_hits = Prometheus::Client::Counter.new(
  :points_raw_data_cache_hits_total,
  docstring: 'Cache hits for raw_data access',
  labels: [:cache_type]  # temporary, long_term, miss
)

Success Criteria

Metric Baseline Target Alert If
Database size 50GB 15-20GB > 25GB
Query perf (p95) 200ms 160-180ms > 220ms
Archive success rate N/A > 99% < 95%
Cache hit rate N/A > 80% < 60%
Archive storage size 0 GB ~0.5-1 GB > 5 GB
Export time (with archived) 30s < 36s > 45s

Rollback Plan

Level 1: Stop Archival (No Data Loss)

# 1. Disable cron job
# Comment out in config/schedule.yml or crontab

# 2. Stop any running archive jobs
# Via Sidekiq dashboard or:
rails runner "Sidekiq::Queue.new('default').each(&:delete) if Sidekiq::Queue.new('default').map(&:klass).include?('Points::RawData::ArchiveJob')"

# 3. Monitor - system still works, no new archival

Level 2: Restore Data (Reversible)

# Restore all archived data back to database
rake points:raw_data:restore_all[user_id]  # Per user, or:

# Restore all users (can take hours)
rails runner "
  User.find_each do |user|
    puts \"Restoring user \#{user.id}...\"
    Points::RawData::Restorer.new.restore_all_for_user(user.id)
  end
"

# Verify restoration
rails runner "puts Point.where(raw_data_archived: true).count"  # Should be 0

Level 3: Remove System (Nuclear)

# 1. Ensure all data restored (Level 2)

# 2. Remove foreign keys and indexes
rails dbconsole
# DROP INDEX IF EXISTS index_points_on_archived_true;
# DROP INDEX IF EXISTS index_points_on_user_time_archived;
# ALTER TABLE points DROP CONSTRAINT IF EXISTS fk_rails_points_raw_data_archives;

# 3. Rollback migrations
rails db:migrate:down VERSION=YYYYMMDDHHMMSS  # timestamp_columns
rails db:migrate:down VERSION=YYYYMMDDHHMMSS  # archival_columns
rails db:migrate:down VERSION=YYYYMMDDHHMMSS  # create_archives_table

# 4. Delete ActiveStorage blobs
rails runner "
  Points::RawDataArchive.find_each do |archive|
    archive.file.purge
    archive.destroy
  end
"

# 5. Remove code
git revert <commit-sha>
git push production main

# 6. VACUUM to reclaim space
psql -d dawarich_production -c "VACUUM FULL points;"

Timeline

Week 1: Foundation

  • Mon-Tue: Create migrations, deploy
  • Wed-Thu: Implement models, basic services
  • Fri: Code review, tests

Week 2: Core Services

  • Mon-Tue: Complete Archiver service
  • Wed: Complete Restorer service
  • Thu: Export optimization
  • Fri: Background jobs

Week 3: Tools & Testing

  • Mon-Tue: Rake tasks
  • Wed: Comprehensive test suite
  • Thu-Fri: Integration testing on staging

Week 4: Production Deploy

  • Mon: Deploy to production (code only, inactive)
  • Tue: Test on single user
  • Wed-Fri: Monitor, validate

Week 5: Initial Archive

  • Mon: Archive 3+ year old data
  • Tue-Fri: Monitor metrics, validate

Week 6: Expand Archive

  • Mon: Archive 2+ year old data
  • Tue-Fri: Monitor, optimize

Week 7: Production Ready

  • Mon: Enable monthly cron
  • Tue: Final validation
  • Wed: Documentation update
  • Thu-Fri: Reclaim space (VACUUM FULL)

Total: 7 weeks


Cost Analysis

Database Savings

Before:

  • Points table: 50GB
  • Daily backup cost: ~$0.05/GB/day = $2.50/day = $75/month

After:

  • Points table: 15GB (-70%)
  • Daily backup cost: ~$0.75/day = $22.50/month
  • Savings: $52.50/month on backups

S3 Costs (20M points)

Storage:

  • Compressed size: ~0.5GB (average 25 bytes per raw_data compressed)
  • S3 Standard: $0.023/GB/month
  • Cost: 500MB × $0.023 = $0.012/month (~negligible)

Requests:

  • Monthly archival: ~50 PUT requests (50 users × 1 new month)
  • User exports: ~100 GET requests/month
  • PUT: $0.005/1000 = $0.0003/month
  • GET: $0.0004/1000 = $0.00004/month

Data Transfer:

  • Export downloads: ~10GB/month (100 exports × 100MB avg)
  • First 10GB free, then $0.09/GB
  • Cost: $0/month (under free tier)

Total S3 Cost: ~$0.02/month (essentially free!)

Net Savings

Total Monthly Savings: $52.50 - $0.02 = $52.48/month = $629.76/year

Plus:

  • Faster queries → better UX
  • Faster backups → reduced downtime risk
  • Room for growth → can add 20M more points before hitting old size

FAQ

Q: How do I enable/disable archival?

A: Control via environment variable:

# Enable archival
ARCHIVE_RAW_DATA=true

# Disable archival (default)
ARCHIVE_RAW_DATA=false
# or simply don't set the variable

When disabled:

  • Archive jobs return immediately without processing
  • No S3 operations occur
  • No costs incurred
  • Existing archived data remains accessible
  • Can be re-enabled anytime by setting to "true"

Deployment recommendation:

  1. Deploy code with ARCHIVE_RAW_DATA=false
  2. Test on staging
  3. Enable on production: ARCHIVE_RAW_DATA=true
  4. Monitor for 1 week
  5. If issues arise, set back to false immediately

Q: What happens if S3 is down?

A: The app continues working with graceful degradation:

  • New imports work (raw_data stored in DB)
  • Existing non-archived points work normally
  • Archived points return {} from Point.raw_data (logged to Sentry)
  • Exports may be incomplete (raw_data missing for archived points)

Q: Can I switch storage backends later?

A: Yes! ActiveStorage handles this:

# 1. Configure new backend in config/storage.yml
# 2. Set STORAGE_BACKEND=new_backend
# 3. Migrate blobs:
rails active_storage:migrate_blobs[s3,gcs]

Q: How do I restore data for a specific migration?

A:

# 1. Temporarily restore to cache (1 hour)
rake points:raw_data:restore_temporary[123,2024,6]

# 2. Run your migration immediately
rails runner "
  Point.where(user_id: 123, timestamp_year: 2024, timestamp_month: 6).find_each do |point|
    # point.raw_data now returns archived data from cache
    point.fix_coordinates_from_raw_data
    point.save!
  end
"

# 3. Cache expires automatically in 1 hour

Q: What if archive job fails?

A: Designed for safety:

  • Advisory locks prevent duplicate processing
  • Transactions ensure atomic DB updates
  • Failed uploads don't mark points as archived
  • Job retries automatically (Sidekiq)
  • Sentry captures exceptions

Q: Can I archive specific users only?

A: Yes, modify the archiver:

# Archive only specific users
Points::RawData::Archiver.new.call(user_ids: [1, 2, 3])

# Or exclude users
Points::RawData::Archiver.new.call(exclude_user_ids: [123])

Q: How do I monitor cache hit rates?

A:

# Add instrumentation to Point#raw_data
def raw_data
  return super unless raw_data_archived?

  cached = check_temporary_restore_cache
  if cached
    Rails.logger.debug("Cache hit: temporary restore for point #{id}")
    return cached
  end

  result = fetch_from_archive  # Internally logs cache hits/misses
  Rails.logger.debug("Cache miss: fetched from S3 for point #{id}") if result
  result
end

Appendix A: What is JSONL?

JSONL stands for JSON Lines (also called newline-delimited JSON or ndjson).

Definition

JSONL is a text format where each line is a separate, complete, valid JSON object. Unlike regular JSON which wraps everything in an array or object, JSONL stores multiple JSON objects separated by newlines.

Format Comparison

Regular JSON (Array):

[
  {"id": 1, "name": "Alice", "age": 30},
  {"id": 2, "name": "Bob", "age": 25},
  {"id": 3, "name": "Charlie", "age": 35}
]

JSONL (JSON Lines):

{"id": 1, "name": "Alice", "age": 30}
{"id": 2, "name": "Bob", "age": 25}
{"id": 3, "name": "Charlie", "age": 35}

No commas, no brackets—just one JSON object per line.

Why We Use JSONL for Archives

1. Memory-Efficient Streaming

# Regular JSON: Must load entire array into memory
data = JSON.parse(File.read('huge_file.json'))  # ❌ Could be gigabytes!
data.each { |item| process(item) }

# JSONL: Process line-by-line
File.foreach('huge_file.jsonl') do |line|  # ✅ Only one line in memory
  item = JSON.parse(line)
  process(item)
end

For a month with 100,000 points:

  • JSON: Must hold all 100k objects in memory (~200MB+)
  • JSONL: Process one at a time (~2KB per point in memory)

2. Fast Searching Without Full Parse

# Find one specific point without parsing everything
def find_point_raw_data(archive_file, point_id)
  Zlib::GzipReader.new(archive_file).each_line do |line|
    data = JSON.parse(line)
    return data['raw_data'] if data['id'] == point_id  # Found it! Stop reading.
  end
end

With regular JSON, you'd have to:

  1. Download entire file
  2. Parse entire JSON array
  3. Search through array
  4. Return result

With JSONL, you:

  1. Stream file line by line
  2. Parse only lines until found
  3. Stop immediately (could be after 10 lines instead of 100k!)

3. Perfect for Append-Only Architecture

# June 1st: Create initial archive
echo '{"id":1,"raw_data":{...}}' >> raw_data.jsonl
echo '{"id":2,"raw_data":{...}}' >> raw_data.jsonl
# ... 1000 lines

# July 1st: User imports 50 retrospective points
# Just append new lines!
echo '{"id":1001,"raw_data":{...}}' >> raw_data.jsonl
echo '{"id":1002,"raw_data":{...}}' >> raw_data.jsonl
# ... 50 more lines

# Done! No need to download, parse, merge, and re-upload.

With regular JSON, you'd need to:

  1. Download entire array
  2. Parse JSON
  3. Add new objects to array
  4. Re-serialize entire array
  5. Re-upload entire file

4. Excellent Compression

# JSONL compresses very well with gzip
raw_data.jsonl          # 10 MB (uncompressed)
raw_data.jsonl.gz       # 1 MB (compressed)  # 90% reduction!

Each line has similar structure, so gzip finds repeated patterns:

  • Same keys: "id", "raw_data", "lon", "lat", etc.
  • Same formats: numbers, nested objects
  • Repetitive whitespace

Our Implementation Examples

Writing Archive (Archiver Service)

gz = Zlib::GzipWriter.new(io)

points.find_each(batch_size: 1000) do |point|
  # Each point becomes one JSONL line
  gz.puts({ id: point.id, raw_data: point.raw_data }.to_json)
end

# Result file (compressed):
# Line 1: {"id":123,"raw_data":{"lon":13.4,"lat":52.5,"accuracy":10}}
# Line 2: {"id":124,"raw_data":{"lon":13.5,"lat":52.6,"accuracy":12}}
# Line 3: {"id":125,"raw_data":{"lon":13.6,"lat":52.7,"accuracy":8}}
# ...

Reading Archive (Point Model)

gz = Zlib::GzipReader.new(compressed_file)

# Stream search - only parse lines until we find our point
gz.each_line do |line|
  data = JSON.parse(line)
  return data['raw_data'] if data['id'] == target_id  # Found! Done.
end

Restoring Archive (Restorer Service)

# Process entire archive line-by-line (memory efficient)
decompressed_content.each_line do |line|
  data = JSON.parse(line)

  Point.where(id: data['id']).update_all(
    raw_data: data['raw_data']
  )
end

Performance Comparison

Operation Regular JSON JSONL Winner
Archive 100k points Load all 100k in memory Process 1k batches JSONL
Find 1 point Parse entire 100k array Stop after finding (avg 50k) JSONL
Add 50 new points Download, merge, re-upload Append 50 lines JSONL
Memory usage (100k points) ~200 MB ~2 MB JSONL
Compression ratio 60-70% 85-95% JSONL
Processing speed 5-10 sec 0.5-2 sec JSONL

Common Use Cases for JSONL

  • Log aggregation - Kibana, Logstash, Splunk
  • Big data - Apache Spark, Hadoop, BigQuery native support
  • Machine learning datasets - TensorFlow, PyTorch data pipelines
  • API streaming - Twitter API, Slack RTM API
  • Database exports - MongoDB export, PostgreSQL COPY
  • Our use case - Point data archives!

File Extensions

All are valid and recognized:

  • .jsonl - Official extension (JSONL)
  • .ndjson - Alternative (Newline-Delimited JSON)
  • .jsonl.gz - Compressed (what we use)
  • .ndjson.gz - Compressed alternative

Key Takeaway

JSONL = One JSON object per line

Perfect for our archive system because it enables:

  1. Stream processing - Low memory usage
  2. Fast searching - Stop when found
  3. Append-only - No merge needed
  4. Great compression - 90%+ size reduction
  5. Simple format - Easy to read/write/debug

It's essentially the difference between:

  • Phone book (JSON) - One big book, must open entire thing
  • Index cards (JSONL) - One card per entry, process individually

Appendix B: File Checklist

Files to Create (17)

Migrations (3):

  • db/migrate/YYYYMMDDHHMMSS_create_points_raw_data_archives.rb
  • db/migrate/YYYYMMDDHHMMSS_add_archival_columns_to_points.rb
  • db/migrate/YYYYMMDDHHMMSS_add_generated_timestamp_columns_to_points.rb

Models & Concerns (2):

  • app/models/points/raw_data_archive.rb (or app/models/points_raw_data_archive.rb)
  • app/models/concerns/archivable.rb NEW

Services (3):

  • app/services/points/raw_data/archiver.rb
  • app/services/points/raw_data/restorer.rb
  • app/services/points/raw_data/chunk_compressor.rb

Jobs (2):

  • app/jobs/points/raw_data/archive_job.rb
  • app/jobs/points/raw_data/re_archive_month_job.rb

Rake Tasks (1):

  • lib/tasks/points_raw_data.rake

Specs (6):

  • spec/models/points_raw_data_archive_spec.rb
  • spec/models/concerns/archivable_spec.rb NEW
  • spec/services/points/raw_data/archiver_spec.rb
  • spec/services/points/raw_data/restorer_spec.rb
  • spec/jobs/points/raw_data/archive_job_spec.rb
  • spec/lib/tasks/points_raw_data_rake_spec.rb

Files to Modify (6+)

Core Models:

  • app/models/point.rb - Add include Archivable (one line!)
  • app/models/user.rb - Add has_many :raw_data_archives relationship

Services:

  • app/services/users/export_data/points.rb - Batch load archives (see Phase 4)
  • app/services/imports/create.rb - Trigger re-archival after import
  • app/services/points/raw_data_lonlat_extractor.rb - Use raw_data_with_archive

Serializers (if needed):

  • app/serializers/export_serializer.rb - Use raw_data_with_archive
  • Other serializers that access raw_data (find with grep)

Config:

  • .env.template - Add archival configuration variable:
    # Raw Data Archival (Optional)
    ARCHIVE_RAW_DATA=false    # Set to "true" to enable archival (archives data 2+ months old)
    # Note: Requires STORAGE_BACKEND configured (s3, local, etc.)
    

Files NOT Modified (Already Configured!)

  • config/storage.yml - ActiveStorage already configured
  • config/initializers/aws.rb - S3 credentials already set
  • config/environments/production.rb - Storage backend already set
  • Gemfile - aws-sdk-s3 already included

Appendix C: User Model Integration

Archive Deletion Policy & Cascade Behavior

Requirement: When a user is deleted, all their raw_data archives (both database records and S3 files) must be deleted.

Implementation

User Model Addition:

# app/models/user.rb
class User < ApplicationRecord
  # ... existing associations ...
  has_many :raw_data_archives, class_name: 'Points::RawDataArchive', dependent: :destroy
  # ... rest of model ...
end

How Cascade Deletion Works

  1. User deletion triggereduser.destroy
  2. dependent: :destroy on association → Rails calls destroy on each Points::RawDataArchive record
  3. Archive model's has_one_attached :file → ActiveStorage's built-in callback triggers
  4. ActiveStorage purges blob → S3 file deleted via file.purge
  5. Archive record deleted → Database row removed

Automatic Cleanup Chain

user.destroy
  ↓
user.raw_data_archives.destroy_all
  ↓
archive.destroy (for each archive)
  ↓
ActiveStorage::Attachment callback fires
  ↓
active_storage_blobs.purge
  ↓
S3 file deleted
  ↓
active_storage_attachments row deleted
  ↓
active_storage_blobs row deleted
  ↓
points_raw_data_archives row deleted

GDPR Compliance

Complete data removal:

  • Database: points table records deleted (via has_many :points, dependent: :destroy)
  • Database: points_raw_data_archives table records deleted
  • Storage: All archive .jsonl.gz files in S3/local storage deleted
  • Database: ActiveStorage metadata (active_storage_blobs, active_storage_attachments) deleted

No manual intervention needed - Standard Rails cascade handles everything

Why No after_destroy Callback Needed

You might think you need this:

# ❌ NOT NEEDED
after_destroy :purge_raw_data_archives

def purge_raw_data_archives
  raw_data_archives.find_each { |a| a.file.purge }
end

But you don't because:

  1. ActiveStorage automatically purges attached files when the parent record is destroyed
  2. This is the same behavior as existing Import and Export models in Dawarich
  3. Adding manual purge would duplicate the work and potentially cause errors

Verification Test

# spec/models/user_spec.rb
RSpec.describe User, type: :model do
  describe 'archive deletion cascade' do
    let(:user) { create(:user) }
    let!(:archive) { create(:points_raw_data_archive, user: user) }

    before do
      # Attach a file to the archive
      archive.file.attach(
        io: StringIO.new('test data'),
        filename: 'test.jsonl.gz',
        content_type: 'application/gzip'
      )
    end

    it 'deletes archives and their S3 files when user is deleted' do
      blob_id = archive.file.blob.id

      expect {
        user.destroy
      }.to change(Points::RawDataArchive, :count).by(-1)
        .and change(ActiveStorage::Blob, :count).by(-1)

      expect(ActiveStorage::Blob.find_by(id: blob_id)).to be_nil
    end
  end
end

Notes

  • This pattern is consistent with existing Dawarich code (Import/Export models)
  • No special configuration needed for S3 deletion - ActiveStorage handles it
  • Works with any storage backend (S3, GCS, Azure, local disk)
  • Deletion is transactional - if user deletion fails, archives remain intact

Conclusion

This implementation plan provides a comprehensive, production-ready approach to archiving points.raw_data using ActiveStorage, with:

Consistency: Uses same patterns as existing Import/Export Safety: Append-only, transactional, idempotent Flexibility: Works with any ActiveStorage backend Observability: Comprehensive rake tasks and monitoring Reversibility: Full restore capabilities

Next Steps:

  1. Review plan with team
  2. Approve and prioritize
  3. Create GitHub issue with checklist
  4. Begin Week 1 implementation

Questions? Review the FAQ section or ask for clarification on specific components.