65 KiB
Points raw_data S3 Archival System Implementation Plan
Version: 1.2 Date: 2025-11-25 Last Updated: 2025-12-06 Status: Planning Phase
🔄 Version 1.2 Updates (2025-12-06)
Key Changes:
- ✅ Optional archival system - Controlled by
ARCHIVE_RAW_DATAenvironment variable (default: disabled) - ✅ User model integration - Added
has_many :raw_data_archiveswith cascade deletion - ✅ GDPR compliance - Complete data removal on user deletion (DB + S3)
- ✅ Configuration section - New section documenting ENV vars and opt-in behavior
- ✅ Appendix C added - User Model Integration with cascade deletion details
Why These Changes:
- Safe deployment: Feature disabled by default, can be enabled gradually
- Data privacy: Full cascade deletion ensures GDPR compliance
- Operational flexibility: Can disable archival instantly if issues arise
- User deletion: Archives and S3 files automatically cleaned up
🔄 Version 1.1 Updates (2025-11-25)
Key Changes:
- ✅ No column accessor override - Uses
raw_data_with_archivemethod instead of overridingraw_datacolumn - ✅ Archivable concern - Extracted all archival logic to reusable
Archivableconcern - ✅ Migration guide - Added comprehensive guide for updating existing code
- ✅ JSONL appendix - Added detailed explanation of JSONL format and why we use it
- ✅ Updated file checklist - Added concern files and additional modifications needed
Why These Changes:
- Avoids ActiveRecord column accessor conflicts
- Cleaner separation of concerns (Point model stays focused)
- Explicit method names (
raw_data_with_archivevsraw_data) - Reusable pattern for other models if needed
Executive Summary
Implement a system to archive points.raw_data JSONB column using ActiveStorage to reduce database size from 50GB+ to ~15-20GB while maintaining ability to restore data for migrations and fixes.
Key Benefits:
- 60-70% database size reduction (~30-35GB saved)
- 10-20% query performance improvement
- ~$55/month storage costs
- Zero data loss with append-only architecture
- Full restore capabilities via rake tasks
Table of Contents
- Architecture Overview
- Configuration ✨ NEW
- Why ActiveStorage?
- Implementation Phases
- Migration Guide for Existing Code ✨ NEW
- Rake Tasks for Restoration
- Deployment Strategy
- Monitoring & Success Metrics
- Rollback Plan
- Timeline
- Cost Analysis
- FAQ
- Appendix A: What is JSONL? ✨ NEW
- Appendix B: File Checklist
- Appendix C: User Model Integration ✨ NEW
Architecture Overview
Current State
┌─────────────────────────────┐
│ Points Table (50GB) │
│ │
│ - id │
│ - user_id │
│ - timestamp │
│ - lonlat (PostGIS) │
│ - raw_data (JSONB) ← 30GB │ ← Problem: 60% of table size
│ - ...other columns │
└─────────────────────────────┘
Target State
┌─────────────────────────────┐ ┌──────────────────────────────────┐
│ Points Table (15-20GB) │ │ Points::RawDataArchive Model │
│ │ │ │
│ - id │ │ - id │
│ - user_id │ │ - user_id │
│ - timestamp │ │ - year, month, chunk_number │
│ - lonlat (PostGIS) │ │ - point_count, checksum │
│ - raw_data (NULL) ← freed │ │ - metadata (JSONB) │
│ - raw_data_archived (bool) │────▶│ - has_one_attached :file │
│ - raw_data_archive_id │ │ │
│ - timestamp_year (gen) │ └──────────────┬───────────────────┘
│ - timestamp_month (gen) │ │
└─────────────────────────────┘ │ ActiveStorage
│
┌──────────────▼───────────────────┐
│ S3 / Local Storage │
│ │
│ raw_data_000001.jsonl.gz (1MB) │
│ raw_data_000002.jsonl.gz (50KB) │
│ ...append-only chunks │
└──────────────────────────────────┘
Data Flow
Archival (Monthly Cron):
1. Find months 2+ months old with unarchived points
2. Create Points::RawDataArchive record
3. Compress points to JSONL.gz
4. Attach via ActiveStorage (handles S3 upload)
5. Atomically: Mark points archived, NULL raw_data
Access (Lazy Loading):
1. Point.raw_data called
2. Check if archived → yes
3. Check cache (1 day TTL) → miss
4. Fetch from archive.file.blob.download
5. Cache result, return data
Restoration (Rake Task):
1. Find all archives for user/month
2. Download via ActiveStorage
3. Decompress, parse JSONL
4. Update points: restore raw_data, unmark archived
Configuration
Environment Variables
The archival system is opt-in and controlled by a single environment variable:
# .env or environment
ARCHIVE_RAW_DATA=true # Enable/disable archival (default: false)
STORAGE_BACKEND=s3 # Already exists: s3, local, etc.
Important: If ARCHIVE_RAW_DATA is not set to "true", the entire archival system is disabled:
- Archive jobs won't run
- Archive service returns early
- No S3 costs incurred
- Existing code works unchanged
The archival lag period (2 months) is a constant in the code and can be modified if needed.
This allows gradual rollout and easy disabling if issues arise.
Why ActiveStorage?
Consistency with Existing Code
Dawarich already uses ActiveStorage for Import and Export models:
Existing Pattern (Import model):
class Import < ApplicationRecord
has_one_attached :file
end
# Usage in services
import.file.attach(io: File.open(path), filename: name)
content = import.file.blob.download
Our Pattern (Archive model):
class Points::RawDataArchive < ApplicationRecord
has_one_attached :file
end
# Usage in archiver
archive.file.attach(io: StringIO.new(compressed), filename: "...")
content = archive.file.blob.download
Benefits Over Direct S3
| Feature | Direct S3 | ActiveStorage |
|---|---|---|
| Backend flexibility | Manual config | Automatic via STORAGE_BACKEND env |
| Code consistency | New S3 client code | Same as Import/Export |
| Checksums | Manual implementation | Built-in via Blob |
| Local dev | Need MinIO/localstack | Works with local disk |
| Testing | Mock S3 clients | Use ActiveStorage test helpers |
| Cleanup | Manual delete calls | file.purge_later |
| Streaming | Custom chunking | Built-in `download { |
No Additional Configuration Needed
# config/storage.yml - Already configured!
s3:
service: S3
access_key_id: <%= ENV.fetch("AWS_ACCESS_KEY_ID") %>
secret_access_key: <%= ENV.fetch("AWS_SECRET_ACCESS_KEY") %>
region: <%= ENV.fetch("AWS_REGION") %>
bucket: <%= ENV.fetch("AWS_BUCKET") %>
# config/environments/production.rb - Already configured!
config.active_storage.service = ENV.fetch('STORAGE_BACKEND', :local)
Result: Zero new infrastructure setup! ✅
Implementation Phases
Phase 1: Database Schema (3 Migrations)
Migration 1: Create Points::RawDataArchive Table
# db/migrate/YYYYMMDDHHMMSS_create_points_raw_data_archives.rb
class CreatePointsRawDataArchives < ActiveRecord::Migration[8.0]
def change
create_table :points_raw_data_archives do |t|
t.bigint :user_id, null: false
t.integer :year, null: false
t.integer :month, null: false
t.integer :chunk_number, null: false, default: 1
t.integer :point_count, null: false
t.string :point_ids_checksum, null: false
t.jsonb :metadata, default: {}, null: false
t.datetime :archived_at, null: false
t.timestamps
end
add_index :points_raw_data_archives, :user_id
add_index :points_raw_data_archives, [:user_id, :year, :month]
add_index :points_raw_data_archives, :archived_at
add_foreign_key :points_raw_data_archives, :users
end
end
Migration 2: Add Archival Columns to Points
# db/migrate/YYYYMMDDHHMMSS_add_archival_columns_to_points.rb
class AddArchivalColumnsToPoints < ActiveRecord::Migration[8.0]
def change
add_column :points, :raw_data_archived, :boolean, default: false, null: false
add_column :points, :raw_data_archive_id, :bigint, null: true
add_index :points, :raw_data_archived,
where: 'raw_data_archived = true',
name: 'index_points_on_archived_true'
add_index :points, :raw_data_archive_id
add_foreign_key :points, :points_raw_data_archives,
column: :raw_data_archive_id,
on_delete: :nullify # Don't delete points if archive deleted
end
end
Migration 3: Add Generated Timestamp Columns
# db/migrate/YYYYMMDDHHMMSS_add_generated_timestamp_columns_to_points.rb
class AddGeneratedTimestampColumnsToPoints < ActiveRecord::Migration[8.0]
def change
# Use PostgreSQL generated columns for automatic year/month extraction
add_column :points, :timestamp_year, :integer,
as: "(EXTRACT(YEAR FROM to_timestamp(timestamp))::int)",
stored: true
add_column :points, :timestamp_month, :integer,
as: "(EXTRACT(MONTH FROM to_timestamp(timestamp))::int)",
stored: true
# Composite index for efficient archival queries
add_index :points, [:user_id, :timestamp_year, :timestamp_month, :raw_data_archived],
name: 'index_points_on_user_time_archived'
end
end
Why generated columns?
- No application code needed to maintain
- Automatically updated on insert/update
- Can be indexed for fast queries
- PostgreSQL calculates on write (not on read)
Phase 2: Models
Points::RawDataArchive Model
# app/models/points_raw_data_archive.rb
class Points::RawDataArchive < ApplicationRecord
belongs_to :user
has_many :points, foreign_key: :raw_data_archive_id, dependent: :nullify
has_one_attached :file
validates :year, :month, :chunk_number, :point_count, presence: true
validates :year, numericality: { greater_than: 1970, less_than: 2100 }
validates :month, numericality: { greater_than_or_equal_to: 1, less_than_or_equal_to: 12 }
validates :chunk_number, numericality: { greater_than: 0 }
validates :point_ids_checksum, presence: true
validate :file_must_be_attached, on: :update
scope :for_month, ->(user_id, year, month) {
where(user_id: user_id, year: year, month: month)
.order(:chunk_number)
}
scope :recent, -> { where('archived_at > ?', 30.days.ago) }
scope :old, -> { where('archived_at < ?', 1.year.ago) }
def month_display
Date.new(year, month, 1).strftime('%B %Y')
end
def filename
"raw_data_#{user_id}_#{year}_#{format('%02d', month)}_chunk#{format('%03d', chunk_number)}.jsonl.gz"
end
def size_mb
return 0 unless file.attached?
(file.blob.byte_size / 1024.0 / 1024.0).round(2)
end
private
def file_must_be_attached
errors.add(:file, 'must be attached') unless file.attached?
end
end
Archivable Concern
⚠️ Important: We use a concern instead of overriding the raw_data column accessor to avoid ActiveRecord conflicts.
# app/models/concerns/archivable.rb
module Archivable
extend ActiveSupport::Concern
included do
# Associations
belongs_to :raw_data_archive,
class_name: 'Points::RawDataArchive',
foreign_key: :raw_data_archive_id,
optional: true
# Scopes
scope :archived, -> { where(raw_data_archived: true) }
scope :not_archived, -> { where(raw_data_archived: false) }
scope :with_archived_raw_data, -> {
includes(raw_data_archive: { file_attachment: :blob })
}
end
# Main method: Get raw_data with fallback to archive
# Use this instead of point.raw_data when you need archived data
def raw_data_with_archive
# If raw_data is present in DB, use it
return raw_data if raw_data.present? || !raw_data_archived?
# Otherwise fetch from archive
fetch_archived_raw_data
end
# Alias for convenience (optional)
alias_method :archived_raw_data, :raw_data_with_archive
# Restore archived data back to database column
def restore_raw_data!(value)
update!(
raw_data: value,
raw_data_archived: false,
raw_data_archive_id: nil
)
end
# Cache key for long-term archive caching
def archive_cache_key
"raw_data:archive:#{self.class.name.underscore}:#{id}"
end
private
def fetch_archived_raw_data
# Check temporary restore cache first (for migrations)
cached = check_temporary_restore_cache
return cached if cached
# Check long-term cache (1 day TTL)
Rails.cache.fetch(archive_cache_key, expires_in: 1.day) do
fetch_from_archive_file
end
rescue StandardError => e
handle_archive_fetch_error(e)
end
def check_temporary_restore_cache
return nil unless respond_to?(:timestamp_year) && timestamp_year && timestamp_month
cache_key = "raw_data:temp:#{user_id}:#{timestamp_year}:#{timestamp_month}:#{id}"
Rails.cache.read(cache_key)
end
def fetch_from_archive_file
return {} unless raw_data_archive&.file&.attached?
# Download and search through JSONL
compressed_content = raw_data_archive.file.blob.download
io = StringIO.new(compressed_content)
gz = Zlib::GzipReader.new(io)
result = nil
gz.each_line do |line|
data = JSON.parse(line)
if data['id'] == id
result = data['raw_data']
break
end
end
gz.close
result || {}
end
def handle_archive_fetch_error(error)
Rails.logger.error(
"Failed to fetch archived raw_data for #{self.class.name} #{id}: #{error.message}"
)
Sentry.capture_exception(error) if defined?(Sentry)
{} # Graceful degradation
end
end
Point Model (Clean!)
# app/models/point.rb
class Point < ApplicationRecord
include Nearable
include Distanceable
include Archivable # ← All archival logic in concern
belongs_to :import, optional: true, counter_cache: true
belongs_to :visit, optional: true
belongs_to :user, counter_cache: true
belongs_to :country, optional: true
belongs_to :track, optional: true
validates :timestamp, :lonlat, presence: true
# ... rest of existing code ...
# Keep existing scope for query optimization
def self.without_raw_data
select(column_names - ['raw_data'])
end
# ... rest of existing methods ...
end
Usage:
# In services that need raw_data:
point.raw_data_with_archive # Gets from DB or archive
# Regular column access (doesn't check archive):
point.raw_data # May be NULL if archived
# For restoration:
point.restore_raw_data!(data)
Phase 3: Archive Services
Service: Points::RawData::Archiver
# app/services/points/raw_data/archiver.rb
class Points::RawData::Archiver
SAFE_ARCHIVE_LAG = 2.months
def initialize
@stats = { processed: 0, archived: 0, failed: 0 }
end
def call
unless archival_enabled?
Rails.logger.info('Raw data archival disabled (ARCHIVE_RAW_DATA != "true")')
return @stats
end
Rails.logger.info('Starting points raw_data archival...')
archivable_months.find_each do |month_data|
process_month(month_data)
end
Rails.logger.info("Archival complete: #{@stats}")
@stats
end
def archive_specific_month(user_id, year, month)
month_data = {
'user_id' => user_id,
'year' => year,
'month' => month
}
process_month(month_data)
end
private
def archival_enabled?
ENV['ARCHIVE_RAW_DATA'] == 'true'
end
def archivable_months
# Only months 2+ months old with unarchived points
safe_cutoff = Date.current.beginning_of_month - SAFE_ARCHIVE_LAG
Point.select(
'user_id',
'timestamp_year as year',
'timestamp_month as month',
'COUNT(*) as unarchived_count'
).where(raw_data_archived: false)
.where('to_timestamp(timestamp) < ?', safe_cutoff)
.group('user_id, timestamp_year, timestamp_month')
end
def process_month(month_data)
user_id = month_data['user_id']
year = month_data['year']
month = month_data['month']
lock_key = "archive_points:#{user_id}:#{year}:#{month}"
# Advisory lock prevents duplicate processing
ActiveRecord::Base.with_advisory_lock(lock_key, timeout_seconds: 0) do
archive_month(user_id, year, month)
@stats[:processed] += 1
end
rescue ActiveRecord::AdvisoryLockError
Rails.logger.info("Skipping #{lock_key} - already locked")
rescue StandardError => e
Rails.logger.error("Archive failed for #{user_id}/#{year}/#{month}: #{e.message}")
Sentry.capture_exception(e) if defined?(Sentry)
@stats[:failed] += 1
end
def archive_month(user_id, year, month)
# Find unarchived points for this month
points = Point.where(
user_id: user_id,
timestamp_year: year,
timestamp_month: month,
raw_data_archived: false
).where.not(raw_data: nil) # Skip already-NULLed points
return if points.empty?
point_ids = points.pluck(:id)
Rails.logger.info("Archiving #{point_ids.count} points for user #{user_id}, #{year}-#{sprintf('%02d', month)}")
# Create archive chunk
archive = create_archive_chunk(user_id, year, month, points, point_ids)
# Atomically mark points and NULL raw_data
Point.transaction do
Point.where(id: point_ids).update_all(
raw_data_archived: true,
raw_data_archive_id: archive.id,
raw_data: nil # Reclaim space!
)
end
@stats[:archived] += point_ids.count
Rails.logger.info("✓ Archived chunk #{archive.chunk_number} (#{archive.size_mb} MB)")
end
def create_archive_chunk(user_id, year, month, points, point_ids)
# Determine chunk number (append-only)
chunk_number = Points::RawDataArchive
.where(user_id: user_id, year: year, month: month)
.maximum(:chunk_number).to_i + 1
# Compress points data
compressed_data = Points::RawData::ChunkCompressor.new(points).compress
# Create archive record
archive = Points::RawDataArchive.create!(
user_id: user_id,
year: year,
month: month,
chunk_number: chunk_number,
point_count: point_ids.count,
point_ids_checksum: calculate_checksum(point_ids),
archived_at: Time.current,
metadata: {
format_version: 1,
compression: 'gzip',
archived_by: 'Points::RawData::Archiver'
}
)
# Attach compressed file via ActiveStorage
filename = "raw_data_#{user_id}_#{year}_#{sprintf('%02d', month)}_chunk#{sprintf('%03d', chunk_number)}.jsonl.gz"
archive.file.attach(
io: StringIO.new(compressed_data),
filename: filename,
content_type: 'application/gzip'
)
archive
end
def calculate_checksum(point_ids)
Digest::SHA256.hexdigest(point_ids.sort.join(','))
end
end
Helper: Points::RawData::ChunkCompressor
# app/services/points/raw_data/chunk_compressor.rb
class Points::RawData::ChunkCompressor
def initialize(points_relation)
@points = points_relation
end
def compress
io = StringIO.new
gz = Zlib::GzipWriter.new(io)
# Stream points to avoid memory issues with large months
@points.select(:id, :raw_data).find_each(batch_size: 1000) do |point|
# Write as JSONL (one JSON object per line)
gz.puts({ id: point.id, raw_data: point.raw_data }.to_json)
end
gz.close
io.string # Returns compressed bytes
end
end
Service: Points::RawData::Restorer
# app/services/points/raw_data/restorer.rb
class Points::RawData::Restorer
def restore_to_database(user_id, year, month)
archives = Points::RawDataArchive.for_month(user_id, year, month)
raise "No archives found for user #{user_id}, #{year}-#{month}" if archives.empty?
Rails.logger.info("Restoring #{archives.count} archives to database...")
Point.transaction do
archives.each do |archive|
restore_archive_to_db(archive)
end
end
Rails.logger.info("✓ Restored #{archives.sum(:point_count)} points")
end
def restore_to_memory(user_id, year, month)
archives = Points::RawDataArchive.for_month(user_id, year, month)
raise "No archives found for user #{user_id}, #{year}-#{month}" if archives.empty?
Rails.logger.info("Loading #{archives.count} archives into cache...")
cache_key_prefix = "raw_data:temp:#{user_id}:#{year}:#{month}"
count = 0
archives.each do |archive|
count += restore_archive_to_cache(archive, cache_key_prefix)
end
Rails.logger.info("✓ Loaded #{count} points into cache (expires in 1 hour)")
end
def restore_all_for_user(user_id)
archives = Points::RawDataArchive.where(user_id: user_id)
.select(:year, :month)
.distinct
.order(:year, :month)
Rails.logger.info("Restoring #{archives.count} months for user #{user_id}...")
archives.each do |archive|
restore_to_database(user_id, archive.year, archive.month)
end
Rails.logger.info("✓ Complete user restore finished")
end
private
def restore_archive_to_db(archive)
decompressed = download_and_decompress(archive)
decompressed.each_line do |line|
data = JSON.parse(line)
Point.where(id: data['id']).update_all(
raw_data: data['raw_data'],
raw_data_archived: false,
raw_data_archive_id: nil
)
end
end
def restore_archive_to_cache(archive, cache_key_prefix)
decompressed = download_and_decompress(archive)
count = 0
decompressed.each_line do |line|
data = JSON.parse(line)
Rails.cache.write(
"#{cache_key_prefix}:#{data['id']}",
data['raw_data'],
expires_in: 1.hour
)
count += 1
end
count
end
def download_and_decompress(archive)
# Download via ActiveStorage
compressed_content = archive.file.blob.download
# Decompress
io = StringIO.new(compressed_content)
gz = Zlib::GzipReader.new(io)
content = gz.read
gz.close
content
rescue StandardError => e
Rails.logger.error("Failed to download/decompress archive #{archive.id}: #{e.message}")
raise
end
end
Phase 4: Export Optimization
# app/services/users/export_data/points.rb (modify existing)
class Users::ExportData::Points
def call
# ... existing query code ...
result.filter_map do |row|
# ... existing code ...
# Handle archived raw_data
raw_data = if row['raw_data_archived']
fetch_raw_data_from_archive(row['raw_data_archive_id'], row['id'])
else
row['raw_data']
end
point_hash = {
# ... existing fields ...
'raw_data' => raw_data
}
# ... existing code ...
point_hash
end
end
private
# Cache downloaded archives to avoid re-downloading per point
def fetch_raw_data_from_archive(archive_id, point_id)
return {} if archive_id.nil?
@archive_cache ||= {}
unless @archive_cache[archive_id]
archive = Points::RawDataArchive.find(archive_id)
@archive_cache[archive_id] = parse_archive(archive)
end
@archive_cache[archive_id][point_id] || {}
end
def parse_archive(archive)
# Download once, parse all points
compressed = archive.file.blob.download
io = StringIO.new(compressed)
gz = Zlib::GzipReader.new(io)
result = {}
gz.each_line do |line|
data = JSON.parse(line)
result[data['id']] = data['raw_data']
end
gz.close
result
rescue StandardError => e
Rails.logger.error("Failed to parse archive #{archive.id}: #{e.message}")
{}
end
end
Phase 5: Background Jobs
# app/jobs/points/raw_data/archive_job.rb
class Points::RawData::ArchiveJob < ApplicationJob
queue_as :default
def perform
stats = Points::RawData::Archiver.new.call
Rails.logger.info("Archive job complete: #{stats}")
rescue StandardError => e
Rails.logger.error("Archive job failed: #{e.message}")
Sentry.capture_exception(e) if defined?(Sentry)
raise
end
end
# app/jobs/points/raw_data/re_archive_month_job.rb
class Points::RawData::ReArchiveMonthJob < ApplicationJob
queue_as :default
def perform(user_id, year, month)
Rails.logger.info("Re-archiving #{user_id}/#{year}/#{month} (retrospective import)")
Points::RawData::Archiver.new.archive_specific_month(user_id, year, month)
rescue StandardError => e
Rails.logger.error("Re-archive failed: #{e.message}")
Sentry.capture_exception(e) if defined?(Sentry)
raise
end
end
Trigger re-archival after imports:
# app/services/imports/create.rb (or wherever import completes)
class Imports::Create
def call
# ... existing import logic ...
# After successful import, check for retrospective points
check_for_archived_months_needing_update
end
private
def check_for_archived_months_needing_update
# Find months where we added points to already-archived data
affected_months = import.points
.where(raw_data_archived: true)
.select('DISTINCT timestamp_year, timestamp_month')
affected_months.each do |month|
# Queue job to create append-only chunk
Points::RawData::ReArchiveMonthJob.perform_later(
import.user_id,
month.timestamp_year,
month.timestamp_month
)
end
end
end
Migration Guide for Existing Code
After implementing the archival system, you'll need to update existing code that accesses point.raw_data.
Step 1: Find All Usages
# Find all places where raw_data is accessed
grep -r "\.raw_data" app/services app/models --include="*.rb" > raw_data_usages.txt
# Common locations:
# - app/services/points/raw_data_lonlat_extractor.rb
# - app/services/google_maps/*_importer.rb
# - app/services/users/export_data/points.rb
# - app/serializers/*_serializer.rb
Step 2: Decision Tree
For each usage, ask:
Question 1: Is this code creating/importing new points?
- ✅ Yes → Keep
point.raw_data(data is in DB during import) - ❌ No → Go to Question 2
Question 2: Does this code need to access potentially archived data?
- ✅ Yes → Change to
point.raw_data_with_archive - ❌ No → Keep
point.raw_data(but add comment why)
Step 3: Update Common Services
Example 1: RawDataLonlatExtractor
Before:
# app/services/points/raw_data_lonlat_extractor.rb
class Points::RawDataLonlatExtractor
def extract_lonlat(point)
if point.raw_data.dig('activitySegment', 'waypointPath', 'waypoints', 0)
# ... extract coordinates ...
elsif point.raw_data['longitudeE7'] && point.raw_data['latitudeE7']
# ... extract coordinates ...
end
end
end
After:
# app/services/points/raw_data_lonlat_extractor.rb
class Points::RawDataLonlatExtractor
def extract_lonlat(point)
# Use raw_data_with_archive to support archived points
raw = point.raw_data_with_archive
if raw.dig('activitySegment', 'waypointPath', 'waypoints', 0)
# ... extract coordinates ...
elsif raw['longitudeE7'] && raw['latitudeE7']
# ... extract coordinates ...
end
end
end
Why: This service is called for coordinate fixes/migrations, which may need archived data.
Example 2: Importer Services
Keep as-is:
# app/services/google_maps/semantic_history_importer.rb
class GoogleMaps::SemanticHistoryImporter
def build_point_from_location(longitude:, latitude:, timestamp:, raw_data:, accuracy: nil)
{
longitude: longitude,
latitude: latitude,
timestamp: timestamp,
raw_data: raw_data # ← Keep as-is, we're CREATING points
# ...
}
end
end
Why: Import services create new points, so raw_data will be in the database.
Example 3: Export Service
Before:
# app/services/users/export_data/points.rb
class Users::ExportData::Points
def call
points_sql = <<-SQL
SELECT p.id, p.raw_data, ...
FROM points p
WHERE p.user_id = $1
SQL
result = ActiveRecord::Base.connection.exec_query(points_sql, 'Points Export', [user.id])
result.map do |row|
{
'raw_data' => row['raw_data'], # ← Problem: may be NULL if archived
# ...
}
end
end
end
After (Option A - Use concern method):
class Users::ExportData::Points
def call
# Fetch points with archive association eager-loaded
points = user.points.with_archived_raw_data.order(:id)
points.map do |point|
{
'raw_data' => point.raw_data_with_archive, # ← Handles archived data
# ...
}
end
end
end
After (Option B - Batch fetch archives, see Phase 4 in plan):
# Already implemented in plan - caches downloaded archives
Example 4: Serializers
Before:
# app/serializers/export_serializer.rb
class ExportSerializer
def serialize(point)
{
id: point.id,
raw_data: point.raw_data, # ← May be NULL
# ...
}
end
end
After:
class ExportSerializer
def serialize(point)
{
id: point.id,
raw_data: point.raw_data_with_archive, # ← Fetches from archive if needed
# ...
}
end
end
Step 4: Testing Your Changes
# spec/services/points/raw_data_lonlat_extractor_spec.rb
RSpec.describe Points::RawDataLonlatExtractor do
context 'with archived raw_data' do
let(:archive) { create(:points_raw_data_archive, user: user, year: 2024, month: 6) }
let(:point) { create(:point, user: user, raw_data: nil, raw_data_archived: true, raw_data_archive_id: archive.id) }
before do
# Mock archive content
allow(archive.file.blob).to receive(:download).and_return(
gzip_compress({ id: point.id, raw_data: { 'lon' => 13.4, 'lat' => 52.5 } }.to_json)
)
end
it 'extracts coordinates from archived raw_data' do
service = described_class.new(point)
service.call
expect(point.reload.longitude).to be_within(0.001).of(13.4)
expect(point.reload.latitude).to be_within(0.001).of(52.5)
end
end
end
Step 5: Gradual Rollout Strategy
- Week 1: Update services, add tests
- Week 2: Deploy changes (before archival starts)
- Week 3: Start archival (code already handles it)
This ensures services work with both:
- Points with
raw_datain DB (current state) - Points with
raw_dataarchived (future state)
Common Patterns Summary
| Code Location | Change? | Reason |
|---|---|---|
| Importers (creating points) | ❌ No | raw_data is in DB during import |
| RawDataLonlatExtractor | ✅ Yes | Used for fixes/migrations |
| Export services | ✅ Yes | Users export all their data |
| Serializers for API | ✅ Yes | May serialize archived points |
| Display views | ✅ Yes | May show archived points |
| Background jobs (processing new imports) | ❌ No | Processing fresh data |
| Data migrations | ✅ Yes | May process old data |
Rake Tasks for Restoration
# lib/tasks/points_raw_data.rake
namespace :points do
namespace :raw_data do
desc 'Restore raw_data from archive to database for a specific month'
task :restore, [:user_id, :year, :month] => :environment do |t, args|
validate_args!(args)
user_id = args[:user_id].to_i
year = args[:year].to_i
month = args[:month].to_i
puts "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━"
puts " Restoring raw_data to DATABASE"
puts " User: #{user_id} | Month: #{year}-#{sprintf('%02d', month)}"
puts "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━"
puts ""
restorer = Points::RawData::Restorer.new
restorer.restore_to_database(user_id, year, month)
puts ""
puts "✓ Restoration complete!"
puts ""
puts "Points in #{year}-#{month} now have raw_data in database."
puts "Run VACUUM ANALYZE points; to update statistics."
end
desc 'Restore raw_data to memory/cache temporarily (for data migrations)'
task :restore_temporary, [:user_id, :year, :month] => :environment do |t, args|
validate_args!(args)
user_id = args[:user_id].to_i
year = args[:year].to_i
month = args[:month].to_i
puts "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━"
puts " Loading raw_data into CACHE (temporary)"
puts " User: #{user_id} | Month: #{year}-#{sprintf('%02d', month)}"
puts "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━"
puts ""
puts "Data will be available for 1 hour via Point.raw_data accessor"
puts ""
restorer = Points::RawData::Restorer.new
restorer.restore_to_memory(user_id, year, month)
puts ""
puts "✓ Cache loaded successfully!"
puts ""
puts "You can now run your data migration."
puts "Example:"
puts " rails runner \"Point.where(user_id: #{user_id}, timestamp_year: #{year}, timestamp_month: #{month}).find_each { |p| p.fix_coordinates_from_raw_data }\""
puts ""
puts "Cache will expire in 1 hour automatically."
end
desc 'Restore all archived raw_data for a user'
task :restore_all, [:user_id] => :environment do |t, args|
raise 'Usage: rake points:raw_data:restore_all[user_id]' unless args[:user_id]
user_id = args[:user_id].to_i
user = User.find(user_id)
puts "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━"
puts " Restoring ALL archives for user"
puts " #{user.email} (ID: #{user_id})"
puts "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━"
puts ""
archives = Points::RawDataArchive.where(user_id: user_id)
.select(:year, :month)
.distinct
.order(:year, :month)
puts "Found #{archives.count} months to restore"
puts ""
archives.each_with_index do |archive, idx|
puts "[#{idx + 1}/#{archives.count}] Restoring #{archive.year}-#{sprintf('%02d', archive.month)}..."
restorer = Points::RawData::Restorer.new
restorer.restore_to_database(user_id, archive.year, archive.month)
end
puts ""
puts "✓ All archives restored for user #{user_id}!"
end
desc 'Show archive statistics'
task status: :environment do
puts "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━"
puts " Points raw_data Archive Statistics"
puts "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━"
puts ""
total_archives = Points::RawDataArchive.count
total_points = Point.count
archived_points = Point.where(raw_data_archived: true).count
percentage = total_points > 0 ? (archived_points.to_f / total_points * 100).round(2) : 0
puts "Archives: #{total_archives}"
puts "Points archived: #{archived_points} / #{total_points} (#{percentage}%)"
puts ""
# Storage size via ActiveStorage
total_blob_size = ActiveStorage::Blob
.joins("INNER JOIN active_storage_attachments ON active_storage_attachments.blob_id = active_storage_blobs.id")
.where("active_storage_attachments.record_type = 'Points::RawDataArchive'")
.sum(:byte_size)
puts "Storage used: #{ActiveSupport::NumberHelper.number_to_human_size(total_blob_size)}"
puts ""
# Recent activity
recent = Points::RawDataArchive.where('archived_at > ?', 7.days.ago).count
puts "Archives created last 7 days: #{recent}"
puts ""
# Top users
puts "Top 10 users by archive count:"
puts "─────────────────────────────────────────────────"
Points::RawDataArchive.group(:user_id)
.select('user_id, COUNT(*) as archive_count, SUM(point_count) as total_points')
.order('archive_count DESC')
.limit(10)
.each_with_index do |stat, idx|
user = User.find(stat.user_id)
puts "#{idx + 1}. #{user.email.ljust(30)} #{stat.archive_count.to_s.rjust(3)} archives, #{stat.total_points.to_s.rjust(8)} points"
end
puts ""
end
desc 'Verify archive integrity for a month'
task :verify, [:user_id, :year, :month] => :environment do |t, args|
validate_args!(args)
user_id = args[:user_id].to_i
year = args[:year].to_i
month = args[:month].to_i
puts "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━"
puts " Verifying Archives"
puts " User: #{user_id} | Month: #{year}-#{sprintf('%02d', month)}"
puts "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━"
puts ""
archives = Points::RawDataArchive.for_month(user_id, year, month)
if archives.empty?
puts "No archives found."
exit
end
all_ok = true
archives.each do |archive|
print "Chunk #{archive.chunk_number}: "
# Check file attached
unless archive.file.attached?
puts "✗ ERROR - File not attached!"
all_ok = false
next
end
# Download and count
begin
compressed = archive.file.blob.download
io = StringIO.new(compressed)
gz = Zlib::GzipReader.new(io)
actual_count = 0
gz.each_line { actual_count += 1 }
gz.close
if actual_count == archive.point_count
puts "✓ OK (#{actual_count} points, #{archive.size_mb} MB)"
else
puts "✗ MISMATCH - Expected #{archive.point_count}, found #{actual_count}"
all_ok = false
end
rescue => e
puts "✗ ERROR - #{e.message}"
all_ok = false
end
end
puts ""
if all_ok
puts "✓ All archives verified successfully!"
else
puts "✗ Some archives have issues. Please investigate."
end
end
desc 'Run initial archival for old data (safe to re-run)'
task initial_archive: :environment do
puts "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━"
puts " Initial Archival (2+ months old data)"
puts "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━"
puts ""
puts "This will archive points.raw_data for months 2+ months old."
puts "This is safe to run multiple times (idempotent)."
puts ""
print "Continue? (y/N): "
response = $stdin.gets.chomp.downcase
unless response == 'y'
puts "Cancelled."
exit
end
puts ""
stats = Points::RawData::Archiver.new.call
puts ""
puts "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━"
puts " Archival Complete"
puts "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━"
puts ""
puts "Months processed: #{stats[:processed]}"
puts "Points archived: #{stats[:archived]}"
puts "Failures: #{stats[:failed]}"
puts ""
if stats[:archived] > 0
puts "Next steps:"
puts "1. Verify a sample: rake points:raw_data:verify[user_id,year,month]"
puts "2. Check stats: rake points:raw_data:status"
puts "3. (Optional) Reclaim space: VACUUM FULL points; (during maintenance)"
end
end
end
end
def validate_args!(args)
unless args[:user_id] && args[:year] && args[:month]
raise 'Usage: rake points:raw_data:TASK[user_id,year,month]'
end
end
Deployment Strategy
Phase 1: Deploy Non-Breaking Changes (Week 1)
Goal: Add infrastructure without using it
# 1. Deploy migrations
rails db:migrate
# 2. Verify schema
rails db:schema:dump
git diff db/schema.rb
# 3. Check indexes created
psql -c "\d points" | grep -i index
psql -c "\d points_raw_data_archives" | grep -i index
# 4. Deploy code (models, services inactive)
git push production main
# 5. Monitor for 24 hours
# - Check database performance
# - Check for any errors related to new columns
# - Verify generated columns populate correctly
Rollback: Simple rails db:rollback STEP=3
Phase 2: Test on Small Dataset (Week 2)
Goal: Validate archival works end-to-end
# Pick a test user with old data
# Example: User 123 with data from 2022
# 1. Manual archive
rails runner "Points::RawData::Archiver.new.archive_specific_month(123, 2022, 1)"
# 2. Verify archive created
rails runner "puts Points::RawDataArchive.where(user_id: 123, year: 2022, month: 1).inspect"
# 3. Verify ActiveStorage blob exists
rails runner "archive = Points::RawDataArchive.find_by(user_id: 123, year: 2022, month: 1); puts archive.file.attached?"
# 4. Verify points marked archived
rails runner "puts Point.where(user_id: 123, timestamp_year: 2022, timestamp_month: 1, raw_data_archived: true).count"
# 5. Test lazy loading
rails runner "point = Point.where(user_id: 123, timestamp_year: 2022, timestamp_month: 1).first; puts point.raw_data.inspect"
# 6. Verify integrity
rake points:raw_data:verify[123,2022,1]
# 7. Test restore
rake points:raw_data:restore_temporary[123,2022,1]
# 8. Clean up test
# (Leave archived for continued testing)
Expected Results:
- Archive file created in S3/local storage
- Points have
raw_data = NULLin DB Point.raw_datastill returns data (from cache/S3)- No errors in logs
Phase 3: Gradual Rollout (Weeks 3-5)
Goal: Archive progressively older data, monitoring each step
Week 3: Archive 3+ Years Old
# Safety: Very old data, rarely accessed
rake points:raw_data:initial_archive
# Monitor:
# - Database size (should decrease)
# - S3 storage (should increase)
# - Query performance (should improve)
# - Error logs (should be empty)
# Check stats
rake points:raw_data:status
Week 4: Archive 2-3 Years Old
# Adjust threshold temporarily
# In Points::RawData::Archiver, change SAFE_ARCHIVE_LAG to 2.years
rake points:raw_data:initial_archive
# Monitor same metrics
Week 5: Archive 1-2 Years Old
# Adjust threshold to 1.year
rake points:raw_data:initial_archive
# Monitor same metrics
Week 6: Enable Monthly Cron
# config/schedule.yml (or crontab)
0 2 1 * * cd /app && rails points_raw_data:archive_job
# 2 AM on 1st of each month
Phase 4: Reclaim Space (Week 7)
Goal: Actually reclaim disk space from NULLed raw_data
-- During maintenance window (low traffic period)
-- 1. Check current table size
SELECT
pg_size_pretty(pg_total_relation_size('points')) as total_size,
pg_size_pretty(pg_relation_size('points')) as table_size,
pg_size_pretty(pg_indexes_size('points')) as indexes_size;
-- 2. Vacuum full (can take hours, locks table!)
-- IMPORTANT: This locks the table. Do during maintenance!
VACUUM FULL points;
-- 3. Reindex
REINDEX TABLE points;
-- 4. Update statistics
ANALYZE points;
-- 5. Check new size
SELECT
pg_size_pretty(pg_total_relation_size('points')) as total_size,
pg_size_pretty(pg_relation_size('points')) as table_size,
pg_size_pretty(pg_indexes_size('points')) as indexes_size;
Alternative (No Downtime):
-- Use pg_repack extension if available
pg_repack -d dawarich_production -t points
-- Or create new table, copy data, swap
CREATE TABLE points_new (LIKE points INCLUDING ALL);
INSERT INTO points_new SELECT * FROM points;
-- ... swap tables atomically
Monitoring & Success Metrics
Database Metrics
-- Query to monitor archival progress
SELECT
COUNT(*) FILTER (WHERE raw_data_archived = false) as not_archived,
COUNT(*) FILTER (WHERE raw_data_archived = true) as archived,
COUNT(*) as total,
ROUND(100.0 * COUNT(*) FILTER (WHERE raw_data_archived = true) / COUNT(*), 2) as archived_percentage
FROM points;
-- Table size over time
SELECT
pg_size_pretty(pg_total_relation_size('points')) as total_size,
pg_size_pretty(pg_relation_size('points')) as table_size;
-- Average row size
SELECT
pg_size_pretty(AVG(pg_column_size(points.*))::bigint) as avg_row_size
FROM points
LIMIT 10000; -- Sample
Application Metrics
# config/initializers/prometheus.rb (if using Prometheus)
# Archive operations
archive_operations = Prometheus::Client::Counter.new(
:points_raw_data_archives_total,
docstring: 'Total number of archive operations',
labels: [:status] # success, failure
)
# Archived points
archived_points_total = Prometheus::Client::Gauge.new(
:points_raw_data_archived_count,
docstring: 'Number of points with archived raw_data'
)
# Storage size
archive_storage_bytes = Prometheus::Client::Gauge.new(
:points_raw_data_archive_storage_bytes,
docstring: 'Total storage used by archives'
)
# Cache hit rate
raw_data_cache_hits = Prometheus::Client::Counter.new(
:points_raw_data_cache_hits_total,
docstring: 'Cache hits for raw_data access',
labels: [:cache_type] # temporary, long_term, miss
)
Success Criteria
| Metric | Baseline | Target | Alert If |
|---|---|---|---|
| Database size | 50GB | 15-20GB | > 25GB |
| Query perf (p95) | 200ms | 160-180ms | > 220ms |
| Archive success rate | N/A | > 99% | < 95% |
| Cache hit rate | N/A | > 80% | < 60% |
| Archive storage size | 0 GB | ~0.5-1 GB | > 5 GB |
| Export time (with archived) | 30s | < 36s | > 45s |
Rollback Plan
Level 1: Stop Archival (No Data Loss)
# 1. Disable cron job
# Comment out in config/schedule.yml or crontab
# 2. Stop any running archive jobs
# Via Sidekiq dashboard or:
rails runner "Sidekiq::Queue.new('default').each(&:delete) if Sidekiq::Queue.new('default').map(&:klass).include?('Points::RawData::ArchiveJob')"
# 3. Monitor - system still works, no new archival
Level 2: Restore Data (Reversible)
# Restore all archived data back to database
rake points:raw_data:restore_all[user_id] # Per user, or:
# Restore all users (can take hours)
rails runner "
User.find_each do |user|
puts \"Restoring user \#{user.id}...\"
Points::RawData::Restorer.new.restore_all_for_user(user.id)
end
"
# Verify restoration
rails runner "puts Point.where(raw_data_archived: true).count" # Should be 0
Level 3: Remove System (Nuclear)
# 1. Ensure all data restored (Level 2)
# 2. Remove foreign keys and indexes
rails dbconsole
# DROP INDEX IF EXISTS index_points_on_archived_true;
# DROP INDEX IF EXISTS index_points_on_user_time_archived;
# ALTER TABLE points DROP CONSTRAINT IF EXISTS fk_rails_points_raw_data_archives;
# 3. Rollback migrations
rails db:migrate:down VERSION=YYYYMMDDHHMMSS # timestamp_columns
rails db:migrate:down VERSION=YYYYMMDDHHMMSS # archival_columns
rails db:migrate:down VERSION=YYYYMMDDHHMMSS # create_archives_table
# 4. Delete ActiveStorage blobs
rails runner "
Points::RawDataArchive.find_each do |archive|
archive.file.purge
archive.destroy
end
"
# 5. Remove code
git revert <commit-sha>
git push production main
# 6. VACUUM to reclaim space
psql -d dawarich_production -c "VACUUM FULL points;"
Timeline
Week 1: Foundation
- Mon-Tue: Create migrations, deploy
- Wed-Thu: Implement models, basic services
- Fri: Code review, tests
Week 2: Core Services
- Mon-Tue: Complete Archiver service
- Wed: Complete Restorer service
- Thu: Export optimization
- Fri: Background jobs
Week 3: Tools & Testing
- Mon-Tue: Rake tasks
- Wed: Comprehensive test suite
- Thu-Fri: Integration testing on staging
Week 4: Production Deploy
- Mon: Deploy to production (code only, inactive)
- Tue: Test on single user
- Wed-Fri: Monitor, validate
Week 5: Initial Archive
- Mon: Archive 3+ year old data
- Tue-Fri: Monitor metrics, validate
Week 6: Expand Archive
- Mon: Archive 2+ year old data
- Tue-Fri: Monitor, optimize
Week 7: Production Ready
- Mon: Enable monthly cron
- Tue: Final validation
- Wed: Documentation update
- Thu-Fri: Reclaim space (VACUUM FULL)
Total: 7 weeks
Cost Analysis
Database Savings
Before:
- Points table: 50GB
- Daily backup cost: ~$0.05/GB/day = $2.50/day = $75/month
After:
- Points table: 15GB (-70%)
- Daily backup cost: ~$0.75/day = $22.50/month
- Savings: $52.50/month on backups
S3 Costs (20M points)
Storage:
- Compressed size: ~0.5GB (average 25 bytes per raw_data compressed)
- S3 Standard: $0.023/GB/month
- Cost: 500MB × $0.023 = $0.012/month (~negligible)
Requests:
- Monthly archival: ~50 PUT requests (50 users × 1 new month)
- User exports: ~100 GET requests/month
- PUT: $0.005/1000 = $0.0003/month
- GET: $0.0004/1000 = $0.00004/month
Data Transfer:
- Export downloads: ~10GB/month (100 exports × 100MB avg)
- First 10GB free, then $0.09/GB
- Cost: $0/month (under free tier)
Total S3 Cost: ~$0.02/month (essentially free!)
Net Savings
Total Monthly Savings: $52.50 - $0.02 = $52.48/month = $629.76/year
Plus:
- Faster queries → better UX
- Faster backups → reduced downtime risk
- Room for growth → can add 20M more points before hitting old size
FAQ
Q: How do I enable/disable archival?
A: Control via environment variable:
# Enable archival
ARCHIVE_RAW_DATA=true
# Disable archival (default)
ARCHIVE_RAW_DATA=false
# or simply don't set the variable
When disabled:
- Archive jobs return immediately without processing
- No S3 operations occur
- No costs incurred
- Existing archived data remains accessible
- Can be re-enabled anytime by setting to
"true"
Deployment recommendation:
- Deploy code with
ARCHIVE_RAW_DATA=false - Test on staging
- Enable on production:
ARCHIVE_RAW_DATA=true - Monitor for 1 week
- If issues arise, set back to
falseimmediately
Q: What happens if S3 is down?
A: The app continues working with graceful degradation:
- New imports work (raw_data stored in DB)
- Existing non-archived points work normally
- Archived points return
{}fromPoint.raw_data(logged to Sentry) - Exports may be incomplete (raw_data missing for archived points)
Q: Can I switch storage backends later?
A: Yes! ActiveStorage handles this:
# 1. Configure new backend in config/storage.yml
# 2. Set STORAGE_BACKEND=new_backend
# 3. Migrate blobs:
rails active_storage:migrate_blobs[s3,gcs]
Q: How do I restore data for a specific migration?
A:
# 1. Temporarily restore to cache (1 hour)
rake points:raw_data:restore_temporary[123,2024,6]
# 2. Run your migration immediately
rails runner "
Point.where(user_id: 123, timestamp_year: 2024, timestamp_month: 6).find_each do |point|
# point.raw_data now returns archived data from cache
point.fix_coordinates_from_raw_data
point.save!
end
"
# 3. Cache expires automatically in 1 hour
Q: What if archive job fails?
A: Designed for safety:
- Advisory locks prevent duplicate processing
- Transactions ensure atomic DB updates
- Failed uploads don't mark points as archived
- Job retries automatically (Sidekiq)
- Sentry captures exceptions
Q: Can I archive specific users only?
A: Yes, modify the archiver:
# Archive only specific users
Points::RawData::Archiver.new.call(user_ids: [1, 2, 3])
# Or exclude users
Points::RawData::Archiver.new.call(exclude_user_ids: [123])
Q: How do I monitor cache hit rates?
A:
# Add instrumentation to Point#raw_data
def raw_data
return super unless raw_data_archived?
cached = check_temporary_restore_cache
if cached
Rails.logger.debug("Cache hit: temporary restore for point #{id}")
return cached
end
result = fetch_from_archive # Internally logs cache hits/misses
Rails.logger.debug("Cache miss: fetched from S3 for point #{id}") if result
result
end
Appendix A: What is JSONL?
JSONL stands for JSON Lines (also called newline-delimited JSON or ndjson).
Definition
JSONL is a text format where each line is a separate, complete, valid JSON object. Unlike regular JSON which wraps everything in an array or object, JSONL stores multiple JSON objects separated by newlines.
Format Comparison
Regular JSON (Array):
[
{"id": 1, "name": "Alice", "age": 30},
{"id": 2, "name": "Bob", "age": 25},
{"id": 3, "name": "Charlie", "age": 35}
]
JSONL (JSON Lines):
{"id": 1, "name": "Alice", "age": 30}
{"id": 2, "name": "Bob", "age": 25}
{"id": 3, "name": "Charlie", "age": 35}
No commas, no brackets—just one JSON object per line.
Why We Use JSONL for Archives
1. Memory-Efficient Streaming
# Regular JSON: Must load entire array into memory
data = JSON.parse(File.read('huge_file.json')) # ❌ Could be gigabytes!
data.each { |item| process(item) }
# JSONL: Process line-by-line
File.foreach('huge_file.jsonl') do |line| # ✅ Only one line in memory
item = JSON.parse(line)
process(item)
end
For a month with 100,000 points:
- JSON: Must hold all 100k objects in memory (~200MB+)
- JSONL: Process one at a time (~2KB per point in memory)
2. Fast Searching Without Full Parse
# Find one specific point without parsing everything
def find_point_raw_data(archive_file, point_id)
Zlib::GzipReader.new(archive_file).each_line do |line|
data = JSON.parse(line)
return data['raw_data'] if data['id'] == point_id # Found it! Stop reading.
end
end
With regular JSON, you'd have to:
- Download entire file
- Parse entire JSON array
- Search through array
- Return result
With JSONL, you:
- Stream file line by line
- Parse only lines until found
- Stop immediately (could be after 10 lines instead of 100k!)
3. Perfect for Append-Only Architecture
# June 1st: Create initial archive
echo '{"id":1,"raw_data":{...}}' >> raw_data.jsonl
echo '{"id":2,"raw_data":{...}}' >> raw_data.jsonl
# ... 1000 lines
# July 1st: User imports 50 retrospective points
# Just append new lines!
echo '{"id":1001,"raw_data":{...}}' >> raw_data.jsonl
echo '{"id":1002,"raw_data":{...}}' >> raw_data.jsonl
# ... 50 more lines
# Done! No need to download, parse, merge, and re-upload.
With regular JSON, you'd need to:
- Download entire array
- Parse JSON
- Add new objects to array
- Re-serialize entire array
- Re-upload entire file
4. Excellent Compression
# JSONL compresses very well with gzip
raw_data.jsonl # 10 MB (uncompressed)
raw_data.jsonl.gz # 1 MB (compressed) # 90% reduction!
Each line has similar structure, so gzip finds repeated patterns:
- Same keys:
"id","raw_data","lon","lat", etc. - Same formats: numbers, nested objects
- Repetitive whitespace
Our Implementation Examples
Writing Archive (Archiver Service)
gz = Zlib::GzipWriter.new(io)
points.find_each(batch_size: 1000) do |point|
# Each point becomes one JSONL line
gz.puts({ id: point.id, raw_data: point.raw_data }.to_json)
end
# Result file (compressed):
# Line 1: {"id":123,"raw_data":{"lon":13.4,"lat":52.5,"accuracy":10}}
# Line 2: {"id":124,"raw_data":{"lon":13.5,"lat":52.6,"accuracy":12}}
# Line 3: {"id":125,"raw_data":{"lon":13.6,"lat":52.7,"accuracy":8}}
# ...
Reading Archive (Point Model)
gz = Zlib::GzipReader.new(compressed_file)
# Stream search - only parse lines until we find our point
gz.each_line do |line|
data = JSON.parse(line)
return data['raw_data'] if data['id'] == target_id # Found! Done.
end
Restoring Archive (Restorer Service)
# Process entire archive line-by-line (memory efficient)
decompressed_content.each_line do |line|
data = JSON.parse(line)
Point.where(id: data['id']).update_all(
raw_data: data['raw_data']
)
end
Performance Comparison
| Operation | Regular JSON | JSONL | Winner |
|---|---|---|---|
| Archive 100k points | Load all 100k in memory | Process 1k batches | JSONL |
| Find 1 point | Parse entire 100k array | Stop after finding (avg 50k) | JSONL |
| Add 50 new points | Download, merge, re-upload | Append 50 lines | JSONL |
| Memory usage (100k points) | ~200 MB | ~2 MB | JSONL |
| Compression ratio | 60-70% | 85-95% | JSONL |
| Processing speed | 5-10 sec | 0.5-2 sec | JSONL |
Common Use Cases for JSONL
- Log aggregation - Kibana, Logstash, Splunk
- Big data - Apache Spark, Hadoop, BigQuery native support
- Machine learning datasets - TensorFlow, PyTorch data pipelines
- API streaming - Twitter API, Slack RTM API
- Database exports - MongoDB export, PostgreSQL COPY
- Our use case - Point data archives!
File Extensions
All are valid and recognized:
.jsonl- Official extension (JSONL).ndjson- Alternative (Newline-Delimited JSON).jsonl.gz- Compressed (what we use).ndjson.gz- Compressed alternative
Key Takeaway
JSONL = One JSON object per line
Perfect for our archive system because it enables:
- ✅ Stream processing - Low memory usage
- ✅ Fast searching - Stop when found
- ✅ Append-only - No merge needed
- ✅ Great compression - 90%+ size reduction
- ✅ Simple format - Easy to read/write/debug
It's essentially the difference between:
- Phone book (JSON) - One big book, must open entire thing
- Index cards (JSONL) - One card per entry, process individually
Appendix B: File Checklist
Files to Create (17)
Migrations (3):
db/migrate/YYYYMMDDHHMMSS_create_points_raw_data_archives.rbdb/migrate/YYYYMMDDHHMMSS_add_archival_columns_to_points.rbdb/migrate/YYYYMMDDHHMMSS_add_generated_timestamp_columns_to_points.rb
Models & Concerns (2):
app/models/points/raw_data_archive.rb(orapp/models/points_raw_data_archive.rb)app/models/concerns/archivable.rb✨ NEW
Services (3):
app/services/points/raw_data/archiver.rbapp/services/points/raw_data/restorer.rbapp/services/points/raw_data/chunk_compressor.rb
Jobs (2):
app/jobs/points/raw_data/archive_job.rbapp/jobs/points/raw_data/re_archive_month_job.rb
Rake Tasks (1):
lib/tasks/points_raw_data.rake
Specs (6):
spec/models/points_raw_data_archive_spec.rbspec/models/concerns/archivable_spec.rb✨ NEWspec/services/points/raw_data/archiver_spec.rbspec/services/points/raw_data/restorer_spec.rbspec/jobs/points/raw_data/archive_job_spec.rbspec/lib/tasks/points_raw_data_rake_spec.rb
Files to Modify (6+)
Core Models:
app/models/point.rb- Addinclude Archivable(one line!)app/models/user.rb- Addhas_many :raw_data_archivesrelationship
Services:
app/services/users/export_data/points.rb- Batch load archives (see Phase 4)app/services/imports/create.rb- Trigger re-archival after importapp/services/points/raw_data_lonlat_extractor.rb- Useraw_data_with_archive
Serializers (if needed):
app/serializers/export_serializer.rb- Useraw_data_with_archive- Other serializers that access
raw_data(find with grep)
Config:
.env.template- Add archival configuration variable:# Raw Data Archival (Optional) ARCHIVE_RAW_DATA=false # Set to "true" to enable archival (archives data 2+ months old) # Note: Requires STORAGE_BACKEND configured (s3, local, etc.)
Files NOT Modified (Already Configured!)
- ✅
config/storage.yml- ActiveStorage already configured - ✅
config/initializers/aws.rb- S3 credentials already set - ✅
config/environments/production.rb- Storage backend already set - ✅
Gemfile- aws-sdk-s3 already included
Appendix C: User Model Integration
Archive Deletion Policy & Cascade Behavior
Requirement: When a user is deleted, all their raw_data archives (both database records and S3 files) must be deleted.
Implementation
User Model Addition:
# app/models/user.rb
class User < ApplicationRecord
# ... existing associations ...
has_many :raw_data_archives, class_name: 'Points::RawDataArchive', dependent: :destroy
# ... rest of model ...
end
How Cascade Deletion Works
- User deletion triggered →
user.destroy dependent: :destroyon association → Rails callsdestroyon eachPoints::RawDataArchiverecord- Archive model's
has_one_attached :file→ ActiveStorage's built-in callback triggers - ActiveStorage purges blob → S3 file deleted via
file.purge - Archive record deleted → Database row removed
Automatic Cleanup Chain
user.destroy
↓
user.raw_data_archives.destroy_all
↓
archive.destroy (for each archive)
↓
ActiveStorage::Attachment callback fires
↓
active_storage_blobs.purge
↓
S3 file deleted
↓
active_storage_attachments row deleted
↓
active_storage_blobs row deleted
↓
points_raw_data_archives row deleted
GDPR Compliance
✅ Complete data removal:
- Database:
pointstable records deleted (viahas_many :points, dependent: :destroy) - Database:
points_raw_data_archivestable records deleted - Storage: All archive
.jsonl.gzfiles in S3/local storage deleted - Database: ActiveStorage metadata (
active_storage_blobs,active_storage_attachments) deleted
✅ No manual intervention needed - Standard Rails cascade handles everything
Why No after_destroy Callback Needed
You might think you need this:
# ❌ NOT NEEDED
after_destroy :purge_raw_data_archives
def purge_raw_data_archives
raw_data_archives.find_each { |a| a.file.purge }
end
But you don't because:
- ActiveStorage automatically purges attached files when the parent record is destroyed
- This is the same behavior as existing
ImportandExportmodels in Dawarich - Adding manual purge would duplicate the work and potentially cause errors
Verification Test
# spec/models/user_spec.rb
RSpec.describe User, type: :model do
describe 'archive deletion cascade' do
let(:user) { create(:user) }
let!(:archive) { create(:points_raw_data_archive, user: user) }
before do
# Attach a file to the archive
archive.file.attach(
io: StringIO.new('test data'),
filename: 'test.jsonl.gz',
content_type: 'application/gzip'
)
end
it 'deletes archives and their S3 files when user is deleted' do
blob_id = archive.file.blob.id
expect {
user.destroy
}.to change(Points::RawDataArchive, :count).by(-1)
.and change(ActiveStorage::Blob, :count).by(-1)
expect(ActiveStorage::Blob.find_by(id: blob_id)).to be_nil
end
end
end
Notes
- This pattern is consistent with existing Dawarich code (Import/Export models)
- No special configuration needed for S3 deletion - ActiveStorage handles it
- Works with any storage backend (S3, GCS, Azure, local disk)
- Deletion is transactional - if user deletion fails, archives remain intact
Conclusion
This implementation plan provides a comprehensive, production-ready approach to archiving points.raw_data using ActiveStorage, with:
✅ Consistency: Uses same patterns as existing Import/Export ✅ Safety: Append-only, transactional, idempotent ✅ Flexibility: Works with any ActiveStorage backend ✅ Observability: Comprehensive rake tasks and monitoring ✅ Reversibility: Full restore capabilities
Next Steps:
- Review plan with team
- Approve and prioritize
- Create GitHub issue with checklist
- Begin Week 1 implementation
Questions? Review the FAQ section or ask for clarification on specific components.