Skip to content

Implement Google Drive Import Tool with Comprehensive Monitoring #4

@bryanchriswhite

Description

@bryanchriswhite

Summary

Implement a comprehensive Google Drive import tool that allows users to authenticate with their Google account, select files/folders from their Drive, and import them into PinShare with full monitoring capabilities. The tool should support both manual one-time imports and optional continuous synchronization.

User Story

As a PinShare user, I want to import my files from Google Drive into the decentralized PinShare network, so that I can:

  • Liberate my data from centralized cloud storage
  • Share my files via P2P/IPFS without relying on Google's infrastructure
  • Maintain a decentralized backup of my important files
  • Optionally keep my PinShare instance in sync with my Drive

Current State

PinShare currently supports file uploads via:

  • File system watcher monitoring an ./upload folder
  • Manual file placement by users

Architecture Gaps for Google Drive Import:

  1. ❌ No user authentication system (OAuth or otherwise)
  2. ❌ No Google Drive API integration
  3. ❌ No background job queue for long-running operations
  4. ❌ No per-user file import tracking
  5. ❌ Limited upload status tracking (designed for quick local file processing)
  6. ❌ No continuous sync/monitoring capability

Existing Assets to Leverage:

  • ✅ Robust upload pipeline (validation → hashing → security scanning → IPFS → metadata storage)
  • ✅ Upload status tracking system (UploadStatusManager)
  • ✅ Real-time UI updates via React Query
  • ✅ Security scanning infrastructure (VirusTotal, ClamAV, P2P-Sec)
  • ✅ P2P metadata distribution via PubSub

Proposed Solution

Architecture Components

1. User Authentication System

  • OAuth 2.0 with PKCE for Google Drive API access
  • Per-user token storage (encrypted)
  • Token refresh mechanism
  • Support for multiple users on the same PinShare instance

Tech Stack:

  • google.golang.org/api/drive/v3 for Google Drive API
  • golang.org/x/oauth2 for OAuth flow
  • Secure token storage (encrypted database or keyring)

2. Google Drive Integration

  • List user's Drive folders and files
  • Stream file downloads directly to PinShare
  • Folder hierarchy preservation (optional)
  • Metadata mapping (Drive metadata → PinShare metadata)

Features:

  • Folder tree browser UI
  • Multi-select file/folder selection
  • Filter by file type, size, date
  • Preview file list before import

3. Background Job System

  • Job queue for import operations
  • Worker pool for concurrent processing
  • Job persistence (survive restarts)
  • Progress tracking per job
  • Retry mechanism with exponential backoff

Job States:

pending → downloading → hashing → scanning → uploading → completed
                                                      └→ failed (with retry)

4. Enhanced Monitoring Dashboard

Real-time metrics:

  • Overall import progress (X of Y files, % complete)
  • Per-file status with detailed stages
  • Error tracking with specific failure reasons
  • Bandwidth metrics (current speed, average speed, ETA)
  • Success/failure statistics

Historical tracking:

  • Import job history
  • Per-file import logs
  • Retry attempts
  • Total data imported

5. Continuous Sync Engine (Phase 3)

  • Watch Google Drive for changes (polling or webhooks)
  • Auto-import new/modified files
  • Configurable sync interval
  • Conflict resolution strategy
  • Sync pause/resume capability

Technical Implementation

Backend API Endpoints

Authentication

POST /api/google-drive/authorize
  → Initiates OAuth flow, returns authorization URL

POST /api/google-drive/callback?code={authCode}
  → Exchanges auth code for tokens, stores encrypted

GET /api/google-drive/auth-status
  → Returns whether user is authenticated

DELETE /api/google-drive/revoke
  → Revokes access and deletes tokens

File Selection

GET /api/google-drive/folders?path={folderId}
  → Lists files/folders in specified folder (defaults to root)
  Response: { id, name, mimeType, size, modifiedTime, parents[] }

POST /api/google-drive/preview-import
  Request: { fileIds: [], folderIds: [], recursive: bool }
  Response: { files: [], totalSize, totalCount }

Import Operations

POST /api/google-drive/import
  Request: { 
    fileIds: [], 
    folderIds: [], 
    recursive: bool,
    options: { preserveHierarchy, skipDuplicates }
  }
  Response: { jobId, status, filesQueued }

GET /api/google-drive/import/{jobId}/status
  Response: { 
    jobId, 
    status: "pending|running|completed|failed|cancelled",
    progress: { 
      totalFiles, 
      completedFiles, 
      failedFiles, 
      currentFile,
      percentComplete,
      bytesTransferred,
      totalBytes,
      transferRate,
      estimatedTimeRemaining
    },
    files: [
      { 
        driveId, 
        fileName, 
        status: "pending|downloading|hashing|scanning|uploading|completed|failed",
        progress: 0-100,
        error: ""
      }
    ]
  }

POST /api/google-drive/import/{jobId}/cancel
  → Cancels running import job

POST /api/google-drive/import/{jobId}/retry-failed
  → Retries all failed files in the job

GET /api/google-drive/import/history
  → Returns list of past import jobs with summary stats

Continuous Sync (Phase 3)

POST /api/google-drive/sync/configure
  Request: { enabled, folderId, interval, options }
  
GET /api/google-drive/sync/status
  Response: { enabled, lastSync, nextSync, syncedFiles, errors }

POST /api/google-drive/sync/trigger
  → Manually triggers sync cycle

Database Schema

Users Table

CREATE TABLE users (
  id SERIAL PRIMARY KEY,
  google_id VARCHAR(255) UNIQUE NOT NULL,
  email VARCHAR(255) NOT NULL,
  encrypted_access_token TEXT,
  encrypted_refresh_token TEXT,
  token_expiry TIMESTAMP,
  created_at TIMESTAMP DEFAULT NOW(),
  updated_at TIMESTAMP DEFAULT NOW()
);

Import Jobs Table

CREATE TABLE import_jobs (
  id UUID PRIMARY KEY,
  user_id INTEGER REFERENCES users(id),
  status VARCHAR(50), -- pending, running, completed, failed, cancelled
  total_files INTEGER,
  completed_files INTEGER,
  failed_files INTEGER,
  total_bytes BIGINT,
  transferred_bytes BIGINT,
  started_at TIMESTAMP,
  completed_at TIMESTAMP,
  options JSONB,
  created_at TIMESTAMP DEFAULT NOW(),
  updated_at TIMESTAMP DEFAULT NOW()
);

Import Files Table

CREATE TABLE import_files (
  id SERIAL PRIMARY KEY,
  job_id UUID REFERENCES import_jobs(id),
  drive_file_id VARCHAR(255),
  file_name VARCHAR(500),
  file_size BIGINT,
  status VARCHAR(50), -- pending, downloading, hashing, scanning, uploading, completed, failed
  progress INTEGER, -- 0-100
  sha256_hash VARCHAR(64),
  ipfs_cid VARCHAR(255),
  error_message TEXT,
  retry_count INTEGER DEFAULT 0,
  started_at TIMESTAMP,
  completed_at TIMESTAMP,
  created_at TIMESTAMP DEFAULT NOW(),
  updated_at TIMESTAMP DEFAULT NOW()
);

Sync Configurations Table (Phase 3)

CREATE TABLE sync_configs (
  id SERIAL PRIMARY KEY,
  user_id INTEGER REFERENCES users(id),
  drive_folder_id VARCHAR(255),
  enabled BOOLEAN DEFAULT TRUE,
  sync_interval INTEGER, -- minutes
  last_sync_at TIMESTAMP,
  next_sync_at TIMESTAMP,
  options JSONB,
  created_at TIMESTAMP DEFAULT NOW(),
  updated_at TIMESTAMP DEFAULT NOW()
);

Internal Architecture

New Go Packages

internal/gdrive/

  • client.go - Google Drive API client wrapper
  • oauth.go - OAuth flow management
  • downloader.go - File download from Drive
  • mapper.go - Drive metadata → PinShare metadata conversion

internal/jobs/

  • queue.go - Job queue interface
  • worker.go - Worker pool implementation
  • import_job.go - Import job definition and state machine
  • persistence.go - Job state persistence

internal/users/

  • auth.go - User authentication
  • store.go - User data storage
  • tokens.go - Encrypted token management

internal/sync/ (Phase 3)

  • engine.go - Continuous sync orchestration
  • watcher.go - Drive change detection
  • scheduler.go - Sync scheduling

Integration with Existing Systems

Upload Pipeline Integration:

// In internal/jobs/import_job.go
func (j *ImportJob) processFile(driveFile *drive.File) error {
    // 1. Download from Google Drive
    j.updateFileStatus(driveFile.Id, "downloading", 0)
    localPath, err := j.driveClient.Download(driveFile)
    
    // 2. Plug into existing upload pipeline
    j.updateFileStatus(driveFile.Id, "hashing", 30)
    sha256 := psfs.ComputeSHA256(localPath)
    
    j.updateFileStatus(driveFile.Id, "scanning", 50)
    secResult := psfs.SecurityCheck(localPath, sha256)
    
    if !secResult.Safe {
        return j.failFile(driveFile.Id, "Security scan failed")
    }
    
    j.updateFileStatus(driveFile.Id, "uploading", 70)
    cid, err := psfs.AddFileIPFS(localPath)
    
    j.updateFileStatus(driveFile.Id, "storing", 90)
    metadata := store.BaseMetadata{
        FileSHA256: sha256,
        IPFSCID: cid,
        FileName: driveFile.Name,
        // ... map other Drive metadata
    }
    store.GlobalStore.AddFile(metadata)
    
    j.updateFileStatus(driveFile.Id, "completed", 100)
    return nil
}

UI Requirements

1. Google Drive Authorization Page

Location: /import/google-drive/authorize

Components:

  • "Connect to Google Drive" button
  • OAuth consent explanation
  • Permissions required list
  • Privacy policy link

2. Folder/File Selection Interface

Location: /import/google-drive/select

Features:

  • Tree view of Drive folders (collapsible)
  • File list view with checkboxes
  • Multi-select capability
  • File type icons
  • Size/date metadata display
  • Search/filter bar
  • "Select All" / "Deselect All" buttons
  • Preview import summary (X files, Y GB)
  • Import options:
    • Preserve folder hierarchy
    • Skip duplicates (by SHA256)
    • Include shared files
  • "Start Import" button

3. Import Status Dashboard

Location: /import/google-drive/status/{jobId}

Real-time Metrics:

╔════════════════════════════════════════════════════════╗
║  Import Progress                            [Cancel]  ║
╠════════════════════════════════════════════════════════╣
║  ████████████████░░░░░░░░░░  45% (45/100 files)      ║
║  ⬇ Downloading: document.pdf (2.5 MB/s)              ║
║  ⏱ ETA: 5 minutes                                     ║
║  📊 Status: 40 completed, 5 failed, 55 pending       ║
╚════════════════════════════════════════════════════════╝

Files:
┌─────────────────────────────────────────────────────┐
│ ✅ report.pdf         │ Completed  │ 2.3 MB │ 12:30 │
│ ⏳ presentation.pptx  │ Scanning   │ ████░░ │       │
│ ❌ large-video.mp4    │ Failed     │ Error: Too large│
│ ⏸ document.docx      │ Pending    │ 45 KB  │       │
└─────────────────────────────────────────────────────┘

[Retry Failed Files]  [View Details]

Detailed Per-File View:

  • File name with Drive icon
  • Progress bar for current file
  • Current stage (downloading/hashing/scanning/uploading)
  • Transfer speed
  • Success/error indicator
  • Retry button for failed files

4. Import History Page

Location: /import/google-drive/history

Display:

  • List of past import jobs
  • Job ID, start time, duration
  • Success/failure counts
  • Total data imported
  • "View Details" link to status page

5. Sync Configuration Page (Phase 3)

Location: /import/google-drive/sync

Settings:

  • Enable/disable continuous sync
  • Select Drive folder to sync
  • Sync interval (hourly, daily, etc.)
  • Conflict resolution strategy
  • Last sync timestamp
  • Manual "Sync Now" button

Monitoring & Observability

Metrics to Track

Job-Level Metrics:

  • gdrive_import_jobs_total{status="completed|failed|cancelled"}
  • gdrive_import_duration_seconds
  • gdrive_import_files_total{status="completed|failed"}
  • gdrive_import_bytes_total

File-Level Metrics:

  • gdrive_file_download_duration_seconds
  • gdrive_file_size_bytes{stage="downloaded|uploaded"}
  • gdrive_transfer_rate_bytes_per_second

API Metrics:

  • gdrive_api_requests_total{endpoint,status}
  • gdrive_api_errors_total{error_type}
  • gdrive_api_rate_limit_hits_total

Sync Metrics (Phase 3):

  • gdrive_sync_cycles_total{status}
  • gdrive_sync_new_files_detected
  • gdrive_sync_lag_seconds (time since last successful sync)

Logging Strategy

  • Structured JSON logs
  • Log levels: DEBUG, INFO, WARN, ERROR
  • Include job ID, user ID, file ID in all log entries
  • Detailed error logging with stack traces

Security Considerations

OAuth Security

  1. PKCE Flow - Use Proof Key for Code Exchange for additional security
  2. Token Encryption - Encrypt tokens at rest using AES-256
  3. Secure Storage - Store encrypted tokens in database or OS keyring
  4. Token Rotation - Implement automatic refresh token rotation
  5. Scope Minimization - Request only drive.readonly scope

API Security

  1. Rate Limiting - Respect Google Drive API quotas (per-user limits)
  2. Authentication Required - All endpoints require valid user session
  3. Input Validation - Validate all Drive file IDs, folder paths
  4. CORS - Restrict to localhost and configured domains

File Security

  1. Leverage Existing Scanning - All imported files go through security checks
  2. Size Limits - Enforce max file size limits
  3. Type Validation - Respect PinShare's allowed file types
  4. Malware Scanning - VirusTotal/ClamAV on all imports

Privacy

  1. User Data Isolation - Each user only sees their own imports
  2. Token Revocation - Support complete data deletion
  3. Audit Logging - Log all import operations

Testing Strategy

Unit Tests

  • Google Drive client mocking
  • OAuth flow state machine
  • Job queue operations
  • Metadata mapping accuracy

Integration Tests

  • End-to-end import flow with test files
  • OAuth callback handling
  • Database persistence
  • Worker pool concurrency

Load Tests

  • Import 1,000 files concurrently
  • Test with 10+ concurrent import jobs
  • Measure memory usage and performance
  • Test rate limit handling

Security Tests

  • OAuth PKCE flow validation
  • Token encryption/decryption
  • Unauthorized access attempts
  • Input validation edge cases

Implementation Phases

Phase 1: OAuth + Basic Import (MVP)

Goal: Import files manually from Google Drive

Deliverables:

  • User authentication system
  • Google Drive OAuth integration
  • Drive folder/file browser UI
  • Basic import job queue
  • Simple progress tracking
  • Integration with existing upload pipeline

Estimated Effort: 2-3 weeks

Phase 2: Enhanced Monitoring

Goal: Comprehensive import monitoring

Deliverables:

  • Per-file status tracking
  • Real-time progress updates (WebSocket)
  • Bandwidth/speed metrics
  • Error tracking and retry mechanism
  • Import history dashboard
  • Prometheus metrics

Estimated Effort: 1-2 weeks

Phase 3: Continuous Sync

Goal: Auto-sync Drive changes

Deliverables:

  • Drive change detection (polling)
  • Sync scheduler
  • Sync configuration UI
  • Conflict resolution
  • Sync pause/resume

Estimated Effort: 2-3 weeks

Phase 4: Performance & Polish

Goal: Production-ready reliability

Deliverables:

  • Performance optimization
  • Webhook support (vs polling)
  • Advanced filtering options
  • Batch operations
  • Comprehensive documentation

Estimated Effort: 1-2 weeks


Dependencies

External Services

  • Google Cloud Project - OAuth credentials, API enablement
  • Google Drive API v3 - File access
  • Database - PostgreSQL or SQLite for job/user persistence

Go Packages

require (
    google.golang.org/api v0.XXX
    golang.org/x/oauth2 v0.XXX
    github.com/lib/pq v1.XXX // PostgreSQL driver
    github.com/google/uuid v1.XXX // Job IDs
)

Configuration

# config.yaml additions
google_drive:
  oauth:
    client_id: "${GOOGLE_OAUTH_CLIENT_ID}"
    client_secret: "${GOOGLE_OAUTH_CLIENT_SECRET}"
    redirect_url: "http://localhost:9090/api/google-drive/callback"
    scopes:
      - "https://www.googleapis.com/auth/drive.readonly"
  
  import:
    max_concurrent_downloads: 5
    max_file_size_mb: 1024
    temp_download_dir: "./tmp/gdrive"
    
  rate_limiting:
    requests_per_second: 10
    burst: 20

Success Metrics

User Adoption

  • Number of users connecting Google Drive
  • Total files imported
  • Active sync configurations

Performance

  • Average import speed (files/minute, MB/s)
  • P95 latency for import operations
  • Error rate < 1%

Reliability

  • Job success rate > 99%
  • Retry success rate
  • Sync lag < 5 minutes (Phase 3)

Future Enhancements

Beyond Initial Implementation

  • Dropbox Integration - Apply same pattern to Dropbox
  • OneDrive Support - Microsoft OneDrive import
  • S3 Import - AWS S3 bucket import
  • Selective Export - PinShare → Google Drive
  • Smart Deduplication - Cross-user file deduplication
  • Bandwidth Scheduling - Import during off-peak hours
  • Multi-folder Sync - Sync multiple Drive folders simultaneously

Related Issues


References


Priority: High
Complexity: High
Impact: High - Unlocks PinShare for users with existing cloud storage

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions