bruno/dss

Files

Digital Production Factory 276ed71f31 Initial commit: Clean DSS implementation

Migrated from design-system-swarm with fresh git history.
Old project history preserved in /home/overbits/apps/design-system-swarm

Core components:
- MCP Server (Python FastAPI with mcp 1.23.1)
- Claude Plugin (agents, commands, skills, strategies, hooks, core)
- DSS Backend (dss-mvp1 - token translation, Figma sync)
- Admin UI (Node.js/React)
- Server (Node.js/Express)
- Storybook integration (dss-mvp1/.storybook)

Self-contained configuration:
- All paths relative or use DSS_BASE_PATH=/home/overbits/dss
- PYTHONPATH configured for dss-mvp1 and dss-claude-plugin
- .env file with all configuration
- Claude plugin uses ${CLAUDE_PLUGIN_ROOT} for portability

Migration completed: $(date)
🤖 Clean migration with full functionality preserved

2025-12-09 18:45:48 -03:00

21 KiB

Raw Blame History

Admin Principles & Operational Standards

Version 2.0.0 | Status: Production | Context: Extension of PRINCIPLES.md

This document defines the architectural and operational standards for the Design System Swarm (DSS) Administration Layer. While PRINCIPLES.md governs the system as a whole, these principles specifically constrain how privileged operations, monitoring, and governance are implemented.

The 6 Admin Principles extend the core DSS principles with role-based governance, configuration hierarchy, and team-specific operational modes. They enable different teams (Admin, UI, UX, QA) to work within the same system without duplicating configuration or losing governance auditability.

1. Admin Visibility (The Glass Box)

Core Concept

The Admin layer must provide total transparency into the system's state. There are no "hidden" processes. If the system does it, the Admin sees it. This is not just about logs; it is about semantic understanding of system health, user activity, and resource state.

Why It Matters

Trust: You cannot govern what you cannot see. Hidden failures erode trust in the swarm.
Velocity: Debugging time is inversely proportional to visibility. Opaque systems require hours to debug; transparent systems require minutes.
Proactive Maintenance: Visibility allows admins to spot trends (e.g., rising error rates) before they become outages.

How to Apply

1. Semantic Dashboards, Not Just Logs

Don't just stream text logs. Aggregate state into meaningful views.

Bad: A log stream showing "User X updated Token Y".
Good: A "Recent Token Activity" widget showing a heatmap of updates by user and component.

2. Real-Time State Reflection

Admin UIs must subscribe to state changes, not poll for them.

Use WebSockets or Server-Sent Events (SSE) to push updates.
If a deployment status changes, the Admin UI updates immediately.

3. Correlated Telemetry

Every error visible in the Admin UI must link directly to:

The Trace ID.
The User ID involved.
The exact System State version at that moment.

Implementation Checklist

Unified Logging: All services emit structured JSON logs to a central aggregator accessible by Admin tools.
State Inspection Tools: MCP tools exist to query the raw state of any component (get_component_state, inspect_queue).
Deployment Observability: Admins can see the exact step-by-step progress of any active deployment.
Error aggregation: Repeated errors are grouped, counted, and ranked by severity on the dashboard.

Red Flags (Anti-Patterns)

SSH Debugging: If an admin has to SSH into a container to check a log file, Visibility has failed.
"It works on my machine": Admin views differ from actual system state due to caching or separate data paths.
Opaque Queues: Background jobs are processing, but the Admin UI shows no indication of queue depth or latency.
Silent Failures: A process fails but reports "Success" to the UI because the API call technically succeeded (even if the job failed).

Enforcement Mechanisms

Linting: CI checks reject code that uses console.log instead of the structured logger.
Contract Testing: Every MCP tool must return a standard telemetry object containing trace IDs.
Chaos Testing: Deliberately break a service in staging; verify the Admin UI accurately reports the specific error within <5 seconds.

Success Metrics

MTTD (Mean Time To Detect): < 1 minute for critical failures.
UI Latency: Admin dashboard reflects system state changes within 500ms.
Coverage: 100% of background processes have visible status indicators in the Admin UI.

2. Admin Authority (Guardrails, Not God Mode)

Core Concept

Admins have high privileges, but they are not above the law. The "law" is defined by the immutable contracts (API_CONTRACTS.md, ARCHITECTURE.md). Admin tools are powerful agents of the system, but they cannot force the system into an invalid state or override versioned constraints without a formal schema change.

Why It Matters

Integrity: If admins can bypass validation logic, data corruption is inevitable.
Consistency: The system behaves predictably only if rules apply to everyone, including admins.
Safety: Preventing "God Mode" prevents accidental catastrophic deletion or misconfiguration.

How to Apply

1. Validated Mutations Only

Admin actions go through the exact same API validation pipes as user actions.

Do not write directly to the database.
Do call the SystemMutation API which enforces schemas.

2. Immutable Contract Protection

Admin UIs cannot modify Tier 1 documents directly.

To change a contract, an Admin must trigger a Proposal Workflow (which creates a PR/Version Bump).
Admins approve/merge changes; they do not "edit" them in a WYSIWYG editor.

3. Bounded Scope

Admin tokens have scopes. A "User Admin" cannot modify "System Architecture". A "Deployment Admin" cannot "Delete Users".

Implementation Checklist

No Direct DB Access: Admin API endpoints use the core service layer, never raw SQL.
Contract Validation: Every admin input is validated against the Zod schemas defined in API_CONTRACTS.md.
Immutable File Locks: The file system prevents the Admin process from writing to docs/01_core/ except via specific versioning tools.
Role Scopes: JWT tokens for admins contain specific scopes (admin:users, admin:deploy, admin:full).

Red Flags (Anti-Patterns)

"Force" Flags: API parameters like ?force=true that bypass validation logic.
Direct SQL Tools: An admin panel that is just a GUI over a raw SQL query.
Contract Overrides: Admin tools that allow changing a Design Token's type from "color" to "spacing" without a major version bump.

Enforcement Mechanisms

Middleware: API middleware explicitly checks for "Immutable" targets and rejects modifications.
Code Review: Any PR introducing a "bypass validation" function is automatically flagged for architectural review.
Penetration Testing: Attempt to use Admin API to inject invalid data; system must reject it with 400 Bad Request.

Success Metrics

Contract Violations: 0 successful admin actions that violate a defined schema.
Drift: 0 discrepancies between the file-system contracts and the runtime database state.

3. Admin Accountability (The Audit Trail)

Core Concept

Every action taken by an admin is a signed, timestamped, and immutable record. There is no concept of an "anonymous admin" or "system action" without attribution. Accountability is non-repudiation.

Why It Matters

Security: In the event of a breach, we must know exactly which account was compromised and what they did.
Compliance: Many regulatory standards (SOC2, HIPAA) require strict access logging.
Rollback: To undo a mistake, you must know exactly what the mistake was and the state before it happened.

How to Apply

1. The 5 Ws of Logging

Every Audit Log entry must contain:

Who: User ID + IP Address + Session ID.
What: The specific operation (Tool Name + Arguments).
Where: The resource ID affected.
When: ISO 8601 Timestamp (UTC).
Why: (Optional) A "Reason" field for sensitive actions (e.g., "Deleting user X per GDPR request").

2. Immutable Audit Store

Audit logs are write-only.

Admin tools cannot delete or modify audit logs.
Ideally, ship logs immediately to an external secure storage (S3 Object Lock, Datadog, etc.).

3. Session Context

If an Admin assumes another user's identity (Impersonation), the log must reflect: Actor: Admin_Alice acting_as User_Bob.

Implementation Checklist

Audit Middleware: Global interceptor on all /admin/* routes that records the request/response.
Reason Prompts: UI modals that force admins to type a reason before performing destructive actions (Delete, Ban, Force Deploy).
Read-Only Audit UI: A dedicated page in the Admin panel to search/filter audit logs (no delete button).
Export Capability: Ability to export logs for external compliance review.

Red Flags (Anti-Patterns)

Shared Credentials: "admin@company.com" used by 5 different people.
Generic "System" Logs: Changes attributed to "System" when they were actually triggered by a human button press.
Missing Context: A log saying "Updated Config" without showing what changed (diff).

Enforcement Mechanisms

Database Triggers: Prevent UPDATE or DELETE operations on the audit_logs table.
Schema Validation: Ensure actor_id is a required field in the AuditEvent schema.
Regular Audits: Automated weekly report sending a summary of Admin actions to the CTO/Lead.

Success Metrics

Attribution Rate: 100% of state-changing operations have a linked human Actor ID.
Audit Lag: Time from Action -> Audit Log availability < 1 second.

4. Admin-Developer Partnership

Core Concept

Clear separation of concerns strengthens the system. Developers propose and implement; Admins approve, monitor, and enforce. Admins do not write code in production; Developers do not restart production services manually.

Why It Matters

Stability: Prevents "hot fixes" that bypass CI/CD and review processes.
Focus: Developers focus on feature velocity; Admins focus on system reliability and governance.
Checks and Balances: Requires two distinct approvals for major changes (Code Review + Deployment Approval).

How to Apply

1. Promotion Workflows

Developers push to staging. Admin tools control the gate from staging -> production.

The Admin UI shows "Pending Changes" (diffs).
The Admin clicks "Promote" to execute the deployment.

2. Feedback Loops

Admin tools provide feedback to developers.

If a deployment fails, the Admin tool generates a report linked to the Developer's commit.
Developers consume "Admin" metrics (performance, errors) via their own views/tools.

3. Configuration Management

Developers define "Default Config" in code. Admins manage "Environment Config" (secrets, scaling limits) in the platform.

Implementation Checklist

Deployment Gates: Production deployments require an explicit Admin approval signal (API call or UI click).
Environment Isolation: Developers have full access to Dev/Staging; Read-only access to Production.
Service Catalog: Admin UI lists all services with their "Owner" (Developer Team) clearly contactable.
Incident Routing: Admin tools automatically route alerts to the specific developer/team who owns the failing component.

Red Flags (Anti-Patterns)

Hot-Patching: Admins editing script files directly on the server to fix a bug.
Cowboy Deploys: Developers bypassing the Admin gate to push directly to production.
Vague Ownership: A service crashes and Admins don't know which Developer to page.

Enforcement Mechanisms

CI/CD Pipelines: Pipeline explicitly halts at Staging and waits for an AdminApproval signal.
RBAC: Developer accounts have READ permission on Prod; Admin accounts have DEPLOY permission.

Success Metrics

Change Failure Rate: < 1% of promotions to production result in rollback.
MTTR (Mean Time To Recovery): Reduced by clear ownership routing.

5. Admin Isolation (The Lifeboat Principle)

Core Concept

The Admin Plane is distinct from the Data Plane. If the user-facing application crashes, is under DDoS attack, or the database is locked, the Admin tools must still function. You cannot fix a broken system if the tool you use to fix it is also broken.

Why It Matters

Resilience: The admin panel is the lifeboat. It must float when the ship sinks.
Security: Isolating admin traffic prevents vectoring attacks from public endpoints to admin tools.
Resource Contention: Heavy user load shouldn't make the Admin dashboard sluggish.

How to Apply

1. Separate Infrastructure

Ideally, run Admin tools on separate containers/pods or even a separate cluster/network.

Separate subdomains (admin.internal vs app.com).
Separate ingress controllers to prevent bandwidth starvation.

2. Dedicated Resources

Admin API limits are distinct from User API limits.

If users are rate-limited, Admins are not.
Reserve compute/memory specifically for Admin operations.

3. Out-of-Band Access

Maintain a "break-glass" mechanism.

If the API is totally unresponsive, have a CLI tool or direct operational port that bypasses the main load balancer.

Implementation Checklist

Separate Build: Admin App is a separate build artifact from the User App.
Connection Pooling: Admin tools use a dedicated database connection pool (so user load doesn't starve admin access).
External Monitoring: Admin uptime is monitored from an external location (e.g., UptimeRobot) distinct from user monitoring.
Failover Testing: Simulate a 100% CPU load on the User App; verify Admin App loads instantly.

Red Flags (Anti-Patterns)

Monolith Integration: Admin routes (/admin) served by the exact same Express/FastAPI instance as user traffic.
Shared Rate Limits: Admin gets 429 Too Many Requests because users are spamming the API.
Single Point of Failure: The Auth service goes down, locking Admins out of the system they need to fix.

Enforcement Mechanisms

Infrastructure as Code: Terraform/Docker Compose defines explicit resource reservations for admin-service.
Load Testing: Load tests specifically target the user plane while simultaneously measuring admin plane latency.

Success Metrics

Admin Availability: > 99.99% (higher than User Availability).
Latency during Incident: Admin latency increases < 10% during high-severity user-facing incidents.

6. Role-Based Configuration & Visibility (The One-Click Pattern)

Core Concept

Admin configures the system once (immutable, versioned). All teams access only their relevant features via role-based access control (RBAC). Operations flow from admin configuration without requiring teams to re-enter information. Components are automatically discovered (not manually imported via URLs). Configuration hierarchy flows: System Settings (Admin) → Project Settings (Admin/Team) → User Secrets (Encrypted per-user).

Why It Matters

Velocity: Teams don't spend time reconfiguring Figma URLs, Storybook paths, or Jira credentials. Admin sets once, everyone uses.
Consistency: System-wide configuration ensures all teams operate with identical understanding of component locations, credentials, and settings.
Governance: Single source of truth for configuration means auditing, compliance, and rollback are straightforward.
Autonomy: Each team sees only the features they need—no cognitive overload, no accidental admin access.

How to Apply

1. Configuration Hierarchy (Immutable Contracts)

Create a three-tier configuration model:

Tier 1: System Configuration (Immutable, Admin-only)

{
  "system_id": "dss-v2",
  "version": "2.1.0",
  "figma_api_base": "https://api.figma.com",
  "atlassian_api_base": "https://api.atlassian.cloud",
  "storybook_base_url": "https://storybook.designsystem.internal",
  "component_registry_update_interval": "5m",
  "audit_retention_days": 365
}

Tier 2: Project Configuration (Versioned, Admin/Team-settable)

{
  "project_id": "tokens-library",
  "figma_file_id": "abc123xyz",
  "figma_team_id": "team456",
  "storybook_project_path": "/projects/tokens",
  "jira_project_key": "TOKENS",
  "slack_channel": "#tokens-alerts",
  "skin_selection": ["material", "ios"],
  "version": "1.0.0"
}

Tier 3: User Secrets (Encrypted, Per-user)

{
  "user_id": "alice@company.com",
  "figma_api_key": "[encrypted]",
  "atlassian_api_token": "[encrypted]",
  "slack_bot_token": "[encrypted]"
}

Each tier is versioned separately. Changes to Tier 2 trigger a MINOR version bump (per Principle 2). Changes require audit trail logging.

2. Component Registry (Automatic Discovery)

Don't ask admins for component URLs. Instead:

Admin enables "Component Indexing" once
System polls/watches component sources (Figma, Git, Storybook) on configurable interval
All components automatically registered with metadata (name, path, Figma ID, Storybook URL, owner)
Teams query registry by name, project, or tag

Component Registry Entry:
{
  "name": "Button",
  "project": "tokens-library",
  "figma_component_id": "comp:123",
  "figma_url": "https://figma.com/file/...",
  "storybook_url": "https://storybook/...",
  "last_indexed": "2025-12-08T14:23:00Z",
  "owner": "UI Team"
}

3. Team-Specific Dashboards (Role-Based Views)

Each team sees a customized portal:

Admin Dashboard:

System settings (configuration, version management)
Project creation & onboarding
User management & key rotation
Audit logs & compliance reports
Team role assignments

UI Team Dashboard:

Figma extract (one-click sync with change detection)
QuickWins analysis (non-breaking improvements)
Regression tool (full style migration to DSS)
Metrics & performance trends

UX Team Dashboard:

Component listing (filterable, linked to Storybook + Figma)
Token listing (with update/export options)
Icon listing (with usage counts)
Figma plugin customization & download
Metrics dashboard

QA Team Dashboard:

Metrics dashboard
Component listing (with comparison links)
ESRE text area (testing)
Component screenshot comparison (create issues)
Jira issues list (assignable, linked)

All Teams:

Metrics frontpage (error rate, deployment health, sync status)
AI chat sidebar (collapsable, for design questions/suggestions)
Jira integration (view & create issues)

4. One-Click Operations Pattern

Once admin configures Figma URL, Storybook path, and Jira project:

UI team clicks "Extract from Figma" → system uses pre-configured Figma ID + credentials
UX team clicks "Component Comparison" → system loads Storybook + Figma automatically
QA team clicks "Create Issue" → system pre-fills with component metadata
No re-prompts, no manual URL entry

Implementation Checklist

Configuration Schema: Define Tier 1 (system), Tier 2 (project), Tier 3 (user) schemas in TypeScript
Configuration Versioning: Track version history, enable rollback to previous config
Component Registry: Implement indexer that scans Figma, Git, Storybook on interval
RBAC Middleware: Middleware that filters dashboard features based on user role
Dashboard Routing: Separate Next.js routes for /admin, /ui-team, /ux-team, /qa-team
Encrypted Secrets: Use libsodium or similar for Tier 3 user secrets
Config API Endpoints: GET /api/config/system, GET /api/config/project/{id}, POST /api/config/validate
Team Feature Flags: Runtime flags for which teams see which features
One-Click Tests: Verify each team's quick-action buttons work with pre-configured values

Red Flags (Anti-Patterns)

Hardcoded Credentials: Team code contains hardcoded Figma tokens or API keys
Manual Configuration Prompts: Team operations prompt for URLs ("Enter Figma URL...") instead of using config
Config Duplication: Same Figma URL stored in multiple places (settings file, env vars, database)
No Registry: Admins manually track which components exist; teams ask for component URLs
Shared Dashboards: All teams see all features, including admin-only config screens
Stale Configuration: Config doesn't update when projects move or credentials rotate

Enforcement Mechanisms

Pre-commit Hooks: Reject code commits containing Figma API keys or URLs
Secrets Scanning: Automated detection of leaked credentials in logs
Config Validation: Every config change is schema-validated before persisting
RBAC Testing: Jest tests verify each role sees only their features
Integration Tests: "One-click" operations tested end-to-end with real Figma/Storybook calls
Audit Logging: Every config access/change logged with actor, timestamp, diff

Success Metrics

Configuration Centralization: 100% of credentials and URLs stored in config, 0 in code
Component Discovery: 100% of components auto-indexed, admin never manually registers
Team Independence: Each team can complete their tasks without asking admin for URLs
One-Click Success Rate: > 95% of team quick-actions complete without additional input
Configuration Drift: < 2% discrepancy between config and runtime state
Credential Rotation: User can rotate API keys in < 2 minutes with no team disruption

Last Updated: 2025-12-09 Version: 2.0.0 (Added Principle 6: Role-Based Configuration & Visibility) Status: PRODUCTION - All 6 principles active and enforced

21 KiB Raw Blame History

Admin Principles & Operational Standards

1. Admin Visibility (The Glass Box)

Core Concept

Why It Matters

How to Apply

1. Semantic Dashboards, Not Just Logs

2. Real-Time State Reflection

3. Correlated Telemetry

Implementation Checklist

Red Flags (Anti-Patterns)

Enforcement Mechanisms

Success Metrics

2. Admin Authority (Guardrails, Not God Mode)

Core Concept

Why It Matters

How to Apply

1. Validated Mutations Only

2. Immutable Contract Protection

3. Bounded Scope

Implementation Checklist

Red Flags (Anti-Patterns)

Enforcement Mechanisms

Success Metrics

3. Admin Accountability (The Audit Trail)

Core Concept

Why It Matters

How to Apply

1. The 5 Ws of Logging

2. Immutable Audit Store

3. Session Context

Implementation Checklist

Red Flags (Anti-Patterns)

Enforcement Mechanisms

Success Metrics

4. Admin-Developer Partnership

Core Concept

Why It Matters

How to Apply

1. Promotion Workflows

2. Feedback Loops

3. Configuration Management

Implementation Checklist

Red Flags (Anti-Patterns)

Enforcement Mechanisms

Success Metrics

5. Admin Isolation (The Lifeboat Principle)

Core Concept

Why It Matters

How to Apply

1. Separate Infrastructure

2. Dedicated Resources

3. Out-of-Band Access

Implementation Checklist

Red Flags (Anti-Patterns)

Enforcement Mechanisms

Success Metrics

6. Role-Based Configuration & Visibility (The One-Click Pattern)

Core Concept

Why It Matters

How to Apply

1. Configuration Hierarchy (Immutable Contracts)

2. Component Registry (Automatic Discovery)

3. Team-Specific Dashboards (Role-Based Views)

4. One-Click Operations Pattern

Implementation Checklist

Red Flags (Anti-Patterns)

Enforcement Mechanisms

Success Metrics

21 KiB

Raw Blame History