cloud-sheriff

IAM Permissions Explorer — Design Document
DESIGN_DOC_V1.0

IAM Permissions Explorer

A design brief for identifying and remediating excessive or unused cloud IAM permissions — bridging the gap between what identities can access and what they actually use.

The Entitlement Problem

Cloud environments grant permissions far faster than they revoke them. The result: identities accumulate massive, mostly-unused access over time — creating an attack surface that grows invisibly.

Industry analysis shows 75% of security failures by end of 2023 result from inadequate management of identities and privileges. Research from Microsoft's Entra team confirms the majority of granted cloud permissions go entirely unused.

"Dormant entitlements constitute a vast, unnecessary attack surface that threat actors actively exploit to escalate privileges and facilitate lateral movement."

Yet enforcing least privilege at scale is notoriously difficult — not because engineers lack intent, but because the tooling fails them.

The Visibility Gap

The core engineering challenge is a Gap Analysis: identifying the delta between what IAM policies say an identity can do versus what telemetry logs record it actually doing.

This sounds simple. In practice it is plagued by:

Taxonomical misalignment
IAM service prefixes differ from CloudTrail API prefixes. ses in IAM → email in logs. Action names diverge. s3:ListAllMyBuckets vs the CLI call aws s3 ls.
Telemetry blind spots
CloudTrail omits high-volume data events (S3 GetObject, Lambda invocations) by default. Some actions like CloudWatch:PutMetricData are never recorded at all.
API versioning mutations
Legacy SDK calls append version strings: ListDistributions2017_03_25. A robust tool needs regex or ML classifiers to normalise these back to root IAM actions.

What This Tool Must Do

① Visualise

Translate raw log-policy delta into intuitive executive-level metrics (PCI, Risk Margin) and interactive inheritance graphs — not raw JSON arrays.

② Remediate

Automatically generate least-privilege replacement policies. Show a side-by-side diff so engineers can validate changes before deploying. One-click export to IaC.

③ Empower

Give developers self-service policy generation. Break the centralised bottleneck where a single security team reviews every policy request — a known velocity killer.

Primary Operator

SC
Sarah Chen
Senior Cloud Security Engineer · AWS + Azure · 6 yrs exp

Background

Sarah sits at the intersection of security governance and developer velocity. She owns IAM policy reviews, compliance audits, and incident response — while simultaneously being the person developers blame when deployments are slow.

She manages permissions for 3,000+ identities across two cloud environments. She uses the AWS console daily, has scripted her own ad-hoc gap analyses in Python, and is deeply familiar with CloudTrail. She distrusts "magic" automation because she has seen it break production.

Goals

G
Pass the next SOC2 audit
Needs evidence that orphaned accounts have been cleaned up and least privilege is enforced across all environments.
G
Reduce her policy review backlog
Currently 40+ open IAM policy requests from developers. Each one requires manual context-gathering that takes 30–90 minutes.
G
Not be the reason a service went down
Her deepest fear: an automated policy revocation breaks a critical production job that only runs once a quarter.

Three Core Pain Points

01
Fear of breaking production
A role that has only used read permissions for 89 days might need write access on day 90 for a quarterly batch job. Traditional gap analysis tools have no predictive context. When stakes of an outage far outweigh the abstract risk of an over-permissioned account, the default posture degrades to "allow all."

Design implication: Every remediation suggestion must show confidence signals — last-used timestamp, usage frequency heatmap, inferred job schedule — not just "this was unused."
02
Hidden dependencies and inheritance complexity
A developer's effective permissions derive from a labyrinthine chain: Active Directory groups → assumed cross-account IAM roles → nested Resource-Based Policies → Permission Boundaries → Service Control Policies. When the source of a privilege is obfuscated, engineers hesitate to alter anything.

Machine identities make this worse — CI pipelines, Lambda functions, and containers rotate rapidly and have wildly different scopes than humans.

Design implication: The tool must visualise the full inheritance chain for any selected identity — not just the attached policies.
03
Centralisation bottleneck and developer friction
When a central security team must manually review all IAM policies, they become the primary bottleneck. Developers — lacking context on what permissions they actually need — request wildcard policies (s3:*) to guarantee nothing breaks. The security team, lacking application context, cannot meaningfully push back.

This creates an adversarial dynamic characterised by delayed deployments and mutual distrust.

Design implication: Developers must be able to self-serve scoped policy generation using their activity logs, without going through the security team for initial drafts.

Application Developer

Needs to understand why their deployment failed (AccessDenied) and self-generate the minimum policy required. Values speed, not security depth.

CISO / Security Director

Needs a single-number risk posture score to report to the board. Does not want to read policy JSON — wants trend lines and compliance percentages.

3-Screen Design

Three interconnected screens covering discovery → investigation → remediation. Click between screens using the controls below. Annotations explain key design decisions.

Screen 1 — Risk Dashboard
Screen 2 — Identity Deep Dive
Screen 3 — Remediation Diff
IAM Permissions Explorer · Dashboard
Dashboard
Identities
Policies
Findings
Settings
ENVIRONMENT
AWS · us-east-1
+ Azure · East US
Permissions Overview
Tracking window: 90 days · Last scan: 2 min ago
Export Report
▶ Run Scan
PERMISSIONS CREEP INDEX
68
↑ +4 from last week
ORPHANED ACCOUNTS
127
Zero activity, 90 days
LEAST PRIV. ADHERENCE
41%
Target: >80%
UNUSED PERMISSIONS
2,841
Across 3,214 identities
Risk Composition
68
PCI Score
Critical identities (28%)
High risk (27%)
Compliant (45%)
Remediation TradeoffInteractive
Slide to see impact of deploying N new scoped policies
Policies
50
PCI Reduction: −18 New score: 50
Est. remediation effort: ~4 hrs
Identities by Risk 127 showing · Sort: Risk ↓
Identity / Role Type Risk Permissions Used Last Active
arn:aws:iam::prod/DataPipelineRole
Attached: AdministratorAccess
ROLE CRITICAL
8%
6 days ago
Inspect →
svc-account@project.iam
Azure · Subscription: prod-001
SVC ACC HIGH
22%
14 days ago
Inspect →
john.doe@company.com
IAM User · Groups: dev-team, data-eng
USER MEDIUM
44%
Yesterday
Inspect →
LambdaExecutionRole-ETL
Attached: S3FullAccess, DynamoDBFull
ROLE LOW
78%
1 hr ago
Inspect →

Screen 1 Annotations

1
Permissions Creep Index (PCI) — Single top-line metric quantifying the divergence between granted and used permissions. Gives CISOs a boardroom-ready number without needing to interpret raw data.
2
Remediation Tradeoff Slider — Interactive element showing how deploying N new scoped policies reduces PCI. Turns risk reduction into a game with a visible, quantified payoff. Directly addresses Sarah's "is it worth it?" hesitation.
3
Usage bar per identity — Ratio of used/granted permissions shown as a progress bar. Colour-coded red/amber/green by severity. Scannable at a glance — no number reading required for triage.
4
Multi-cloud unified view — AWS and Azure identities shown in one table, distinguished by the identity type badge. Reduces context-switching between console UIs.
IAM Permissions Explorer · Identity Deep Dive — DataPipelineRole
Dashboard
Identities
Policies
Findings
Settings
QUICK FILTERS
▶ Critical only
▶ Unused 90d+
▶ Machine identities
▶ Cross-account roles
← Back to All Identities
DataPipelineRole
arn:aws:iam::123456789:role/DataPipelineRole
CRITICAL RISK
Generate Fix →
Permission Usage (90 days)
s3:GetObject
92%
s3:PutObject
67%
dynamodb:GetItem
34%
s3:DeleteBucket
0%
iam:PassRole
0%
ec2:TerminateInstances
0%
+ 2,834 more unused
never called
⚠ Source: AdministratorAccess · Grants ALL AWS permissions
Access Inheritance Chain
SCP: DenyLeave
Org root
AdministratorAccess
Attached directly
DataPipelineRole
Lambda fn: etl-daily
CI: github-actions
SCP Service Control Policy POL Managed Policy ROLE IAM Role
⚠ Critical Finding Wildcard Admin Attached
This role has AdministratorAccess attached but has only called 3 distinct AWS services in the last 90 days out of 300+ available. The role is assumed by a Lambda function and a CI pipeline — neither of which requires administrative access. This constitutes a critical attack path: a compromised CI token could pivot to full account takeover.
Generate Least-Privilege Policy →
Detach Admin Policy
JIT Access Instead

Screen 2 Annotations

1
Per-action usage frequency bars — Each permission listed with its 90-day invocation frequency. Engineers see exactly which actions are genuinely used vs never called, with dangerous unused permissions (DeleteBucket, iam:PassRole) surfaced in red to demand attention.
2
Visual inheritance chain — Indented tree shows how permissions flow: SCP → Managed Policy → Role → Assumers. The engineer can see at a glance that two machine identities (Lambda + GitHub Actions) are assuming this overprivileged role, making it a critical lateral movement risk.
3
Context-aware remediation options — Because the role is entirely unused for admin actions, three options are offered: generate a scoped replacement policy, directly detach the admin policy (hard revocation), or convert to JIT access. The choice respects the engineer's risk tolerance.
4
Plain-English finding summary — A narrative explanation of WHY this is a problem, not just WHAT it is. Addresses Sarah's need to justify remediation decisions to developers without writing her own impact analysis.
IAM Permissions Explorer · Remediation — Policy Diff Review
Dashboard
Identities
Policies
Findings
Settings
CHANGE SUMMARY
− 2,841 removed
+ 6 added (scoped)
= 3 unchanged
Risk reduction
−34 PCI pts
← Back to DataPipelineRole
Review Proposed Policy Changes
Auto-generated based on 90-day usage analysis · Verify before deploying
↓ Export JSON
Open PR in GitHub
✓ Deploy Now
Confidence: High — No permissions were used in 90 days. No scheduled jobs detected. Last invocation: 6 days ago using only S3 and DynamoDB actions. Review quarterly batch jobs before deploying.
Current Policy (AdministratorAccess)
2,841 excess permissions
 {
  "Version": "2012-10-17",
 "Statement": [{
  "Effect": "Allow",
  "Action": "*",
  "Resource": "*"
 }]
 }
Proposed Policy (Least Privilege)
+6 scoped actions
 {
  "Version": "2012-10-17",
+ "Statement": [{
+  "Effect": "Allow",
+  "Action": [
+   "s3:GetObject",
+   "s3:PutObject",
+   "dynamodb:GetItem",
+   "dynamodb:PutItem",
+   "dynamodb:Query",
+   "logs:CreateLogGroup"
+  ],
+  "Resource": [
+   "arn:aws:s3:::etl-*",
+   "arn:aws:dynamodb:us-east-1:..."
+  ]
+ }]
 }
Test in Sandbox
Deploy to staging, run integration tests, then promote
Open GitHub PR
Add to IaC Terraform/CDK repo for review
Deploy Directly
Apply immediately. Rollback in 1 click if issues arise

Screen 3 Annotations

1
Confidence Signal Banner — Before seeing the diff, the engineer sees a plain-English confidence statement about WHY this recommendation is safe (or what to check). This directly addresses the #1 pain point: fear of breaking production. A quarterly-batch job warning appears proactively.
2
Synchronized side-by-side JSON diff — Left panel: current wildcard policy. Right panel: generated least-privilege policy. Colour-coded: red lines removed, green lines added. Panels scroll in sync. Key design choice: wildcards (*) are highlighted in red/green to make the "wildcard → specific action" replacement visually obvious.
3
Resource scoping visible in the diff — The new policy not only scopes actions but also scopes resources (e.g., arn:aws:s3:::etl-*). Engineers can verify the blast radius is correctly contained before approving.
4
Three deployment paths — Sandbox test, IaC PR, or direct deploy. This respects different org maturity levels. Every path includes rollback affordance. The IaC path is recommended (accent colour) to encourage GitOps practices.

Proposed Features, Prioritisation & Success Metrics

A max 1-page equivalent write-up synthesising the design rationale, feature tiers, and the KPI matrix used to measure success.

Feature Prioritisation

Features are ordered by the intersection of user impact (addresses a core pain point) and feasibility (achievable in an MVP timeframe without requiring deep ML infrastructure).

P1
CRITICAL
Unified Permission Gap Dashboard with PCI Score
A single-screen view showing the Permissions Creep Index, Least Privilege Adherence Score, and orphaned account count — all derived from a backend gap analysis. Identities ranked by risk with usage bars. This is the entry point for every user session and must communicate "what's on fire" within 5 seconds of loading.
Addresses Pain #1CISO-facingMVP coreMulti-cloud
P1
CRITICAL
Per-Identity Inheritance Graph + Usage Breakdown
Deep dive view for any identity: visualises the full access chain (SCP → Policy → Role → Principal), lists every granted action with its 90-day invocation frequency, and highlights dangerous unused permissions in red. Eliminates the need for Sarah to manually trace inheritance across 5 different console screens.
Addresses Pain #2Security engineer-facingMVP core
P1
CRITICAL
AI-Generated Policy + Synchronized JSON Diff Viewer
Auto-generates a least-privilege replacement policy using 90-day usage data. Presents changes in a side-by-side diff with colour-coded additions/removals, scoped resource ARNs, and a confidence signal banner (flags potential quarterly jobs). One-click deployment with three paths: sandbox, IaC PR, or direct. This directly neutralises the fear of breakage by making changes transparent and reversible.
Addresses Pain #1Addresses Pain #3Core differentiatorMVP core
P2
HIGH
Remediation Tradeoff Slider
Interactive slider on the dashboard: "If I fix N identities, my PCI drops by X." Gamifies risk reduction and helps security teams prioritise remediation sprints. Gives the CISO an "effort vs impact" view for resource allocation. Technically straightforward once the gap analysis engine is running.
CISO-facingEngagement driverPost-MVP v1
P2
HIGH
Developer Self-Service Policy Generator
A simplified interface where developers can paste their Lambda/EC2 role ARN, select a date window, and receive a least-privilege policy draft without involving the security team. Addresses the centralisation bottleneck. Includes a "request review" button that creates a ticket for the security team — not a full bypass, but a massive acceleration.
Addresses Pain #3Developer-facingPost-MVP v1
P3
MEDIUM
Just-in-Time (JIT) Temporary Elevation Workflow
Replace standing privileged roles with on-demand temporary access: developer requests elevated permissions for N hours, approved by a security engineer, auto-expires. Prevents permanent privilege accumulation at its source. Requires deeper integration with identity providers (Okta, Azure AD) and is architecturally complex — ideal for a v2 milestone.
Zero Trust alignedv2 roadmapPAM maturity

Original Design Decisions

Confidence Signal over raw confidence scores: Rather than showing a numeric ML confidence score (opaque, distrusted), the diff view shows a plain-English explanation of the evidence: "No permissions used in 90 days. No scheduled jobs detected." This is more actionable and builds trust faster.

Three deployment paths, not one: Most tools offer "deploy" or "export." Our tool offers sandbox → IaC PR → direct deploy as distinct options, meeting teams where they are in GitOps maturity without forcing a single workflow.

Success Metrics (KPIs)

KPI Definition Target Business Impact
Least Privilege
Adherence Score
% of identities operating at or below minimum required access (granted vs used comparison) > 80% Quantifies Zero Trust enforcement; reduces theoretical blast radius of credential compromise
Orphaned Account
Ratio
% of active identities with zero telemetry activity over 90-day tracking period < 2% Eliminates dormant backdoors; directly supports SOC2 and GDPR compliance audits
Authorization
Failure Rate
% of API requests resulting in AccessDenied (total failed ÷ total requests) < 0.5% Confirms policies are accurate — high rate indicates over-restriction breaking production workflows
Remediation
Adoption Rate
% of generated policy recommendations that are deployed (sandbox, IaC, or direct) within 30 days > 60% Measures tool trust and UX effectiveness; low rate indicates engineers fear the suggestions
Lateral Movement
Risk Score
Composite: count of highly-privileged accounts × rate of standing privileges × presence of static credentials Continuous ↓ Quantifies attacker's escalation potential following an initial breach
Policy Review
Cycle Time
Average hours from developer policy request to approved deployment (measures bottleneck reduction) < 4 hrs Directly measures developer friction reduction; baseline of 30–90 hrs manual review

MVP Blueprint — Dev Handoff

Five concrete engineering tasks scoped for a Minimum Viable Product. Backend tasks marked purple, frontend tasks marked green. Each item includes discussion points for the engineering team.

B1
Scalable Log Ingestion Pipeline
BACKEND SPRINT 1 Node.js / JavaSQSLambda
Build a resilient data pipeline to ingest CloudTrail logs (AWS) and Activity Logs (Azure) at scale. Architecture: stream logs into Amazon SQS → serverless function polls queue, parses nested JSON, handles network timeouts, deserialises events into queryable data objects, filters noise (read-only describe/list events). Must handle CloudTrail's payload truncation limit (102,401–131,072 character requests).
  • Configure CloudTrail S3 delivery + SQS integration with dead-letter queues for parsing failures
  • Handle Azure Activity Log schema inconsistencies — some resource providers omit evidence blocks entirely
  • Normalise API version strings: ListDistributions2017_03_25cloudfront:ListDistributions using regex classifiers
  • Discussion: Should data events (S3 GetObject, Lambda invoke) be opt-in or always-on? Cost and noise tradeoff.
  • Discussion: How do we handle sts:GetCallerIdentity (always allowed, skews usage analysis)?
B2
Gap Analysis & Heuristic Mapping Engine
BACKEND SPRINT 1–2 Pythonbotocore mappingGraphDB
Core algorithmic engine that maps deserialized CloudTrail events to IAM privilege names using a maintained translation dictionary (SDK endpoints ↔ IAM prefixes). Executes temporal gap analysis: compare 90-day usage array against policies currently attached to the evaluated role. Output: explicit list of unused actions per identity, plus a permissions graph for inheritance chain traversal.
  • Build and maintain IAM ↔ API mapping dictionary (critical: sesemail, cloudwatchmonitoring, etc.)
  • Handle IAM services mapping to multiple API endpoints (e.g. lexmodels.lex + runtime.lex)
  • Store identity inheritance graphs in a graph database (Neptune / Neo4j) for efficient traversal queries
  • Discussion: How do we handle Azure RBAC's authorization.evidence.roleAssignmentId when multiple concurrent role paths exist?
  • Discussion: Configurable tracking window (30d / 90d / 180d) — how do we surface "last used" for infrequent batch jobs?
B3
Automated Least-Privilege Policy Generation Service
BACKEND SPRINT 2 JSON synthesisIAM APITerraform
Synthesises gap analysis output into a new least-privilege JSON policy document. Generation logic must safely scope down wildcards (e.g. s3:* → specific used actions), preserve necessary resource constraints, tags, and conditional context keys from the original policy. Must produce valid IAM JSON that passes AWS policy simulator validation before being surfaced to the UI.
  • Wildcard expansion: when a role has dynamodb:* but only calls 3 actions, emit only those 3 actions
  • Resource scoping: where CloudTrail captures the specific resource ARN, include it in the generated policy
  • Preserve Condition blocks from original policy (IP restrictions, MFA requirements, time windows)
  • Generate Terraform / CDK equivalents alongside JSON for teams using IaC
  • Discussion: Should we always require a human approval step, or allow auto-remediation for zero-usage orphaned accounts?
F1
CIEM Dashboard + Metrics Visualisation
FRONTEND SPRINT 2 React / Next.jsD3.jsWebSocket
Build the primary dashboard (Screen 1 of wireframes) rendering the Permissions Creep Index, Least Privilege Adherence Score, orphaned account count, and sortable identity table with usage bars. Must handle real-time scan results via WebSocket or SSE. Graph visualisation library (D3 or Recharts) to render the PCI donut and remediation tradeoff slider. Dark-mode native, responsive down to 1280px.
  • PCI donut chart with animated transition on score change — provides immediate visual feedback on remediation impact
  • Identity table: virtual scrolling for 10,000+ identities; filter by environment (AWS/Azure), type (user/role/svc), risk level
  • Remediation tradeoff slider: debounced API call returns projected PCI reduction for N remediated identities
  • Discussion: Should the scan be on-demand (button) or continuous background polling? Implications for API rate limits and cost.
F2
Side-by-Side JSON Diff Viewer + Deployment UI
FRONTEND SPRINT 3 ReactMonaco EditorGitHub API
The deep-nested JSON diff viewer (Screen 3 of wireframes). Dual-pane layout with synchronized scrolling, syntax highlighting, and colour-coded diff indicators (red removals, green additions) that bubble up to parent nodes. JSON normalisation to prevent false positives from key ordering differences. Deployment affordances: sandbox test, GitHub PR creation (requires OAuth), and direct AWS/Azure API deployment with rollback button.
  • Use Monaco Editor (VS Code's engine) or react-diff-viewer as base — do NOT build a diff engine from scratch
  • JSON normalise before diff: sort keys alphabetically, strip whitespace variation — eliminates false positives
  • Collapsible node trees (both panels sync collapse state) — critical for large policies with 50+ statement blocks
  • GitHub PR creation: pre-fills PR title, description, and reviewer assignment from the security team roster
  • Confidence signal banner: surface "last used" timestamp, detected cronjob patterns, quarterly batch job heuristics
  • Discussion: Rollback mechanism — should we snapshot the previous policy version before applying changes?

Technical Risk Register

High: CloudTrail data event cost
Enabling S3 GetObject logging on large buckets can cost hundreds of dollars/month. Mitigation: offer opt-in per bucket with cost estimator shown before enabling.
High: IAM ↔ API mapping drift
AWS adds ~200 new IAM actions per year. Mapping dictionary needs automated testing against live AWS documentation on every release.
Medium: Azure RBAC opacity
Azure's authorization.evidence field is inconsistently emitted across resource providers. Need provider-specific parsing fallbacks.
Medium: False positive suppression
Actions that run quarterly (batch jobs, compliance reports) will appear as "unused" in a 90-day window. Need job schedule inference to prevent false positive remediation suggestions.