AWS Governance at Scale — Guardrails Without the Red Tape

Governance is the word that makes engineers groan and makes compliance officers smile. I've been on both sides. I've managed environments with over 100 AWS accounts, and I can tell you: the organizations that get governance right move faster, not slower.

The ones that get it wrong end up in one of two places:

Anarchy — Every team has their own account with no guardrails. Security incidents, runaway costs, and shadow IT everywhere.
Bureaucracy — A centralized cloud team that gates every change. Deployment tickets that take two weeks. Engineers who hate the cloud team.

The sweet spot is automated guardrails: policies that prevent the truly dangerous stuff while giving teams maximum autonomy for everything else.

Here's how I build that.

The Multi-Account Strategy: Foundation of Everything

If you're running more than one production workload on AWS, you need multiple accounts. This is not optional. It's the single most important governance decision you'll make.

Why Multiple Accounts?

Blast radius containment — A compromised account can't reach other accounts
Billing isolation — Clear cost attribution per team/project/environment
Service limit isolation — One team's Lambda concurrency spike doesn't affect another
IAM boundary — Account boundaries are the strongest IAM boundary in AWS
Compliance scope — Reduce the scope of audits (PCI, HIPAA, SOC2) to specific accounts

My Recommended OU Structure

After building landing zones for organizations from 10 to 500+ accounts, this structure works for most:

Multi-Account OU Structure

Key Design Decisions

One account per workload per environment. Team A gets team-a-dev, team-a-staging, team-a-prod. This gives you clean blast radius isolation and cost attribution.

Dedicated security accounts. Never put security tooling in the management account. The management account should be locked down with near-zero human access — it's used for Organizations management and billing only.

Sandbox accounts with guardrails. Give developers sandbox accounts with a monthly budget cap ($50-$200) and SCPs that prevent creating expensive resources (no p5.48xlarge instances!). This enables experimentation without risk.

The Suspended OU. When accounts are decommissioned, move them here with an SCP that denies all actions ("Effect": "Deny", "Action": "*", "Resource": "*"). Don't delete accounts — you may need them for audit trails.

💡 Pro Tip: The Policy Staging OU is your safety net for testing SCPs. Before applying a new SCP to your Production OU, apply it to Policy Staging first and test with a sacrificial account. I've seen a miswritten SCP lock an entire organization out of their production accounts. Test first. Always.

Service Control Policies: The Art of Saying No

SCPs are the most powerful governance tool in AWS. They define the maximum permissions boundary for every principal in an account. Even root can't override an SCP.

SCP Strategy: Deny-List vs. Allow-List

I strongly prefer the deny-list approach for most organizations:

Allow-list: Start by denying everything, then explicitly allow what's needed. Extremely secure, but operationally painful. Every new service adoption requires an SCP update.
Deny-list: Allow everything by default, then deny specific dangerous actions. More practical, easier to maintain, and doesn't block teams from using new AWS services.

For highly regulated environments (government, healthcare), an allow-list may be required. For everyone else, deny-list gives you 95% of the security with 20% of the operational overhead.

My Essential SCP Library

Here are the SCPs I deploy in every organization:

1. Prevent Leaving the Organization

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "PreventLeavingOrganization",
      "Effect": "Deny",
      "Action": ["organizations:LeaveOrganization"],
      "Resource": "*"
    }
  ]
}

2. Enforce Region Restrictions

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "DenyNonApprovedRegions",
      "Effect": "Deny",
      "NotAction": [
        "a4b:*",
        "acm:*",
        "aws-marketplace-management:*",
        "aws-marketplace:*",
        "budgets:*",
        "ce:*",
        "chime:*",
        "cloudfront:*",
        "config:*",
        "cur:*",
        "directconnect:*",
        "ec2:DescribeRegions",
        "ec2:DescribeTransitGateways",
        "ec2:DescribeVpnGateways",
        "fms:*",
        "globalaccelerator:*",
        "health:*",
        "iam:*",
        "importexport:*",
        "kms:*",
        "mobileanalytics:*",
        "networkmanager:*",
        "organizations:*",
        "pricing:*",
        "route53:*",
        "route53domains:*",
        "route53-recovery-cluster:*",
        "route53-recovery-control-config:*",
        "route53-recovery-readiness:*",
        "s3:GetBucketLocation",
        "s3:ListAllMyBuckets",
        "shield:*",
        "sts:*",
        "support:*",
        "trustedadvisor:*",
        "waf-regional:*",
        "waf:*",
        "wafv2:*",
        "wellarchitected:*"
      ],
      "Resource": "*",
      "Condition": {
        "StringNotEquals": {
          "aws:RequestedRegion": ["us-east-1", "eu-west-1", "eu-central-1"]
        }
      }
    }
  ]
}

💡 Pro Tip: The NotAction list in the region restriction SCP is critical. Many AWS global services (IAM, CloudFront, Route 53, Organizations) only work in us-east-1. If you don't exclude them, you'll break basic AWS functionality. I maintain an updated list and review it quarterly as AWS adds new global services.

3. Protect Security Infrastructure

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "ProtectCloudTrail",
      "Effect": "Deny",
      "Action": [
        "cloudtrail:DeleteTrail",
        "cloudtrail:StopLogging",
        "cloudtrail:UpdateTrail",
        "cloudtrail:PutEventSelectors"
      ],
      "Resource": "arn:aws:cloudtrail:*:*:trail/organization-trail",
      "Condition": {
        "StringNotLike": {
          "aws:PrincipalARN": ["arn:aws:iam::*:role/OrganizationSecurityRole"]
        }
      }
    },
    {
      "Sid": "ProtectConfigRules",
      "Effect": "Deny",
      "Action": [
        "config:DeleteConfigRule",
        "config:DeleteConfigurationRecorder",
        "config:DeleteDeliveryChannel",
        "config:StopConfigurationRecorder"
      ],
      "Resource": "*",
      "Condition": {
        "StringNotLike": {
          "aws:PrincipalARN": ["arn:aws:iam::*:role/OrganizationSecurityRole"]
        }
      }
    },
    {
      "Sid": "ProtectGuardDuty",
      "Effect": "Deny",
      "Action": [
        "guardduty:DeleteDetector",
        "guardduty:DisassociateFromMasterAccount",
        "guardduty:DeleteMembers",
        "guardduty:DisassociateMembers"
      ],
      "Resource": "*"
    }
  ]
}

4. Sandbox Cost Controls

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "DenyExpensiveInstances",
      "Effect": "Deny",
      "Action": "ec2:RunInstances",
      "Resource": "arn:aws:ec2:*:*:instance/*",
      "Condition": {
        "ForAnyValue:StringLike": {
          "ec2:InstanceType": [
            "p5.*",
            "p4d.*",
            "p3.*",
            "x2idn.*",
            "x2iedn.*",
            "u-*",
            "*.metal",
            "*.24xlarge",
            "*.48xlarge"
          ]
        }
      }
    },
    {
      "Sid": "DenyExpensiveServices",
      "Effect": "Deny",
      "Action": [
        "redshift:CreateCluster",
        "rds:CreateDBCluster",
        "elasticache:CreateCacheCluster",
        "es:CreateDomain",
        "kafka:CreateCluster"
      ],
      "Resource": "*"
    }
  ]
}

AWS Control Tower: Worth It or Not?

I get asked this constantly. Here's my honest assessment:

Use Control Tower When:

You're setting up a new AWS organization from scratch
You want AWS-managed guardrails without writing SCPs by hand
You need a landing zone quickly (days, not weeks)
Your team doesn't have deep Organizations/SCP expertise
You want the Account Factory for standardized account provisioning

Skip Control Tower When:

You have an existing complex organization with custom OU structures
You need SCPs beyond what Control Tower's controls catalog offers
You want full control over every aspect of your landing zone
You've already built a custom landing zone with IaC (Terraform/CDK)

Control Tower's controls catalog now includes 750+ managed controls across security, cost, durability, and operations — a massive improvement from its early days. The proactive controls that use CloudFormation Hooks to prevent non-compliant resource deployment are particularly powerful.

💡 Pro Tip: Even if you don't use Control Tower, steal its ideas. The concept of mandatory guardrails (always enforced), strongly recommended guardrails (should enforce), and elective guardrails (optional) is an excellent framework for organizing your SCP strategy.

Account Factory with Terraform (AFT)

If you want Control Tower's account provisioning but prefer Terraform over the native Account Factory:

# account-request.tf
module "account_request" {
  source = "github.com/aws-ia/terraform-aws-control_tower_account_factory//modules/aft-account-request"

  control_tower_parameters = {
    AccountEmail              = "[email protected]"
    AccountName               = "team-alpha-prod"
    ManagedOrganizationalUnit = "Workloads/Production"
    SSOUserEmail              = "[email protected]"
    SSOUserFirstName          = "Cloud"
    SSOUserLastName           = "Admin"
  }

  account_tags = {
    Team        = "alpha"
    Environment = "production"
    CostCenter  = "CC-1234"
    DataClassification = "confidential"
  }

  account_customizations_name = "production-baseline"
}

IAM Identity Center: Centralized Access Done Right

If you're still managing IAM users in individual accounts, please stop. IAM Identity Center (formerly AWS SSO) gives you:

Single sign-on across all AWS accounts from one portal
Integration with external IdPs (Okta, Azure AD, Google Workspace)
Permission sets — reusable role definitions applied across accounts
Temporary credentials — No long-lived access keys

Permission Set Strategy

I design permission sets around job functions, not AWS services:

Permission Sets:
├── AdministratorAccess     → Platform team only, break-glass
├── PowerDeveloper          → Full access minus IAM, Organizations, billing
├── Developer               → Compute, storage, serverless — no VPC/networking
├── DataEngineer            → S3, Glue, Athena, EMR, Redshift, SageMaker
├── SecurityAuditor         → ReadOnly + security services (GuardDuty, Inspector)
├── BillingViewer           → Cost Explorer, Budgets — read-only
├── NetworkAdmin            → VPC, TGW, Route 53, CloudFront
└── ReadOnly                → ViewOnlyAccess — for auditors, stakeholders

IAM Identity Center Flow

💡 Pro Tip: For production accounts, I use a different permission set with stricter boundaries. Developers get PowerDeveloper in dev/staging but only Developer (no VPC/IAM changes) in production. Deployments go through CI/CD pipelines, not human hands.

Security Hub: Your Single Pane of Glass

Security Hub aggregates findings from GuardDuty, Inspector, Config, Firewall Manager, IAM Access Analyzer, and third-party tools into a single dashboard with compliance scoring.

Setting Up Cross-Account Security Hub

Security Hub Aggregation

Automated Remediation

Don't just detect — remediate automatically for well-understood violations:

# Lambda function for auto-remediation of public S3 buckets
import boto3

def handler(event, context):
    """Triggered by Security Hub finding via EventBridge"""

    finding = event['detail']['findings'][0]

    # Only auto-remediate specific finding types
    if finding['Type'] != 'Software and Configuration Checks/AWS Security Best Practices':
        return

    generator_id = finding['GeneratorId']

    if 'S3.2' in generator_id:  # S3 bucket with public read
        bucket_name = finding['Resources'][0]['Id'].split(':')[-1]

        s3 = boto3.client('s3')

        # Block public access
        s3.put_public_access_block(
            Bucket=bucket_name,
            PublicAccessBlockConfiguration={
                'BlockPublicAcls': True,
                'IgnorePublicAcls': True,
                'BlockPublicPolicy': True,
                'RestrictPublicBuckets': True
            }
        )

        # Update finding status
        securityhub = boto3.client('securityhub')
        securityhub.batch_update_findings(
            FindingIdentifiers=[{
                'Id': finding['Id'],
                'ProductArn': finding['ProductArn']
            }],
            Note={
                'Text': 'Auto-remediated: Public access blocked',
                'UpdatedBy': 'auto-remediation-lambda'
            },
            Workflow={'Status': 'RESOLVED'}
        )

        print(f"Remediated public access on bucket: {bucket_name}")

Standards I Enable by Default

AWS Foundational Security Best Practices (FSBP) — Comprehensive, well-maintained
CIS AWS Foundations Benchmark — Industry standard for compliance
PCI DSS (if applicable) — For payment card data environments

💡 Pro Tip: Don't enable every Security Hub standard on day one. Start with FSBP, fix all critical/high findings, then add CIS. Enabling everything at once creates alert fatigue — I've seen teams with 15,000+ findings who just stop looking at the dashboard entirely.

Cost Governance: The CFO's Best Friend

The Cost Governance Stack

Cost Governance Stack

Budget Strategy

I set up budgets at multiple levels:

Organization Level:
  └── Total monthly budget: $500,000 (alert at 80%, 90%, 100%)

OU Level:
  ├── Production OU: $350,000 (alert at 85%, 95%)
  ├── Development OU: $100,000 (alert at 90%)
  └── Sandbox OU: $10,000 (alert at 80%, auto-action at 100%)

Account Level:
  ├── team-alpha-prod: $50,000 (alert at 85%)
  └── sandbox-dev-jane: $200 (auto-deny at $200)

For sandbox accounts, I use Budget Actions to automatically apply an SCP that denies ec2:RunInstances, rds:CreateDBInstance, and other resource-creation actions when the budget threshold is hit.

Tagging Strategy: The Glue That Holds It Together

Tags are the foundation of cost attribution, automation, and compliance. Here's my mandatory tag schema:

Required Tags (enforced via SCP + Config Rules):
  - Environment: [production, staging, development, sandbox]
  - Team: [alpha, beta, platform, data, security]
  - CostCenter: CC-XXXX
  - Application: [app-name]
  - Owner: [email address]
  - DataClassification: [public, internal, confidential, restricted]

Optional but Recommended:
  - Terraform: [true/false]
  - Pipeline: [pipeline-name]
  - ExpirationDate: [YYYY-MM-DD] # For temporary resources

Enforcing tags with an SCP:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "EnforceTagging",
      "Effect": "Deny",
      "Action": [
        "ec2:RunInstances",
        "rds:CreateDBInstance",
        "rds:CreateDBCluster",
        "lambda:CreateFunction",
        "ecs:CreateService",
        "s3:CreateBucket"
      ],
      "Resource": "*",
      "Condition": {
        "Null": {
          "aws:RequestTag/Environment": "true",
          "aws:RequestTag/Team": "true",
          "aws:RequestTag/CostCenter": "true"
        }
      }
    }
  ]
}

Pair this with an AWS Config rule that detects resources missing tags after creation (for resources that don't support tag-on-create):

# Config Rule: required-tags
ConfigRuleName: required-tags
Source:
  Owner: AWS
  SourceIdentifier: REQUIRED_TAGS
InputParameters:
  tag1Key: Environment
  tag2Key: Team
  tag3Key: CostCenter
  tag4Key: Owner
Scope:
  ComplianceResourceTypes:
    - AWS::EC2::Instance
    - AWS::RDS::DBInstance
    - AWS::Lambda::Function
    - AWS::S3::Bucket
    - AWS::ECS::Service

💡 Pro Tip: Tag enforcement is one of the highest-ROI governance investments you can make. Without consistent tags, your cost reports are useless, your automation breaks, and your compliance team is flying blind. Fight this battle early. It only gets harder as your environment grows.

CloudTrail: Your Audit Trail

Organization-Level CloudTrail

Deploy a single organization trail that covers all accounts:

resource "aws_cloudtrail" "organization" {
  name                          = "organization-trail"
  s3_bucket_name                = aws_s3_bucket.cloudtrail.id
  include_global_service_events = true
  is_multi_region_trail         = true
  is_organization_trail         = true
  enable_log_file_validation    = true

  cloud_watch_logs_group_arn    = "${aws_cloudwatch_log_group.cloudtrail.arn}:*"
  cloud_watch_logs_role_arn     = aws_iam_role.cloudtrail_cloudwatch.arn

  event_selector {
    read_write_type           = "All"
    include_management_events = true

    data_resource {
      type   = "AWS::S3::Object"
      values = ["arn:aws:s3"]
    }

    data_resource {
      type   = "AWS::Lambda::Function"
      values = ["arn:aws:lambda"]
    }
  }

  insight_selectors {
    insight_type = "ApiCallRateInsight"
  }

  insight_selectors {
    insight_type = "ApiErrorRateInsight"
  }
}

Key CloudTrail best practices:

Send logs to a dedicated Log Archive account — No one can tamper with the audit trail
Enable log file validation — Detect if logs have been modified
Enable CloudTrail Insights — Detects unusual API activity (anomaly detection)
Protect with an SCP — Prevent accounts from deleting/modifying the organization trail
Set up Athena for querying — CloudTrail Lake is powerful but expensive at scale; Athena on S3 is more cost-effective for most organizations

Compliance Automation: Config Rules + Conformance Packs

AWS Config continuously monitors resource configurations and evaluates them against rules. At scale, use Conformance Packs — collections of Config Rules deployed as a single unit.

Organization-Wide Conformance Pack

# conformance-pack-security-baseline.yaml
Parameters:
  MaxAccessKeyAge:
    Type: String
    Default: '90'

Resources:
  IAMPasswordPolicy:
    Type: AWS::Config::ConfigRule
    Properties:
      ConfigRuleName: iam-password-policy
      Source:
        Owner: AWS
        SourceIdentifier: IAM_PASSWORD_POLICY
      InputParameters:
        RequireUppercaseCharacters: 'true'
        RequireLowercaseCharacters: 'true'
        RequireSymbols: 'true'
        RequireNumbers: 'true'
        MinimumPasswordLength: '14'
        PasswordReusePrevention: '24'
        MaxPasswordAge: '90'

  AccessKeyRotation:
    Type: AWS::Config::ConfigRule
    Properties:
      ConfigRuleName: access-key-rotated
      Source:
        Owner: AWS
        SourceIdentifier: ACCESS_KEYS_ROTATED
      InputParameters:
        maxAccessKeyAge: !Ref MaxAccessKeyAge

  S3BucketPublicReadProhibited:
    Type: AWS::Config::ConfigRule
    Properties:
      ConfigRuleName: s3-bucket-public-read-prohibited
      Source:
        Owner: AWS
        SourceIdentifier: S3_BUCKET_PUBLIC_READ_PROHIBITED

  EncryptedVolumes:
    Type: AWS::Config::ConfigRule
    Properties:
      ConfigRuleName: encrypted-volumes
      Source:
        Owner: AWS
        SourceIdentifier: ENCRYPTED_VOLUMES

  RDSEncryptionEnabled:
    Type: AWS::Config::ConfigRule
    Properties:
      ConfigRuleName: rds-storage-encrypted
      Source:
        Owner: AWS
        SourceIdentifier: RDS_STORAGE_ENCRYPTED

  RootAccountMFAEnabled:
    Type: AWS::Config::ConfigRule
    Properties:
      ConfigRuleName: root-account-mfa-enabled
      Source:
        Owner: AWS
        SourceIdentifier: ROOT_ACCOUNT_MFA_ENABLED

  VPCFlowLogsEnabled:
    Type: AWS::Config::ConfigRule
    Properties:
      ConfigRuleName: vpc-flow-logs-enabled
      Source:
        Owner: AWS
        SourceIdentifier: VPC_FLOW_LOGS_ENABLED

Deploy it across the organization:

aws configservice put-organization-conformance-pack \
  --organization-conformance-pack-name security-baseline \
  --template-s3-uri s3://config-templates/conformance-pack-security-baseline.yaml \
  --excluded-accounts '["123456789012"]'  # Exclude management account

The Governance Maturity Model

I use this framework to help organizations assess where they are and where they need to go:

Level	Description	Key Capabilities
1 — Reactive	Single account, no guardrails	Manual security reviews, no cost visibility
2 — Foundational	Multi-account, basic SCPs	Organization trail, basic budgets, IAM Identity Center
3 — Managed	OU structure, Config Rules	Automated compliance checks, tag enforcement, Security Hub
4 — Optimized	Full automation, self-service	Auto-remediation, account vending, IaC-only deployments
5 — Continuous	Policy-as-code, continuous compliance	OPA/Cedar policies, drift detection, compliance dashboards

Most organizations I work with are between Level 2 and Level 3. Getting to Level 4 requires significant investment in automation but pays for itself within 6 months through reduced operational overhead and faster team velocity.

Putting It All Together: A 30-Day Governance Sprint

If I'm setting up governance for a new organization, here's my timeline:

Week 1: Foundation

Set up AWS Organizations with the OU structure above
Configure IAM Identity Center with your IdP
Deploy the organization CloudTrail to a Log Archive account
Apply baseline SCPs (prevent leaving org, region restrictions, protect security infra)

Week 2: Security

Enable Security Hub with FSBP standard across all accounts
Enable GuardDuty across all accounts (delegated admin)
Deploy organization Config Rules (conformance pack)
Set up cross-account Security Hub aggregation

Week 3: Cost

Implement tagging strategy with SCP enforcement
Set up AWS Budgets at org, OU, and account levels
Enable Cost Anomaly Detection
Configure CUR delivery to S3 for detailed analysis

Week 4: Automation

Build account vending pipeline (Control Tower AFT or custom)
Set up auto-remediation for critical findings
Create compliance dashboards (QuickSight or Grafana)
Document runbooks and train the team

💡 Pro Tip: Don't try to boil the ocean. Start with the highest-impact, lowest-effort guardrails (organization trail, region restrictions, Security Hub). Then iterate. Perfect governance that ships in 6 months is worse than good-enough governance that ships in 30 days.

Key Takeaways

Multi-account is mandatory, not optional. Account boundaries are your strongest security and isolation primitive. Use them aggressively.
SCPs are your most powerful tool. They define the ceiling of what's possible in an account. Use deny-list approach for most organizations, with essential SCPs for region restriction, security protection, and cost control.
Automate account provisioning. Whether through Control Tower AFT or custom pipelines, account creation should be self-service and standardized. No tickets, no delays.
Tags are governance infrastructure. Enforce them from day one. Without consistent tags, cost attribution, automation, and compliance are impossible.
Detect and remediate, don't just alert. Security Hub findings that sit in a dashboard for weeks aren't useful. Auto-remediate the well-understood ones (public S3 buckets, unencrypted volumes), alert on the rest.
Governance enables speed. The goal isn't to slow teams down — it's to give them a safe, well-lit highway to drive fast on. Guardrails, not gates.
Start with the 30-day sprint. Don't wait for perfect. Get the foundation in place, then iterate continuously.

What's Next

This completes my three-part series on AWS architecture fundamentals. If you found these articles useful, I'd recommend diving deeper into:

AWS Compute — Choosing the Right Engine for the Job — Navigate the compute landscape with confidence
Serverless on AWS — Beyond the Hype — Event-driven patterns, cost realities, and honest opinions on serverless
Infrastructure as Code at Scale — How to manage hundreds of accounts with Terraform, CDK, or CloudFormation (coming soon)
AWS Networking for the Real World — Transit Gateway, PrivateLink, VPC design patterns, and hybrid connectivity (coming soon)

Building on AWS at scale is equal parts technical skill and organizational design. The best architectures I've seen aren't just technically sound — they're backed by governance frameworks that let teams move fast without breaking things. That's the sweet spot we should all be aiming for.