The End of an Era: Migrating from Ingress NGINX to Gateway API

A practical guide for platform engineers navigating the future of Kubernetes traffic management

The Wake-Up Call

If you’re reading this, you’ve probably already heard the news that sent ripples through the Kubernetes community: Ingress NGINX, the battle-tested controller that’s been routing traffic for countless production clusters since the early days of Kubernetes, is being retired. After March 2026, there will be no more security patches, no bug fixes, and no updates.

For many of us, Ingress NGINX has been that reliable workhorse we rarely thought about. It just worked. But as with all technology, evolution is inevitable, and sometimes that evolution means saying goodbye to old friends.

The good news? This retirement isn’t just an ending—it’s an opportunity to modernize your infrastructure with Gateway API, a more powerful and flexible approach to traffic management that represents the future of Kubernetes networking.

Why Ingress NGINX Had to Go

Before we dive into the migration path, it’s worth understanding why this beloved project is being sunset. The story is a familiar one in open source: a project becomes incredibly popular, but maintainer resources don’t scale with adoption.

Ingress NGINX was originally created as a reference implementation, a proof of concept to demonstrate how the Ingress API could work. Nobody expected it to become the de facto standard for Kubernetes traffic routing. Its flexibility—the ability to inject arbitrary NGINX configuration through annotations and snippets—made it powerful but also created a maintenance nightmare.

What was once considered a feature became a security liability. The project ran on the dedication of one or two maintainers working nights and weekends. Despite its massive user base, the community never rallied with enough contributor support to make it sustainable. Even the planned replacement, InGate, failed to gain traction.

This is a sobering reminder: if you depend on open source software, contribute back when you can. Maintainer burnout is real, and projects don’t maintain themselves.

Gateway API: More Than Just a Replacement

Gateway API isn’t simply a new version of Ingress—it’s a fundamental rethinking of how we manage traffic in Kubernetes. If Ingress was a Swiss Army knife, Gateway API is a full professional toolkit.

What Makes Gateway API Different?

Role-Oriented Design: Gateway API recognizes that different people manage different aspects of infrastructure. Cluster operators manage Gateways, while application developers manage Routes. This separation of concerns reduces conflicts and improves security.

Expressiveness Without Chaos: Instead of relying on vendor-specific annotations that varied wildly between implementations, Gateway API provides rich, standardized resources. Need header-based routing? Traffic splitting? Timeouts? They’re all first-class citizens in the API.

Portable by Design: With Ingress, switching providers often meant rewriting all your annotations. Gateway API implementations share a common core, making migrations between providers far less painful.

Advanced Capabilities Built-In: Traffic weighting for canary deployments, cross-namespace routing with explicit grants, and request mirroring aren’t afterthoughts—they’re native features.

Reading the Map: Understanding Gateway API Resources

Before you start migrating, you need to understand the new landscape. Gateway API introduces several key resources:

GatewayClass

Think of this as the infrastructure template. It defines what type of load balancer or proxy you’re using—whether that’s NGINX, Envoy, HAProxy, or a cloud provider’s offering. Cluster administrators typically manage this.

Gateway

This is your actual load balancer instance. It defines listeners (ports and protocols), TLS configuration, and which routes can attach to it. One Gateway can handle multiple domains and applications.

HTTPRoute (and friends)

These define how traffic gets routed to your services. HTTPRoute handles HTTP/HTTPS traffic, while TCPRoute, TLSRoute, and UDPRoute handle other protocols. Application teams typically manage these.

ReferenceGrant

Security-conscious and often overlooked, ReferenceGrants explicitly allow cross-namespace references. If your route in namespace A wants to send traffic to a service in namespace B, you need a ReferenceGrant in namespace B permitting it.

Choosing Your Gateway: The Implementation Landscape

One of the first decisions you’ll face is which Gateway API implementation to use. Unlike Ingress NGINX, where there was essentially one dominant choice, the Gateway API ecosystem offers several mature options.

NGINX Gateway Fabric

The spiritual successor to Ingress NGINX, maintained by NGINX (now part of F5). If you want the least disruptive migration and are comfortable with NGINX’s architecture, this is your natural path. It’s designed with migration from Ingress NGINX in mind.

Best for: Teams already invested in NGINX, those wanting familiar architecture, organizations prioritizing migration simplicity.

Envoy Gateway

A CNCF project built on the Envoy proxy, which powers Istio, Contour, and Ambassador. It’s production-ready, actively developed, and benefits from Envoy’s proven performance in demanding environments.

Best for: Teams wanting cutting-edge features, those already using Envoy elsewhere, organizations prioritizing open governance.

Istio

If you’re already running a service mesh or plan to, Istio’s Gateway API implementation leverages the same infrastructure. You get traffic management plus observability, security, and resilience features.

Best for: Organizations needing service mesh capabilities, teams with complex microservices architectures, those prioritizing observability.

Cilium Gateway API

Built on eBPF technology, Cilium offers incredible performance and integrates tightly with Cilium’s network policies. If you’re using Cilium for CNI, this is a natural fit.

Best for: Performance-critical workloads, teams already using Cilium, organizations prioritizing network security.

Tigera Calico Gateway

Leveraging Envoy under the hood, Calico’s Gateway API implementation integrates seamlessly with Calico’s network policies and security features. If you’re already using Calico for network policy enforcement, this provides unified management of both networking and security.

Best for: Organizations using Calico CNI, teams prioritizing zero-trust security models, those wanting tight integration between network policy and ingress, enterprises needing advanced compliance features.

Kong Gateway

Offers both open-source and enterprise options with API management features built in. If you need more than just routing—rate limiting, authentication, analytics—Kong provides a comprehensive platform.

Best for: API-first organizations, teams needing enterprise support, those wanting integrated API management.

The Migration Game Plan

Now for the practical part: actually moving your workloads. Here’s a battle-tested approach that minimizes risk.

Phase 1: Discovery and Planning

Start by understanding what you’re migrating. Create an inventory:

# Find all Ingress resources
kubectl get ingress --all-namespaces -o yaml > ingress-backup.yaml

# Check for NGINX-specific annotations
kubectl get ingress --all-namespaces -o json | \
  jq '.items[] | select(.metadata.annotations | to_entries[] | .key | contains("nginx"))' | \
  jq -r '"\(.metadata.namespace)/\(.metadata.name)"'

# Identify snippet usage (high migration effort)
kubectl get ingress --all-namespaces -o json | \
  jq '.items[] | select(.metadata.annotations."nginx.ingress.kubernetes.io/configuration-snippet" or 
                         .metadata.annotations."nginx.ingress.kubernetes.io/server-snippet")' | \
  jq -r '"\(.metadata.namespace)/\(.metadata.name)"'

Document everything:

  • How many Ingress resources do you have?
  • Which annotations are you using?
  • Do you have custom NGINX snippets?
  • What’s your TLS certificate strategy?
  • Are there any mission-critical apps that need special attention?

Phase 2: Set Up Your Test Environment

Install Gateway API CRDs:

kubectl apply -f https://github.com/kubernetes-sigs/gateway-api/releases/download/v1.2.0/standard-install.yaml

Install your chosen Gateway implementation. For NGINX Gateway Fabric as an example:

kubectl apply -f https://github.com/nginxinc/nginx-gateway-fabric/releases/download/v1.4.0/crds.yaml
kubectl apply -f https://github.com/nginxinc/nginx-gateway-fabric/releases/download/v1.4.0/nginx-gateway-fabric.yaml

Create a test Gateway:

apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  name: prod-gateway
  namespace: gateway-system
spec:
  gatewayClassName: nginx
  listeners:
  - name: http
    protocol: HTTP
    port: 80
    allowedRoutes:
      namespaces:
        from: All
  - name: https
    protocol: HTTPS
    port: 443
    allowedRoutes:
      namespaces:
        from: All
    tls:
      mode: Terminate
      certificateRefs:
      - kind: Secret
        name: wildcard-tls-cert

Phase 3: Convert Your First Application

Choose a non-critical application for your first migration. Here’s how a typical conversion looks:

Original Ingress:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: shop-app
  namespace: production
  annotations:
    nginx.ingress.kubernetes.io/rewrite-target: /$2
    nginx.ingress.kubernetes.io/ssl-redirect: "true"
    nginx.ingress.kubernetes.io/rate-limit: "100"
spec:
  ingressClassName: nginx
  tls:
  - hosts:
    - shop.example.com
    secretName: shop-tls
  rules:
  - host: shop.example.com
    http:
      paths:
      - path: /api(/|$)(.*)
        pathType: ImplementationSpecific
        backend:
          service:
            name: shop-api
            port:
              number: 8080
      - path: /
        pathType: Prefix
        backend:
          service:
            name: shop-frontend
            port:
              number: 3000

Gateway API Equivalent:

apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: shop-app
  namespace: production
spec:
  parentRefs:
  - name: prod-gateway
    namespace: gateway-system
  hostnames:
  - shop.example.com
  rules:
  # API route with path rewriting
  - matches:
    - path:
        type: PathPrefix
        value: /api
    filters:
    - type: URLRewrite
      urlRewrite:
        path:
          type: ReplacePrefixMatch
          replacePrefixMatch: /
    backendRefs:
    - name: shop-api
      port: 8080
  # Frontend route
  - matches:
    - path:
        type: PathPrefix
        value: /
    backendRefs:
    - name: shop-frontend
      port: 3000

Notice what happened here:

  • The rewrite-target annotation became a URLRewrite filter
  • SSL redirect is handled at the Gateway level (HTTPS listener only)
  • Rate limiting requires implementation-specific policies (varies by controller)
  • Path matching is more explicit and predictable

Phase 4: Handle the Tricky Parts

TLS Management

Gateway API handles TLS at the Gateway level, not per route. This is actually cleaner but requires adjustment:

apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  name: prod-gateway
spec:
  gatewayClassName: nginx
  listeners:
  - name: https-shop
    protocol: HTTPS
    port: 443
    hostname: shop.example.com
    tls:
      mode: Terminate
      certificateRefs:
      - name: shop-tls
  - name: https-blog
    protocol: HTTPS
    port: 443
    hostname: blog.ramasankarmolleti.com
    tls:
      mode: Terminate
      certificateRefs:
      - name: blog-tls

Cross-Namespace Routing

If you need to route traffic from a Gateway in one namespace to services in another, you need ReferenceGrants:

apiVersion: gateway.networking.k8s.io/v1beta1
kind: ReferenceGrant
metadata:
  name: allow-gateway-routes
  namespace: production
spec:
  from:
  - group: gateway.networking.k8s.io
    kind: HTTPRoute
    namespace: gateway-system
  to:
  - group: ""
    kind: Service

Advanced Features

Common NGINX annotations and their Gateway API equivalents:

  • Timeouts: Use implementation-specific policies or BackendRef timeouts
  • Retries: HTTPRoute filters (implementation-specific)
  • CORS: HTTPRoute filters or policies
  • Authentication: Typically handled via implementation-specific policies
  • Rate limiting: Implementation-specific policies
  • Redirects: Built-in HTTPRoute filters

Phase 5: Parallel Running

Don’t just rip out Ingress NGINX. Run both controllers simultaneously:

  • Keep Ingress NGINX handling production traffic
  • Set up Gateway API for the same routes with different ingress class
  • Use external testing to validate Gateway API routes
  • Monitor performance and behavior differences

This gives you a safety net and time to ensure feature parity.

Phase 6: Gradual Cutover

Move applications incrementally:

  1. Start with development/staging environments
  2. Move low-risk production apps
  3. Progress to higher-risk workloads
  4. Keep mission-critical apps for last

For each migration:

  • Update DNS or load balancer to point to new ingress
  • Monitor closely for 24-48 hours
  • Keep rollback plan ready
  • Document any issues and solutions

Phase 7: Decommission

Once all applications are migrated and stable:

  • Remove Ingress resources
  • Scale down Ingress NGINX deployment
  • After a safe waiting period, uninstall Ingress NGINX
  • Update documentation and runbooks

Common Migration Challenges and Solutions

Challenge 1: Complex Rewrite Rules

Problem: You have intricate NGINX rewrite rules in snippets.

Solution: Some rewrite patterns map cleanly to URLRewrite filters. For complex cases, consider:

  • Implementing rewrites in your application
  • Using more sophisticated Gateway implementations that support advanced routing
  • Creating custom policies if your implementation supports them

Challenge 2: Custom NGINX Configuration

Problem: You rely heavily on configuration snippets for custom behavior.

Solution:

  • Evaluate if the functionality is actually necessary
  • Check if your Gateway implementation offers equivalent features
  • Consider moving logic to your application layer
  • Use sidecar proxies for truly custom requirements

Challenge 3: Rate Limiting and WAF

Problem: You use NGINX’s rate limiting and ModSecurity.

Solution: Different Gateway implementations offer varying support:

  • NGINX Gateway Fabric: Use NGINX policies
  • Envoy Gateway: Use RateLimitFilter
  • Istio: Use RequestAuthentication and AuthorizationPolicy
  • Consider dedicated WAF solutions like Cloudflare, AWS WAF, or open-source alternatives

Challenge 4: Observability Gaps

Problem: You have custom metrics and dashboards for Ingress NGINX.

Solution:

  • Gateway API implementations expose metrics differently
  • Most support Prometheus-compatible metrics
  • Plan to rebuild dashboards for your new implementation
  • Take this opportunity to improve your observability strategy

Real-World Lessons from Early Adopters

Lesson 1: Don’t Rush Teams that tried to migrate everything at once regretted it. Those who took a methodical, app-by-app approach had smoother experiences.

Lesson 2: Test Everything Twice Gateway API is more explicit than Ingress, which is good, but it means subtle configuration differences can have big impacts. What worked in staging might behave differently in production due to traffic patterns.

Lesson 3: Invest in Automation If you have hundreds of Ingress resources, manually converting them is error-prone. Write scripts or use tools like ingress2gateway to automate conversion, then manually review and test.

Lesson 4: Documentation Is Your Friend Gateway API implementations are still maturing. Read the docs thoroughly, especially around implementation-specific features and limitations.

The Silver Lining

While forced migrations are never fun, Gateway API genuinely offers improvements:

Better Security Posture: Role-based resource separation means developers can’t accidentally (or intentionally) inject arbitrary configuration that could compromise cluster security.

Improved Developer Experience: Once set up, HTTPRoutes are more intuitive than Ingress with its maze of annotations. Developers spend less time debugging cryptic NGINX behavior.

Future-Proof Architecture: Gateway API is actively developed by the Kubernetes community and is designed to evolve. New features like session affinity, request mirroring, and advanced traffic management are being added regularly.

Vendor Flexibility: Don’t like your current Gateway implementation? Switching to another is significantly easier than migrating between different Ingress controllers used to be.

Your Action Plan for Tomorrow

If you’re running Ingress NGINX in production, here’s what to do immediately:

  1. Audit your infrastructure – Run the discovery commands to understand your current state
  2. Set a deadline – Aim to complete migration by Q4 2025, well before the March 2026 cutoff
  3. Choose an implementation – Research Gateway implementations and pick one that fits your needs
  4. Allocate resources – This isn’t a “spare time” project. Dedicate engineering time and treat it as a priority
  5. Start learning – Read Gateway API documentation, experiment in a test cluster
  6. Communicate – Inform stakeholders, plan for application team training if needed

Closing Thoughts

The retirement of Ingress NGINX marks the end of an era, but it’s not a crisis—it’s an evolution. Gateway API represents years of lessons learned from Ingress’s limitations, and it provides a more robust foundation for the next decade of Kubernetes networking.

Yes, migrations are work. Yes, there will be challenges. But the Kubernetes ecosystem is healthier when we sunset projects that have become unmaintainable and embrace better alternatives.

Take this opportunity to not just migrate, but to improve. Review your traffic patterns, simplify your routing rules, enhance your security posture, and build a more maintainable infrastructure.

The deadline is March 2026, but don’t wait that long. Start planning now, migrate thoughtfully, and you’ll emerge with a better, more future-proof system.

Your future self—and your infrastructure—will thank you.


Hope you enjoyed the post.

Cheers

Ramasankar Molleti

LinkedIn

Book 1:1

AWS re:invent 2025 – What to watch?

Ready for AWS re:Invent? Let’s meet at the event. I will be at AWS Community Builders re:Invent 2025 Mixer, AWS Community Builders re:Invent 2025 Mixer

Here’s the guide for the event happening December 1-5:

Key Keynotes to Attend

The event features five major keynotes where AWS CEO Matt Garman and senior executives will announce new products and services AWS

Matt Garman (CEO) – Monday, Dec 2: Opening keynote covering AWS innovations across foundational building blocks and new experiences AWS

Swami Sivasubramanian (VP of Agentic AI) – Tuesday, Dec 3: Focus on how Agentic AI will transform work, featuring tools and services for building secure, reliable, and scalable agents on AWS AWS

Peter DeSantis & Dave Brown – Thursday, Dec 4: Deep dive into the engineering powering AWS services, from silicon to services AWS

Dr. Werner Vogels (CTO) – Thursday, Dec 4: Developer keynote on how tools, patterns, and practices are evolving in an AI-driven world AWS

Dr. Ruba Borno (VP of Partners) – Wednesday, Dec 3: Partnership keynote with a fireside chat featuring Matt Garman on customer transformations in the agentic era AWS

Technologies & Services to Focus On

1. Agentic AI – The Dominant Theme

This year features 43 sessions focused on Amazon Connect and agentic AI representing the most significant theme across the conference:

Amazon Bedrock AgentCore: Seven core services for deploying and operating secure AI agents at enterprise scale Amazon

(https://www.aboutamazon.com/news/aws/aws-summit-agentic-ai-innovations-2025)

Model Context Protocol (MCP): Integration with services like Amazon EKS for context-aware Kubernetes workflows and multi-agent AI systems with secure agent-to-agent communication AWS

(https://aws.amazon.com/blogs/containers/guide-to-amazon-eks-and-kubernetes-sessions-at-aws-reinvent-2025/)

Real-world ROI: Companies like Zepz are deflecting 30% of contacts while processing $16 billion in transactions, TUI Group migrated 10,000 agents across 12 European markets and cut operating costs by 10% CX Today

2. Amazon Connect & Customer Experience

Amazon Connect is transforming customer experiences by seamlessly embedding AI across every customer touchpoint, driving coordination between human and AI agents

Key sessions include:

BIZ221: Agentic AI advancements in customer experience

Multiple case studies from financial services, healthcare, and tourism sectors

3. Compute & Infrastructure

76 compute-focused sessions covering EC2, with over 1000 instance types including new processors from Intel (Granite Rapids), AMD (Turin), and AWS Graviton

Focus areas:

  • AI hardware optimization
  • Amazon EC2 Auto Mode
  • Serverless computing with AWS Lambda

4. Kubernetes & Container Services

48 Amazon EKS and Kubernetes sessions covering simplified cluster management with Amazon EKS Auto Mode, AI/ML workload orchestration, and support for ultra-scale clusters of up to 100K nodes AWS

https://aws.amazon.com/blogs/containers/guide-to-amazon-eks-and-kubernetes-sessions-at-aws-reinvent-2025

5. Well-Architected Framework Updates

AWS is launching three Well-Architected Lenses: the new Responsible AI Lens and updated Machine Learning Lens and Generative AI Lens, providing comprehensive guidance for AI workloads.

My Key focus for this year are

“The Agentic AI Revolution” – This is clearly the headline story with real production deployments showing measurable ROI

Infrastructure for AI at Scale – Cover the compute innovations, Kubernetes capabilities, and how AWS is building for AI workloads

Practical Implementation – Highlight the case studies and workshops that show how companies are actually deploying these technologies

Developer Experience – Focus on tools like Model Context Protocol and how AWS is making AI development more accessible

Hope you enjoyed the post.

Cheers

Ramasankar Molleti

LinkedIn

Book 1:1

Trending Topic: Deepseek R1

The Rise of DeepSeek R1: A Game-Changer in AI Landscape


The AI world is a buzz, and right at its center is DeepSeek R1. But that is where the frenzy lies, not just in the amazing things it can do but also in how it’s redefining the entire landscape of AI.


Cut to the chase: DeepSeek R1 is now pretty much available everywhere that matters. AWS users can have it via Amazon Bedrock and SageMaker, Microsoft fans on Azure AI Foundry and GitHub, and NVIDIA has even jumped on board, offering it as a NIM microservice preview.

Talk about making an entrance!What’s got everyone talking isn’t just its availability but, quite simply, the punch it packs. We are looking at a beast: 671 billion parameter Mixture of Experts architecture. What does that mean in plain English? It is not big; it is smart about how it deploys its size. The model uses chain-of-thought reasoning and reinforcement learning, making it stand head-to-head with OpenAI in several benchmarks. But here’s the kicker-it does all this while being more cost-effective than its competitors.


It has been interesting to see the industry impact. Several AI companies based in the U.S. have watched their market values take a hit, which sparked some interesting discussions about U.S. dominance in AI technology. It’s like watching a new player crash a party and immediately become the center of attention.


On the practical side, DeepSeek R1 proves to be quite versatile: Microsoft is already bringing it into Copilot+ PCs to allow on-device applications. It seems to be particularly good at tasks that require logical reasoning, mathematical reasoning, and coding. Companies have also begun taking advantage of the model-for instance, TuanChe, which wants to use DeepSeek R1 to upgrade their tech infrastructure.


However, all that glitters is not gold. Security researchers have pointed out a few vulnerabilities, most of which revolve around jailbreak techniques and prompt injections. The writing on the wall is clear: immense power comes with great care and security concerns.


But the remarkable thing is how DeepSeek pulled it off: There are reports that they’ve managed to train this model on a fraction of the compute budget compared to others, almost like taking a shortcut to the top, leaving the rest in the industry scratching their heads for notes and notepads.


If DeepSeek R1 fares well, it will probably trigger the race for more efficient models of AI. This could be a paradigm shift in how the industry approaches the development and deployment of AI models. The focus might finally be shifting from “bigger is better” to “smarter is better.”


Ultimately, DeepSeek R1 is more than just another AI model-it is a wake-up call that there is still room for innovation in AI, and that sometimes the most disruptive advances come from places you least expect. Whether a technology enthusiast, developer, or business leader, this is surely one to watch.


The question now isn’t whether DeepSeek R1 will make an impact-it already has. The real question is how the rest of the industry will respond. One thing’s for sure: just got a lot more interesting.
What are your thoughts on DeepSeek R1? Have you had a chance to try it out? Let me know in the comments below!

Here’s a consolidated list of key developments and points to review regarding DeepSeek R1:

  1. Model Availability:
  2. Performance and Capabilities:
  3. Industry Impact:
  4. Deployment and Use Cases:
  5. Security and Ethical Considerations:
  6. Development and Training:
  7. Future Implications:

This list covers the major points of interest surrounding the DeepSeek R1 model based on recent news and developments.

Hope you enjoyed the post.

Cheers

Ramasankar Molleti

LinkedIn

Book 1:1

Mastering Karpenter Autoscaling in Rancher RKE2

Have you ever experienced the endless struggle of fitting workloads into your Kubernetes clusters, feeling like you’re stuck in a never-ending game of Tetris? It can be quite a challenge, especially when demand suddenly surges, and you find yourself in a frenzy trying to adjust. But fear not, as we’re about to take a voyage that will turn your cluster management from a chaotic puzzle into a well-coordinated system!

Join us on this detailed journey where we delve into the world of setting up Karpenter autoscaling in Rancher RKE2 using customized EC2 instances. We’ll simplify the process into manageable steps, sprinkle in some real-life scenarios for better understanding, and perhaps share a humorous moment or two along the way. By the time you finish reading this piece, you’ll have all the insights needed to revamp your Kubernetes cluster into an efficient autoscaling powerhouse!

Understanding the Karpenter Advantage

Let’s start by exploring what sets Karpenter apart as a game-changer in the realm of Kubernetes autoscaling. Picture yourself at a buffet where instead of loading up your plate with everything, you have a personal chef who prepares exactly what you need precisely when you need it. That’s essentially the magic of Karpenter for your Kubernetes cluster!

Karpenter acts like an exceptionally intelligent waiter who not only anticipates your desires but also efficiently serves them up. It examines your workloads, takes into account your limitations, and then allocates just the right amount of computing resources to ensure smooth operations. Say goodbye to wasted resources and frantic scaling when traffic surges – Karpenter has got you covered!

But here’s where it gets truly fascinating: Karpenter isn’t limited to a fixed menu. It can collaborate with custom EC2 instances, giving you the freedom to customize your infrastructure according to your unique requirements. Imagine having access to a buffet where you can order bespoke dishes that aren’t even listed on the menu!

Now, you might be pondering, “How is this different from the Cluster Autoscaler I’m accustomed to using?” Excellent question! While the Cluster Autoscaler functions like a vigilant kitchen manager overseeing your existing node groups and adjusting their size as necessary, Karpenter is more akin to a culinary virtuoso capable of spontaneously creating brand-new node groups tailored specifically for your workloads.

Setting the Stage: Preparing Your Rancher RKE2 Environment

Before we dive into the nitty-gritty of implementing Karpenter, we need to make sure our Rancher RKE2 environment is primed and ready. Think of this as prepping your kitchen before a big cook-off – you want all your ingredients within reach and your tools sharpened!

First things first, let’s make sure you have a Rancher RKE2 cluster up and running. If you’re starting from scratch, head over to the Rancher documentation and follow their guide on setting up an RKE2 cluster. It’s as easy as pie – or should I say, as easy as kubectl apply!

Once you’ve got your cluster humming along, it’s time to roll up your sleeves and get your hands dirty with some AWS configuration. You’ll need to set up an IAM role for your EC2 instances, giving them the necessary permissions to interact with AWS services. This is like giving your chef the keys to the pantry – they need access to all the ingredients to whip up their culinary masterpieces!

Here’s a quick checklist to ensure you’re ready to rock:

  1. A functioning Rancher RKE2 cluster
  2. AWS CLI configured with the necessary credentials
  3. IAM roles set up for your EC2 instances
  4. kubectl and helm installed on your local machine

Understood everything? Awesome! You’re all set to begin your Karpenter adventure. Before we jump in, let’s pause for a moment to admire the magnificence of our upcoming task. We’re not just configuring another autoscaler – we’re transforming how our cluster adjusts to evolving needs. It’s akin to granting your Kubernetes cluster an advanced degree in managing resources!

Implementing Karpenter: The Main Course

Hey there, everyone! Are you ready for the big moment? Let’s dive in and kick off the process of setting up Karpenter in our Rancher RKE2 cluster. This is where all the excitement unfolds, so stay tuned!

To begin with, our first step is to get Karpenter installed in our cluster. We’re going to leverage Helm for this task because, honestly, who can resist a well-crafted Helm chart? It’s akin to having a cookbook for your Kubernetes deployments!

helm repo add karpenter https://charts.karpenter.sh
helm repo update
helm upgrade --install karpenter karpenter/karpenter --namespace karpenter --create-namespace \
  --set serviceAccount.annotations."eks\.amazonaws\.com/role-arn"=arn:aws:iam::YOUR_ACCOUNT_ID:role/KarpenterControllerRole-YOUR_CLUSTER_NAME

Make sure to replace YOUR_ACCOUNT_ID and YOUR_CLUSTER_NAME with your actual AWS account ID and cluster name. It’s like filling in the blanks in a mad lib, but way more fun because it’s INFRASTRUCTURE!

Now that we have successfully installed Karpenter, the next step is to customize it to operate with our unique EC2 instances. This is the truly thrilling part! Our task involves setting up a Provisioner, essentially a detailed guide for Karpenter on generating fresh nodes.

Here’s an example of a Provisioner that uses custom EC2 instances:

apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
  name: default
spec:
  requirements:
    - key: karpenter.sh/capacity-type
      operator: In
      values: ["spot", "on-demand"]
  limits:
    resources:
      cpu: 1000
  providerRef:
    name: default
  ttlSecondsAfterEmpty: 30
---
apiVersion: karpenter.k8s.aws/v1alpha1
kind: AWSNodeTemplate
metadata:
  name: default
spec:
  subnetSelector:
    karpenter.sh/discovery: YOUR_CLUSTER_NAME
  securityGroupSelector:
    karpenter.sh/discovery: YOUR_CLUSTER_NAME
  instanceProfile: KarpenterNodeInstanceProfile-YOUR_CLUSTER_NAME
  tags:
    karpenter.sh/discovery: YOUR_CLUSTER_NAME

This Provisioner is like a blueprint for your nodes. It tells Karpenter what kind of instances to use (spot or on-demand), sets resource limits, and specifies which subnets and security groups to use. It’s like giving Karpenter a shopping list for your infrastructure!

But wait, there’s more! We can get even fancier with our custom EC2 configurations. Let’s say you have some workloads that require GPU instances, and others that need high memory. No problem! We can create multiple Provisioners to handle different types of workloads:

apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
  name: gpu-workload
spec:
  requirements:
    - key: node.kubernetes.io/instance-type
      operator: In
      values: ["p3.2xlarge", "p3.8xlarge"]
    - key: karpenter.sh/capacity-type
      operator: In
      values: ["on-demand"]
  taints:
    - key: gpu
      value: "true"
      effect: NoSchedule
---
apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
  name: high-memory-workload
spec:
  requirements:
    - key: node.kubernetes.io/instance-type
      operator: In
      values: ["r5.2xlarge", "r5.4xlarge"]
    - key: karpenter.sh/capacity-type
      operator: In
      values: ["spot"]

With these Provisioners in place, Karpenter will automatically create the right type of nodes for your different workloads. It’s like having a personal assistant who always knows exactly what you need!

Fine-Tuning Your Karpenter Setup: The Secret Sauce

Now that we’ve got the basics down, let’s add some FLAVOR to our Karpenter setup. This is where we can really make our autoscaling shine!

One of the coolest features of Karpenter is its ability to use consolidation to optimize your cluster. Consolidation is like playing Tetris with your workloads – Karpenter will try to fit them onto the fewest number of nodes possible, potentially SAVING you money on your cloud bill. Who doesn’t love saving money?

To enable consolidation, we can add a few lines to our Provisioner:

apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
  name: default
spec:
  consolidation:
    enabled: true

But wait, there’s more! We can also set up custom termination rules to control how Karpenter decides which nodes to remove when scaling down. This is like giving Karpenter a set of guidelines for spring cleaning your cluster:

spec:
ttlSecondsUntilExpired: 2592000 # 30 days
ttlSecondsAfterEmpty: 30

These settings tell Karpenter to remove nodes that have been empty for 30 seconds, and to replace nodes that are older than 30 days. It’s like having a robot maid that knows exactly when to tidy up!

Wrapping Up: The Karpenter Feast

Phew! We’ve covered a lot of ground, haven’t we? From understanding the basics of Karpenter to implementing it in our Rancher RKE2 cluster with custom EC2 instances, we’ve truly embarked on a feast of Kubernetes autoscaling knowledge!

Let’s take a moment to digest what we’ve learned:

  1. Karpenter is like a SUPER smart waiter for your Kubernetes cluster, provisioning just the right resources when you need them.
  2. Setting up Karpenter in Rancher RKE2 involves preparing your AWS environment, installing Karpenter, and configuring Provisioners.
  3. Custom EC2 instances give you the flexibility to tailor your infrastructure to your specific needs.
  4. Fine-tuning features like consolidation and custom termination rules can help optimize your cluster even further.

By implementing Karpenter, you’re not just improving your cluster’s efficiency – you’re revolutionizing the way you manage your Kubernetes infrastructure. It’s like upgrading from a bicycle to a ROCKET SHIP!

So, what are you waiting for? Get out there and start Karpentering! Your future self (and your CFO) will thank you for the optimized infrastructure and reduced cloud bills. Remember, in the world of Kubernetes, efficiency is king, & with Karpenter, you’re wearing the crown!

Hope you enjoyed the post.

Cheers

Ramasankar Molleti

LinkedIn

Book 1:1

Exploring the pivotal role of advanced cloud architecture in enabling AI-driven digital transformation.

Modern AI Architecture: Building Scalable and Efficient Systems


In the rapidly changing landscape of technology today, the challenges for organizations in managing and scaling their AI infrastructure have been unprecedented. This comprehensive guide will walk through key elements of modern AI architecture and also provide practical insights into implementation.

The Growing AI Landscape: Current Challenges

Exponential Growth in AI Workloads


The AI revolution brought along exponential growth in computational needs. To train sophisticated machine learning models, organizations are processing datasets of unprecedented sizes, with those sizes pushing the limits of today’s traditional infrastructure. However, this surge in AI workloads also brings great opportunity and challenge to every enterprise seeking to remain competitive.

The Triple Challenge: Scale, Cost, and Speed

Organizations face three primary challenges:

  • Scalability: Systems must handle varying workload intensities without performance degradation
  • Cost Management: Balancing operational expenses with performance requirements
  • Real-time Processing: Meeting the growing demand for instant data processing and analysis

Core Architecture Components

Foundation Layer: The Backbone

The foundation layer serves as the infrastructure cornerstone, comprising:

  • Multi-cloud infrastructure for optimal resource distribution
  • Kubernetes orchestration for container management
  • Service mesh implementation for reliable microservices communication
  • GitOps pipelines for streamlined development workflows

AI Layer: The Intelligence Center

At the heart of the architecture lies the AI layer:

  • Sophisticated model training infrastructure
  • High-performance inference endpoints
  • Centralized feature store for consistent model training
  • Comprehensive model registry for version control and governance

Data Layer: The Knowledge Foundation

A robust data layer ensures efficient data management through:

  • Scalable data lakes for diverse data storage
  • Optimized vector databases for high-dimensional data
  • Real-time streaming capabilities
  • Strategic cache implementation for reduced latency

Implementation Best Practices

Infrastructure Automation

Modern AI architectures benefit from:

  • Infrastructure as Code (IaC) using tools like Terraform
  • Automated deployment processes with Helm and ArgoCD
  • Continuous integration and deployment pipelines

MLOps Excellence

Establish robust MLOps practices including:

  • Systematic model versioning
  • Automated testing protocols
  • A/B testing frameworks for performance optimization
  • Continuous deployment strategies

Comprehensive Monitoring

Implement a multi-faceted monitoring approach:

  • Metrics collection and analysis with Prometheus
  • Visual data representation through Grafana
  • Distributed tracing with Jaeger
  • Centralized logging using the ELK stack

Real-World Implementation: E-commerce Case Study

Challenge

An e-commerce platform faced the challenge of managing millions of daily users while maintaining high performance and personalization.

Solution Components

The implementation included:

  • Serverless inference for dynamic scaling
  • Real-time feature computation for personalization
  • Intelligent auto-scaling mechanisms
  • Edge computing integration

Results

The solution achieved:

  • Significantly improved response times
  • Optimal resource utilization
  • Enhanced user experience through personalization
  • Reduced operational costs

Security and Performance Optimization

Security Best Practices

  • End-to-end encryption for data protection
  • Role-Based Access Control (RBAC) implementation
  • Regular security audits and updates
  • Zero-trust architecture principles

Performance Enhancement Strategies

  • Aggressive caching mechanisms
  • CDN utilization for edge inference
  • Optimized data processing pipelines
  • Circuit breaker implementation for failure prevention

Key Success Factors

Automation First

Prioritize automation across all layers to:

  • Reduce manual intervention
  • Minimize human error
  • Increase deployment speed
  • Ensure consistency

Cost Optimization

Implement strategic cost management through:

  • Resource usage monitoring
  • Automated scaling policies
  • Regular cost analysis and optimization
  • Strategic technology investments

Performance Monitoring

Maintain system health through:

  • Real-time performance monitoring
  • Proactive issue detection
  • Regular performance audits
  • Continuous optimization

Conclusion

Building a modern AI architecture requires a careful balance of scalability, security, and performance. By following these architectural principles and implementation practices, organizations can create robust, efficient, and cost-effective AI systems that drive business value while maintaining operational excellence.

Remember that architecture is not a one-time effort but an evolving journey that requires continuous refinement and adaptation to meet changing business needs and technological advances.

Hope you enjoyed the post.

Cheers

Ramasankar Molleti

LinkedIn

Book 1:1

AWS re:Invent 2024: The Game-Changing Announcements You Might Have Missed

While the keynotes and major product launches grabbed the headlines at AWS re:Invent 2024, some of the most transformative announcements flew under the radar. Let’s explore these hidden gems that are set to reshape how we build and deploy cloud solutions.

Amazon Nova: Pushing the Boundaries of AI

The launch of Amazon Nova through Amazon Bedrock marks a pivotal moment in the evolution of cloud-based AI services. What sets Nova apart isn’t just its capabilities – it’s how seamlessly it integrates different types of media and context.

Think about the last time you tried to analyze a lengthy video presentation. Nova’s advanced video processing can now automatically generate detailed summaries and extract key insights, saving hours of manual review. Content creators are already exploring its ability to transform written scripts into storyboards, opening new possibilities for rapid prototyping and production.

But where Nova truly shines is in its context handling. With support for up to 300,000 tokens (and plans to expand to 2 million tokens in early 2025), we’re talking about AI that can process and understand entire books or hours of conversation while maintaining context throughout. This isn’t just about handling more text – it’s about enabling AI to truly grasp the nuances of long-form content.

Iceberg Tables on S3: A Data Lake Revolution

The introduction of native Iceberg Tables support for S3 might sound technical, but its implications for data management are profound. For years, organizations have struggled with the complexity of managing large-scale data lakes. This announcement changes everything.

Real-time metadata querying means you can now track changes to your data as they happen. Imagine being able to instantly understand how your data is being used, who’s accessing it, and how it’s evolving over time. This isn’t just about better data management – it’s about enabling real-time decision making based on your data’s lifecycle.

The architecture simplification is equally impressive. By reducing the need for separate metadata stores and enabling more efficient query planning, Iceberg Tables on S3 cuts through the complexity that has long plagued data lake implementations. This means faster development, lower costs, and fewer headaches for data teams.

CloudFront VPC Origins: Security Meets Performance

The new VPC Origins feature for CloudFront is a masterclass in solving multiple problems at once. By enabling direct connections to resources in private subnets through Elastic Network Interfaces, AWS has eliminated one of the biggest security concerns in content delivery – the need for public IP addresses on origin servers.

But the benefits go beyond security. Organizations are discovering that this architectural change can lead to significant cost savings by eliminating the need for NAT gateways and simplifying network management. It’s a rare example of a security enhancement that actually reduces complexity rather than adding to it.

SageMaker Lakehouse: Where Data Meets AI

The next generation of Amazon SageMaker brings us SageMaker Lakehouse, and it’s clear that AWS has been listening to its customers. The platform unifies data processing, analytics, and machine learning in a way that feels natural rather than forced.

What stands out is how it handles different types of data. Whether you’re working with structured databases, unstructured business documents, or something in between, SageMaker Lakehouse provides a consistent experience. This isn’t just about convenience – it’s about enabling new types of analysis that weren’t practical before.

AWS Glue: The Unsung Hero of Data Integration

The updates to AWS Glue might not make headlines, but they solve real-world problems that data engineers face every day. The 25% improvement in automated schema detection accuracy means less time fixing data parsing errors and more time actually using the data.

The expanded library of pre-built transformations is equally important, as it addresses the reality that most organizations spend more time preparing data than analyzing it. Combined with new connectors for enterprise systems like SAP and Salesforce, Glue is becoming the Swiss Army knife of data integration.

Looking Ahead

These announcements from re:Invent 2024 show AWS’s commitment to solving real-world problems rather than just chasing trends. The focus on practical innovations in AI, data management, and security suggests a maturing cloud ecosystem where integration and usability are taking center stage.

For organizations building on AWS, these changes represent new opportunities to simplify architectures, improve security, and extract more value from their data. The real excitement isn’t about individual features – it’s about how these capabilities can be combined to create solutions that weren’t possible before.

Stay tuned as these technologies mature and new use cases emerge. The cloud journey is far from over, and if these announcements are any indication, the next chapter will be even more exciting than the last.

Hope you enjoyed the post.

Cheers

Ramasankar Molleti

LinkedIn

Book 1:1

Decoupling Terraform and Ansible: A Deep Dive into Infrastructure Management

Why Decouple Terraform and Ansible?

Decoupling Terraform and Ansible allows for a clear separation of concerns in infrastructure management:

  1. Terraform excels at provisioning and managing cloud resources (infrastructure as code).
  2. Ansible specializes in configuration management and application deployment.

By decoupling these tools, we can:

  • Improve modularity and maintainability of our infrastructure code
  • Enable independent scaling of infrastructure provisioning and configuration management
  • Facilitate easier troubleshooting and rollbacks
  • Allow for more flexible workflows and tool choices

Approach 1: Jenkins CI Integration

This approach uses Jenkins as the orchestrator for both Terraform and Ansible operations.

Step-by-Step Example:

  1. Set up a Jenkins server with necessary plugins (Terraform, Ansible, AWS).
  2. Create a Jenkins pipeline for EKS cluster provisioning:
pipeline {
    agent any
    stages {
        stage('Provision EKS Cluster') {
            steps {
                sh 'terraform init'
                sh 'terraform apply -auto-approve'
            }
        }
        stage('Configure EKS Cluster') {
            steps {
                sh 'ansible-playbook -i inventory eks-config.yml'
            }
        }
    }
}
  1. Create Terraform configuration for EKS:
resource "aws_eks_cluster" "example" {
  name     = "example-cluster"
  role_arn = aws_iam_role.example.arn

  vpc_config {
    subnet_ids = ["subnet-12345678", "subnet-87654321"]
  }
}
  1. Create Ansible playbook for EKS configuration:
- name: Configure EKS Cluster
  hosts: localhost
  tasks:
    - name: Update kubeconfig
      shell: aws eks get-token --cluster-name example-cluster | kubectl apply -f -

5. Run the Jenkins pipeline to provision and configure the EKS cluster.

Approach 2: GitOps Method

The GitOps approach uses Git as the single source of truth for both infrastructure and application deployments.

Step-by-Step Example:

  1. Set up a Git repository for your infrastructure code.
  2. Create a Terraform configuration for EKS in the repository:
resource "aws_eks_cluster" "example" {
  name     = "example-cluster"
  role_arn = aws_iam_role.example.arn

  vpc_config {
    subnet_ids = ["subnet-12345678", "subnet-87654321"]
  }
}
  1. Create Ansible playbooks for cluster configuration in the same repository:
- name: Configure EKS Cluster
  hosts: localhost
  tasks:
    - name: Update kubeconfig
      shell: aws eks get-token --cluster-name example-cluster | kubectl apply -f -
  1. Set up a GitOps operator (e.g., Flux or ArgoCD) in your EKS cluster.
  2. Create a GitOps configuration file (e.g., for Flux):
apiVersion: source.toolkit.fluxcd.io/v1beta1
kind: GitRepository
metadata:
  name: infrastructure-repo
  namespace: flux-system
spec:
  interval: 1m
  url: https://github.com/your-org/infrastructure-repo
  ref:
    branch: main
---
apiVersion: infra.contrib.fluxcd.io/v1alpha1
kind: Terraform
metadata:
  name: eks-cluster
  namespace: flux-system
spec:
  interval: 1h
  path: ./terraform
  sourceRef:
    kind: GitRepository
    name: infrastructure-repo

6. Apply the GitOps configuration to your cluster.

Advantages of GitOps Approach:

  1. Version Control: All changes are tracked in Git, providing a clear audit trail.
  2. Automated Synchronization: The desired state in Git is automatically reconciled with the cluster state.
  3. Simplified Rollbacks: Reverting to a previous state is as easy as reverting a Git commit.
  4. Improved Collaboration: Teams can use familiar Git workflows for infrastructure changes.
  5. Enhanced Security: Reduced need for direct cluster access, as changes are made through Git.

Conclusion

While both approaches have their merits, the GitOps method aligns more closely with Kubernetes’ declarative nature and offers better scalability and auditability. However, the choice between Jenkins CI and GitOps should be based on your team’s specific needs, existing toolchain, and comfort with Git-centric workflows.As the DevOps landscape continues to evolve, the decoupling of infrastructure provisioning and configuration management tools will remain crucial for building flexible, maintainable, and scalable systems.

Hope you enjoyed the post.

Cheers

Ramasankar Molleti

LinkedIn

New Trends in Devops, Cloud Engineering, Platform Engineering

As we move into 2025, the DevOps, Cloud Engineering, and Platform Engineering landscape continues to change rapidly. These interrelated disciplines are undergoing transformative changes that reshape how organizations approach software development, deployment, and infrastructure management. Let’s review some of the most significant trends in these domains.

AI-Driven DevOps and Cloud Operations

AI and ML are becoming integral to DevOps and cloud engineering practices. These technologies are enhancing automation, predictive analytics, and decision-making processes throughout the development lifecycle.

  • AI-powered code assistants are streamlining Infrastructure as Code (IaC) creation, reducing the time and effort required to manage cloud infrastructure.
  • Machine learning algorithms are being employed to predict potential issues in deployments, enabling proactive problem-solving.
  • AI is also being leveraged to optimize cloud resource allocation, ensuring cost-effective and efficient use of cloud services.

The Rise of Platform Engineering

Platform engineering is gaining significant momentum as organizations seek to create internal developer platforms that streamline workflows and improve productivity.

  • By 2026, Gartner predicts that 80% of software engineering organizations will establish platform teams to provide reusable services, components, and tools for application delivery.
  • Platform engineering is addressing the inefficiencies caused by the decentralization of tools and processes in DevOps, especially as organizations scale.
  • This approach is offering a frictionless self-service experience to developers, accelerating the software development process.

Serverless and Edge Computing Integration

The adoption of serverless architectures and edge computing is reshaping how applications are developed and deployed.

  • Serverless computing is simplifying infrastructure management, allowing developers to focus more on code rather than server maintenance.
  • Edge computing is enabling DevOps teams to deploy and manage applications closer to end-users, improving performance and reducing latency.
  • The integration of edge computing with cloud platforms is creating new possibilities for real-time data processing and IoT applications.

DevSecOps Evolution

Security is becoming increasingly integrated into the DevOps lifecycle, giving rise to more sophisticated DevSecOps practices.

  • Organizations are adopting a “shift left” approach, embedding security practices from the earliest stages of development.
  • Zero Trust architectures are being implemented in cloud environments, operating on the principle of “never trust, always verify.”
  • AI-powered security tools are being used to automate threat detection and response in cloud-native applications.

Multi-Cloud and Hybrid Cloud Strategies

Organizations are increasingly adopting multi-cloud and hybrid cloud approaches to leverage the strengths of different cloud providers and maintain flexibility.

  • Platform engineering teams are developing tools and practices to ensure consistent deployment and management across various cloud environments.
  • Cloud-agnostic technologies like Kubernetes are playing a crucial role in enabling portability between different cloud platforms.

Sustainability in Cloud Computing

Green cloud computing is emerging as a significant trend, with organizations focusing on reducing the environmental impact of their cloud operations.

  • Cloud providers are investing in renewable energy sources for their data centers.
  • DevOps and platform engineering teams are developing practices to optimize resource utilization and reduce energy consumption in cloud environments.

Conclusion

The convergence of DevOps, Cloud Engineering, and Platform Engineering is driving innovation and efficiency in software development and deployment. As these fields continue to evolve, organizations that stay ahead of these trends will be well-positioned to leverage the full potential of modern technology practices. The future of these domains lies in intelligent automation, enhanced security, flexible cloud strategies, and sustainable computing practices.

Hope you enjoyed the post.

Cheers

Ramasankar Molleti

LinkedIn

Python Script to find untagged resources in AWS

Here is the script to find a un tagged resources in AWS:

import boto3
import datetime
from botocore.exceptions import ClientError
import json

class AWSTagRecommender:
    def __init__(self, region='us-east-1'):
        self.region = region
        self.ec2 = boto3.client('ec2', region_name=region)
        self.rds = boto3.client('rds', region_name=region)
        self.s3 = boto3.client('s3')
        self.lambda_client = boto3.client('lambda', region_name=region)

    def recommended_tags(self, resource_type, resource_details=None):
        """Generate recommended tags based on resource type and details"""
        current_date = datetime.datetime.now().strftime('%Y-%m-%d')
        
        # Base recommended tags
        base_tags = {
            'Environment': ['prod', 'dev', 'stage', 'test'],
            'Owner': 'REQUIRED',
            'CostCenter': 'REQUIRED',
            'Project': 'REQUIRED',
            'CreatedDate': current_date,
            'Backup': ['true', 'false'],
            'SecurityLevel': ['high', 'medium', 'low']
        }
        
        # Resource-specific tag recommendations
        specific_tags = {
            'ec2': {
                'ApplicationRole': ['web', 'app', 'db', 'cache'],
                'PatchGroup': ['group1', 'group2', 'critical'],
                'AutoStop': ['true', 'false']
            },
            'rds': {
                'DatabaseType': ['mysql', 'postgres', 'oracle', 'sqlserver'],
                'BackupRetention': ['7days', '30days', '90days'],
                'DataClassification': ['public', 'private', 'confidential']
            },
            's3': {
                'DataType': ['logs', 'backups', 'user-content', 'static-assets'],
                'AccessPattern': ['frequent', 'infrequent', 'archive'],
                'LifecyclePolicy': ['required', 'not-required']
            },
            'lambda': {
                'FunctionType': ['api', 'scheduled', 'event-driven'],
                'Runtime': resource_details.get('Runtime', 'unknown') if resource_details else 'REQUIRED',
                'APIEndpoint': ['true', 'false']
            }
        }
        
        return {**base_tags, **specific_tags.get(resource_type, {})}

    def find_untagged_ec2(self):
        """Find untagged EC2 resources"""
        try:
            instances = self.ec2.describe_instances()
            untagged = []
            
            for reservation in instances['Reservations']:
                for instance in reservation['Instances']:
                    if not instance.get('Tags'):
                        untagged.append({
                            'ResourceId': instance['InstanceId'],
                            'Type': 'ec2',
                            'Details': {
                                'InstanceType': instance['InstanceType'],
                                'State': instance['State']['Name'],
                                'LaunchTime': instance['LaunchTime'].strftime('%Y-%m-%d')
                            }
                        })
            return untagged
        except ClientError as e:
            print(f"Error finding untagged EC2 instances: {e}")
            return []

    def find_untagged_rds(self):
        """Find untagged RDS resources"""
        try:
            instances = self.rds.describe_db_instances()
            untagged = []
            
            for instance in instances['DBInstances']:
                tags = self.rds.list_tags_for_resource(
                    ResourceName=instance['DBInstanceArn']
                )['TagList']
                
                if not tags:
                    untagged.append({
                        'ResourceId': instance['DBInstanceIdentifier'],
                        'Type': 'rds',
                        'Details': {
                            'Engine': instance['Engine'],
                            'Class': instance['DBInstanceClass'],
                            'Storage': instance['AllocatedStorage']
                        }
                    })
            return untagged
        except ClientError as e:
            print(f"Error finding untagged RDS instances: {e}")
            return []

    def find_untagged_s3(self):
        """Find untagged S3 buckets"""
        try:
            buckets = self.s3.list_buckets()['Buckets']
            untagged = []
            
            for bucket in buckets:
                try:
                    tags = self.s3.get_bucket_tagging(Bucket=bucket['Name'])
                except ClientError as e:
                    if e.response['Error']['Code'] == 'NoSuchTagSet':
                        untagged.append({
                            'ResourceId': bucket['Name'],
                            'Type': 's3',
                            'Details': {
                                'CreationDate': bucket['CreationDate'].strftime('%Y-%m-%d')
                            }
                        })
            return untagged
        except ClientError as e:
            print(f"Error finding untagged S3 buckets: {e}")
            return []

    def find_untagged_lambda(self):
        """Find untagged Lambda functions"""
        try:
            functions = self.lambda_client.list_functions()['Functions']
            untagged = []
            
            for function in functions:
                tags = self.lambda_client.list_tags(
                    Resource=function['FunctionArn']
                ).get('Tags', {})
                
                if not tags:
                    untagged.append({
                        'ResourceId': function['FunctionName'],
                        'Type': 'lambda',
                        'Details': {
                            'Runtime': function['Runtime'],
                            'LastModified': function['LastModified']
                        }
                    })
            return untagged
        except ClientError as e:
            print(f"Error finding untagged Lambda functions: {e}")
            return []

    def generate_report(self):
        """Generate a comprehensive report of untagged resources and recommendations"""
        all_untagged = []
        all_untagged.extend(self.find_untagged_ec2())
        all_untagged.extend(self.find_untagged_rds())
        all_untagged.extend(self.find_untagged_s3())
        all_untagged.extend(self.find_untagged_lambda())

        report = {
            'generated_date': datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S'),
            'region': self.region,
            'untagged_resources': []
        }

        for resource in all_untagged:
            report['untagged_resources'].append({
                'resource_id': resource['ResourceId'],
                'resource_type': resource['Type'],
                'details': resource['Details'],
                'recommended_tags': self.recommended_tags(resource['Type'], resource['Details'])
            })

        return report

def main():
    # Initialize the tag recommender
    recommender = AWSTagRecommender()
    
    # Generate the report
    print("Analyzing AWS resources for missing tags...")
    report = recommender.generate_report()
    
    # Save the report to a file
    filename = f"tag_recommendations_{datetime.datetime.now().strftime('%Y%m%d_%H%M%S')}.json"
    with open(filename, 'w') as f:
        json.dump(report, f, indent=2, default=str)
    
    # Print summary
    print(f"\nReport generated: {filename}")
    print(f"Found {len(report['untagged_resources'])} untagged resources")
    
    # Print resource type breakdown
    resource_types = {}
    for resource in report['untagged_resources']:
        resource_types[resource['resource_type']] = resource_types.get(resource['resource_type'], 0) + 1
    
    print("\nBreakdown by resource type:")
    for rtype, count in resource_types.items():
        print(f"{rtype}: {count} untagged resources")

if __name__ == "__main__":
    main()

Hope you enjoyed the post.

Cheers

Ramasankar Molleti

LinkedIn

Service Mesh Evolution in Kubernetes: 2023 State of the Art

Introduction


2023 marked significant advancements in service mesh technologies, with Istio, Linkerd, and Cilium emerging as leading solutions. Let’s explore the latest developments and best practices in Kubernetes service mesh implementations.

Istio’s Ambient Mesh

Overview

In 2023, Istio’s Ambient Mesh became production-ready, offering a sidecar-less architecture:

# Example of Ambient Mesh configuration
apiVersion: install.istio.io/v1alpha1
kind: IstioOperator
metadata:
  name: ambient-install
spec:
  profile: ambient
  components:
    ztunnel:
      enabled: true
    waypoint:
      enabled: true
  meshConfig:
    defaultConfig:
      proxyMetadata:
        ISTIO_META_AMBIENT_MESH: "true"

Key Features

  1. Reduced resource overhead
  2. Simplified operations
  3. Better security model
  4. Improved performance

Cilium Service Mesh

Native eBPF Integration

# Cilium Service Mesh configuration
apiVersion: cilium.io/v2alpha1
kind: CiliumL7LoadBalancerConfig
metadata:
  name: lb-config
spec:
  services:
    - name: myapp
      loadBalancerClass: cilium
      backends:
        - target: deployment/myapp
          port: 80
      tls:
        certificates:
          - secretName: myapp-cert

Advancements

  1. Enhanced observability
  2. Native multi-cluster support
  3. Improved security features
  4. Lower latency

Linkerd’s 2023 Updates

Simplified Multi-Cluster

apiVersion: split.smi-spec.io/v1alpha2
kind: TrafficSplit
metadata:
  name: my-service-split
spec:
  service: my-service
  backends:
  - service: my-service-v1
    weight: 90
  - service: my-service-v2
    weight: 10

Performance Comparisons

Latency Metrics

# Example metrics collection
def collect_mesh_metrics():
    metrics = {
        'istio': {
            'p99_latency': '2.3ms',
            'memory_overhead': '50MB'
        },
        'linkerd': {
            'p99_latency': '1.8ms',
            'memory_overhead': '30MB'
        },
        'cilium': {
            'p99_latency': '1.5ms',
            'memory_overhead': '25MB'
        }
    }
    return metrics

Security Enhancements

Zero Trust Architecture

# Istio Authorization Policy
apiVersion: security.istio.io/v1beta1
kind: AuthorizationPolicy
metadata:
  name: httpbin
  namespace: default
spec:
  selector:
    matchLabels:
      app: httpbin
  action: ALLOW
  rules:
  - from:
    - source:
        principals: ["cluster.local/ns/default/sa/sleep"]
    to:
    - operation:
        methods: ["GET"]
        paths: ["/info*"]

Observability Improvements

OpenTelemetry Integration

# OpenTelemetry Collector configuration
apiVersion: opentelemetry.io/v1alpha1
kind: OpenTelemetryCollector
metadata:
  name: otel
spec:
  config: |
    receivers:
      otlp:
        protocols:
          grpc:
            endpoint: 0.0.0.0:4317
    processors:
      batch:
    exporters:
      prometheus:
        endpoint: 0.0.0.0:8889

Multi-Cluster Management

Federation Support

# Multi-cluster service
apiVersion: networking.istio.io/v1alpha3
kind: ServiceEntry
metadata:
  name: cross-cluster-service
spec:
  hosts:
  - my-service.prod.svc.cluster.global
  location: MESH_INTERNAL
  ports:
  - number: 80
    name: http
    protocol: HTTP
  resolution: DNS
  endpoints:
  - address: prod-cluster.example.com

Best Practices for 2023

  1. Resource Management
apiVersion: v1
kind: Pod
metadata:
  name: meshed-pod
spec:
  containers:
  - name: app
    resources:
      requests:
        memory: "64Mi"
        cpu: "250m"
      limits:
        memory: "128Mi"
        cpu: "500m"
  1. Traffic Management
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: my-service-route
spec:
  hosts:
  - my-service
  http:
  - match:
    - headers:
        end-user:
          exact: jason
    route:
    - destination:
        host: my-service-v2
  - route:
    - destination:
        host: my-service-v1

Implementation Guidelines

  1. Initial Setup
    • Start with pilot deployments
    • Gradual rollout
    • Monitor performance metrics
    • Plan for scale
  2. Migration Strategy
    • Service-by-service approach
    • Comprehensive testing
    • Rollback procedures
    • Team training

Future Trends

  1. WebAssembly Integration
    • Custom extensions
    • Dynamic policy enforcement
    • Enhanced security features
  2. AI/ML Integration
    • Automated traffic routing
    • Anomaly detection
    • Performance optimization

Conclusion

2023’s service mesh landscape showed:

  1. Increased focus on performance
  2. Enhanced security features
  3. Better multi-cluster support
  4. Improved observability
  5. Simplified operations

Organizations should:

  • Evaluate mesh options based on requirements
  • Plan for scalability
  • Implement security best practices
  • Monitor performance metrics
  • Train teams effectively

The service mesh ecosystem is constantly changing, and every solution has unique advantages for specific use cases and requirements.

Hope you enjoyed the post.

Cheers

Ramasankar Molleti

LinkedIn