· Debugging and Troubleshooting:
· Investigate and resolve complex software issues within OpenStack environments (particularly those running on Ubuntu), including networking, compute, and storage.
· Diagnose and troubleshoot problems related to Kubernetes container orchestration, including pod failures, service outages, and networking issues.
· Debug and analyze issues with Docker containers and their interaction with the underlying system.
· Analyze and resolve issues related to Ceph distributed storage, including data replication, performance tuning, and storage availability.
· Work on Octavia load balancers to troubleshoot L2/L3 networking issues and ensure reliable load balancing for cloud-native applications.
· Incident and Problem Management:
· Lead incident resolution efforts for platform outages or performance degradation, coordinating across different teams to ensure swift recovery.
· Perform root cause analysis (RCA) and provide long-term fixes for recurring or critical issues.
· Document incident postmortems to prevent future occurrences and improve processes.
· Performance Optimization:
· Analyze performance bottlenecks across the cloud stack, including OpenStack components, Kubernetes, and Ceph, and implement optimizations to improve reliability and efficiency.
· Optimize networking setups, including Octavia load balancers, to enhance cloud service delivery.
· Monitor and improve containerized application performance and scaling across Docker and Kubernetes clusters.
· Cloud Platform Maintenance:
· Assist in upgrading and maintaining cloud infrastructure, ensuring that all components (Ubuntu, OpenStack, Kubernetes, Ceph, etc.) are kept secure and up to date.
· Participate in the deployment of software updates, security patches, and configuration changes in a controlled manner with minimal downtime.
· Automation and Tooling:
· Build and maintain automation scripts for monitoring, troubleshooting, and resolving cloud platform issues, focusing on OpenStack, Kubernetes, Ceph, and Docker environments.
· Implement and optimize Infrastructure as Code (IaC) solutions to improve the deployment and configuration of cloud resources.
· 3+ years of experience with cloud platforms, specifically focusing on Ubuntu, OpenStack, Kubernetes, Ceph, Octavia load balancers, and Docker.
· Strong debugging skills and familiarity with cloud and software debugging tools.
· Experience with networking, compute, and storage components in OpenStack.
· Hands-on experience with containerization (Docker) and orchestration (Kubernetes).
· Familiarity with Ceph distributed storage solutions and troubleshooting storage issues.
· Experience with monitoring and logging tools, such as Prometheus, Grafana, and Elasticsearch.
· Solid understanding of networking principles, including L2/L3 networking, load balancing (Octavia), and SDN (Software Defined Networking).
· Proficient in scripting languages like Python, Bash, or equivalent for automation.
· Strong communication skills and the ability to work in a collaborative environment.