Meridian Cooperative is looking for a Director of Site Reliability Engineering to join a team of passionate innovators and problem-solvers, empowered to rise above challenges and swarm around solutions. Here, at our Dunwoody office, we are energized by the fact that our work is important. We are driven to make work as easy as possible for our Members, Customers, Partners, and Employees. Help us lead the way in Utility Software, join a winning company and thrive. In office presence is required. The role is hybrid and will be performed out of Dunwoody, GA.
Job Summary:
We are seeking an experienced Director of Site Reliability Engineering (SRE) to join our team. As a Director of SRE, you will manage the SRE team ensuring the reliability, performance, and scalability of our systems and services. The SRE team will collaborate closely with the DevOps and Development teams to improve system stability, automate repetitive tasks, and proactively address issues before they affect customers.
Essential Functions:
- Directs SRE department operations, staffing, management development, and training to promote interdepartmental collaboration, engagement, and achievement of annual business objectives.
- Oversee SRE team members responsible for ensuring the reliability, performance, and scalability of our systems and services.
- Develop comprehensive goals and expectations for maintaining high standards of performance across the SRE team. Make effective and efficient use of resources and set high, achievable aspirations for SRE personnel to align with the organization’s goals and objectives.
- SREs specialize in systems (operating systems, storage subsystems, networking), while implementing best practices for availability, reliability and scalability, with varied interests in algorithms and distributed systems.
- Develop and implement monitoring, alerting, and incident management processes to ensure system health and performance.
- Collaborate with development and operations teams to improve the overall availability and performance of our services.
- Identify and resolve performance bottlenecks, system failures, and issues related to scalability.
- Automate repetitive tasks and processes, such as deployments, scaling, and monitoring, to improve efficiency and reduce human error.
- Develop and maintain tools for continuous integration, automated testing, and continuous deployment.
- Conduct root cause analysis of incidents and implement long-term solutions to prevent recurrence.
- Maintain and enhance disaster recovery, backup, and failover strategies to ensure high availability and data integrity.
- Stay up-to-date with the latest technologies and trends in DevOps, cloud computing, and system reliability.
- Take the initiative in thought leadership, innovation, and creativity.
- Represent the company at conferences and networking events.
- Adheres to all Meridian corporate policies and procedures.
- Travel as required.
- Any additional responsibilities assigned by management.
Requirements:
- Bachelors Degree.
- Seven years of applicable experience and a minimum of three years in a leadership role.
- AWS Certified Cloud Practitioner Certification
- AWS Certified Solutions Architect Certification
Skills:
- Strong knowledge of cloud platforms (e.g., AWS, Google Cloud, Azure) and experience with cloud infrastructure management.
- Proficiency in programming or scripting languages (e.g., Python, Go, Bash) for automation and system management.
- Experience with monitoring and observability tools (e.g., Prometheus, Grafana, Datadog, ELK stack).
- Familiarity with containerization and orchestration technologies (e.g., Docker, Kubernetes).
- Experience with CI/CD pipelines and automation tools (e.g., Jenkins, GitLab, CircleCI).
- Experience with distributed storage technologies such as NFS, HDFS, Ceph, and Amazon S3, as well as dynamic resource management frameworks (Apache Mesos, Kubernetes, Yarn)
- Proactive approach to identifying problems, performance bottlenecks, and areas for improvement
- Infrastructure-as-Code such as Terraform and AWS Cloud Formation Template
- Strong problem-solving skills and the ability to troubleshoot complex systems in real-time.
- Excellent communication and collaboration skills, with the ability to work cross-functionally with development, operations, and security teams.