SDI (Site Reliability Engineer)


Brief Summary

                            Purpose of Job

The primary purpose of site reliability engineering at company is to improve and sustain the reliability of company’s most critical IT systems. The role is essential in helping establish and measure service level objectives for critical systems. In addition, SREs will continuously identify engineering and automation opportunities to effectively manage production systems at scale. An SRE will model a blameless culture through effective post mortems and a focus on minimizing impact felt from outages.
Site reliability engineers at company will have the job title of a Software Developers and Integrators (SDIs) who are also engaged in all phases of the software development lifecycle which include; gathering and analyzing user/business system requirements, responding to outages and creating application system models. SDIs primary functions are to design, develop, document, test and debug new and existing software systems and/or applications for internal use, perform defect corrections (analysis, design, code).

In addition, SDIs participate in design meetings and consult with business clients to refine, test, and debug programs to meet business needs, and interact and sometimes direct third-party partners in the achievement of business and technology initiatives.

This role is a solid, career-level role where functional and technical proficiency has been obtained, and incumbents display a depth of technical understanding within their respective areas of specialization allowing them to operate independently. Incumbents also display a proficiency that allows them to begin to mentor others (third party and internal resources) on procedural matters.

Job Requirements
•	Work with application and system SME’s to design highly scalable and resilient distributed systems
•	Create Service Level Objectives to measure and manage core infrastructure and critical services
•	Analyze, troubleshoot and fix core infrastructure or critical systems when they fail or degrade
•	Write custom code or scripts to automate repetitive or manual system support tasks
•	Lead technical post mortems to identify lessons learned and implement improvements
•	Partner with technical teams and product owners to ensure resiliency work is developer-ready
•	Design and execute failure injection tests to verify adequate system capacity and resiliency
•	Champion Site Reliability Engineering practices across IT organization.
•	Independently installs, customizes and integrates commercial software packages.
•	Facilitates root cause analysis of system issues.
•	Works with experienced team members to conduct root cause analysis of issues, review new and existing code and/or perform unit testing.
•	Learns to create system documentation/play books and attends requirements, design and code reviews.
•	Receives work packages from manager and/or delegates.
•	Identifies ideas to improve system performance and impact availability.
•	Resolves complex technical design issues.
•	Creates system documentation/play book(s) and participates as a reviewer and contributor in requirements, design and code reviews.
•	May serve as the subject matter expert on development techniques.
•	Partners with experienced team members to develop accurate work estimates on work packages.
•	May serve as a mentor on procedural matters to less experienced internal and third party team members.
•	May assist experienced team members with the delegation of work packages.

Minimum Experience
•	Bachelor’s degree or 4 additional years of related experience beyond the minimum required may be substituted in lieu of a degree.
•	4+ years of software development experience demonstrating depth of technical understanding within a specific I/T discipline(s)/technology(s) to include relevant business support and/or general information technology support experience
•	Working knowledge of systems administration and/or systems programming skills
•	Strong interest in monitoring, optimizing, scaling and troubleshooting large distributed systems

Preferred Experience
•	4+ years of experience managing large scale production environments (1000+ servers) and experience with production support of applications in large scale environments
•	Experience in one or more of the following: C, C++, Java or Python
•	Demonstrated experience building SLO-based monitoring solutions, to include working knowledge of technologies such as Grafana, Kibana, Splunk, and Prometheus
•	Expertise in DevOps related practices such as CI/CD, Canary Pushes/ Blue-Green deployments, Software Defined Infrastructure and related tools.
•	Strong experience or working knowledge of end-to-end IT systems (compute, storage, network, security, application runtime, relational databases, REST services, asynchronous messaging, etc.)