Salesforce is seeking an engineering candidate to join the Site Reliability Engineering organization in our San Francisco or Herndon, VA location. Working closely with counterparts in the Infrastructure and R&D organizations, this organization provides a global team of engineers monitoring cloud service availability and ready to swiftly repair any service-impacting issues. Seven days a week, 24 hours a day, in a follow-the-sun model, the Site Reliability team keeps the Salesforce cloud and our customers protected. As a member of the Site Reliability team, you will be responsible for the primary task of detecting and resolving incidents within minutes as well as deep dive troubleshooting of infrastructure and product. This objective is met by monitoring the services, reacting to problems, and proactively addressing issues before they affect performance or availability.
When not fighting fires, the team is responsible for fire prevention through monitoring, automation, self-healing, resiliency initiatives, destructive testing, and game day exercises. The incumbent in this role would demonstrate a strong focus on tactical operations, as well as large-scale production engineering and orchestration.
- Keep the customer-facing services available at top performance by maintaining the constant health of the supporting systems.
- Incident management - Act in key support roles during major incidents e.g. Sev0, Sev1. Also, participate in the technical review of the incident for problem management
- Ability to operate in the high-pressure environment and troubleshoot complex issues quickly successfully handle multiple priorities
- Problem Management - populate and participate RCA’s with our Global Solutions team, to identify and remediate gaps
- Ensuring that work carried out by the Site Reliability team is executed in such a way as to comply with the company’s internal compliance policy and directives
- Identifying work opportunities and preparing or assisting with the preparation of technical proposals as required, working with service owners to improve service resiliency and architecture
- Being available to discuss and resolve technical issues and escalations with other technical staff as required
- Work with and lead other members of the team in staying on top of key industry innovation and technology, and assist in team development growth
- Work to automate detection and resolution of recurring issues in the production environment
- Participating in On-call rotation, Some weekend work may be required for critical situations requiring expertise.
- Minimum of 5-7 years experience in operations and development, with technical deep dives, troubleshooting hardware, infrastructure, and client services for software as a service
- 3-5 years of development experience using Python, to automate manual day to day operations and problems
- Proficient in monitoring implementations, administration and workflow, with the “Golden Signals” of metrics and alerting for service availability
- Past experience in Incident Management, being a leader during critical incidents and leading the bridge to resolution.
- Systems engineering experience in enterprise scale internet service engineering or support role
- Expertise in TCP/IP related technologies (networking protocols, network programming, etc.)
- Expertise in CLI enterprise support of Unix variants (Linux/Solaris/BSD) as well as strong Linux/UNIX knowledge with significant exposure to Red Hat Enterprise Linux and CentOS
- Past experience with Public Cloud platforms, such as AWS, GCP and Azure, with experience using Kubernetes, Spinaker, Docker, and orchestration management
- Strong communication skills (Written and Oral)
- Experience in working in a 24/7 team managing large data centers
- Experience with ITIL's Incident and Change Management practices.
- Perl/GO/BASH scripting experience
- Prior Chef/Puppet or automated deployment experience
- Experience working with and engaging in Service Ownership programs regarding SLO/SLI/SLA metrics
- Experience supporting and troubleshooting relational databases and distributed platforms
- Experience in supporting and maintaining Java applications
- Experience managing monitoring systems and alerting within tools, such as Grafana, Splunk, Nagios.
- Experience with JVM optimization and Java server technologies like Tomcat or Jetty
- MS in Computer Science or related field, or
- BS in Computer Science plus relevant job-related experience
Accessibility - If you require accessibility assistance applying for open positions please contact the Salesforce.com Recruiting Department.
Salesforce.com and Salesforce.org are Equal Employment Opportunity and Affirmative Action Employers. Qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender perception or identity, national origin, age, marital status, protected veteran status, or disability status. Headhunters and recruitment agencies may not submit resumes/CVs through this Web site or directly to managers. Salesforce.com and Salesforce.org do not accept unsolicited headhunter and agency resumes. Salesforce.com and Salesforce.org will not pay fees to any third-party agency or company that does not have a signed agreement with Salesforce.com or Salesforce.org.
Pursuant to the San Francisco Fair Chance Ordinance and the Los Angeles Fair Chance Initiative for Hiring, Salesforce will consider for employment qualified applicants with arrest and conviction records.