Ensure uptime SLAs by maintaining the stability of cloud platforms and hardware
Escalate issues of utmost importance to relevant stakeholders under escalation matrix and suggest short term patches and permanent solutions.
Assist development teams to upload applications in Development, Staging and Production environments.
Gather and analyze metrics from operating systems as well as applications to assist in performance tuning and fault finding
Partner with development teams to improve services
Participate in system design consulting, platform management, and capacity planning
Assist clients as necessary during migration of existing services
Create sustainable systems and services through automation and uplifts
Balance feature development speed and reliability with well-defined service-level objectives
Ensure regular backups and execute recovery plans as needed
Check on firmware and patch library versions which have been deemed safe post VAPT
Upgrade OS platforms.
Support to harden OS platforms together with security teams.
New application and system rollout and rollback planning and execution.
Ensure physical deployment of servers / assets in data center facility
To be on call for emergencies raised by NOC.
Work at office
Bachelors / Master’s degree from any Government recognized or reputed education institution.
Education may be relaxed for exceptional candidates meeting above requirements.
Redhat certification (RHCE) preferred; additional certification on storage will be considered as an added advantage.
Certification on virtualization will be considered a bonus.
Additional certification on open source platforms such as Terraform, Kubernetes and others will be considered as an added advantage.
Additional certification on cyber security such as CEH, CHFI, GIAC or similar will be an added advantage.
At most 5 year(s)
Age at most 45 years
Both males and females are allowed to apply
Proven work experience as a Site Reliability Engineer or similar role for at least 5 years
Collaborate and communicate asynchronously
Document all the things so you don’t need to learn the same thing twice
Have an enthusiastic, go-for-it attitude
Relevant training and/or certifications as a Site Reliability Engineer
Using logs for monitoring and troubleshooting
Firewall (WAF/Network device/OS) configuration
Experience with monitoring platforms (Zabbix/LibreNMS/Grafana/Prometheus)
Experience with automation tools (Chef/Ansible/Terraform)
Ability to program (structured and OOP) using one or more high-level languages (Python)
Ability to write simple scripts to automate routine tasks (bash/LUA)
Experience with distributed storage technologies such as NFS, Ceph
IP planning, DHCP and VXLAN.
Error log analysis and resolution.
Server and platform technology resiliency planning
Taking trace through Wireshark / Arkime and other applications to understand root cause
Proactive approach to identifying problems, performance bottlenecks, and areas for improvement
Jira / Trello for project management and constant feedback
Working knowledge on Confluence
Sound knowledge on Git
Open source platforms
Kubernetes and Docker
GENNEXT TECHNOLOGIES LTD.
Address: Lane#5 , House#348, DOHS Baridhara, Dhaka-1206
Application Deadline : 31 Jul 2023