Search thousands of fresh jobs

×
This job is expired
GoldenRule

Site Reliability Engineer

GoldenRule

  • R Undisclosed
  • Permanent Intermediate position
  • Johannesburg
  • Posted 27 Feb 2026 by GoldenRule
  • Expires in 34 days
  • Job 2634258 - Ref GDR03758

About the position

Job Purpose


The Site Reliability Engineer is responsible for ensuring the high availability, reliability, and performance of AWS-centric microservices platform supporting analytics and market-data products delivered to global brokers. This role is deeply technical, requiring strong AWS expertise and Python proficiency to automate operations, debug production services, optimize performance, and support continuous delivery in a 24x7 financial services environment where uptime is mission-critical. Reporting to the IT Operations Manager, this position demands independent technical decision-making and the ability to exercise sound judgment when responding to critical incidents. The SRE operates with significant autonomy in assessing system performance, diagnosing complex issues, and making critical determinations that impact service availability across a high-traffic, globally distributed infrastructure.


 


Responsibilities


AWS Infrastructure Monitoring and Incident Response



  • Monitor and manage AWS services supporting production workloads (ECS/EKS, EC2, Lambda, API Gateway, SQS/SNS, RDS, ElastiCache, CloudFront)

  • Respond to alerts from CloudWatch, Datadog, and custom monitoring scripts with urgency and precision

  • Exercise independent judgment in assessing incident severity and determining appropriate response strategies

  • Diagnose scaling, networking, and performance issues in distributed AWS systems

  • Perform incident response, ensuring rapid recovery and minimal downtime

  • Coordinate with development teams during critical incidents and outages, serving as the technical authority for infrastructure decisions


Python-Driven Troubleshooting and Automation



  • Write Python scripts and tools to automate operational tasks, system checks, and data validation routines.

  • Analyze Python microservice behavior by reading logs, debugging issues, and profiling performance

  • Build or enhance internal CLI tools to improve support workflows

  • Use Python to interrogate APIs, AWS resources (via boto3), and production data sources

  • Independently assess automation opportunities and implement solutions to reduce manual workload


Production Systems and Data Flow Stability



  • Maintain stability across charting engines, data ingestion pipelines, market-data feeds, scanning engines, and sentiment analysis services

  • Investigate failures in REST APIs, WebSocket streams, and asynchronous workers

  • Validate deployments and configurations for AWS-based microservices

  • Ensure data completeness and accuracy across instruments, markets, and broker-specific configurations

  • Make real-time decisions on system changes, maintenance windows, and emergency response procedures


Collaboration and Continuous Operations Improvement



  • Work with DevOps engineers to refine CI/CD pipelines, infrastructure-as-code workflows, and AWS deployment patterns

  • Collaborate with backend teams to improve microservice reliability and observability

  • Provide feedback on Python code, error-handling logic, and operational robustness

  • Contribute to post-incident root cause analyses and propose architectural or automation improvements

  • Participate in an on-call rotation to provide round-the-clock infrastructure support

  • Documentation and Runbook Management

  • Maintain detailed operational documentation, AWS service runbooks, and troubleshooting guides

  • Build automated checks and self-healing routines where feasible

  • Drive SRE best practices across the team

  • Document configurations, standards, and operational procedures that align with industry best practices

Experience Requirements



  • 2+ years in production support, SRE, or DevOps with a strong AWS and Python footprint

  • Demonstrated ability to exercise independent judgment in high-pressure situations and make

  • critical decisions affecting system availability

  • Strong Python scripting and debugging skills (must be able to analyse stack traces, write

  • scripts, automate workflows)

  • Strong analytical mindset and exceptional problem-solving ability

  • Calm, structured communication during incidents

  • Ability to work cross-functionally with DevOps, developers, QA, and product staff

  • Keen attention to detail and strong ownership of production systems

  • Comfortable working in a high-availability, high-traffic environment

  • Off-hours support and coverage as part of on-call rotation


 


Technical Expertise


AWS Services:



  • ECS or EKS (service deployments, scaling behaviour, debugging containers)

  • EC2, Lambda, API Gateway

  • SQS/SNS messaging patterns

  • RDS (PostgreSQL/MySQL), DynamoDB

  • S3 and CloudFront

  • IAM, KMS, networking (VPC, subnets, security groups)


Monitoring Observability:



  • CloudWatch, Datadog, Grafana, OpenSearch/Kibana


Infrastructure DevOps:



  • Docker containerization

  • Infrastructure-as-code (CloudFormation, Terraform)

  • CI/CD pipelines (CodePipeline, GitHub Actions, GitLab CI, or similar)


Development Data:



  • REST APIs, WebSocket protocols, asynchronous workers, distributed system behaviour

  • SQL proficiency and performance investigation for relational databases

  • MongoDB with JavaScript proficiency


 


Preferred Qualifications



  • Bachelor's degree in Computer Science, Information Systems, Engineering, or equivalent

  • practical experience

  • Experience in fintech, trading systems, or market-data streaming

  • Python experience with data processing, concurrency (asyncio), or task queues (Celery/RQ)

  • Exposure to Kinesis, Kafka, or other event-streaming platforms

  • Familiarity with FastAPI-based microservices

  • Experience with cost optimisation and AWS Well-Architected practices

  • Understanding of foreign exchange markets and trading platform requirements


 


The ideal candidate will demonstrate:



  • Operational Excellence: Reduced production incidents and improved uptime through proactive monitoring and rapid incident response

  • Automation Focus: Faster MTTD and MTTR through automation and AWS-driven improvements

  • Technical Impact: Operational tooling and Python automation that significantly reduces manual workload

  • Collaboration: Positive feedback from internal teams and external broker partners

  • Ownership: Strong sense of accountability for production system health and reliability

Desired Skills:

  • Systems Analysis
  • Complex Problem Solving
  • Programming/configuration
  • Critical Thinking
  • Time Management

GoldenRule

About the agency

GoldenRule is positioned as a Strategic Resourcing Partner. We are IT Services company with a Total solution offering around staffing issues. We currently offer the following services: * Contracting of GoldenRule fulltime staff to our clients for in-house or external projects. * Permanent placements * Temporary placements to permanent placements * Allowing our fulltime employees to accept job offers from our clients to add further value. * All candidates are technically evaluated for each position. (GoldenRule makes use of an In-house evaluation methodology and Teckcheck) * Hosting learnerships. We hold the risk in developing junior resources. Ideal for AA candidates. * We are able to provided local, AA and international skills. * With international resources meticulous attention is given to Work Permit issues and the relocation of our employees. GoldenRule has an attentive program for relocation and integration of our employees into South Africa. This includes accommodation in furnished guesthouses, transport to and from work, assistance with various forms of finance and obtaining banking products etc.

Receive a daily digest of all new jobs matching this job. Your information is safe with us and you can cancel any time.

Expires in 33 days

Email me jobs similar to: Site Reliability Engineer

Receive a daily digest of all new jobs matching this job: Senior IT Auditor. Your information is safe with us and you can cancel at any time.