About the position
Key Responsibilities
Reliability & Operations
- Own uptime, performance, and monitoring for all production applications.
- Manage Heroku pipelines, CI/CD, review apps, and production environments.
- Operate Celery workers and queues, monitor health, and handle missed task check-ins.
- Define and track service level objectives (SLOs) (availability, latency, task success rate).
- Maintain runbooks, a centralised wiki for incident response, and lead post-mortems.
- Run periodic disaster recovery drills and coordinate penetration tests.
Platform Engineering
- Keep environments current (Heroku stacks, Postgres/Redis versions, DO/AWS base images).
- Manage daily backups, ensure restore tests and disaster recovery runbooks are in place.
- Standardise infrastructure (Terraform or scripts for DO/AWS; [URL Removed] for Heroku).
- Manage Cloudflare for DNS, edge security, and performance optimisation.
- Tune performance (DB indices, query optimisation, cache usage, Celery queue design).
- Optimise infrastructure costs across Heroku, DigitalOcean, and AWS.
Developer Experience & CI/CD
- Maintain CI pipelines with type checking, linting, and security scanning.
- Enforce test coverage and automate deploy checks (smoke tests, migration health, error budgets).
- Support developers with tooling for local/staging environments and build self-service dashboards (e.g., Celery queue status).
- Collaborate with developers to streamline workflows and educate on secure coding practices.
Security & Compliance
- Own vulnerability management and dependency patching cadence.
- Manage access reviews, secrets, MFA/SSO, and enforce least-privilege IAM policies.
- Implement encryption for data at rest and in transit (e.g., S3 server-side encryption).
- Contribute evidence and responses for security questionnaires and SOC 2 audits.
- Maintain a "security pack" with architecture, sub-processors, and DR/backup processes.
Monitoring & Alerting
- Configure Sentry ownership rules, Cron Monitors, and release health.
- Centralise metrics/logs (Heroku metrics, Papertrail, Sentry, APM, Prometheus/New Relic).
- Set up alerts on golden signals (latency, errors, traffic, saturation) and avoid alert fatigue.
- Conduct capacity planning and track resource usage trends.
Vendor & External Services
- Evaluate and manage vendor relationships (e.g., Mailgun, Twilio) to ensure service level agreements (SLAs) and performance.
- Assess new tools/services to enhance platform capabilities (e.g., observability, security)
- Track costs, security posture, and integration quality for all third-party services.
- Must-Have
- Cloud infrastructure management: 3+ years operating production apps on Heroku, AWS, DigitalOcean, or similar.
- CI/CD pipelines: Hands-on experience with GitHub Actions, Heroku CI, or equivalent; solid Git fundamentals.
- Monitoring & incident response: Experience with Sentry, Papertrail (or similar), logs, and uptime/performance dashboards.
- Security fundamentals: Understanding of IAM, encryption in transit/at rest, MFA/SSO, and secure configuration practices.
- Disaster recovery & backups: Experience implementing and operating automated backups, restore testing, and writing/maintaining incident runbooks.
- Communication & collaboration: Ability to document processes clearly and work closely with developers in a small team.
Strong Plus- Infrastructure as Code & automation: Experience with Terraform, Docker, or equivalent tooling.- Asynchronous workloads: Familiarity with Celery, Redis, or other task queues and message brokers.- Scaling & cost optimisation: Capacity planning, performance tuning, and managing infra spend.- Compliance frameworks: Exposure to SOC 2, GDPR, or supporting client security questionnaires.- Incident management: Participation in on-call rotations, leading post-mortems, or serving as incident commander.
Nice-to-Have
- Proficiency in Python; familiarity with Django/Flask.
- Experience with DNS/CDN/edge security (e.g., Cloudflare).
- Observability platforms (Prometheus, Grafana, New Relic).
- Static analysis and code quality tools (mypy, Bandit, SonarQube).
- Prior exposure to multi-tenant SaaS environments.
- Certifications (AWS Certified DevOps Engineer
Desired Skills:
- Heroku
- AWS
- DigitalOcean
- Github
- Sentry
- Papertrail
- encryption in transit
- MFA/SSO
- maintaining incident runbooks
- Terraform
- Docker
- Celery
- SOC2
- GDPR
- Python
- Django/Flask
- SaaS
Desired Work Experience: