API Production Support Engineer

Citi

Job Details

Location

Chennai, Tamil Nadu, India

Experience

Salary

10 LPA

Last Date

31/05/2026

Job Description

At Citi, we build and support highly reliable APIs that power critical financial services and customer transactions. We are looking for an experienced API Support Engineer to ensure operational excellence, platform stability, and continuous improvement across our API ecosystem. In this role, you will monitor and troubleshoot APIs, support production environments, resolve incidents, analyze logs, and work closely with cross-functional teams to deliver reliable and high-performing solutions. You will contribute to improving system availability, enhancing customer experience, and driving proactive operational strategies through automation and predictive monitoring. The ideal candidate should have strong hands-on experience in API support, production troubleshooting, monitoring tools, log analysis, REST APIs, incident management, and debugging complex technical issues in enterprise environments.

Key Responsibilities

Drive stability and reliability initiatives to ensure high availability, resilience, and optimal performance of API applications through proactive monitoring, failover improvements, and system health management. Handle critical production incidents with strong analytical and troubleshooting skills, ensuring effective incident, problem, and change management in enterprise environments. Perform end-to-end monitoring and management of production API platforms, maintaining a holistic view of application health, availability, and performance. Define, analyze, and report SLIs/SLOs for APIs and client integrations to maintain measurable service quality and performance standards. Develop and enhance operational tools, automation frameworks, and support processes to improve API management and customer experience. Analyze and optimize API and infrastructure performance, focusing on scalability, reliability, and continuous operational improvements. Provide hands-on support for large-scale distributed API ecosystems, ensuring timely issue resolution and platform stability. Monitor API platforms and infrastructure metrics to support performance tuning, capacity planning, and root cause analysis. Collaborate with development teams to improve service reliability through operational feedback, testing, and release management practices. Build and maintain automation solutions for operational tasks, monitoring, and production support activities. Conduct post-incident reviews, identify recurring issues, and implement proactive monitoring and automation strategies to prevent future incidents. Take ownership of high-priority production support activities, ensuring effective communication, rapid troubleshooting, and issue resolution within SLA timelines.

Required Skills

Java and J2EEAWSOracle DBSplunk

Eligibility Criteria

Experience supporting Java and J2EE based applications

Interview Preparation Guide

1. Technical Foundations OSI model, HTTP/HTTPS, REST, SOAP, gRPC protocols TCP/IP, DNS, SSL/TLS handshake, load balancing concepts Microservices architecture vs monolithic Containerization basics — Docker, Kubernetes Linux command line proficiency for production troubleshooting 2. API Gateway & CDN Expertise (APIGEE & Akamai) APIGEE architecture — proxies, targets, policies, environments API lifecycle management — versioning, deprecation, publishing OAuth 2.0, API key management, JWT token validation in APIGEE Traffic management policies — quota, spike arrest, rate limiting Akamai CDN — edge caching, cache invalidation, origin shield Akamai Fast Purge, Property Manager, Edge Side Includes (ESI) WAF rules, DDoS protection via Akamai Troubleshooting latency between CDN edge and origin API APIGEE analytics, trace tool for debugging API calls 3. SRE Principles — SLIs, SLOs & Error Budgets Difference between SLI, SLO, SLA and how to define each Common SLIs — availability, latency, throughput, error rate How to set realistic SLO targets for APIs Error budget calculation and burn rate alerting Toil reduction and automation mindset Balancing reliability vs feature velocity using error budgets Google SRE book concepts — eliminating toil, embracing risk How to handle SLO breaches and escalation paths 4. Monitoring & Observability Stack (AppDynamics, Splunk, Kibana) Three pillars of observability — metrics, logs, traces AppDynamics — business transactions, baselines, health rules, alerts AppDynamics — flow maps, tier and node-level diagnostics Splunk — SPL queries, dashboards, alert creation, index management Splunk — log correlation across distributed services Kibana — index patterns, KQL queries, Lens visualizations ELK stack architecture — Elasticsearch, Logstash, Kibana Creating actionable alerts vs noise — tuning thresholds Distributed tracing — trace IDs, span correlation across services 5. Cloud & Infrastructure (AWS, ECS, Oracle DB, MongoDB) AWS core services — EC2, VPC, IAM, Route 53, ELB, CloudWatch ECS — task definitions, services, clusters, Fargate vs EC2 launch types ECS service scaling — target tracking, step scaling policies ECS troubleshooting — stopped tasks, container health checks, logs Oracle DB — connection pooling, AWR reports, explain plans, slow query analysis Oracle DB — tablespace monitoring, lock contention, RAC basics MongoDB — replica sets, sharding, indexing strategies MongoDB — slow query profiler, oplog monitoring, connection pool tuning AWS CloudWatch metrics, alarms, log insights for operational visibility 6. Java/J2EE Application Support & Troubleshooting JVM internals — heap, GC types, thread dumps, heap dumps GC tuning — G1GC, CMS, understanding GC logs Thread dump analysis — deadlocks, blocked threads, high CPU diagnosis Heap dump analysis using tools like Eclipse MAT or VisualVM Common Java issues — memory leaks, OutOfMemoryError, StackOverflow J2EE components — Servlets, EJB, JPA, JMS, connection pools Application server knowledge — WebLogic, JBoss, Tomcat, WebSphere Log analysis — reading stack traces, identifying root cause Performance profiling — identifying hot methods, slow transactions