API Production Support Engineer
Citi
Job Details
Location
Chennai, Tamil Nadu, India
Experience
2
Salary
10 LPA
Last Date
31/05/2026
Job Description
At Citi, we build and support highly reliable APIs that power critical financial services and customer transactions. We are looking for an experienced API Support Engineer to ensure operational excellence, platform stability, and continuous improvement across our API ecosystem.
In this role, you will monitor and troubleshoot APIs, support production environments, resolve incidents, analyze logs, and work closely with cross-functional teams to deliver reliable and high-performing solutions. You will contribute to improving system availability, enhancing customer experience, and driving proactive operational strategies through automation and predictive monitoring.
The ideal candidate should have strong hands-on experience in API support, production troubleshooting, monitoring tools, log analysis, REST APIs, incident management, and debugging complex technical issues in enterprise environments.
Key Responsibilities
Drive stability and reliability initiatives to ensure high availability, resilience, and optimal performance of API applications through proactive monitoring, failover improvements, and system health management.
Handle critical production incidents with strong analytical and troubleshooting skills, ensuring effective incident, problem, and change management in enterprise environments.
Perform end-to-end monitoring and management of production API platforms, maintaining a holistic view of application health, availability, and performance.
Define, analyze, and report SLIs/SLOs for APIs and client integrations to maintain measurable service quality and performance standards.
Develop and enhance operational tools, automation frameworks, and support processes to improve API management and customer experience.
Analyze and optimize API and infrastructure performance, focusing on scalability, reliability, and continuous operational improvements.
Provide hands-on support for large-scale distributed API ecosystems, ensuring timely issue resolution and platform stability.
Monitor API platforms and infrastructure metrics to support performance tuning, capacity planning, and root cause analysis.
Collaborate with development teams to improve service reliability through operational feedback, testing, and release management practices.
Build and maintain automation solutions for operational tasks, monitoring, and production support activities.
Conduct post-incident reviews, identify recurring issues, and implement proactive monitoring and automation strategies to prevent future incidents.
Take ownership of high-priority production support activities, ensuring effective communication, rapid troubleshooting, and issue resolution within SLA timelines.
Required Skills
Java and J2EEAWSOracle DBSplunk
Eligibility Criteria
Experience supporting Java and J2EE based applications
Interview Preparation Guide
1. Technical Foundations
OSI model, HTTP/HTTPS, REST, SOAP, gRPC protocols
TCP/IP, DNS, SSL/TLS handshake, load balancing concepts
Microservices architecture vs monolithic
Containerization basics — Docker, Kubernetes
Linux command line proficiency for production troubleshooting
2. API Gateway & CDN Expertise (APIGEE & Akamai)
APIGEE architecture — proxies, targets, policies, environments
API lifecycle management — versioning, deprecation, publishing
OAuth 2.0, API key management, JWT token validation in APIGEE
Traffic management policies — quota, spike arrest, rate limiting
Akamai CDN — edge caching, cache invalidation, origin shield
Akamai Fast Purge, Property Manager, Edge Side Includes (ESI)
WAF rules, DDoS protection via Akamai
Troubleshooting latency between CDN edge and origin API
APIGEE analytics, trace tool for debugging API calls
3. SRE Principles — SLIs, SLOs & Error Budgets
Difference between SLI, SLO, SLA and how to define each
Common SLIs — availability, latency, throughput, error rate
How to set realistic SLO targets for APIs
Error budget calculation and burn rate alerting
Toil reduction and automation mindset
Balancing reliability vs feature velocity using error budgets
Google SRE book concepts — eliminating toil, embracing risk
How to handle SLO breaches and escalation paths
4. Monitoring & Observability Stack (AppDynamics, Splunk, Kibana)
Three pillars of observability — metrics, logs, traces
AppDynamics — business transactions, baselines, health rules, alerts
AppDynamics — flow maps, tier and node-level diagnostics
Splunk — SPL queries, dashboards, alert creation, index management
Splunk — log correlation across distributed services
Kibana — index patterns, KQL queries, Lens visualizations
ELK stack architecture — Elasticsearch, Logstash, Kibana
Creating actionable alerts vs noise — tuning thresholds
Distributed tracing — trace IDs, span correlation across services
5. Cloud & Infrastructure (AWS, ECS, Oracle DB, MongoDB)
AWS core services — EC2, VPC, IAM, Route 53, ELB, CloudWatch
ECS — task definitions, services, clusters, Fargate vs EC2 launch types
ECS service scaling — target tracking, step scaling policies
ECS troubleshooting — stopped tasks, container health checks, logs
Oracle DB — connection pooling, AWR reports, explain plans, slow query analysis
Oracle DB — tablespace monitoring, lock contention, RAC basics
MongoDB — replica sets, sharding, indexing strategies
MongoDB — slow query profiler, oplog monitoring, connection pool tuning
AWS CloudWatch metrics, alarms, log insights for operational visibility
6. Java/J2EE Application Support & Troubleshooting
JVM internals — heap, GC types, thread dumps, heap dumps
GC tuning — G1GC, CMS, understanding GC logs
Thread dump analysis — deadlocks, blocked threads, high CPU diagnosis
Heap dump analysis using tools like Eclipse MAT or VisualVM
Common Java issues — memory leaks, OutOfMemoryError, StackOverflow
J2EE components — Servlets, EJB, JPA, JMS, connection pools
Application server knowledge — WebLogic, JBoss, Tomcat, WebSphere
Log analysis — reading stack traces, identifying root cause
Performance profiling — identifying hot methods, slow transactions
Interview Process
1 and 2nd round : Technical Interview
3rd round : Hr round