AI/ML Solution Architect
Roles and Responsibilities
About Us:
Headquartered in Sunnyvale, with offices in Dallas & Hyderabad, Fission Labs is a leading software development company, specializing in crafting flexible, agile, and scalable solutions that propel businesses forward.With a comprehensive range of services, including product development, cloud engineering, big data analytics, QA, DevOps consulting, and AI/ML solutions, we empower clients to achieve sustainable digital transformation that aligns seamlessly with their business goals.
Key Responsibilities:
Architecture & Infrastructure
- Design, implement, and optimize end-to-end ML training workflows including infrastructure setup, orchestration, fine-tuning, deployment, and monitoring.
- Evaluate and integrate multi-cloud and single-cloud training options across AWS and other major platforms.
- Lead cluster configuration, orchestration design, environment customization, and scaling strategies.
- Compare and recommend hardware options (GPUs, TPUs, accelerators) based on performance, cost, and availability.
Performance & Optimization
- Conduct performance benchmarking, hardware comparisons, and cost-performance trade-off analysis.
- Implement real-time monitoring and control systems with metrics collection, observability, and custom performance tracking.
- Optimize cost models, budget predictability, and resource utilization.
Data & Training Pipelines
- Architect and validate data pipelines with storage, persistence, and throughput optimization.
- Oversee data quality validation, pre-processing, and long-term experiment tracking.
- Support framework flexibility for diverse training techniques (supervised, unsupervised, fine-tuning, reinforcement learning).
Integration & Deployment
- Ensure seamless deployment across multi-cloud environments with security, compliance, and regional availability considerations.
- Collaborate with DevOps and MLOps teams for automation, fault tolerance, job scheduling, and orchestration testing.
- Provide technical guidance on integration with existing enterprise systems.
Analysis & Recommendations
- Lead result analysis, insight generation, and actionable recommendations for training performance and user experience improvements.
- Present performance claims, benchmarking reports, and speculative decoding insights to stakeholders.
Qualifications Required
B.E/Btech/M.E/MTech/M.S/MCAgraduate(preferably from a reputed college or University)
Skills and Experience Required
Technical Expertise
- 10+ years in architecture roles with at least 5 years in AI/ML infrastructure and large-scale training environments.
- Expert in AWS cloud services (EC2, S3, EKS, SageMaker, Batch, FSx, etc.) and familiar with Azure, GCP, and hybrid/multi-cloud setups.
- Strong knowledge of AI/ML training frameworks (PyTorch, TensorFlow, Hugging Face, DeepSpeed, Megatron, Ray, etc.).
- Proven experience with cluster orchestration tools (Kubernetes, Slurm, Ray, SageMaker, Kubeflow).
- Deep understanding of hardware architectures for AI workloads (NVIDIA, AMD, Intel Habana, TPU).
Performance & Cost Management
- Demonstrated expertise in performance benchmarking, reliability testing, and training speed optimization.
- Skilled in cost modeling, budget forecasting, and cost-performance balancing.
Monitoring & Observability
- Experience with real-time monitoring tools (Prometheus, Grafana, CloudWatch) and custom metric instrumentation.
- Familiarity with network performance testing, regional load testing, and multi-region deployment strategies.
Soft Skills
- Strong problem-solving skills with an analytical mindset.
- Excellent communication skills to present technical trade-offs and strategic recommendations to executives and engineering teams.
- Ability to lead cross-functional teams and drive innovation in AI infrastructure.
Why you'll love working with us:
We Offer:
- Opportunity to work on technical challenges with global impact.
- Vast opportunities for self-development, including online university access and sponsored certifications.
- Sponsored Tech Talks & Hackathons to foster innovation and learning.
- Generous benefits package including health insurance, retirement benefits, flexible work hours, and more.
- Supportive work environment with forums to explore passions beyond work. This role presents an exciting opportunity for a motivated individual to contribute to the development of cutting-edge solutions while advancing their career in a dynamic and collaborative environment.