AI COST IN PRODUCTION

Have you ever pictured AI in production to be like this below? AI system is not just ml code but a lot of other components. And the decision to either productionize of not productionize a system depends on a lot of factors outside model accuracy.

Cost to build and maintain the entire system in production is going to an important factor in deciding the productionization of AI for an industry.

AI if you understand it right is an additional cost that is meant to optimize or improve business to generate profits, so cost to production+maintenance vs profit is key to determining if AI goes to production. Let’s take a use case.

Use case:(Object detection and counting system with yolov5)

Requirement:

  • Estimate cost of building an ML system to capture 30-second videos at 120FPS and provide a report on the number and types of objects in the video within 20 minutes.
  • Concurrent users: 1000.
  • Maximum videos submitted for analysis at once: 100.
  • Consider costs for hardware, storage, training the ML model, implementing real-time object detection, and report generation system.

Step 1: Analyse requirements and breakdown components

Understand the high level system flow, the components needed and the compute(GPU/CPU) requirement for your system. Break into multiple modules to be developed.

System flow:

User uploads video -> system converts video to images-> images passed to ML model-> backend system uses results to create report and store data to database.

Components/modules required:

  • Component 1: UI and Backend to upload video and convert video to Images.
  • Component 2: GPU to process images to get analysis
  • Component 3: UI and Backend for outputting Report generation and DB storage

Image calculations:

No of images/frames per sec = 120

Step 2: High Level Diagram and cost per unit

We will be using open source tools, use GCP components for our production deployment. We are using GCP since its easier to estimate the cost on the calculator.

Let’s assume we will be deploying all these components on a VM in GCP. So components required for deployment would be:

  • GPU for ML model
  • Vm for running UI and Backend
  • Persistant disk
  • External IP
  • GCS storage
  • Load Balancer for handling requests

Image Calculations:

Let’s assume we have done benchmarking of our model per image. Using that we are calculating the inference time for our system.

A100 has GPU sharing(check references in the end) and since our workload is higher we will leverage using A100:

GPU cost

Ref:

Step 3: Final Cost Estimation for solution

Infra Requirements from above:

  • GPU for ML model: 1 A100
  • Vm for running UI and Backend: 1 VM
  • Persistant disk: 100 GB
  • External IP: 1
  • GCS storage: 50GB monthly
  • Load Balancer for handling request:1

Resource requirements:

Development:

  • 1 Full stack Engineer — 30 days.
  • 1 cloud/devops Engineer — 15 days

Maintenance:

Let’s assume resource requirements as below for.

Depending on the solution you either need a 24*7 team or L2 support. Let’s assume you have 1 engineer for this.

So summing all of this up:

Let’s assume we are paying each engineer 120k USD a year for calculation purpose.

Note: There are lot of things that can be done to further reduce the cost, like GPU sharing on GKE cluster, optimisation of the model, using a different runtime, quantisation etc. But this post aims to mostly show the end to end process and not focus on optimisation

Final Verdict:

For our system to go into production , It would take us 21,000 USD for development of this system and approximately 4000 USD for maintenance of the system using our current model and business requirements. This is outside the cost for the model development and other portions.

So what we need to gauge is if the profits from having the system in production will exceed the cost to develop and maintain this.

Doing this exercise in the early stage of the solution development will help different stakeholders i.e Data scientists, business teams, product managers, ML engineers and others align on the solution needed. This would help define a lot of the aspects of system and business sla.

References:

Doing this exercise in the early stage of the solution development will help different stakeholders i.e Data scientists, business teams, product managers, ML engineers and others align on the solution needed. This would help define a lot of the aspects of system and business sla.


Originally published at https://www.linkedin.com.

Leave a comment