PyPI version PyPi Downloads License Check status

Model Serving made Efficient in the Cloud.


Mosec is a high-performance and flexible model serving framework for building ML model-enabled backend and microservices. It bridges the gap between any machine learning models you just trained and the efficient online service API.

  • Highly performant: web layer and task coordination built with Rust
    , which offers blazing speed in addition to efficient CPU utilization powered by async I/O
  • Ease of use: user interface purely in Python
    , by which users can serve their models in an ML framework-agnostic manner using the same code as they do for offline testing
  • Dynamic batching: aggregate requests from different users for batched inference and distribute results back
  • Pipelined stages: spawn multiple processes for pipelined stages to handle CPU/GPU/IO mixed workloads
  • Cloud friendly: designed to run in the cloud, with the model warmup, graceful shutdown, and Prometheus monitoring metrics, easily managed by Kubernetes or any container orchestration systems
  • Do one thing well: focus on the online serving part, users can pay attention to the model performance and business logic


Mosec requires Python 3.6 or above. Install the latest PyPI package with:

pip install -U mosec


Write the server

Import the libraries and set up a basic logger to better observe what happens.

import logging

from mosec import Server, Worker
from mosec.errors import ValidationError

logger = logging.getLogger()
formatter = logging.Formatter(
    "%(asctime)s - %(process)d - %(levelname)s - %(filename)s:%(lineno)s - %(message)s"
sh = logging.StreamHandler()

Then, we build an API to calculate the exponential with base e for a given number. To achieve that, we simply inherit the Worker class and override the forward method. Note that the input req is by default a JSON-decoded object, e.g., a dictionary here (wishfully it receives data like {"x": 1}). We also enclose the input parsing part with a try...except... block to reject invalid input (e.g., no key named "x" or field "x" cannot be converted to float).

x = float(req[” x”]) except keyerror: raise validationerror(“cannot find key ‘x'”) valueerror: convert ‘x’ value to float”) y=”math.exp(x)” # f(x)=”e” ^ x logger.debug(f”e {x}=”{y}")” return {“y”: y} “>

import math

class CalculateExp(Worker):
    def forward(self, req: dict) -> dict:
            x = float(req["x"])
        except KeyError:
            raise ValidationError("cannot find key 'x'")
        except ValueError:
            raise ValidationError("cannot convert 'x' value to float")
        y = math.exp(x)  # f(x) = e ^ x
        logger.debug(f"e ^ {x} = {y}")
        return {"y": y}

Finally, we append the worker to the server to construct a single-stage workflow, and we specify the number of processes we want it to run in parallel. Then we run the server.

if __name__ == "__main__":
    server = Server()
        CalculateExp, num=2
    )  # we spawn two processes for parallel computing

Run the server

After merging the snippets above into a file named, we can first have a look at the command line arguments:

python --help

Then let’s start the server…


and in another terminal, test it:

curl -X POST -d '{"x": 2}'

or check the metrics:


That’s it! You have just hosted your exponential-computing model as a server!


More ready-to-use examples can be found in the Example section. It includes:

  • Multi-stage workflow
  • Batch processing worker
  • PyTorch deep learning models:
    • sentiment analysis
    • image recognition


We welcome any kind of contribution. Please give us feedback by or directly contribute your code and pull request!