Skip to main content

Monitoring Tools

Grafana

What is Grafana?

AI Platform's Grafana monitoring setup provides observability into various aspects of its Kubernetes-based infrastructure and testing environment. By leveraging Grafana’s visualization and analysis capabilities, we can monitor service health, workload performance, K6 load tests, and container logs. This helps us ensure system stability, improve performance, and troubleshoot issues effectively.

Each Grafana dashboard serves as a specific entry point for monitoring and troubleshooting within the Kubernetes ecosystem. Users can select time ranges down into metrics to gain deeper insights into system performance. Regular checks on these dashboards are recommended to ensure that all services and workloads are running optimally.

Accessing Grafana

For each AKS cluster, a separate Grafana instance is deployed. It is accessible through the /grafana endpoint specific to each cluster, formatted as https://<CLUSTER_ADDRESS>/grafana. For example, https://balder.dev.equinor.com/grafana.

As an AI Platform user, you can log in to Grafana's dashboards using single sign-on (SSO) with your Equinor Azure AD account.

Grafana log in with SSO Grafana log in with SSO

info
  • By default, you are granted read-only access.
  • For editing permissions and access to the Explore section, contact the AI Platform team.

Available Dashboards

All dashboards are available in the Aurora folder, which is present in every Grafana instance.

folder List of availabale Aurora dashboards in Grafana

Istio Service Dashboard

Individual Services View. This section provides metrics about requests and responses for each individual service within the mesh (HTTP/gRPC and TCP). This also provides metrics about client and service workloads for this service.

istio-service Individual services view

Istio Workload Dashboard

This section provides metrics about requests and responses for each individual workload within the mesh (HTTP/gRPC and TCP). This also provides metrics about inbound workloads and outbound services for this workload.

istio-workload Istio workload

K6 Tests

Monitors the results of K6 load tests, including request rates, response times, and error percentages. It helps assess the performance and scalability of applications under various load conditions, supporting testing and optimization of critical endpoints and services.

k6 K6 tests

note

This dasbhoad should be used only on our shared Grafana which has access to Victoria Metrics.

Kubernetes Container Logs

Aggregates and displays logs from Kubernetes containers. This dashboard facilitates debugging by providing easy access to container-level logs, allowing us to track application behavior, diagnose issues, and ensure smooth operation within our clusters.

k8s-logs Kubernetes container logs

Kubernestes Pods

Provides an overview of pod-level metrics in the Kubernetes cluster, such as CPU, memory usage, and pod status. It’s essential for monitoring pod health, optimizing resource utilization, and maintaining stability across the cluster.

k8s-pods Kubernetes pods


Evidently AI

What is Evidently AI?

Evidently AI is an open-source Python library for data scientists and ML engineers. It helps evaluate, test, and monitor the performance of ML models from validation to production. It works with tabular, text data and embeddings.

Their key offering is model observability, which can be achieved through:

  • Tests: perform structured data and ML model quality checks. They verify a condition and return an explicit pass or fail result.

  • Reports: these can be seen in your Jupyter notebooks, but can also be integrated with Grafana. Reports calculate various data and ML metrics and render rich visualizations. You can create a custom Report or run a preset to evaluate a specific aspect of the model or data performance. For example, a Data Quality or Classification Performance report.

  • Real-time ML monitoring: Evidently has monitors that collect data and model metrics from a deployed ML service. You can use it to build live monitoring dashboards (Grafana).

Further resources

For information, examples, and guides provided by Evidently AI, go to:

For the original integration guide and examples provided by Equinor's AI Platform team, go to:

Using Evidently AI (tutorial)

Prerequisites

The different tutorials on this page require a range of tools that you have installed on your machine or have access, such as:

  • Azure Portal
  • Kubeflow Central Dashboard
  • Jupyter Notebook
  • MLflow
  • Docker
  • Grafana
  • Make
  • Python
  • IDE
  • GitHub

Running locally

  1. Clone the ai-platform-advanced-usecases/evidentlyAI repository on your local machine.
  2. With your preferred method, run the basic_example.ipynb notebook .
    This notebook example covers how to create reports based on Evidently metrics.
  3. Print the different reports to understand what these look like.

Deploying metrics to MLflow

Integrating your metrics with MLflow is possible. The two Jupyter notebooks in the mlflow_integration/ folder demonstrate how to do this.

If your goal is to run the notebooks (or Python script) in AI Platform, you will first need to get an MLflow instance running in your namespace which we can then point to.

How to set up an MLflow instance

To set up an MLflow instance, follow these steps:

  1. If you haven't done so already, from the Azure Portal, create an Azure storage account.

  2. In the Azure Portal, go to Storage account > Security + networking > Access keys and collect the following values:

    1. Azure storage account
    2. Azure storage key
    3. Container name

    azure access key Azure storage access key

  3. Go the Kubeflow Central Dashboard of your corresponding cluster and in the navigation menu, click Applications.

    ApplicationsKubeflow Applications section in Kubeflow Central Dashboard

  4. Click Create MLFlow.

  5. Fill in the dialog box with Azure storage account data.

  6. Point your scripts to the MLflow server running in Aurora by editing the line that sets the tracking URI in your notebook / script. This was added to both notebooks and will need editing before you can run your example:

    client = MlflowClient()
    mlflow.set_tracking_uri("http://aurora-mlflow.<namespace>.svc.cluster.local:5000") # change this to your MLflow's internal URI
    #set experiment
    mlflow.set_experiment('Data Drift Evaluation with Evidently')
  7. Open the MLflow UI using the external URI and you will be able to visualize the metrics. These are the graphs displayed in the historical_drift_visualization.ipynb notebook:

    • Feature drift

      evidentlyMlflowPlot_1

    • p-value:

      evidentlyMlflowPlot_2

    • Dataset Drift:

      evidentlyMlflowPlot_3

Running on AI Platform with Grafana

The example in the prometheus_integration/ folder gives you a use case to deploy a model using Evidently AI on AI Platform. This example is a more advanced use case of the monitoring capability offered by Evidently AI. In this example, we are spinning up a basic flask app with a GET endpoint (/metrics) for metrics, and a POST endpoint (/iterate)

The prometheus_integration/metrics_app folder contains the source code for the monitoring app. The config.yaml file contains the information needed to parse the data correctly, including the type of monitoring we'd like to do for each data source (bike_random_forest, bike_gradient_boosting, kdd_k_neighbors_classifier), the column mapping including categorical and numerical features for each dataset, and other relevant parameters. This can be adapted to your use case.

Steps to run the Prometheus integration in AI Platform

To run this example on AI Platform, follow these steps:

  1. From the repository, open a terminal and go to the prometheus_integration folder:

    cd prometheus_integration
  2. From the terminal, run the scripts/download_datasets.py script to download the datasets needed for the example.

    python3 scripts/download_datasets.py
  3. Run Make to compile the Docker image and push it to the auroradevacr repository.

    make image
    note

    Feel free to change the image name to whatever you want and reuse this for other use cases. You can pass the GIT_TAG (image tag) as a variable.

  4. Edit the deployment.yaml file. It covers the infrastructure needed for this example. This includes:

    • a deployment which will spin up a pod with your flask app
    • a service which exposes your pod's ports
    • a virtual service which creates a URL that you can then access and gives you visibility of the exposed ports.
    note

    Within the in the YAML file, review the following:

    • Ensure that the port number of Deployment, Service, and VirtualService match.
    • The string present in the prefix field in the virtual service resource is the URL path that you will then type on your browser to find the metrics. Change this to something meaningful.
  5. Deploy the resources.

    kubectl apply -f deployment.yaml -n <namespace>
    note

    After your pod is running, you will be able to visit the app's metrics. The URL follows the <cluster_url>/<prefix>/metrics structure, for example: https://balder.dev.aurora.equinor.com/evidentlyai/metrics

  6. To send requests to your app:

    1. Open with your code editor the scripts/example_run_request.py script.

    2. Ensure that the address in line 45 is correct and matches the resource you have created (keep the /iterate suffix).

    3. Run the scripts/example_run_request.py script.

      python3 scripts/example_run_request.py

    You should see several "success" messages appear. If you reload the <cluster_url>/<prefix>/metrics website you will start seeing your Evidently metrics come through.

    note

    The commented line 46 in scripts/example_run_request.py shows how to send requests to a local service, which you can spin up by running python3 -m flask run --host=0.0.0.0 --port=8000 in a second terminal window (this would need to be run from within the metrics_app folder). You can then uncomment that line and comment the remote URL one and send requests the same way.

Visualizing metrics

There are 5 reusable dashboards in Grafana which you are welcome to make a copy of and edit them to adapt them to your use case:

You can explore the metrics you have just created with these dashboards.

Visualizing reports and workspaces

Set up a workspace to organize your data and projects

A workspace means a remote or local directory where you store the snapshots. Snapshots are a JSON version of the report or a test suite which contains measurements and test results for a specific period. You can log them over time and run an Evidently monitoring dashboard for continuous monitoring. The monitoring UI will read the data from this source. We have designed a solution where your snapshots will be stored in your Azure blob storage, making them easily available and shareable.

How to set up Evidently AI dashboard through the Kubeflow Central Dashboard

To set up your own Evidently AI server, you will need to carry out the following steps:

  1. Create or use an existing storage account in the Azure Portal.

  2. Collect the following values from the storage account:

    1. Azure Storage Account
    2. Container Name
    3. Azure Storage key

    azure access keys Storage account details

  3. Open the Kubeflow Central Dashboard through the URL of your corresponding cluster. For example, Balder's cluster: https://balder.dev.aurora.equinor.com/).

    note

    If you don't have a personal or team profile, ask the AI Platform team to create this on your behalf.

  4. In the navigation menu, click Applications.

  5. Click Create Evidently Dashboard.

  6. Enter the values corresponding to the Azure storage account and click Create.
    This will spin up the relevant resources in the background for you.

  7. Click Refresh to reload the page and get information relevant to your Evidently AI server.
    This includes the internal URL which will be used when setting up your WORKSPACE, as well as the external URL, which directs you to the UI.

    EvidentlyAI Create Evidently AI dashboard dialog box

Connect to your server

After generating the snapshots, you will send them to the remote server. You must run the monitoring UI on the same remote server, so that it directly interfaces with the filesystem where the snapshots are stored.

To create a remote workspace, the UI should be running at this address:

workspace = RemoteWorkspace(f"{the internal URI returned by the Kubeflow app creation})

For example:
workspace = RemoteWorkspace("http://aurora-evidently.\<namespace\>.svc.cluster.local:8080)

Building reports

Create a Jupyter notebook server or VSCode server and run the workspace_ui.ipynb script. Make sure the variable WORKSPACE shows the Internal URL that was output when you created the Evidently UI dashboard: Kubeflow Central Dashboard > Applications


Custom metrics with Prometheus

To create and monitor your own metrics calculated from Python scripts on AI Platform, refer to the methodology shown in the following documentation: