LogoLogo
HomeAPI & SDKsProjectsForumStudio
  • Getting started
    • For beginners
    • For ML practitioners
    • For embedded engineers
  • Frequently asked questions (FAQ)
  • Tutorials
    • End-to-end tutorials
      • Computer vision
        • Image classification
        • Object detection
          • Object detection with bounding boxes
          • Detect objects with centroid (FOMO)
        • Visual anomaly detection
        • Visual regression
      • Audio
        • Sound recognition
        • Keyword spotting
      • Time-series
        • Motion recognition + anomaly detection
        • Regression + anomaly detection
        • HR/HRV
        • Environmental (Sensor fusion)
    • Data
      • Data ingestion
        • Collecting image data from the Studio
        • Collecting image data with your mobile phone
        • Collecting image data with the OpenMV Cam H7 Plus
        • Using the Edge Impulse Python SDK to upload and download data
        • Trigger connected board data sampling
        • Ingest multi-labeled data using the API
      • Synthetic data
        • Generate audio datasets using Eleven Labs
        • Generate image datasets using Dall-E
        • Generate keyword spotting datasets using Google TTS
        • Generate physics simulation datasets using PyBullet
        • Generate timeseries data with MATLAB
      • Labeling
        • Label audio data using your existing models
        • Label image data using GPT-4o
      • Edge Impulse Datasets
    • Feature extraction
      • Building custom processing blocks
      • Sensor fusion using embeddings
    • Machine learning
      • Classification with multiple 2D input features
      • Visualize neural networks decisions with Grad-CAM
      • Sensor fusion using embeddings
      • FOMO self-attention
    • Inferencing & post-processing
      • Count objects using FOMO
      • Continuous audio sampling
      • Multi-impulse (C++)
      • Multi-impulse (Python)
    • Lifecycle management
      • CI/CD with Actions
      • Data aquisition from S3 object store - Golioth on AI
      • OTA model updates
        • with Arduino IDE (for ESP32)
        • with Arduino IoT Cloud
        • with Blues Wireless
        • with Docker on Allxon
        • with Docker on Balena
        • with Docker on NVIDIA Jetson
        • with Espressif IDF
        • with Nordic Thingy53 and the Edge Impulse app
        • with Particle Workbench
        • with Zephyr on Golioth
    • API examples
      • Customize the EON Tuner
      • Ingest multi-labeled data using the API
      • Python API bindings example
      • Running jobs using the API
      • Trigger connected board data sampling
    • Python SDK examples
      • Using the Edge Impulse Python SDK to run EON Tuner
      • Using the Edge Impulse Python SDK to upload and download data
      • Using the Edge Impulse Python SDK with Hugging Face
      • Using the Edge Impulse Python SDK with SageMaker Studio
      • Using the Edge Impulse Python SDK with TensorFlow and Keras
      • Using the Edge Impulse Python SDK with Weights & Biases
    • Expert network projects
  • Edge Impulse Studio
    • Organization hub
      • Users
      • Data campaigns
      • Data
        • Cloud data storage
      • Data pipelines
      • Data transformation
        • Transformation blocks
      • Upload portals
      • Custom blocks
        • Custom AI labeling blocks
        • Custom deployment blocks
        • Custom learning blocks
        • Custom processing blocks
        • Custom synthetic data blocks
        • Custom transformation blocks
      • Health reference design
        • Synchronizing clinical data with a bucket
        • Validating clinical data
        • Querying clinical data
        • Transforming clinical data
    • Project dasard
      • Select AI hardware
    • Devices
    • Data acquisition
      • Uploader
      • Data explorer
      • Data sources
      • Synthetic data
      • Labeling queue
      • AI labeling
      • CSV Wizard (time-series)
      • Multi-label (time-series)
      • Tabular data (pre-processed & non-time-series)
      • Metadata
      • Auto-labeler | deprecated
    • Impulses
    • EON Tuner
      • Search space
    • Processing blocks
      • Audio MFCC
      • Audio MFE
      • Audio Syntiant
      • Flatten
      • HR/HRV features
      • Image
      • IMU Syntiant
      • Raw data
      • Spectral features
      • Spectrogram
      • Custom processing blocks
      • Feature explorer
    • Learning blocks
      • Anomaly detection (GMM)
      • Anomaly detection (K-means)
      • Classification
      • Classical ML
      • Object detection
        • MobileNetV2 SSD FPN
        • FOMO: Object detection for constrained devices
      • Object tracking
      • Regression
      • Transfer learning (images)
      • Transfer learning (keyword spotting)
      • Visual anomaly detection (FOMO-AD)
      • Custom learning blocks
      • Expert mode
      • NVIDIA TAO | deprecated
    • Retrain model
    • Live classification
    • Model testing
    • Performance calibration
    • Deployment
      • EON Compiler
      • Custom deployment blocks
    • Versioning
    • Bring your own model (BYOM)
    • File specifications
      • deployment-metadata.json
      • ei-metadata.json
      • ids.json
      • parameters.json
      • sample_id_details.json
      • train_input.json
  • Tools
    • API and SDK references
    • Edge Impulse CLI
      • Installation
      • Serial daemon
      • Uploader
      • Data forwarder
      • Impulse runner
      • Blocks
      • Himax flash tool
    • Edge Impulse for Linux
      • Linux Node.js SDK
      • Linux Go SDK
      • Linux C++ SDK
      • Linux Python SDK
      • Flex delegates
      • Rust Library
    • Rust Library
    • Edge Impulse Python SDK
  • Run inference
    • C++ library
      • As a generic C++ library
      • On Android
      • On your desktop computer
      • On your Alif Ensemble Series Device
      • On your Espressif ESP-EYE (ESP32) development board
      • On your Himax WE-I Plus
      • On your Raspberry Pi Pico (RP2040) development board
      • On your SiLabs Thunderboard Sense 2
      • On your Spresense by Sony development board
      • On your Syntiant TinyML Board
      • On your TI LaunchPad using GCC and the SimpleLink SDK
      • On your Zephyr-based Nordic Semiconductor development board
    • Arm Keil MDK CMSIS-PACK
    • Arduino library
      • Arduino IDE 1.18
    • Cube.MX CMSIS-PACK
    • Docker container
    • DRP-AI library
      • DRP-AI on your Renesas development board
      • DRP-AI TVM i8 on Renesas RZ/V2H
    • IAR library
    • Linux EIM executable
    • OpenMV
    • Particle library
    • Qualcomm IM SDK GStreamer
    • WebAssembly
      • Through WebAssembly (Node.js)
      • Through WebAssembly (browser)
    • Edge Impulse firmwares
    • Hardware specific tutorials
      • Image classification - Sony Spresense
      • Audio event detection with Particle boards
      • Motion recognition - Particle - Photon 2 & Boron
      • Motion recognition - RASynBoard
      • Motion recognition - Syntiant
      • Object detection - SiLabs xG24 Dev Kit
      • Sound recognition - TI LaunchXL
      • Keyword spotting - TI LaunchXL
      • Keyword spotting - Syntiant - RC Commands
      • Running NVIDIA TAO models on the Renesas RA8D1
      • Two cameras, two models - running multiple object detection models on the RZ/V2L
  • Edge AI Hardware
    • Overview
    • Production-ready
      • Advantech ICAM-540
      • Seeed SenseCAP A1101
      • Industry reference design - BrickML
    • MCU
      • Ambiq Apollo4 family of SoCs
      • Ambiq Apollo510
      • Arducam Pico4ML TinyML Dev Kit
      • Arduino Nano 33 BLE Sense
      • Arduino Nicla Sense ME
      • Arduino Nicla Vision
      • Arduino Portenta H7
      • Blues Wireless Swan
      • Espressif ESP-EYE
      • Himax WE-I Plus
      • Infineon CY8CKIT-062-BLE Pioneer Kit
      • Infineon CY8CKIT-062S2 Pioneer Kit
      • Nordic Semi nRF52840 DK
      • Nordic Semi nRF5340 DK
      • Nordic Semi nRF9160 DK
      • Nordic Semi nRF9161 DK
      • Nordic Semi nRF9151 DK
      • Nordic Semi nRF7002 DK
      • Nordic Semi Thingy:53
      • Nordic Semi Thingy:91
      • Open MV Cam H7 Plus
      • Particle Photon 2
      • Particle Boron
      • RAKwireless WisBlock
      • Raspberry Pi RP2040
      • Renesas CK-RA6M5 Cloud Kit
      • Renesas EK-RA8D1
      • Seeed Wio Terminal
      • Seeed XIAO nRF52840 Sense
      • Seeed XIAO ESP32 S3 Sense
      • SiLabs Thunderboard Sense 2
      • Sony's Spresense
      • ST B-L475E-IOT01A
      • TI CC1352P Launchpad
    • MCU + AI accelerators
      • Alif Ensemble
      • Arduino Nicla Voice
      • Avnet RASynBoard
      • Seeed Grove - Vision AI Module
      • Seeed Grove Vision AI Module V2 (WiseEye2)
      • Himax WiseEye2 Module and ISM Devboard
      • SiLabs xG24 Dev Kit
      • STMicroelectronics STM32N6570-DK
      • Synaptics Katana EVK
      • Syntiant Tiny ML Board
    • CPU
      • macOS
      • Linux x86_64
      • Raspberry Pi 4
      • Raspberry Pi 5
      • Texas Instruments SK-AM62
      • Microchip SAMA7G54
      • Renesas RZ/G2L
    • CPU + AI accelerators
      • AVNET RZBoard V2L
      • BrainChip AKD1000
      • i.MX 8M Plus EVK
      • Digi ConnectCore 93 Development Kit
      • MemryX MX3
      • MistyWest MistySOM RZ/V2L
      • Qualcomm Dragonwing RB3 Gen 2 Dev Kit
      • Renesas RZ/V2L
      • Renesas RZ/V2H
      • IMDT RZ/V2H
      • Texas Instruments SK-TDA4VM
      • Texas Instruments SK-AM62A-LP
      • Texas Instruments SK-AM68A
      • Thundercomm Rubik Pi 3
    • GPU
      • Advantech ICAM-540
      • NVIDIA Jetson
      • Seeed reComputer Jetson
    • Mobile phone
    • Porting guide
  • Integrations
    • Arduino Machine Learning Tools
    • AWS IoT Greengrass
    • Embedded IDEs - Open-CMSIS
    • NVIDIA Omniverse
    • Scailable
    • Weights & Biases
  • Tips & Tricks
    • Combining impulses
    • Increasing model performance
    • Optimizing compute time
    • Inference performance metrics
  • Concepts
    • Glossary
    • Course: Edge AI Fundamentals
      • Introduction to edge AI
      • What is edge computing?
      • What is machine learning (ML)?
      • What is edge AI?
      • How to choose an edge AI device
      • Edge AI lifecycle
      • What is edge MLOps?
      • What is Edge Impulse?
      • Case study: Izoelektro smart grid monitoring
      • Test and certification
    • Data engineering
      • Audio feature extraction
      • Motion feature extraction
    • Machine learning
      • Data augmentation
      • Evaluation metrics
      • Neural networks
        • Layers
        • Activation functions
        • Loss functions
        • Optimizers
          • Learned optimizer (VeLO)
        • Epochs
    • What is embedded ML, anyway?
    • What is edge machine learning (edge ML)?
Powered by GitBook
On this page
  • When to Use Different Metrics
  • Types of Evaluation Metrics
  • Importance of Evaluation Metrics
  • How to Choose the Right Metric
  • Conclusion

Was this helpful?

Export as PDF
  1. Concepts
  2. Machine learning

Evaluation metrics

PreviousData augmentationNextNeural networks

Last updated 4 months ago

Was this helpful?

In Edge AI, where models are deployed on resource-constrained devices like microcontrollers, evaluation metrics are critical. They ensure that your model performs well in terms of accuracy and runs efficiently on the target hardware. By understanding these metrics, you can fine-tune your models to achieve the best balance between performance and resource usage.

These metrics serve several important purposes:

  • Model Comparison: Metrics allow you to compare different models and see which one performs better.

  • Model Tuning: They help you adjust and improve your model by showing where it might be going wrong.

  • Model Validation: Metrics ensure that your model generalizes well to new data, rather than just memorizing the training data (a problem known as overfitting).

When to Use Different Metrics

Choosing the right metric depends on your specific task and the application's requirements:

  • Precision: Needed when avoiding false positives, such as in medical diagnosis. (Read on| Read on)

  • Recall: Vital when missing detections is costly, like in security applications. (Read on| Read on)

  • Lower IoU Thresholds: Suitable for tasks where rough localization suffices.

  • Higher IoU Thresholds: Necessary for tasks requiring precise localization.

Understanding these metrics in context ensures that your models are not only accurate but also suitable for their intended applications.

Types of Evaluation Metrics

Used for problems where the output is a category, such as detecting whether a sound is a cough or not:

  • Accuracy: Measures the percentage of correct predictions out of all predictions. For instance, in a model that classifies sounds on a wearable device, accuracy tells you how often the model gets it right. (Read on| Read on)

    Accuracy=TP+TNTP+TN+FP+FN\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}Accuracy=TP+TN+FP+FNTP+TN​
    • ( TP ): True Positives

    • ( TN ): True Negatives

    • ( FP ): False Positives

    • ( FN ): False Negatives

  • Precision: The percentage of true positive predictions out of all positive predictions made by the model. This is crucial in cases where false positives can have significant consequences, such as in health monitoring devices. (Read on| Read on)

    Precision=TPTP+FP\text{Precision} = \frac{TP}{TP + FP}Precision=TP+FPTP​
  • Recall: The percentage of actual positive instances that the model correctly identified. For example, in a fall detection system, recall is vital because missing a fall could lead to serious consequences. (Read on| Read on)

    Recall=TPTP+FN\text{Recall} = \frac{TP}{TP + FN}Recall=TP+FNTP​
  • F1 Score: The harmonic mean of precision and recall, useful when you need to balance the trade-offs between false positives and false negatives. (Read on| Read on)

    F1 Score=2×Precision×RecallPrecision+Recall\text{F1 Score} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}F1 Score=2×Precision+RecallPrecision×Recall​
  • Confusion Matrix: A table that shows the number of correct and incorrect predictions made by the model. It helps visualize the model's performance across different classes. (Read on| Read on)

This confusion matrix helps evaluate the performance of the model by showing where it is performing well (high values along the diagonal) and where it is making mistakes (off-diagonal values).

Here's how to interpret it:

  • Labels: The "True label" on the Y-axis represents the actual class labels of the activities. The "Predicted label" on the X-axis represents the class labels predicted by the model.

  • Classes: The dataset seems to have three classes, represented as 0, 1, and 2. These likely correspond to different human activities.

  • Matrix Cells: The cells in the matrix contain the number of samples classified in each combination of actual versus predicted class.

    • For instance: The top-left cell (44) indicates that the model correctly predicted class 0 for 44 instances where the true label was also 0.

    • The off-diagonal cells represent misclassifications. For example, the cell at row 0, column 1 (29) shows that 29 samples were true class 0 but were incorrectly predicted as class 1.

  • Color Scale: The color scale on the right represents the intensity of the values in the cells, with lighter colors indicating higher values and darker colors indicating lower values.

    • The ROC curve plots True Positive Rate (Recall) against False Positive Rate (FPR), where:

      FPR=FPFP+TN\text{FPR} = \frac{FP}{FP + TN}FPR=FP+TNFP​

The ROC (Receiver Operating Characteristic) curve is a commonly used tool for evaluating the performance of binary classification models. The ROC curve plots the trade-off between the true positive rate (TPR or Recall) and the false positive rate (FPR) for different threshold values.

  • True Positive Rate (Y-axis): This is the proportion of actual positives (walking instances) that the model correctly identifies (recall).

  • False Positive Rate (X-axis): This is the proportion of actual negatives (rest instances) that the model incorrectly identifies as positives (false positives).

  • Precision-Recall Curve: Useful in evaluating binary classification models, especially when dealing with imbalanced datasets, like in the context of walking vs resting activities. The Precision-Recall curve shows the trade-off between precision and recall for various threshold settings of the classifier.

    • Precision (Y-axis): Precision measures the proportion of true positive predictions among all positive predictions made by the model. High precision means that when the model predicts "Walking," it is correct most of the time.

    • Recall (X-axis): Recall (or True Positive Rate) measures the proportion of actual positives (walking instances) that the model correctly identifies. High recall indicates that the model successfully identifies most instances of walking.

  • Log Loss=−1N∑i=1N[yilog⁡(pi)+(1−yi)log⁡(1−pi)]\text{Log Loss} = -\frac{1}{N} \sum_{i=1}^{N} \left[ y_i \log(p_i) + (1 - y_i) \log(1 - p_i) \right]Log Loss=−N1​i=1∑N​[yi​log(pi​)+(1−yi​)log(1−pi​)]
    • ( y_i ): Actual label

    • ( p_i ): Predicted probability

    • ( N ): Number of samples

Used for problems where the output is a continuous value, like predicting the temperature from sensor data:

  • MSE=1N∑i=1N(yi−y^i)2\text{MSE} = \frac{1}{N} \sum_{i=1}^{N} (y_i - \hat{y}_i)^2MSE=N1​i=1∑N​(yi​−y^​i​)2
    • ( y_i ): Actual value

    • ( \hat{y}_i ): Predicted value

    • ( N ): Number of samples

  • MAE=1N∑i=1N∣yi−y^i∣\text{MAE} = \frac{1}{N} \sum_{i=1}^{N} |y_i - \hat{y}_i|MAE=N1​i=1∑N​∣yi​−y^​i​∣
  • R2=1−∑i=1N(yi−y^i)2∑i=1N(yi−yˉ)2R^2 = 1 - \frac{\sum_{i=1}^{N} (y_i - \hat{y}_i)^2}{\sum_{i=1}^{N} (y_i - \bar{y})^2}R2=1−∑i=1N​(yi​−yˉ​)2∑i=1N​(yi​−y^​i​)2​
    • ( \bar{y} ): Mean of the actual values

Used for problems where the goal is to identify and locate objects in an image, such as detecting pedestrians in a self-driving car system.

Focusing on the COCO mAP Score:

The COCO mAP (Mean Average Precision) score is a key metric used to evaluate the performance of an object detection model. It measures the model's ability to correctly identify and locate objects within images.

This result shows a mAP of 0.3, which may seem low, but it accurately reflects the model's performance. The mAP is averaged over Intersection over Union (IoU) thresholds from 0.5 to 0.95, capturing the model's ability to localize objects with varying degrees of precision.

How It Works

  • Detection and Localization: The model attempts to detect objects in an image and draws a bounding box around each one.

  • Intersection over Union (IoU): IoU calculates the overlap between the predicted bounding box and the actual (true) bounding box. An IoU of 1 indicates perfect overlap, while 0 means no overlap.

  • Precision Across Different IoU Thresholds: The mAP score averages the precision (the proportion of correctly detected objects) across different IoU thresholds (e.g., 0.5, 0.75). This demonstrates the model's performance under both lenient (low IoU) and strict (high IoU) conditions.

  • Final Score: The final mAP score is the average of these precision values. A higher mAP score indicates that the model is better at correctly detecting and accurately placing bounding boxes around objects in various scenarios.

IoU Thresholds

  • mAP@IoU=0.5 (AP50): A less strict metric, useful for broader applications where rough localization is acceptable.

  • mAP@IoU=0.75 (AP75): A stricter metric requiring higher overlap between predicted and true bounding boxes, ideal for tasks needing precise localization.

  • mAP@[IoU=0.5:0.95]: The average of AP values computed at IoU thresholds ranging from 0.5 to 0.95. This primary COCO challenge metric provides a balanced view of the model's performance.

Area-Based Evaluation

mAP can also be broken down by object size—small, medium, and large—to assess performance across different object scales:

  • Small Objects: Typically smaller than 32x32 pixels.

  • Medium Objects: Between 32x32 and 96x96 pixels.

  • Large Objects: Larger than 96x96 pixels.

Models generally perform better on larger objects, but understanding performance across all sizes is crucial for applications like aerial imaging or medical diagnostics.

Recall Metrics

Recall in object detection measures the ability of a model to find all relevant objects in an image:

  • Recall@[max_detections=1, 10, 100]: These metrics measure recall when considering only the top 1, 10, or 100 detections per image, providing insight into the model's performance under different detection strictness levels.

  • Recall by Area: Similar to mAP, recall can also be evaluated based on object size, helping to understand how well the model recalls objects of different scales.

Importance of Evaluation Metrics

Evaluation metrics serve multiple purposes in the impulse lifecycle:

  • Model Selection: They enable you to compare different models and choose the one that best suits your needs.

  • Model Tuning: Metrics guide you in fine-tuning models by providing feedback on their performance.

  • Model Interpretation: Metrics help understand how well a model performs and where it might need improvement.

  • Model Deployment: Before deploying a model in real-world applications, metrics are used to ensure it meets the required standards.

  • Model Monitoring: After deployment, metrics continue to monitor the model's performance over time.

How to Choose the Right Metric

Choosing the right metric depends on the specific task and application requirements:

  • For classification: In an Edge AI application like sound detection on a wearable device, precision might be more important if you want to avoid false alarms, while recall might be critical in safety applications where missing a critical event could be dangerous.

  • For regression: If you're predicting energy usage in a smart home, MSE might be preferred because it penalizes large errors more, ensuring your model's predictions are as accurate as possible.

  • For object detection: If you're working on an edge-based animal detection camera, mAP with a higher IoU threshold might be crucial for ensuring the camera accurately identifies and locates potential animals.

Conclusion

Evaluation metrics like mAP and recall provide useful insights into the performance of machine learning models, particularly in object detection tasks. By understanding and appropriately focusing on the correct metrics, you can ensure that your models are robust, accurate, and effective for real-world deployment.

ROC-AUC: The area under the receiver operating characteristic curve, showing the trade-off between true positive rate and false positive rate. (Read on| Read on)

Log Loss: The negative log-likelihood of the true labels given the model predictions. (Read on| Read on)

Mean Squared Error (MSE): The average of the squared differences between the predicted values and the actual values. In an edge device that predicts temperature, MSE penalizes larger errors more heavily, making it crucial for ensuring accurate predictions. (Read on| Read on)

Mean Absolute Error (MAE): The average of the absolute differences between predicted and actual values, providing a straightforward measure of prediction accuracy. This is useful in energy monitoring systems where predictions need to be as close as possible to the actual values. (Read on| Read on)

R-Squared (R2): Measures how well your model explains the variability in the data. A higher R2 indicates a better model fit, which is useful when predicting variables like energy consumption in smart homes. (Read on| Read on)

Scikit-learn ROC-AUC
TensorFlow AUC
Scikit-learn Log Loss
TensorFlow Log Loss
Scikit-learn MSE
TensorFlow MSE
Scikit-learn MAE
TensorFlow MAE
Scikit-learn R2 Score
TensorFlow R2 Score (Custom Implementation)
Scikit-learn Precision
TensorFlow Precision
Scikit-learn Recall
TensorFlow Recall
Scikit-learn Accuracy
TensorFlow Accuracy
Scikit-learn Precision
TensorFlow Precision
Scikit-learn Recall
TensorFlow Recall
Scikit-learn F1 Score
TensorFlow F1 Score
Scikit-learn Confusion Matrix
TensorFlow Confusion Matrix
Classification Metrics
Confusion Matrix for Activity Recognition (UCI HAR Dataset - Simulated)
ROC Curve for Walking vs Rest (UCI HAR Dataset - Simulated)
Precision-Recall Curve for Walking vs Rest (UCI HAR Dataset - Simulated)
Regression Metrics
Object Detection Metrics can be complex