Evaluation metrics

PreviousData augmentation NextNeural networks

Last updated 4 months ago

Was this helpful?

When to Use Different Metrics

Choosing the right metric depends on your specific task and the application's requirements:

Precision: Needed when avoiding false positives, such as in medical diagnosis. (Read on| Read on)

Recall: Vital when missing detections is costly, like in security applications. (Read on| Read on)

Lower IoU Thresholds: Suitable for tasks where rough localization suffices.

Higher IoU Thresholds: Necessary for tasks requiring precise localization.

Understanding these metrics in context ensures that your models are not only accurate but also suitable for their intended applications.

Types of Evaluation Metrics

Used for problems where the output is a category, such as detecting whether a sound is a cough or not:

Accuracy: Measures the percentage of correct predictions out of all predictions. For instance, in a model that classifies sounds on a wearable device, accuracy tells you how often the model gets it right. (Read on| Read on)
$\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}$
- ( TP ): True Positives
- ( TN ): True Negatives
- ( FP ): False Positives
- ( FN ): False Negatives
Precision: The percentage of true positive predictions out of all positive predictions made by the model. This is crucial in cases where false positives can have significant consequences, such as in health monitoring devices. (Read on| Read on)
$\text{Precision} = \frac{TP}{TP + FP}$
Recall: The percentage of actual positive instances that the model correctly identified. For example, in a fall detection system, recall is vital because missing a fall could lead to serious consequences. (Read on| Read on)
$\text{Recall} = \frac{TP}{TP + FN}$
F1 Score: The harmonic mean of precision and recall, useful when you need to balance the trade-offs between false positives and false negatives. (Read on| Read on)
$\text{F1 Score} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}$
Confusion Matrix: A table that shows the number of correct and incorrect predictions made by the model. It helps visualize the model's performance across different classes. (Read on| Read on)

This confusion matrix helps evaluate the performance of the model by showing where it is performing well (high values along the diagonal) and where it is making mistakes (off-diagonal values).

Here's how to interpret it:

Labels: The "True label" on the Y-axis represents the actual class labels of the activities. The "Predicted label" on the X-axis represents the class labels predicted by the model.
Classes: The dataset seems to have three classes, represented as 0, 1, and 2. These likely correspond to different human activities.
Matrix Cells: The cells in the matrix contain the number of samples classified in each combination of actual versus predicted class.
- For instance: The top-left cell (44) indicates that the model correctly predicted class 0 for 44 instances where the true label was also 0.
- The off-diagonal cells represent misclassifications. For example, the cell at row 0, column 1 (29) shows that 29 samples were true class 0 but were incorrectly predicted as class 1.
Color Scale: The color scale on the right represents the intensity of the values in the cells, with lighter colors indicating higher values and darker colors indicating lower values.
- The ROC curve plots True Positive Rate (Recall) against False Positive Rate (FPR), where:
  $\text{FPR} = \frac{FP}{FP + TN}$

The ROC (Receiver Operating Characteristic) curve is a commonly used tool for evaluating the performance of binary classification models. The ROC curve plots the trade-off between the true positive rate (TPR or Recall) and the false positive rate (FPR) for different threshold values.

True Positive Rate (Y-axis): This is the proportion of actual positives (walking instances) that the model correctly identifies (recall).
False Positive Rate (X-axis): This is the proportion of actual negatives (rest instances) that the model incorrectly identifies as positives (false positives).
Precision-Recall Curve: Useful in evaluating binary classification models, especially when dealing with imbalanced datasets, like in the context of walking vs resting activities. The Precision-Recall curve shows the trade-off between precision and recall for various threshold settings of the classifier.
- Precision (Y-axis): Precision measures the proportion of true positive predictions among all positive predictions made by the model. High precision means that when the model predicts "Walking," it is correct most of the time.
- Recall (X-axis): Recall (or True Positive Rate) measures the proportion of actual positives (walking instances) that the model correctly identifies. High recall indicates that the model successfully identifies most instances of walking.
$\text{Log Loss} = -\frac{1}{N} \sum_{i=1}^{N} \left[ y_i \log(p_i) + (1 - y_i) \log(1 - p_i) \right]$
- ( y_i ): Actual label
- ( p_i ): Predicted probability
- ( N ): Number of samples

Used for problems where the goal is to identify and locate objects in an image, such as detecting pedestrians in a self-driving car system.

Focusing on the COCO mAP Score:

The COCO mAP (Mean Average Precision) score is a key metric used to evaluate the performance of an object detection model. It measures the model's ability to correctly identify and locate objects within images.

This result shows a mAP of 0.3, which may seem low, but it accurately reflects the model's performance. The mAP is averaged over Intersection over Union (IoU) thresholds from 0.5 to 0.95, capturing the model's ability to localize objects with varying degrees of precision.

How It Works

Detection and Localization: The model attempts to detect objects in an image and draws a bounding box around each one.
Intersection over Union (IoU): IoU calculates the overlap between the predicted bounding box and the actual (true) bounding box. An IoU of 1 indicates perfect overlap, while 0 means no overlap.
Precision Across Different IoU Thresholds: The mAP score averages the precision (the proportion of correctly detected objects) across different IoU thresholds (e.g., 0.5, 0.75). This demonstrates the model's performance under both lenient (low IoU) and strict (high IoU) conditions.
Final Score: The final mAP score is the average of these precision values. A higher mAP score indicates that the model is better at correctly detecting and accurately placing bounding boxes around objects in various scenarios.

IoU Thresholds

mAP@IoU=0.5 (AP50): A less strict metric, useful for broader applications where rough localization is acceptable.
mAP@IoU=0.75 (AP75): A stricter metric requiring higher overlap between predicted and true bounding boxes, ideal for tasks needing precise localization.
mAP@[IoU=0.5:0.95]: The average of AP values computed at IoU thresholds ranging from 0.5 to 0.95. This primary COCO challenge metric provides a balanced view of the model's performance.

Area-Based Evaluation

mAP can also be broken down by object size—small, medium, and large—to assess performance across different object scales:

Small Objects: Typically smaller than 32x32 pixels.
Medium Objects: Between 32x32 and 96x96 pixels.
Large Objects: Larger than 96x96 pixels.

Models generally perform better on larger objects, but understanding performance across all sizes is crucial for applications like aerial imaging or medical diagnostics.

Recall Metrics

Recall in object detection measures the ability of a model to find all relevant objects in an image:

Recall@[max_detections=1, 10, 100]: These metrics measure recall when considering only the top 1, 10, or 100 detections per image, providing insight into the model's performance under different detection strictness levels.
Recall by Area: Similar to mAP, recall can also be evaluated based on object size, helping to understand how well the model recalls objects of different scales.

Importance of Evaluation Metrics

Evaluation metrics serve multiple purposes in the impulse lifecycle:

Model Selection: They enable you to compare different models and choose the one that best suits your needs.

Model Tuning: Metrics guide you in fine-tuning models by providing feedback on their performance.

Model Interpretation: Metrics help understand how well a model performs and where it might need improvement.

Model Deployment: Before deploying a model in real-world applications, metrics are used to ensure it meets the required standards.

Model Monitoring: After deployment, metrics continue to monitor the model's performance over time.

How to Choose the Right Metric

Choosing the right metric depends on the specific task and application requirements:

For classification: In an Edge AI application like sound detection on a wearable device, precision might be more important if you want to avoid false alarms, while recall might be critical in safety applications where missing a critical event could be dangerous.

For regression: If you're predicting energy usage in a smart home, MSE might be preferred because it penalizes large errors more, ensuring your model's predictions are as accurate as possible.

For object detection: If you're working on an edge-based animal detection camera, mAP with a higher IoU threshold might be crucial for ensuring the camera accurately identifies and locates potential animals.

Conclusion

Evaluation metrics like mAP and recall provide useful insights into the performance of machine learning models, particularly in object detection tasks. By understanding and appropriately focusing on the correct metrics, you can ensure that your models are robust, accurate, and effective for real-world deployment.