An Introduction to YOLO

YOLO – You Only Look Once


Introduction: Why YOLO is Important

Imagine walking down a busy street where dozens of people and vehicles are moving in all directions. Suddenly, an autonomous car approaches and must quickly decide how to navigate among all the objects around it, such as pedestrians, cyclists, and other vehicles. The ability to recognize and classify these objects in real time is essential to avoid accidents and ensure safety. This is where YOLO (You Only Look Once), one of the most powerful computer vision algorithms for object recognition, comes in. YOLO allows a system to recognize multiple objects in a single image with extreme speed, a critical task in dynamic and complex scenarios.

Why Choose YOLO

YOLO is one of the most powerful and efficient choices for object recognition, and there are several reasons why. Unlike other algorithms that break the task into multiple steps, YOLO looks at the entire image in one time, simultaneously predicting both the bounding box coordinates and the object classes. This approach makes it much faster than other methods.

  • Speed and Accuracy: YOLO is extremely fast and can process hundreds of images per second, making it ideal for real-time applications, such as obstacle detection for self-driving cars or surveillance in crowded environments.
  • Single Pass: While other algorithms may require multiple stages (such as object recognition and then classification), YOLO does everything in a single pass, significantly reducing processing time.
  • Scalability:YOLO is highly scalable and adaptable to a wide range of objects and contexts, from classifying objects in a store to recognizing people in public spaces. It can be easily trained on new datasets, making it a versatile option for many practical applications..

Compared to competing solutions like SSD (Single Shot Multibox Detector) or Faster R-CNN, YOLO is faster but still maintains good accuracy, making it a preferred choice in many real-time applications where balancing speed and accuracy is critical.

 

How Does YOLO Work?

The YOLO algorithm divides the image into a grid. Each grid cell is responsible for detecting objects that fall within it.
Each cell predicts:

  • probability that a cell contains an object
  • The coordinates of the object’s location (in terms of bounding boxes).
  • The probability of each object class for that cell.

The model then uses these predictions to identify objects in the image. The process can be broken down into the following steps:

  1. Dividing the Image: The input image is divided into a predefined grid (for example, 7×7, 13×13, etc.).
  2. Prediction per Cell: Each grid cell predicts a fixed number of bounding boxes. Each box has:
  3. Filtering Predictions: After the model has made predictions for all grid cells, predictions with low confidence are filtered out, leaving only those with high probabilities.
  4. Non-Maximum Suppression (NMS): Once the bounding boxes are selected, a process called Non-Maximum Suppression (NMS) is applied to remove redundant boxes, keeping only the one with the highest probability for each object.


Details on Bounding Boxes and Deformable Ports Model

Bounding Boxes in YOLO

In the context of YOLO, bounding boxes are used to represent the position of an object within an image. Each box is represented by 4 coordinates:

  1. Box Center: The coordinates (x, y) indicate the center of the bounding box.
  2. Width and Height: The size of the box, given by w (width) and h (height).
  3. Confidence: The probability that the box contains an object, calculated as the product of the object probability and the box area.
  4. Object Class: The probability that an object belonging to a specific class (e.g., person, car, etc.) is inside the box.

Each grid in YOLO predicts these details for each cell. The prediction of the bounding box is based on the concept of confidence threshold .When a grid cell detects an object, it calculates a prediction for the box coordinates and the probability that the object is actually present, along with the probability for each class.

 

 

Deformable Ports Model (DPM)

The Deformable Ports Model (DPM) is an extension of the detection model that allows for greater flexibility in adapting object positions. In YOLO, the main idea behind Deformable Ports is to make bounding boxes more adaptable to changes in object shapes, rather than relying on rigid predictions. This helps YOLO handle objects that may be deformed or have unusual positions.

In practice, the DPM introduces deformable ports for each bounding box, allowing dynamic alignment of objects during detection, improving the handling of deformable objects or variable shapes.

This approach can be particularly useful in contexts such as:

  • Detection of Deformable Objects: Such as fabrics, curtains, or other objects that change shape
  • Complex Scenes: Where objects might overlap or have unconventional orientations.

The DPM model allows for better alignment of bounding boxes to real-world objects, reducing errors in localization predictions.


YOLO Architecture

YOLO uses a convolutional neural network (CNN) to make predictions. The network architecture typically consists of three main parts:

  • Feature Extraction: A series of convolutional layers that extract relevant features from the input image.
  • Fully Connected Layers: Fully connected layers that map the extracted features to the final predictions (bounding boxes and class probabilities).
  • Output Layer: The final layer that provides the predictions for boxes, object probabilities, and class probabilities

YOLO Steps: From Input to Prediction

  1. Image Input:
    The image is received by the model and preprocessed to fit the specified dimensions, such as 416×416 or 608×608. The model works with fixed-size images to standardize the training and inference process.
  2. Grid Division SxS:
    The preprocessed image is divided into a grid of size SxS. Each grid cell is responsible for predicting a fixed number of bounding boxes. For example, if the grid is 13×13, each cell predicts 5 bounding boxes (one for each potential object).
  3. Bounding Box Predictions per Cell:
    Each grid cell predicts:

    • 4 values for each box: (x, y) for the box center, and (w, h) for width and height.
    • 1 confidence value: The probability that the cell contains an object.
    • C class probabilities: The probability for each object class (e.g., person, car, etc.).
  4. Object Detection:
    • Each cell predicts multiple bounding boxes (e.g., 5) for objects that might be inside it.
    • For each box, a confidence score is calculated, representing the probability that an object is present in the box.
    • The confidence is combined with the class probabilities to determine which object is most likely to be present in that box.
  5. Filtering Predictions
    • Thresholding: Cells with a low probability of containing an object (below a predefined threshold) are ignored.
    • Each cell that exceeds the threshold then predicts one of the possible object classes with the highest probability.
  6. Non-Maximum Suppression (NMS):
    • The final step is to apply Non-Maximum Suppression to eliminate redundant predictions. If multiple boxes predict the same object, only the box with the highest confidence is retained
    • This step is essential for reducing duplicate detections and refining the localization of objects..

Grid SxS and Possibilities

The grid SxS determines the number of cells into which the image is divided. The value of S directly affects the
number of predictions the model can make. For example:

  • A 7×7 grid will produce 49 cells, each predicting a fixed number of bounding boxes.
  • Increasing the value of S results in a finer grid and thus more cells and potentially more predictions for small objects. However, it can also increase false positives if the model is not well-trained.

The grid size is balanced with the image resolution and the ability to detect objects of different sizes. A grid that’s too small may not capture fine details, while a grid that’s too large might not be efficient for larger objects.

 


Conclusion

YOLO is a powerful object detection tool that balances speed and accuracy. Its ability to operate in real time makes it ideal for applications where quick responses are crucial. As the model evolves, YOLO continues to be one of the most widely used detection methods in the field of computer vision.