Sunday, May 7, 2023
HomePythonVelocity Entice - Sparrow Computing

Velocity Entice – Sparrow Computing


This submit goes to showcase the event of a car velocity detector utilizing Sparrow Computing’s open-source libraries and PyTorch Lightning.

The thrilling information right here is that we may make this velocity detector for any site visitors feed with out prior information concerning the website (no calibration required), or specialised imaging gear (no depth sensors required). Higher but, we solely wanted ~300 annotated pictures to achieve an honest velocity estimate. To estimate velocity, we’ll detect all autos coming towards the digicam utilizing an object detection mannequin. We can even predict the areas of the again tire and the entrance tire of each car utilizing a keypoint mannequin. Lastly, we’ll carry out object monitoring to account for a similar tire because it progresses frame-to-frame in order that we are able to estimate car velocity.

With out additional ado, let’s dive in…

What we’ll want

  • A video stream of vehicle site visitors
  • About 300 randomly sampled video frames from the video stream
  • Bounding field annotations round all of the autos shifting towards the digicam
  • Keypoint annotations for the factors the place the entrance tire and the again tire contact the street (on the aspect the place the car is dealing with the digicam)
An instance for an annotated body: to scale back the annotation labor, we outlined a smaller area of the video.

Finish-to-end Pipeline

The pc imaginative and prescient system we’re constructing takes in a site visitors video as an enter, makes inferences by a number of deep studying fashions, and generates 1) an annotated video and a pair of) a site visitors log that prints car counts and speeds. The determine under is a high-level overview of how the entire system is glued collectively (numbered in chronological order). We’ll discuss every of the 9 course of models in addition to the I/O highlighted in inexperienced.

Finish-to-end inference pipeline of our pc imaginative and prescient system

Docker Container Setup

First, we want a template that comes with the dependencies and configuration for Sparrow’s improvement surroundings. This template might be generated with a Python bundle referred to as sparrow-patterns. After creating the mission supply listing, the mission will run in a Docker container in order that our improvement might be platform-independent.

To set this up, all you must do is:

pip set up sparrow-patterns
sparrow-patterns mission speed-trapv3

Earlier than you run the template on a container, add an empty file referred to as .env contained in the folder named .devcontainer as proven under.

For this mission, we used an annotation platform referred to as V7 Darwin, however the identical strategy would work with different annotation platforms.

The Darwin annotation file (proven under) is a JSON object for every picture that we tag. Be aware that “id” is a hash worth that uniquely identifies an object on the body that’s solely related to the Darwin platform. The info we want is contained within the “bounding_box” and “title” objects. After getting annotations, it’s good to convert them to a kind that PyTorch’s Dataset class is in search of:

  "annotations": [
      "bounding_box": {
        "h": 91.61,
        "w": 151.53,
        "x": 503.02,
        "y": 344.16
      "id": "ce9d224b-0963-4a16-ba1c-34a2fc062b1b",
      "name": "vehicle"
      "id": "8120d748-5670-4ece-bc01-6226b380e293",
      "keypoint": {
        "x": 508.47,
        "y": 427.1
      "name": "back_tire"
      "id": "3f71b63d-d293-4862-be7b-fa35b76b802f",
      "keypoint": {
        "x": 537.3,
        "y": 434.54
      "name": "front_tire"

Object Detection Mannequin

Outline the dataset

Within the Dataset class, we create a dictionary with keys such because the body of the video, the bounding containers of the body, and the labels for every of the bounding containers. The values of that dictionary are then saved as PyTorch tensors. The next is the generic setup for the Dataset class:

class RetinaNetDataset(torch.utils.information.Dataset):  # sort: ignore
    """Dataset class for RetinaNet mannequin."""

    def __init__(self, holdout: Non-compulsory[Holdout] = None) -> None:
        self.samples = get_sample_dicts(holdout)
        self.remodel = T.ToTensor()

    def __len__(self) -> int:
        """Return variety of samples."""
        return len(self.samples)

    def __getitem__(self, idx: int) -> dict[str, torch.Tensor]:
        """Get the tensor dict for a pattern."""
        pattern = self.samples[idx]
        img =["image_path"])
        x = self.remodel(img)
        containers = pattern["boxes"].astype("float32")
        return {
            "picture": x,
            "containers": torch.from_numpy(containers),
            "labels": torch.from_numpy(pattern["labels"]),

Outline the mannequin

We use a pre-trained RetinaNet mannequin from TorchVision as our detection mannequin outlined within the Module class:

from torchvision import fashions

class RetinaNet(torch.nn.Module):
    """Retinanet mannequin primarily based on Torchvision"""

    def __init__(
        n_classes: int = Config.n_classes,
        min_size: int = Config.min_size,
        trainable_backbone_layers: int = Config.trainable_backbone_layers,
        pretrained: bool = False,
        pretrained_backbone: bool = True,
    ) -> None:
        self.n_classes = n_classes
        self.mannequin = fashions.detection.retinanet_resnet50_fpn(

Discover that the ahead methodology (under) of the Module class returns the bounding containers, confidence scores for the containers, and the assigned labels for every field in type of tensors saved in a dictionary.

def ahead(
        x: Union[npt.NDArray[np.float64], listing[torch.Tensor]],
        y: Non-compulsory[list[dict[str, torch.Tensor]]] = None,
    ) -> Union[
        dict[str, torch.Tensor], listing[dict[str, torch.Tensor]], FrameAugmentedBoxes
        Ahead cross for coaching and inference.

            An inventory of picture tensors with form (3, n_rows, n_cols) with
            unnormalized values in [0, 1].
            An non-obligatory listing of targets with an x1, x2, y1, y2 "containers" tensor
            and a category index "labels" tensor.

        end result(s)
            If inference, this might be an inventory of dicts with predicted tensors
            for "containers", "scores" and "labels" in each. If coaching, this
            might be a dict with loss tensors for "classification" and
        if isinstance(x, np.ndarray):
            return self.forward_numpy(x)
        elif self.coaching:
            return self.mannequin.ahead(x, y)
        outcomes = self.mannequin.ahead(x, y)
        padded_results = []
        for lead to outcomes:
            padded_result: dict[str, torch.Tensor] = {}
            padded_result["boxes"] = torch.nn.practical.pad(
                end result["boxes"], (0, 0, 0, Config.max_boxes), worth=-1.0
            )[: Config.max_boxes]
            padded_result["scores"] = torch.nn.practical.pad(
                end result["scores"], (0, Config.max_boxes), worth=-1.0
            )[: Config.max_boxes]
            padded_result["labels"] = torch.nn.practical.pad(
                end result["labels"], (0, Config.max_boxes), worth=-1
            )[: Config.max_boxes].float()
        return padded_results

With that, we must always be capable of practice and save the detector as a .pth file, which shops educated PyTorch weights.

One essential element to say right here is that the dictionary created throughout inference is transformed right into a Sparrow information construction referred to as FrameAugmentedBox from the sparrow-datums bundle. Within the following code snippet, the results_to_boxes() operate converts the inference end result right into a FrameAugmentedBox object. We convert the inference outcomes to Sparrow format to allow them to be immediately used with different Sparrow packages resembling sparrow-tracky which is used to carry out object monitoring in a while.

def result_to_boxes(
    end result: dict[str, torch.Tensor],
    image_width: Non-compulsory[int] = None,
    image_height: Non-compulsory[int] = None,
) -> FrameAugmentedBoxes:
    """Convert RetinaNet end result dict to FrameAugmentedBoxes."""
    box_data = to_numpy(end result["boxes"]).astype("float64")
    labels = to_numpy(end result["labels"]).astype("float64")
    if "scores" in end result:
        scores = to_numpy(end result["scores"]).astype("float64")
        scores = np.ones(len(labels))
    masks = scores >= 0
    box_data = box_data[mask]
    labels = labels[mask]
    scores = scores[mask]
    return FrameAugmentedBoxes(
        np.concatenate([box_data, scores[:, None], labels[:, None]], -1),

Now, we are able to use the saved detection mannequin to make inferences and acquire the predictions as FrameAugmentedBoxes.

Object Detection Analysis

To quantify the efficiency of the detection mannequin, we use MOTA (A number of Object Monitoring Accuracy) as the first metric, which has a spread of [-inf, 1.0], the place excellent object monitoring is indicated by 1.0. Since we’ve got not carried out monitoring but, we’ll estimate MOTA with out the id switches. Only for the sake of readability, we’ll name it MODA (A number of Object Detection Accuracy). To calculate MODA, we use a technique referred to as compute_moda() from the sparrow-tracky bundle which employs the pairwise IoU between bounding containers to seek out the similarity.

from sparrow_tracky import compute_moda

moda = 0
rely = 0
for batch in iter(test_dataloader):
    x = batch['image']
    x = x.cuda()
    pattern = {'containers':batch['boxes'][0], 'labels':batch['labels'][0]}
    end result = detection_model(x)[0]
    predicted_boxes = result_to_boxes(end result)
    predicted_boxes = predicted_boxes[predicted_boxes.scores > DetConfig.score_threshold]
    ground_truth_boxes = result_to_boxes(pattern)
    frame_moda = compute_moda(predicted_boxes, ground_truth_boxes)
    moda += frame_moda.worth
    rely += 1
print(f"Primarily based on {rely} check examples, the A number of Object Detection Accuracy is {moda/rely}.")

The MODA for our detection mannequin turned out to be 0.98 primarily based on 43 check examples, indicating that our mannequin has excessive variance, so the mannequin wouldn’t be as efficient if we used a unique video. To enhance the excessive variance, we’ll want extra coaching examples.

Object Monitoring

Now that we’ve got the educated detection mannequin saved as a .pth file, we run an inference with the detection mannequin and carry out object monitoring on the similar time. The supply video is cut up into video frames and detection predictions for a body are transformed right into a FrameAugmentedBox as defined earlier than. Later, it’s fed right into a Tracker object that’s instantiated from Sparrow’s sparrow-tracky bundle. After looping by all of the frames, the Tracker object tracks the autos utilizing an algorithm much like SORT (you may learn extra about our strategy right here). Lastly, the information saved within the Tracker object is written right into a file utilizing a Sparrow information construction. The file may have an extension of .json.gz when it’s saved.

from sparrow_tracky import Tracker

def track_objects(
    video_path: Union[str, Path],
    model_path: Union[str, Path] = Config.pth_model_path,
) -> None:
    Monitor ball and the gamers in a video.
        The trail to the supply video
        The trail to jot down the chunk to
        The trail to the ONNX mannequin
    video_path = Path(video_path)
    slug = video_path.title.removesuffix(".mp4")
    vehicle_tracker = Tracker(Config.vehicle_iou_threshold)
    detection_model = RetinaNet().eval().cuda()
    fps, n_frames = get_video_properties(video_path)
    reader = imageio.get_reader(video_path)
    for i in tqdm(vary(n_frames)):
        information = reader.get_data(i)
        information = cv2.rectangle(
            information, (450, 200), (1280, 720), thickness=5, shade=(0, 255, 0)
        # input_height, input_width = information.form[:2]
        aug_boxes = detection_model(information)
        aug_boxes = aug_boxes[aug_boxes.scores > TrackConfig.vehicle_score_threshold]
        containers = aug_boxes.array[:, :4]
        vehicle_boxes = FrameBoxes(
            PType.absolute_tlbr,  # (x1, y1, x2, y2) in absolute pixel coordinates [With respect to the original image size]

    make_path(str(Config.prediction_directory / slug))
    vehicle_chunk = vehicle_tracker.make_chunk(fps, Config.vehicle_tracklet_length)
        Config.prediction_directory / slug / f"{slug}_vehicle.json.gz"

Monitoring Analysis

We quantify our monitoring efficiency utilizing MOTA (Multi Mannequin Monitoring Accuracy) from the sparrow-tracky bundle. Much like MODA, MOTA has a spread of [-inf, 1.0], the place 1.0 signifies the proper monitoring efficiency. Be aware that we want a pattern video with monitoring floor reality to carry out the analysis. Additional, each the predictions and the bottom reality needs to be reworked into an AugmentedBoxTracking object to be appropriate with the compute_mota() operate from the sparrow-tracky bundle as proven under.

from sparrow_datums import AugmentedBoxTracking, BoxTracking
from sparrow_tracky import compute_mota

test_mota = compute_mota(pred_aug_box_tracking, gt_aug_box_tracking)
print("MOTA for the check video is ", test_mota.worth)

The MOTA worth for our monitoring algorithm seems to be -0.94 once we evaluated it towards a small pattern of “ground-truth” video clips, indicating that there’s loads of room for enchancment.

Keypoint Mannequin

In an effort to find every tire of a car, we crop the autos detected by the detection mannequin, resize them, and feed them into the keypoint mannequin to foretell the tire areas.

Throughout cropping and resizing, the x and y coordinates of the keypoints have to be dealt with correctly. When x and y coordinates are divided by the picture width and the picture peak, the vary of the keypoints turns into [0, 1], and we use the time period “relative coordinates”. Relative coordinates inform us the placement of a pixel with respect to the scale of the picture it’s positioned at. Equally, when keypoints should not divided by the scale of the body, we use the time period “absolute coordinates”, which solely depends on the body’s Cartesian aircraft to determine pixel location. Assuming the keypoints are in relative coordinates once we learn them from the annotation file, we’ve got to multiply by the unique body dimensions to get the keypoints in absolute pixel area. Because the keypoint mannequin takes in cropped areas, we’ve got to subtract the top-left coordinate from the keypoints to vary the origin of the cropped area from (0, 0) to (x1, y1). Since we resize the cropped area, we divide the shifted keypoints by the scale of the cropped area after which multiply by the keypoint mannequin enter measurement. You may visualize this course of utilizing the circulation chart under.

Pre-processing steps for the keypoints

A identified truth amongst neural networks is that they’re good at studying surfaces quite than studying a single level. Due to that, we create two Laplacian of Gaussian surfaces the place the very best power is on the location of every keypoint. These two pictures are referred to as heatmaps that are stacked up on one another earlier than feeding into the mannequin. To visualise the heatmaps, we are able to mix each heatmaps right into a single heatmap and superimpose it on the corresponding car as proven under.

The heatmap (annotation) representing each keypoints (blue: again tire, purple: entrance tire) is overlaid on the corresponding car (coaching instance)
def keypoints_to_heatmap(
    x0: int, y0: int, w: int, h: int, covariance: float = Config.covariance_2d
) -> np.ndarray:
    """Create a 2D heatmap from an x, y pixel location."""
    if x0 < 0 and y0 < 0:
        x0 = 0
        y0 = 0
    xx, yy = np.meshgrid(np.arange(w), np.arange(h))
    zz = (
        / (2 * np.pi * covariance**2)
        * np.exp(
                (xx - x0) ** 2 / (2 * covariance**2)
                + (yy - y0) ** 2 / (2 * covariance**2)
    # Normalize zz to be in [0, 1]
    zz_min = zz.min()
    zz_max = zz.max()
    zz_range = zz_max - zz_min
    if zz_range == 0:
        zz_range += 1e-8
    return (zz - zz_min) / zz_range

The essential truth to note right here is that if the keypoint coordinates are detrimental, (0, 0) is assigned. When each tires should not seen (i.e. due to occlusion), we assign (-1, -1) for the lacking tire on the Dataset class because the PyTorch collate_fn() requires fastened enter shapes. On the keypoints_to_heatmap() operate, the detrimental worth is zeroed out indicating that the tire is positioned on the prime left nook of the car’s bounding field as proven under.

The heatmap behaves usually for the current tire (blue: again tire) however assigns (0, 0) for the absent tire (purple: entrance tire)

In actual life, that is not possible, since tires are positioned within the backside half of the bounding field. The mannequin learns these patterns throughout the coaching and continues to foretell the lacking tire within the prime left nook which makes it simpler for us to filter.

The Dataset class for the keypoint mannequin may seem like this:

class SegmentationDataset(torch.utils.information.Dataset):
    """Dataset class for Segmentations mannequin."""

    def __init__(self, holdout: Non-compulsory[Holdout] = None) -> None:
        self.samples = get_sample_dicts(holdout)

    def __len__(self):
        """Size of the pattern."""
        return len(self.samples)

    def __getitem__(self, idx: int) -> dict[str, torch.Tensor]:
        """Get the tensor dict for a pattern."""
        pattern = self.samples[idx]
        crop_width, crop_height = Config.image_crop_size
        keypoints = process_keypoints(pattern["keypoints"], pattern["bounding_box"])
        heatmaps = []
        for x, y in keypoints:
            heatmaps.append(keypoints_to_heatmap(x, y, crop_width, crop_height))
        heatmaps = np.stack(heatmaps, 0)
        img =["image_path"])
        img = crop_and_resize(pattern["bounding_box"], img, crop_width, crop_height)
        x = image_transform(img)

        return {
            "holdout": pattern["holdout"],
            "image_path": pattern["image_path"],
            "annotation_path": pattern["annotation_path"],
            "heatmaps": heatmaps.astype("float32"),
            "keypoints": keypoints.astype("float32"),
            "labels": pattern["labels"],
            "picture": x,

Be aware that the Dataset class creates a dictionary with the next keys:

  • holdout: Whether or not the instance belongs to coach, dev (validation), or check set
  • image_path: Saved location of the video body
  • annotation_path: Saved location of the annotation file akin to the body
  • heatmaps: Remodeled keypoints within the type of surfaces
  • labels: Labels of the keypoints
  • picture: Numerical values of the body

For the Module class of the keypoint mannequin we use a pre-trained ResNet50 structure with a sigmoid classification prime.

A high-level implementation of the Module class could be:

from torchvision.fashions.segmentation import fcn_resnet50

class SegmentationModel(torch.nn.Module):
    """Mannequin for prediction courtroom/internet keypoints."""

    def __init__(self) -> None:
        self.fcn = fcn_resnet50(
            num_classes=Config.num_classes, pretrained_backbone=True, aux_loss=False

    def ahead(self, x: torch.Tensor) -> torch.Tensor:
        """Run a ahead cross with the mannequin."""
        heatmaps = torch.sigmoid(self.fcn(x)["out"])
        ncols = heatmaps.form[-1]
        flattened_keypoint_indices = torch.flatten(heatmaps, 2).argmax(-1)
        xs = (flattened_keypoint_indices % ncols).float()
        ys = torch.flooring(flattened_keypoint_indices / ncols)
        keypoints = torch.stack([xs, ys], -1)
        return {"heatmaps": heatmaps, "keypoints": keypoints}

    def load(self, model_path: Union[Path, str]) -> str:
        """Load mannequin weights."""
        weights = torch.load(model_path)
        end result: str = self.load_state_dict(weights)
        return end result

Now, we’ve got the whole lot we have to practice and save the mannequin in .pth format.

Keypoints post-processing

Recall that since we reworked the coordinates of the keypoints earlier than feeding them into the mannequin, the keypoint predictions are going to be in absolute area with respect to the scale of the resized cropped area. To mission them again to the unique body dimensions we’ve got to observe just a few steps talked about within the flowchart under.

Put up-processing factors for the keypoints

First, we’ve got to divide the keypoints by the scale of the mannequin enter measurement, which takes the keypoints into the relative area. Then, we’ve got to multiply them by the scale of the cropped area to take them again to absolute area with respect to the cropped area of curiosity dimensions. Regardless of the keypoints being again in absolute area, the origin of its coordinate system begins at (x1, y1). So, we’ve got so as to add (x1, y1) to carry the origin again to (0, 0) of the unique body’s coordinate system.

Keypoint mannequin analysis

We quantify the mannequin efficiency utilizing the relative error metric, which is the ratio of the magnitude of the distinction between floor reality and prediction in comparison with the magnitude of the bottom reality.

overall_rel_error = 0
rely = 0
for batch in iter(test_dataloader):
    x = batch['image']
    x = x.cuda()
    end result = mannequin(x)
    relative_error = torch.norm(
        batch["keypoints"].cuda() - end result["keypoints"].cuda()
    ) / torch.norm(batch["keypoints"].cuda())
    overall_rel_error += relative_error
    rely += 1
print(f"The relative error of the check set is {overall_rel_error * 100}%.")

The relative error of our keypoint mannequin seems to be 20%, which signifies that for each floor reality with a magnitude of 100 models, there’s an error of 20 models within the corresponding prediction. This mannequin can be over-fitting to some extent, so it could in all probability carry out poorly on a unique video. Nonetheless, this could possibly be mitigated by including extra coaching examples.

Aggregating Multi-Mannequin Predictions

Recall that we saved the monitoring ends in a .json.gz file. Now, we open that file utilizing the sparrow-datums bundle, merge the keypoints predictions, and write into two JSON information referred to as objectwise_aggregation.json and framewise_aggregation.json. The motivation behind having these two information is that it permits us to entry all of the predictions in a single place at fixed time (o(1)). Extra particularly, the objectwise_aggregation.json dictionary retains the order that the objects that appeared within the video as the important thing and all of the predictions concerning that object as the worth.

Right here’s an inventory of issues objectwise_aggregation.json retains for each object:

  • Body vary the item appeared
  • The bounding field of every look
  • Object tracklet ID (Distinctive ID assigned for every object by the SORT algorithm)

In distinction, the framewise_aggregation.json dictionary makes use of the body quantity as the important thing. All of the predictions associated to that body are used as the worth.

The next is the listing of issues we are able to seize from every video body:

  • All of the objects that appeared in that body
  • The bounding field of the objects
  • Keypoints (Remodeled x, y coordinates of the tires)
  • Sigmoid scores of the keypoints (we used a sigmoid operate to categorise between the front and back tire)
  • Object ID (the order that the item appeared within the enter video)
  • Object tracklet ID (Distinctive ID assigned for every object by the SORT algorithm)

Velocity Algorithm

As soon as we’ve got all of the primitives detected from frames and movies, we’ll use each frame-wise and object-wise information to estimate the velocity of the autos primarily based on the mannequin predictions. The best type of the velocity algorithm could be to measure the space between the entrance tire and the again tire which is called the wheelbase at body n, after which decide what number of frames it took for the again tire to journey the wheelbase distance from body n. For simplicity, we assume that each car has the identical wheelbase, which is 2.43 meters. Additional, since we have no idea any details about the positioning or the gear, we assume that the pixel-wise wheelbase of a car stays the identical. Due to this fact, our strategy works finest when the digicam is positioned together with the street and pointed within the orthogonal path to the autos’ shifting path (which isn’t the case within the demo video).

Noise Filtering

Since our keypoint mannequin has a 20% error fee, the keypoint predictions should not excellent, so we’ve got to filter out a number of the keypoints. Listed below are a number of the observations we did to establish frequent eventualities the place the mannequin under-performed.

Situation 1

Mannequin predictions across the inexperienced boundary don’t carry out nicely since solely a portion of autos is seen to the digicam. So, it’s higher to attend for these autos to come back to a greater detection space. Due to this fact, we filter out any autos predictions which have x1 lower than some threshold for the top-left x coordinate of their bounding field.

Situation 2

For the lacking tires, we taught the mannequin to make predictions on the origin of the bounding field. Moreover, there are cases when the mannequin mis-predicts the keypoint and it finally ends up being on the higher half of the bounding field of the car. Each of those circumstances might be resolved by eradicating any keypoints which are positioned on the higher half of the bounding field.

Situation 3

For lacking tires, the mannequin tends to foretell each tires on the similar spot, so we’ve got to take away certainly one of them. On this case, we may draw a circle centering the again tire and if the opposite tire is within that circle, we are able to do away with the tire that had the decrease chance within the sigmoid classification prime.

Abstract of the foundations fashioned

So, we kind three guidelines to filter out keypoints that aren’t related.

  1. Filter out any keypoints whose top-left bounding field coordinate satisfies x1 < some threshold.
  2. Ignore any keypoints which are predicted within the higher half of the bounding field.
  3. Neglect the keypoint with the decrease sigmoid rating when tires overlap with one another.

After we plot all of the keypoints predicted all through the enter video, discover that many of the tires overlap and the final development is a straight line.

When the x vs y coordinates of all of the autos within the video are plotted on the identical graph earlier than making use of noise guidelines.
When the x vs y coordinates of particular person autos are plotted on the identical graph earlier than making use of noise guidelines.

After the foundations have been utilized, discover that many of the outliers will get filtered out.

When the x vs y coordinates of all of the autos within the video are plotted on the identical graph after making use of noise guidelines.

Additionally, word that some autos might be utterly ignored leaving solely 4 autos.

When the x vs y coordinates of particular person autos after making use of noise guidelines.

Filling the Lacking Values

After we carry out filtering, there are cases the place solely a single tire will get filtered out, and the opposite stays. To repair that, we match keypoint information of every car right into a straight line utilizing linear regression, the place the x, and y coordinates of the again tire and the x coordinate of the entrance tire are the impartial variables and the y coordinate of the entrance tire is the dependent variable. Utilizing the coefficients of the fitted line, we are able to now predict and fill within the lacking values.

For instance, right here’s what the straight-line development seems to be like with lacking values.

Straight line development for a selected car earlier than filling within the lacking values.

After predicting the lacking values with linear regression, we are able to achieve 50 extra factors to estimate the velocity extra exactly.

Straight line development for a selected car after filling the lacking values.

Velocity Estimation

Now that we’ve got full pairs of keypoints, it’s time to leap into the geometry of the issue we’re fixing…

If we draw a circle across the again tire with a radius representing the pixel-wise wheelbase (d), we kind the grey space outlined on the diagram. Our aim is to seek out out which again tire that exhibits up in a future body has reached the space of d from the present again tire. Because the keypoints are nonetheless not excellent, we may land on a future again tire that’s positioned anyplace on the circle. To treatment that, we are able to discover the theta angle by discovering alpha and beta with easy trigonometry. Then, we threshold the theta worth and outline that theta can not exceed the edge. Ideally, theta needs to be zero because the car is touring on a linear path. Though the longer term again tire and the present entrance tire ideally needs to be on the identical round boundary, there might be some errors. So, we outline an higher sure (inexperienced dotted line) and a decrease sure (purple dotted line) on the diagram. Let’s put this collectively to kind an algorithm to estimate the velocity.

Velocity Algorithm

  • Decide a pair of tires (purple and yellow).
  • Discover the space (d) between the chosen pair.
  • From the again tire, draw a circle with a radius d.
  • From that again tire, iterate by the longer term again tires till the space between the present again tire and the longer term again tire match d.
  • If d has been exceeded by greater than m pixels (inexperienced boundary), ignore the present tire pair and transfer on to the subsequent pair.
  • If d has fallen brief by greater than m pixels (purple boundary), ignore the present tire pair and transfer on to the subsequent pair.
  • If not, discover theta utilizing alpha, beta and if theta is lower than the edge worth of 𝛄, and if the variety of frames from the present backfire to the longer term again tire is bigger than 1, take into account that time as eligible for velocity estimation.
  • In any other case, ignore the present pair and transfer on to the subsequent pair.
  • Cease when all of the tire pairs have been explored.
  • Learn the way many frames there are between the present again tire and the longer term again tire to seek out the elapsed time.
  • Wheelbase (d) / elapsed time renders the instantaneous car velocity.
def estimate_speed(video_path_in):
    """Estimate the velocity of the autos within the video.

    video_path_in : str
        Supply video path
    video_path = video_path_in
    slug = Path(video_path).title.removesuffix(".mp4")
    objects_to_predictions_map = open_objects_to_predictions_map(slug)
    object_names, vehicle_indices, objectwise_keypoints = filter_bad_tire_pairs(
    speed_collection = {}
    for vehicle_index in vehicle_indices:  # Looping by all objects within the video
        approximate_speed = -1
        object_name = object_names[vehicle_index]
        coef, bias, information = straight_line_fit(objectwise_keypoints, object_name)
        ) = fill_missing_keypoints(coef, bias, information)

        vehicle_speed = []
        skipped = 0

        back_tire_keypoints = [back_tire_x_list, back_tire_y_list]
        back_tire_keypoints = [list(x) for x in zip(*back_tire_keypoints[::-1])]
        front_tire_keypoints = [front_tire_x_list, front_tire_y_list]
        front_tire_keypoints = [list(x) for x in zip(*front_tire_keypoints[::-1])]

        back_tire_x_list = []
        back_tire_y_list = []
        front_tire_x_list = []
        front_tire_y_list = []
        # Velocity calculation algorithm begins right here...
        vehicle_speed = {}
        total_num_points = len(objectwise_keypoints[object_name])
        object_start = objects_to_predictions_map[vehicle_index]["segments"][0][0]
        for i in vary(total_num_points):
            back_tire = back_tire_keypoints[i]
            front_tire = front_tire_keypoints[i]
            if back_tire[0] < 0 or front_tire[0] < 0:
                vehicle_speed[i + object_start] = approximate_speed
                skipped += 1
            for j in vary(i, total_num_points):
                future_back_tire = back_tire_keypoints[j]
                if future_back_tire[0] < 0:
                back_tire_x = back_tire[0]
                back_tire_y = back_tire[1]
                front_tire_x = front_tire[0]
                front_tire_y = front_tire[1]
                future_back_tire_x = future_back_tire[0]
                future_back_tire_y = future_back_tire[1]
                current_keypoints_distance = get_distance(
                    back_tire_x, back_tire_y, front_tire_x, front_tire_y
                future_keypoints_distance = get_distance(
                    back_tire_x, back_tire_y, future_back_tire_x, future_back_tire_y
                if (
                    current_keypoints_distance - future_keypoints_distance
                ) >= -SpeedConfig.distance_error_threshold and (
                    current_keypoints_distance - future_keypoints_distance
                ) <= SpeedConfig.distance_error_threshold:
                    alpha = get_angle(
                        back_tire_x, back_tire_y, future_back_tire_x, future_back_tire_y
                    beta = get_angle(
                        back_tire_x, back_tire_y, front_tire_x, front_tire_y
                    if (
                        SpeedConfig.in_between_angle >= alpha + beta
                        and (j - i) > 1
                        approximate_speed = spherical(
                            * SpeedConfig.WHEEL_BASE
                            / frames_to_seconds(30, j - i)
                        vehicle_speed[i + object_start] = approximate_speed
        speed_collection[vehicle_index] = vehicle_speed
    f = open(SpeedConfig.json_directory / slug / "speed_log.json", "w")
    json.dump(speed_collection, f)

The instantaneous velocity calculated by the algorithm is saved right into a JSON file referred to as speed_log.json which retains monitor of the instantaneous velocity of the detected autos with their corresponding frames. Additionally, the detected velocity is printed on the video body and all of the video frames are put again collectively to provide the next annotated video.

After iterating by all of the frames, we are able to use the velocity log JSON file to calculate normal statics resembling the utmost velocity, quickest car, and common velocity of each car to create a report of the site visitors within the video feed.


Let’s summarize what we did at this time. We constructed a pc imaginative and prescient system that may estimate the velocity of autos in a given video. To estimate the velocity of a car, we wanted to know the areas of its tires in each body that it appeared. For that, we wanted to carry out three major duties.

  1. Detecting autos
  2. Monitoring the autos and assigning distinctive identities to them
  3. Predicting the keypoints for each car in each body

After buying keypoint areas for each car, we developed a geometrical rule-based algorithm to estimate the variety of frames it takes for the again tire of a car to achieve the place of its corresponding entrance tire’s place sooner or later. With that data in hand, we are able to approximate the velocity of that car.

Earlier than we wind up our mission, you may try the whole implementation of the code right here. This code would have been extra complicated if it wasn’t for Sparrow packages. So, be certain to examine them out as nicely. Yow will discover me by way of LinkedIn when you have any questions.


Most Popular

Recent Comments