Table of Contents

Let's Build a Network Video Recorder in Python!

I have been rather unhappy with all the existing NVR software out there. It generally needs some
crazy text file base config, it almost always, for reasons unknown, must be run in a Docker,
many require manual admin(Thanks to completely unnecessary use of real databases), and they are typically limited to *just* CCTV, not taking advantage of the fact that the problem domain is similar to VJ video walls, QR readers, and the like.

Many don't even have low latency streaming!

And worst of all, they often use more CPU than one might like them to, because they encode, decode, and re-encode, the video. I wanted something way more hands-off that made use of on-camera encoding.

TL;DR

You can take a look at the project here! https://github.com/EternityForest/KaithemAutomation

Just go to the web UI, make an NVRChannel device, set permissions on it, fill in your RTSP URL, and add Beholder from the modules library. Beholder finds all your NVRChannels and gives you a nice easy UI with a lot of what you see in the usual NVR apps.

Let's start!

My first step in starting a new project is always to see how I can avoid starting a new project.

I looked into Frigate, Shinobi, BlueCherry, AgentDVR(Excellent, but NOT FOSS!!!), ZoneMinder, Moonfire, OS-NVR, etc.

None of these were what I wanted. Unfortunately for my sanity, I had a new project personal project idea.

First Steps

I knew I was going to make this a plugin for my existing Kaithem Automation system, for maximum reuse,
but I had zero clue how to do the streaming.

I spent a lot of time looking into WebRTC, but it turned out to just be too much of a nightmare to work with. I briefly tried HLS, but the latency was too high. The project stalled entirely until I found something interesting.

Video over WebSockets! But how was I supposed to do that? What was I supposed to stream? Video files have packets and framing, you can't just start anywhere.

GStreamer

GStreamer is a framework for media processing. It's node based, so you never touch the content, you just set up processing pipelines. Almost anything you want to do has a node(Called an 'element').

It's basically the only media framework of it's kind. I use it all the time.

Unfortunately, pipelines can refuse to start and it is not always obvious what element is constipating the whole line, or why it would do so, and some elements need obscure settings and routing. It can be rather complicated, with elements having dynamic inputs and outputs appearing at runtime.

But, it succeeds at turning a deeply mathematical, low level challenge of dealing with media, into just your basic everyday “coding” task. Normally you don't even need to worry about syncing audio and video or any of that. It mostly is pretty good at what it does.

MPEG-TS Is Great

Turns out there's a really simple solution to framing streams of video. MPEG-2 Transport streams. Every packet is 188 bytes long. As long as you do things in 188 bytes chunks you can start anywhere. It's perfect!

Even better, mpegts.js supports it using Media Stream Extensions(MSE)! The Web player is done!

Furthermore, HLS uses .ts files linked by a playlist, and a TS file is just a bunch of those 188 byte blocks(It does have to start with a special packet type, but GStreamer handles that).

This means both live and prerecorded playback are essentially solved.

I use the hlssink element in gstreamer for recording, and the filesink with a named pipe(I run all this in a background process, so the appsink element that is actually made for seems less than ideal), which I read in my server(in 188b chunks of course) and send out my websockets.

Issues with that

Apparently, iPhones don't do MSE, and can't play h264 via WebSocket. I solved this the same way so many
other devs do, by pretending iProducts don't exist. Not ideal, but… It's FOSS, if someone wants an iDevice friendly streaming mode, they can figure it out themselves, or pay someone to do so.

Also, h264 and MP4 have multiple profiles. Not all are supported by MSE. You will get incredibly unhelpful error messages if you do anything wrong here.

Moving Madness

One big problem is motion detection. Since I want this to run multiple HD cameras on a Raspberry Pi,
I can't decode every frame.

To solve this I use GStreamer's Identity element to drop delta frames. Most cameras allow configuring keyframe interval, and to use this system, you have to set It to something reasonable.

I don't touch the video stream at all, except for the keyframes that can be decoded independently, which should be set to happen every 0.5-2 seconds.

I examine just these for motion. But this creates a real response time issue.

To solve this, I record constantly into a RAM disk, in the form of TS segments. When a recording starts, I already have the few seconds preceding the motion even. Response time is less of an issue when you can capture events that happen *before* the motion.

Still, it does decrease efficiency to be unable to use larger keyframe intervals without missing short events. I'll probably look into other solutions eventually.

While I was at it, I also added QR code reading that can be optionally enabled.

motioncells

Gstreamer's motion detection wasn't working. It seems to be designed for full-rate video and performs very poorly on 0.5fps video.

To solve this I used Pillow, and an algorithm with a 1-frame memory.

First I take the absolute difference in frames, and erode using MinFilter to get rid of tiny noise pixels.

Next I take the average value of this difference, and go a little higher. This is a threshold value. In theory this should reject minor lighting changes that are uniform across the whole frame, and widely spread out noise. A smarter threshold may be needed to really reject fast changing lighting.

Next I take the RMS value of the whole frame after applying the threshold. This algorithm prioritizes large changes, and closely grouped changed pixels. It reliably detects people even in poor lighting.

Object Detection

I quickly learned that passing cars tripped this. All the time. I really did need motion detection.

I knew nothing of machine learning before this, but I knew pretrained models exist, and that people seem to like tensorflow, kinda.

After the usual trying stuff that doesn't work, I settled on Efficientdet lite(Is Mobiledet better?).

I exported it from the automl repos, and eventually got it working. Turns out integer tflite can be a bit slow on X86, and I wanted this to work on both RasPi and desktop, so I went for a floating point model.

These models basically all seem to fall into two categories. People/faces, and COCO-trained. The COCO dataset has 80 classes, including people, cars, handbags, phones, and many other common objects. Good enough! I don't think I have any hardware that could train a new model anyway.

I can do the deep learning inference in about 0.3s, but there is no reason to burn more CPU than needed, plus, false positives are still an issue. So, I only run detection every 10-30 seconds, unless I detect motion.

Sadly, I can't detect people across the street with the model I'm using. Where I live, everyone seems to be rather spy-friendly(Which I am very happy about) on account of the amount of porch pirates, and local groups are always asking “Anyone have cameras on XXX street?”.

Special Effects

Finally, I wanted this to be usable for art installations. Applying effects to a live video was important. This was easy to solve. I just used Pixi.JS! All the effects are done in-browser on the display side.

Closing Thoughts

It's all still beta, but it *works*! I'm testing it now, fixing bugs as they come up, and it's already
pretty usable.

The disadvantage? I didn't really build all that much of it. This uses about a dozen dependencies. Aside from the UI, and the motion detection algorithm…. there's not much original here. And I'll be honest, I don't have a clue how most of it works. I just pieced it together from existing open code and slapped a UI on.

It's fairly performant, but there's nothing lightweight or elegant about it. I have no idea if it would run on non-Debian systems, and it definitely wouldn't work on windows.

In the future, almost all of it *should* be able to be used as a standalone library outside of Kaithem, but implementing this required adding experimental features to libraries that should be separate projects to allow that, and that all needs to be documented and finalized.

A lot of code cleanup is needed, and I'm honestly a little scared of the community reaction to my dependencies list. I still need to add camera control for the PTZ.

But, as it turns out, getting to a usable point only takes about three weeks of coding, once you find all the pieces. I was expecting these apps to be a lot harder, but it's pretty reasonable…. as long as you don't do anything yourself!