This project develops an AI-based hand gesture recognition system designed to control a drone through real-time hand gestures. By leveraging MediaPipe for hand tracking and a custom CNN (Convolutional Neural Network) for classification, the system translates physical movements into drone commands (e.g., takeoff, land, movement) without the need for traditional controllers.
The system features a hybrid pipeline that combines precise hand landmark detection with specialized binary image classification to ensure robust performance across various lighting conditions and backgrounds.
- Dual-Model Pipeline: Uses MediaPipe for hand localization and a custom TFLite model for gesture classification.
- Binary Processing: Converts hand crops to black-and-white (binary) to focus on morphology rather than skin tone.
- Apple Silicon Optimized: Performance-tuned for Apple M5 chips using XNNPACK delegates.
The model was trained for 20 epochs using a CNN architecture:
- Training Accuracy: 98.41%
- Validation Accuracy: 99.93%
- Input Format: 96x96 Grayscale Binary Image
The model is trained on a specialized version of the Hand Gesture Recognition Dataset.
- Source: Hand Gesture Recognition Dataset (Kaggle)
- Total Images: 24,000 (18,000 Train / 6,000 Test)
- Classes: 20 distinct gesture categories (0-19)
- Input Specs: 96x96 pixels, Grayscale (Single Channel)
The following workflow describes the real-time inference process:
Camera Input (RGB)
↓
MediaPipe Hand Landmarker (Hand Localization)
↓
Crop & Preprocess (Grayscale + Otsu's Thresholding)
↓
Custom CNN Model (96x96x1 TFLite)
↓
Gesture Classification (20 Classes)
↓
Drone Command Mapping
| Gesture (ID) | Command | Description |
|---|---|---|
| OK (0) | Takeoff | Start the motors and hover. |
| Fist (11) | Land | Secure landing at current position. |
| Point (10) | Forward | Move the drone forward. |
| Rock (17) | Flip | Perform a 360 degree stunt flip. |
- Python 3.11: Core development language.
- MediaPipe: For high-fidelity hand landmark detection.
- TensorFlow / Keras: Used for training the CNN classifier.
- TensorFlow Lite: For lightweight, real-time edge inference.
- OpenCV: For advanced image preprocessing and binary thresholding.
- XNNPACK: Optimized CPU inference for Apple M5.
Activate the virtual environment and install dependencies:
# Create the virtual environment (only need to do this once)
python3 -m venv venv_detect
# Activate the virtual environment
source venv_detect/bin/activate
# Upgrade pip and install required packages
pip install --upgrade pip
pip install mediapipe tensorflow opencv-python numpyBefore running the detection, you must download the official MediaPipe model:
- Hand Landmarker Bundle: Download the
hand_landmarker.taskfile from the MediaPipe Official Models page. - Placement: Ensure
hand_landmarker.taskis placed in the project root directory.
To retrain the classifier using the grayscale binary approach:
python train_model.pyExecute the main detection script:
python gesture_detection.py- Drone SDK Integration: Connecting the command outputs to DJI Tello or ArduPilot.
- 3D Gesture Tracking: Utilizing Z-axis data from MediaPipe for altitude control.
- Robustness: Adding more background-noise augmentation to the binary training set.
The trained TFLite model is designed to be integrated into drone systems using two primary architectural approaches:
This is the most accessible method for drones like the DJI Tello.
- Workflow: The drone streams live video via Wi-Fi to a laptop (Ground Station). The laptop runs
gesture_detection.pyusing its CPU/GPU (optimized for Apple Silicon). - Command Transmission: Recognized gestures are translated into SDK commands (e.g.,
tello.takeoff(),tello.land()) and sent back to the drone over the same Wi-Fi network. - Tools:
djitellopylibrary for Python-based drone control.
For autonomous or semi-autonomous drones (e.g., custom builds with Pixhawk or Betaflight).
- Hardware: Mounting a lightweight companion computer such as a Raspberry Pi 4 or NVIDIA Jetson Nano on the drone.
- Efficiency: Since the model is in .tflite format, it is highly optimized for these edge devices.
- Communication: The companion computer processes the camera feed locally and sends MAVLink commands to the Flight Controller (FC) via a serial connection.
| Gesture (Folder ID) | Drone Command | Action Description |
|---|---|---|
| Folder_5 | Takeoff | Start the motors and hover at 1m altitude. |
| Folder_11 | Land | Perform a controlled vertical landing. |
| Folder_0 | Stop / Hover | Stop all movement and hover in place. |
| Folder_1 | Move Up | Increase drone altitude (Fly Upward). |
| Folder_12 | Move Down | Decrease drone altitude (Fly Downward). |
- Latency Control: Optimizing the Wi-Fi video stream to minimize command delay.
- Safety Interlocks: Implementing a "Command Confirmation" logic (e.g., gesture must be held for 0.5s) to prevent accidental maneuvers.
- Dynamic Lighting: Enhancing binary thresholding stability for outdoor environments.