Gemini API (gemini-2.5-flash-preview-05-20) for Co-Speech Gesture Annotation

Intro

Gemini 2.5 pro is one of Google’s newer multimodal models that can be accessed for free through the Google AI studio. This blog post details how I created a simple pipeline in Python to input videos taken from the Distributed Little Red Hen Lab’s 30_Videos dataset of the Ellen DeGeneres show and return a description of the actor’s movement and a co-speech gesture category. I started with the script I wrote for my gemini-2.0-flash-001 trials and adapted it to work with Gemini 2.5. Unlike Gemini 2.0, Gemini 2.5 can use the file names in its analysis and can generate gesture onset timestamps. These varied capabilities required some adaptations from my 2.0 trials, detailed in methods. The version tested is a beta released on 5/20/2025, the most current version of the model publicly available on the Google API with enough model calls to make testing possible.

Video Annotation

The videos used for this testing were .7-5.6s clips from the Ellen DeGeneres show, adapted from the 30_Videos dataset. Videos were cropped down from 3-7 minutes to shorter clips that contained 1 actor performing 1 gesture (or sequence of gestures). This dataset was originally intended for testing the capabilities of visual transformers ViViT and TubeViT, but it works fine as a test set for evaluating Gemini’s capabilities on the same task. Each new clip was renamed to reflect the original video name and the sequence from which it was taken. It was then colloquially labeled with the gesture within it, loosely in line with the ELAN annotations done by Dr. Peter Uhrig on the original 30_Videos dataset. Unlike Gemini 2.0, Gemini 2.5 is capable of reading and incorporating the file names into it’s analysis, so all labels of the contents had to be removed for these trials.

Naming Conventions Example:
Original video: 11-11.mp4
Cropped video: 11-11-1_shrug.mp4
Adaption for Gemini 2.5: 11-11-1.mp4

Methods

As a pro preview, only 500 calls to Gemini 2.5 per day are available without incurring fees. Due to these limits, and to avoid wasting compute, coherency trials of 5 videos were conducted to ensure a valid output before a whole trial of 99 videos was run. The prompt was kept the same from the 2.0-flash-001 trials for the sake of creating comparable data.

Prompt: Please analyze the co-speech gesture in this video in two sections:
1) The action being performed (visual analysis)
2) The co-speech gesture category the gesture belongs to (beat, deictic, iconic, etc.), with timestamps in MM:SS.MS for gesture onset. If a gesture is iconic or metaphoric, please provide a description or subcategory.

Context: pdf document of McNeill, D. (1992). Hand and mind : what gestures reveal about thought, Chapter 3 (pp. 75–104). University Of Chicago Press.

The first coherency trial failed because gemini-2.5 was able to access, read, and base analysis on the file name. After this, all gesture labels were deleted from the file names, leaving only the file’s numerical designation. The next trial failed because 3 of the 5 test files did not return a coherent output. With repeated 5-video trials, it was noted that gemini-2.5 had a much higher rate of returning no coherent response than gemini-2.0. What this meant was that not every model call resulted in any output. There were videos that were more likely to receive a coherent output than others, but on average each file only received any annotation 70% of the time, about 60% of which were whole— the model also had a tendency to drop off mid-sentence. It began spontaneously adding in timestamps for gesture onset, so timestamps were added to the prompt request for consistency. If an annotation was incomplete, it usually left off the timestamps.

Due to the inconsistency, the script was re-written with a few provisions for the main trial that weren’t necessary for the gemini-2.0 runs.
1) After the initial try, the script recorded the files that either returned no output or did not return an output containing a timestamp in the MM:SS.MS format and retried those files, recording the new processing time and the amount of retries required to obtain a coherent output.
2) The script tracked the number of model calls that had been made and automatically shut itself down and printed all coherent outputs, as well as a list of the files that did not receive outputs when the model calls reached 450 (50 under the hard limit).

After these parts were added, a full 99-video test was run, with 98 coherent responses achieved. As of today (5/29/2025) only one file does not have valid output, and that one will be re-run tomorrow once the model quota has reset. This 2.3s file has received 157 retries and about 40 minutes of processing time total with no valid output. Confusingly, the gesture is a beat and (from a human annotator perspective) fairly average in duration and content. Perhaps the model does not like Bill Clinton, as many of the videos containing him cause issues for gemini-2.5.

Results

… will be added after I finish validation

Things I’ve drawn while waiting on ML training/test runs

ISGS10