Unveiling the Deception: Google's Astonishing Gemini Demo Fabrication
Google's latest Gemini AI model made its highly anticipated debut yesterday, sparking a mixed reception among users. However, the company's technology and integrity now face scrutiny as it has come to light that the most impressive demo showcasing Gemini was, in fact, largely fabricated.
A captivating video entitled "Hands-on
with Gemini: Interacting with multimodal AI" rapidly gained one million
views within a day, and its popularity is understandable. The demo showcased a
range of interactions with Gemini, highlighting the model's ability to combine
language and visual understanding in a flexible and responsive manner.
The video begins by showcasing the evolution
of a simple squiggle into a detailed drawing of a duck, albeit in an
unrealistic color. It then feigns surprise ("What the quack!") upon
encountering a toy blue duck. The demo proceeds to demonstrate Gemini's
capabilities by responding to various voice queries about the toy. Furthermore,
it flaunts other impressive feats such as tracking a ball in a cup-switching
game, recognizing gestures in shadow puppetry, and rearranging sketches of
planets, among others.
The responsiveness of Gemini is particularly
striking, although the video does admit that certain aspects have been altered
for brevity and to minimize latency. Thus, occasional hesitations and
excessively long answers have been edited out. Overall, this demonstration
served as a truly awe-inspiring display of Gemini's prowess in the realm of
multimodal understanding. Personally, my skepticism about Google's ability to
deliver a competitive AI model wavered after witnessing this hands-on
presentation.
While it is true that Gemini appears to generate the responses showcased in the video, there is a significant discrepancy between the actual speed, accuracy, and mode of interaction with the model, leaving viewers misled.
For example, at 2:45 in the video, a hand
silently performs a series of gestures, and Gemini promptly responds, "I
know what you're doing! You're playing Rock, Paper, Scissors!" However,
the documentation clearly states that the model does not reason based on
individual gestures. In reality, all three gestures must be shown
simultaneously, accompanied by the prompt, "What do you think I'm doing?
Hint: it's a game." Only then does Gemini respond with, "You're
playing rock, paper, scissors."
These interactions do not feel equivalent;
they seem fundamentally different. One is an intuitive evaluation that captures
an abstract idea effortlessly, while the other is a contrived and heavily
guided interaction that reveals as many limitations as capabilities. Gemini
demonstrated the latter, not the former. The "interaction" depicted
in the video did not actually occur.
Furthermore, when three sticky notes with
doodles of the Sun, Saturn, and Earth are placed on the surface, Gemini is
asked in the video, "Is this the correct order?" It promptly responds
with the correct answer. However, the genuine written prompt asks, "Is
this the right order? Consider the distance from the sun and explain your
reasoning." Did Gemini genuinely get it right, or did it require
assistance to produce an answer suitable for the video? Did it even recognize
the planets, or did it need help in that aspect as well?
Similarly, in the video, a ball of paper is
swapped under a cup, and Gemini seemingly and instantly detects and tracks the
movement. However, in the accompanying post, not only does the activity have to
be explained, but the model also needs to be trained (albeit quickly and using
natural language) to perform it. The examples go on.
These instances may appear trivial at first
glance. After all, the ability to recognize hand gestures as a game with such
speed is genuinely impressive for a multimodal model. The same goes for making
a judgment call on whether an incomplete picture depicts a duck or not.
However, now, with the absence of an explanation for the duck sequence in the
blog post, doubts arise regarding the authenticity of that interaction as well.
Had the video explicitly stated, "This
is a stylized representation of interactions our researchers tested," few
would have questioned it—we often expect such videos to blend fact and
aspiration. However, the video is titled "Hands-on with Gemini," and
when it claims to present "our favorite interactions," it implies
that the interactions depicted are authentic. They were not. Some were
embellished, others entirely different, and some even appeared nonexistent.
Furthermore, the video fails to specify which model it represents—the currently
available Gemini Pro or the upcoming Ultra version scheduled for release next
year.
Should we have assumed that Google was merely
providing us with a glimpse of what to expect when they described it in the
manner they did? Perhaps we should now assume that all capabilities showcased
in Google AI demos are exaggerated for dramatic effect. In the headline, I
assert that this video was "faked." Initially, I questioned whether
such strong language was warranted, and Google's spokesperson requested a
change. However, despite containing genuine elements, the video simply does not
reflect reality. It is, indeed, fake.
Google asserts that the video "shows
real outputs from Gemini," which is technically true. However, the claim
that they made only a few edits to the demo while being transparent about it is
misleading. This was not a genuine demo, and the interactions shown in the
video were significantly different from those created to inform it.