Title Image Title Image
Home  >  PERSPECTIVES  >  GPT-4’s Giant Leap Ahead in Image and Voice Recognition



GPT-4’s Giant Leap Ahead in Image and Voice Recognition

The recent enhancements to ChatGPT, GPT-4, including its image recognition and voice capabilities, showcase the platform’s evolution towards sensory-like functions. These advances echo the capabilities we once imagined in fictional AIs, like Tony Stark’s J.A.R.V.I.S. in Iron Man, or HAL 9000 in 2001: A Space Odyssey. What was once the purview of science fiction is now edging closer to reality.

By definition, multimodal AI refers to artificial intelligence systems that process and integrate various data sources, such as text, audio and images, to provide a comprehensive and accurate response to a given request. On the OpenAI website, the most recent version of ChatGPT, GPT-4, a paid platform, boasts the ability to recognize images, understand voice commands and engage in conversations with users. OpenAI describes this technology as exhibiting “human-level performance on various professional and academic benchmarks, while being less capable than humans in many real-world scenarios.” Having experienced it firsthand, I can attest that it comes impressively close to mimicking human-like features.

Yes, ChatGPT Can Now See, Hear and Speak

After reading The New York Times’ analysis and demonstration of ChatGPT Plus by technology columnist and author Kevin Roose—prior to its broad public release—I was once again intrigued by the next wave of sensory-like capabilities that allow the platform to interpret visual cues, decipher auditory signals and engage in dialogue. Without reservation, I subscribed to the premium version of GPT-4 to explore its capacities myself.

My inaugural test was to gauge its ability to describe an evocative photograph I captured during a Rock and Roll retrospective at the MET. This exhibit showcased Keith Moon’s historical drum kit, which was destined for a dramatic demise. I prompted GPT-4 with a query, “What can you tell me about the graphics used on this item?” While the system didn’t identify the object as part of a drum kit—likely owing to the specificity of my question—it distinctly recognized and described the Union Jack flag design, the inscribed text stating “Keith Moon Patent Exploding Drummer,” and offered a concise commentary alluding to Moon’s renowned on- and off-stage unpredictability


I then wanted to determine the system’s ability to read by interpreting and adapting a page from a cookbook. My inquiry had objectives: “Can you simplify the steps in this recipe and convert the measures from metric to U.S. measures?” In a matter of seconds, GPT-4 was able to break down part of the recipe with greater simplicity and accurately convert the metric quantities into their U.S. counterparts.

When Kevin Roose tasked GPT-4 with breaking down assembly instructions from IKEA, it did so with great clarity, making those sometimes-confusing instructions more digestible.

While the platform conscientiously abstains from facial recognition capabilities to avoid potential misuse, the prospect of soliciting guidance or formulating plans based on imagery or text-based photos is undeniably compelling.

Teaching Siri and Alexa to Speak Like This

This particular test proved both fun and surprisingly authentic. Contrary to the concise interjections typical of Siri and Alexa, GPT-4’s dialogic feature—exclusively accessible via the app—resonates with striking realism. Drawing inspiration from the same New York Times article and leveraging the “Three Little Pigs’” narrative as a foundation, I posed a challenge to GPT-4: could it reinterpret the story in a rap format?

I invite you to click the audio link below to listen to what it came up with. The conversation felt very realistic and I can only imagine it getting even better as new versions of the platform are released. In my estimation, this functionality represents yet another potential paradigm shift. For example, envision harnessing it to synthesize a dictation seamlessly into a polished speech or presentation. Beyond that, consider its application as an AI companion for the elderly, poised to notify family members or emergency responders in the event of unresponsiveness or a critical decline in vital signs.

Furthermore, GPT-4’s adeptness at deciphering images could revolutionize sectors such as healthcare—e.g., imagine a physician receiving instant feedback on medical imagery. In the realm of art and culture, museums and galleries could deploy the platform to offer rich, contextual descriptions of exhibits, enhancing visitor experiences. On the educational front, teachers could utilize its capabilities to instantly generate content summaries or quizzes based on textbook images.

As for voice capabilities, the potential extends into realms such as language learning, where students could engage in realistic dialogues, or in customer service, where businesses could provide an enhanced, human-like interaction without the constraints of human limitations. The intersection of voice and visual capabilities might even pave the way for immersive storytelling, where narratives adapt based on visual cues provided by users.

In the evolving landscape of artificial intelligence, GPT-4 stands out with its enhanced sensory-like functions, bridging the realms of sight, sound and conversation. From interpreting imagery with precision, to converting text into rich narratives, to serving as a potential companion for the elderly, its capabilities are pushing the boundaries of what we once believed possible in the AI sphere. Drawing inspiration from the sci-fi films of yesterday, today’s GPT-4 reminds us that AI’s full potential is still unfolding.

Questions? Please email me here. As always, thank you for reading!