AI Sees, Hears, and Creates — The Multimodal Present
The era of text-only AI is over. AI now understands images, listens to audio, and creates video. Here's how multimodal AI is being used in real work.
The era of AI only reading and writing is over. Now AI sees, hears, and creates.
The Text Wall Has Fallen
Just two years ago, AI was trapped in the world of text. Reading text, writing text, answering in text. Need an image? Use a separate image AI. Need audio? A separate audio AI. Different tools, different methods, different costs.
In 2025, this wall fell.
A single AI reads text while simultaneously viewing images, listening to audio, and understanding video. And beyond answering in text, it creates images, generates speech, and edits video.
This is called multimodal AI. AI that handles multiple senses (modalities) simultaneously.
What This Means for Real Work
More important than technical explanation is how this changes how we work.
A single photo becomes a report. Take a photo of a facility on-site, send it to AI, and it analyzes the image, assesses condition, and drafts an inspection report. "Corrosion traces observed at pipe junction #3. Replacement recommended within 6 months." The time spent translating photos into text disappears.
Meetings end with everything organized. Hand meeting audio to AI and it converts speech to text, summarizes key points, extracts action items, and finds related data to compile together. Post-meeting follow-up shrinks from hours to 5 minutes.
Describe it and design appears. Say "create a banner image with warm autumn tones, our product centered" and multiple drafts appear in seconds. Pick one you like and say "add the logo top-left" — instantly reflected. Non-designers can create visual content.
AI understands video content. Show a product demo video to AI and it identifies what's happening in each scene, generates subtitles, and timestamps key moments. Ask it to "extract only the scenes explaining core features" from a 20-minute video, and it finds those segments.
Real Applications by Industry
Manufacturing: AI analyzes production line CCTV in real-time. When defect signs appear, it sends immediate alerts. Replacing the human watching monitors.
Healthcare: AI analyzes medical imaging and flags anomalies. Doctors focus their review on AI-flagged areas. Diagnostic accuracy rises, reading time drops.
Real Estate: Upload property photos and AI analyzes the space, estimates area, evaluates condition, and auto-generates listing descriptions. Agent listing time drops dramatically.
Education: Feed lecture video to AI and it divides by topic chapters, summarizes key concepts, and even auto-generates quiz questions. Educational content production costs drop significantly.
Marketing: From a single product photo, mass-generate marketing images across different backgrounds, angles, and seasons. Create seasonal campaign visuals without photo shoots.
Quality Today — How Far Can You Go?
Honestly, multimodal AI quality varies significantly by domain.
Already production-ready:
- Image understanding and analysis
- Speech recognition and transcription
- Text-to-image generation (for marketing, social media)
- Visual document interpretation (reading graphs, tables, charts)
Usable but needs review:
- Fine detail accuracy in generated images
- Speaker identification in long audio
- Complex video content summarization
Still at assistant level:
- Video generation (short clips possible but quality varies)
- Real-time video analysis accuracy
- Nuanced emotion in voice generation
The key: even without perfection, the value as "a draft that saves human time" is already sufficient. Use it in a structure where AI creates and humans refine, and you can see immediate results.
What Does It Cost?
Multimodal AI costs are dropping rapidly.
Image analysis/generation: Included in ChatGPT Plus or Claude Pro subscriptions. No additional cost.
Speech recognition: Services processing dozens of hours for tens of dollars monthly. Free tools also exist.
Video analysis: Still relatively expensive, but short video (under 5 minutes) analysis possible within subscriptions.
Image/video generation: Basic generation included in subscriptions. High volume or quality incurs additional costs.
For SMBs, you can often use multimodal features within your existing AI subscription ($20-30/month). Meaning you can start without significant additional investment.
How to Start
The fastest way to apply multimodal AI to your work.
Step 1: Start with AI you already use. Upload images or use voice mode in ChatGPT or Claude. No new tools needed — just activate multimodal features in your existing tools.
Step 2: Find "conversion" tasks. Turning photos into text, audio into text, text into images — if your work includes these "conversion" tasks, that's your first multimodal application point.
Step 3: Verify quality, then expand scope. Start with internal use only while checking AI output quality. Once satisfied, expand to external use (customers, partners).
End of the Text Era, Start of the Sensory Era
When AI only handled text, AI applications were limited to "text-based work." Multimodal AI breaks this limit.
Taking photos on-site, having conversations in meetings, creating designs, editing video — AI can now participate in all these domains.
In the next article, we'll discuss the final piece — AI writing code and building apps, opening an era where non-developers can create their own tools.
AI no longer just reads and writes. It sees, hears, and creates.