When AI Sees and Speaks — The Rise of Vision-Language Models
A high level view on how modern vision-language models connect pixels and prose, from CLIP and BLIP to Flamingo, MiniGPT-4, Kosmos, and Gemini.
All the articles with the tag "computer-vision".
A high level view on how modern vision-language models connect pixels and prose, from CLIP and BLIP to Flamingo, MiniGPT-4, Kosmos, and Gemini.
A clear introduction to diffusion and guided diffusion — how a simple physical process became a foundation for modern generative AI, from Stable Diffusion to robotics and protein design.
The evolution of image diffusion models from U-Net architectures to Diffusion Transformers (DiT). Covers latent diffusion, the DiT revolution, and the complete image generation pipeline.
Deep dive into state-of-the-art video generation models: Sora, Veo 3, and Open-Sora. Plus motion modeling techniques using optical flow, geometry, and diffusion fields.
How to accelerate diffusion sampling and control output quality. Covers DDIM, DPM-Solver, Classifier-Free Guidance (CFG), negative prompting, and inference optimization techniques.
Why video is harder than images, the DiT revolution for video, and how diffusion models learn temporal consistency. Covers V-DiT, AsymmDiT, and the mathematical foundations of video generation.