AI Glossary
Beginner? Get familiar the essential terms and tools behind AI image, video, and audio generation. This glossary is continuously updated to reflect the latest advancements in generative technology.
Image Generation terms
Short definitions that will help you get better results from tools like GetIMG, Midjourney, StableDiffusion, or Flux AI.
The text instruction that guides image generation. Strong prompts include subject, style, lighting, and emotion.
A list of things you don’t want the AI to include (e.g., blurry, extra limbs, deformed face, multiple fingers, etc.).
A numerical value that controls randomness. Same prompt + same seed = same image.
Controls how strictly the AI sticks to your prompt. Higher CFG = closer to your exact instructions.
Number of refinement cycles during generation. More steps = more detail (but slower).
A type of AI model that generates images by gradually denoising a random input. Examples include Stable Diffusion and DALL·E.
The algorithm used to generate the image from noise. Examples include: Euler, DPM++ 2M, and DDIM. Each produces unique image characteristics.
A compressed mathematical representation of image features that the model manipulates during generation.
A saved version of a trained model. Different checkpoints produce different image styles or capabilities.
A method of fine-tuning AI models with minimal data. Popular in customizing Stable Diffusion models for specific styles or characters.
A component in many diffusion models that compresses and reconstructs image data. Impacts output sharpness and color fidelity.
The process of regenerating or replacing specific areas of an image based on a new prompt. Useful for editing or erasing.
Extending an existing image beyond its borders using AI, while preserving its original style and context.
A powerful extension for Stable Diffusion that allows precise control over generation using guides like depth maps, poses, or edge detection.
Generating seamless, repeating images that can be used as textures or patterns.
The number of images generated per prompt submission. Larger batch sizes increase generation time but offer more variation.
A now less commonly used architecture for image generation, consisting of a generator and discriminator. For a detailed explanation of how a GAN pipeline works, see the GAN Pipeline entry below, on the Video Generation Terms.
Uses a CLIP model to evaluate how well the generated image matches the text prompt. Improves coherence between prompt and output.
An AI model that enhances the resolution of an image without loss of detail. Often used post-generation for production-quality results.
The width-to-height ratio of the generated image. Customizable to fit social media, posters, wallpapers, etc.
The technique of emphasizing parts of a prompt using syntax like ((word)) or [word:strength] to influence the image outcome.
Video Generation Terms
Short definitions to help you generate high-quality AI videos with tools like Runway, Pika Labs, Sora, KlingAI, Veo, and so forth.
A frame that defines the start or end point of a transition or movement. Keyframes anchor how motion evolves in AI-generated videos.
The process of generating intermediate frames between two keyframes to create smooth transitions or movement.
The number of frames shown per second. Common values: 24, 30, 60 FPS. Higher frame rates result in smoother video playback.
A model’s ability to maintain stable and coherent motion across frames, especially for character limbs, camera movement, or object paths.
The alignment of visual elements over time. Crucial for preventing flickering, identity loss, or scene jumps between frames.
A model that generates video from a text prompt. Leading models include Pika, Runway, and Sora.
Video generation based on a static image input. The model animates the image with motion, transitions, or effects.
Transforms one video into another by applying a prompt or style transfer. Useful for converting webcam footage to stylized animation.
A method where videos are generated by denoising in a compressed latent space over time. Enables faster and more coherent video synthesis.
Using a sequence of images (or sketches) with associated prompts to define narrative structure or shot composition in video generation.
Masking (or Video Inpainting) represents the process of selecting and replacing or regenerating specific regions across video frames, used for editing or object removal.
Applying the visual style of an image (e.g., Van Gogh, Pixar) to an entire video while preserving motion and scene structure.
Uses depth maps or stereo inputs to add perspective and realistic 3D effects to AI-generated video sequences.
Customizing how the virtual camera moves through the scene (e.g., pan, zoom, orbit). Essential for cinematic quality.
Merging multiple generated clips into a seamless narrative. Used in AI filmmaking or long-form content generation.
A structured process used in Generative Adversarial Networks to synthesize realistic videos. Involves a generator and discriminator in a competitive training loop. Still relevant in foundational video synthesis models like MoCoGAN.
Audio Generation Terms
Short definitions to help you create high-quality AI audio using tools like ElevenLabs, AudioCraft, Bark, and Voicemod.
A technology that converts written text into spoken voice using AI. Advanced TTS systems like ElevenLabs or Microsoft Azure Neural TTS generate highly realistic, human-like speech.
The process of replicating a person’s voice using a small sample of their audio data. Popular in dubbing, personalization, and synthetic media creation.
A broader term for generating speech, which includes TTS, voice cloning, and prosody modeling. Focuses on natural rhythm, tone, and inflection.
Expands beyond speech to generate non-verbal audio (e.g., soundscapes, music, ambient noise) from a text description. Tools like Google’s AudioLM or Stability AI’s Harmonai are examples.
The smallest unit of sound in speech. AI models use phoneme-level control to ensure accurate pronunciation and emotional expression in synthesized speech.
A visual representation of audio used by many AI models to analyze or generate sound. Models often generate spectrograms first, then convert them to audio using a vocoder.
The patterns of rhythm, stress, and intonation in speech. Advanced TTS systems manipulate prosody to make voices sound expressive and lifelike.
A neural component that converts spectrograms into actual audio waveforms. Examples include HiFi-GAN and WaveGlow. Critical for producing natural-sounding voices.
Generates speech in a completely new voice using only a few seconds of reference audio, without needing model fine-tuning.
Filling in missing or corrupted parts of an audio clip using AI. Useful for restoring old recordings or smoothing transitions in generated speech or music.