Generative AI techniques like ChatGPT, DALL-e and Codex can generate digital content such as images, text, and the code. Recent progress in large-scale AI models has improved generative AI’s ability to understand intent and generate more realistic content. This text summarizes the history of generative models and components, recent advances in AI-generated content for text, images, and across modalities, as well as remaining challenges.
In recent years, Artificial Intelligence Generated Content (AIGC) has gained much attention beyond the computer science community, where the whole society is interested in the various content generation products built by large tech companies. Technically, AIGC refers to, given human instructions which could help teach and guide the model to complete the task, using Generative AI algorithms to form a content that satisfies the instruction. This generation process usually comprises two steps: extracting intent information from human instructions and generating content according to the extracted intentions.
Generative models have a long history of AI, dating to the 1950s. Early models like Hidden Markov Models and Gaussian Mixture Models generated simple data. Generative models saw major improvements in deep learning. In NLP, traditional sentence generation used N-gram language models, but these struggled with long sentences. Recurrent neural networks and Gated Recurrent Units enabled modeling longer dependencies, handling ~200 tokens. In CV, pre-deep learning image generation used hand-designed features with limited complexity and diversity. Generative Adversarial Networks and Variational Autoencoders enabled impressive image generation. Advances in generative models followed different paths but converged with transformers, introduced for NLP in 2017. Transformers dominate many generative models across domains. In NLP, large language models like BERT and GPT use transformers. In CV, Vision Transformers and Swin Transformers combine transformers and visual components for images.