A Survey on Data Augmentation in Large Model Era

Yue Zhou, Chenlu Guo, Xu Wang, Yi Chang, Senior Member, IEEE, Yuan Wu, Member, IEEE

Abstract—Large models, encompassing large language and diffusion models, have shown exceptional promise in approximating human-level intelligence, garnering signiﬁcant interest from both academic and industrial spheres. However, the training of these large models necessitates vast quantities of high-quality data, and with continuous updates to these models, the existing reservoir of high-quality data may soon be depleted. This challenge has catalyzed a surge in research focused on data augmentation methods. Leveraging large models, these data augmentation techniques have outperformed traditional approaches. This paper offers an exhaustive review of large model-driven data augmentation methods, adopting a comprehensive perspective. We begin by establishing a classiﬁcation of relevant studies into three main categories: image augmentation, text augmentation, and paired data augmentation. Following this, we delve into various data post-processing techniques pertinent to large model-based data augmentation. Our discussion then expands to encompass the array of applications for these data augmentation methods within natural language processing, computer vision, and audio signal processing. We proceed to evaluate the successes and limitations of large model-based data augmentation across different scenarios. Concluding our review, we highlight prospective challenges and avenues for future exploration in the ﬁeld of data augmentation. Our objective is to furnish researchers with critical insights, ultimately contributing to the advancement of more sophisticated large models. We consistently maintain the related open-source materials at: https://github.com/MLGroup-JLU/LLM-data-aug-survey.

Index Terms—Large Language Models, Diffusion Models, Data Augmentation

1 INTRODUCTION

ing, addresses the challenge of training models with limited labeled data for diverse tasks. It involves enhancing the sufﬁciency and diversity of training examples with- out explicitly collecting new data, thus playing a crucial role in improving model generalization (Feng et al., 2021; Shorten and Khoshgoftaar, 2019). The essence of data aug- mentation lies in generating new data by altering exist- ing data points through various transformations. This pre- vents models from memorizing irrelevant data patterns, with the augmented data closely mirroring the distribution of real data (Cubuk et al., 2019; Wei and Zou, 2019). Such techniques are directly applicable in supervised learning (Liu et al., 2021c) and can be employed in semi-supervised learning for unlabeled data through consistency regulariza- tion (Zhang et al., 2021a). Originally developed for com- puter vision (CV), data augmentation methods create ar- tiﬁcial images through operations like cropping, rotating, and color adjustment (Kanwal et al., 2022; Krell and Kim, 2017; Takahashi et al., 2019). In natural language processing (NLP), similar approaches involve random character inser- tion, word deletion, and synonym replacement (Liu et al., 2020; Shorten and Khoshgoftaar, 2019).

learning, a demand often unmet in real-world scenarios. De- spite signiﬁcant advancements in data augmentation over the past decades, especially with deep learning techniques, these methods still struggle with capturing the complexities of real-world data (Feng et al., 2021), generating scalable data (Yang et al., 2022), and defending against adversarial examples (Qiu et al., 2020).

In response to these limitations, current research is ex- ploring innovative techniques to enhance the efﬁcacy and diversity of data augmentation methods. Among these, large models, including large language models (Zhao et al., 2023) and diffusion models (Yang et al., 2023), show con- siderable promise. Large language models (LLMs), such as GPT-4 (OpenAI, 2023a) and Llama2 (Touvron et al., 2023b), have revolutionized NLP. Characterized by transformer architectures (Vaswani et al., 2017) and trained on exten- sive corpora, LLMs excel in understanding and generat- ing human-like text, marking a signiﬁcant advancement in machine learning capabilities (Zhao et al., 2023). These models, with billions of parameters, can undertake diverse and complex tasks, including code generation (Zhang et al., 2023b) and data augmentation (Dai et al., 2023), paving the way toward artiﬁcial general intelligence (AGI).

The signiﬁcance of data augmentation has attracted sub- stantial attention in both academic and industrial ﬁelds. As a vibrant research area, it addresses the growing need for large volumes of high-quality labeled data in machine

• Y. Zhou, C. Guo, X. Wang, Y. Wu and Y. Chang are with the School of Artiﬁcial Intelligence, Jilin University, Changchun, China. The ﬁrst two authors contributed equally.

• Correspondence to: Yuan Wu (yuanwu@jlu.edu.cn).

Diffusion models (Ho et al., 2020; Song et al., 2020), a new family of state-of-the-art generative models, have sur- passed the long-standing dominance of generative adversar- ial networks (GANs) (Goodfellow et al., 2014) in image syn- thesis within computer vision (Dhariwal and Nichol, 2021; Ho et al., 2020). Unlike prior models like variational auto- encoders (VAEs) (Kingma and Welling, 2013) and GANs, diffusion models iteratively add and reverse noise to gener- ate high-quality synthetic images and have enabled text-to- image generation (Saharia et al., 2022), expanding the scope of data augmentation.

With the impressive capabilities of LLMs and diffusion models, there is a growing interest in using these mod- els for data augmentation in both NLP and CV. This has led to the creation of more diverse and comprehensive datasets (Dai et al., 2023; Sahu et al., 2022; Samuel et al., 2023b; Trabucco et al., 2023). Over the past year, research on large model-based data augmentation has expanded signiﬁcantly, posing a challenge for new researchers to keep pace with recent developments and discern major trends. A systematic summary of this rapidly evolving ﬁeld is therefore crucial to provide a comprehensive understanding and inspire future research.

In this paper, we present an extensive survey of large model-based data augmentation approaches. As outlined in Fig. 1, we structure our survey across three dimen- sions: approach, data post-processing, and application. The ”approach” dimension includes image, text, and paired data augmentation methods; ”data post-processing” covers methodologies like Top-K selection, model-based, score- based, and cluster-based approaches; and ”application” of- fers insights into the use of large model-based data augmen- tation in NLP, CV, and audio signal processing. This survey provides a systematic overview, establishes comprehensive taxonomies, identiﬁes challenges, and discusses potential future directions in the ﬁeld of large model-based data augmentation.

The contributions of this paper are as follows:

1) We present a comprehensive overview of large model-based data augmentation methods, spanning three dimensions: approach, data post-processing, and application. Our analysis covers data augmen- tation techniques applicable to natural language processing (NLP), computer vision (CV), and audio signal processing. In terms of approach, we extensively summarize ex- isting data augmentation methods that leverage ad- vanced large language models (LLMs) and diffusion models. This provides an insightful perspective into the future trajectory of data augmentation research. 3) We also delve into future challenges associated with the development of data augmentation methods that incorporate sophisticated large models.

The structure of this paper is as follows: Section 2 intro- duces the foundational concepts of LLMs, diffusion models, and data augmentation methods. Section 3 reviews current large model-based data augmentation methods, focusing on (1) Image data augmentation, (2) Text data augmenta- tion, and (3) Paired data augmentation techniques. Section 4 examines existing data post-processing approaches that utilize large models. In Section 5, we outline the application scenarios of the surveyed papers. Section 6 synthesizes the key insights from this survey, while Section 7 explores sig- niﬁcant future challenges in the ﬁeld. The paper concludes with Section 8, summarizing our ﬁndings and contributions.

2 BACKGROUND

2.1 Large Language Models

The evolution of large language models (LLMs) can be language models traced to the early era of statistical

(Gao and Lin, 2004; Liu and Croft, 2005; Rosenfeld, 2000), which estimated the probability distribution of linguistic elements like words, sentences, and documents, based on the Markov assumption. N-gram models (Brown et al., 1992; Marino et al., 2006; O’Boyle et al., 1994), the most prevalent among statistical language models, predict the likelihood of a word based on its preceding n 1 words. Despite their widespread use, these models struggled to capture long- range dependencies and intricate linguistic structures.

The advent of deep learning catalyzed the emergence of neural language models, utilizing neural networks (Bengio et al., 2000) to predict word sequence probabili- ties. Among these, Recurrent Neural Networks (RNNs) (Mikolov et al., 2010) emerged as a signiﬁcant improvement over n-gram models. However, RNNs had limitations in handling long sequence data, as they could only capture information from nearby sequences and often lost memory of earlier sequences. This led to the development of Long Short Term Memory networks (LSTMs) (Sundermeyer et al., 2012), which addressed this limitation through a gating mechanism and cell states, enabling the capture of long- term dependencies. These models have shown signiﬁcant improvements over traditional statistical models, but they still suffer from limitations in their ability to capture context and long-range dependencies.

The game-changing introduction of Transformer archi- tecture with self-attention mechanisms marked a signiﬁ- cant milestone Vaswani et al. (2017). Initially designed for machine translation, Transformers could capture both long- range dependencies and contextual information. This break- through spurred the development of pre-trained language models (PLMs) like BERT, GPT-2, and BART (Devlin et al., 2018; Lewis et al., 2019; Radford et al., 2019), trained on extensive corpora to acquire universal language represen- tations and ﬁne-tuned for speciﬁc downstream tasks, such as text classiﬁcation or question answering.

The success of PLMs prompted extensive research, particularly in scaling PLMs. Large-scale PLMs such as GPT-3 (Brown et al., 2020), GPT-4 (OpenAI, 2023a), PaLM (Chowdhery et al., 2022), and Llama (Chowdhery et al., 2022) demonstrated remarkable capabilities in complex tasks, language model (LLM).” Notable applications of LLMs include ChatGPT (OpenAI, 2023b) for dialogue interaction and Med-PaLM 2 (Singhal et al., 2023) for medical question answering.

A distinct characteristic of LLMs is their ’knowledge emergence’ ability (Wei et al., 2022a), absent in smaller mod- els but present in larger ones (Brown et al., 2020; Wei et al., 2021, 2022b). This ability manifests in scenarios like few- shot prompting (Brown et al., 2020), where LLMs generate desired outputs with minimal demonstrations and natural language instructions, without further training. Instruction tuning (Wei et al., 2021) enhances LLMs’ task adaptability by ﬁne-tuning on a mix of tasks presented as instructions. Chain-of-thought (CoT) prompting (Wei et al., 2022b), an advanced strategy, involves incorporating intermediate rea- soning steps into prompts to improve LLM performance in complex reasoning tasks.

Overall, the development of LLMs has revolutionized the NLP ﬁeld and has great potential to drive the develop- ment of other ﬁelds. As a result, it is highly probable that

central focus of research and development. TABLE 1 offers a concise comparison of traditional statistical models, neural language models, PLMs, and LLMs, while Fig. 2 depicts the recent trend in LLM-related publications.

Large Model-based Data Augmentation

(Luo et al., 2023);(Couairon et al., 2022);(Nichol et al., 2021); (Samuel et al., 2023a);(Tumanyan et al., 2023);(Hertz et al., 2022); (Zhang et al., 2023c);(Bar-Tal et al., 2022);(Kim et al., 2022); (Gal et al., 2022b);(Couairon et al., 2022); (Dunlap et al., 2023);(Trabucco et al., 2023);(Kawar et al., 2023); (Yin et al., 2023);(Avrahami et al., 2022);(Wang et al., 2023a); (Brooks et al., 2023);(Ge et al., 2023)Patashnik et al. (2023); (Koohpayegani et al., 2023);(Doubinsky et al., 2023)

Text Prompt-driven

(1) Theoretical understanding; (2) The number of augmented data; Multimodal data augmentation; (3) Language and vision foundation models; (4) Automatic data augmentation; (5) Robust and consistent data augmentation; (6) Trustworthy data augmentation; (7) The evaluation of augmented data; (8) Beyond augmentation: Training large models using augmented data

Fig. 1. Structure of this paper.

LLMs will remain a focus for research and development in the foreseeable future. TABLE 1 provides a brief comparison of traditional statistical models, neural language models, PLMs and LLMs. Fig. 2 shows the trend in the number of publications on LLMs in recent years.

The development of LLMs has not only revolutionized the ﬁeld of NLP but also holds signiﬁcant potential for other disciplines. Consequently, LLMs are likely to remain a

2.2 Diffusion Models

Diffusion models (Ho et al., 2020; Sohl-Dickstein et al., 2015; Song and Ermon, 2019; Song et al., 2020) are a class of prob-
state.

TABLE 1 Comparison of traditional statistical Models, neural language models, pre-trained language models and large language models.

|Comparison | Traditional Statistical Models | Neural Language Models | Pre-trained Language Models | Large Language Models || Model Size | Limited | Large | Large | Very large |
| Training Data Size | Large | Large | Large | Very large |
| Emergent Abilities | No | No | No | Yes |
| Feature Extraction | Artificial | Automatic | Automatic | Automatic |
| Interactiveness | Poor | Poor | Poor | Good |
| Interpretability | Good | Poor | Poor | Poorest |
| Performance | Common | Higher | Higher | Highest |
| Evaluation | Automatic | Automatic | Automatic | Automatic, Human |
| Resources Requirements | Low | High | High | Highest |
| Ability to Capture Long-range Dependencies | Poor | Better | Better | Best |
| Representative Models | N-gram | RNN,LSTM | BERT,GPT-1,T5 | PaLM,GPT-3,Llama,GPT-4 |

10,000  s r e p a p f o r e b m u N  8,000  6,000  4,000  Number of papers  2,000  2020  2021  2022  2023 

Fig. 2. Trend of the number of publications on large language models.

Number of papers  s r e p a p f o r e b m u N  6,000  4,000  2,000  2020  2021  2022  2023 

Fig. 3. Trend of the number of publications on diffusion models.

Diffusion models, a class of probabilistic generative models in machine learning and image processing, have garnered attention for their unique approach to data evolution over time through controlled, incremental dif- fusion steps (Ho et al., 2020; Sohl-Dickstein et al., 2015; Song and Ermon, 2019; Song et al., 2020). These models start with a real image and progressively introduce Gaussian noise at each step, transforming the image into a progres- sively noisier version. The training process involves revers- ing this noise-addition, effectively restoring the image to its original state.

Given a genuine image x0, the forward process of the diffusion model involves the incremental introduction of Gaussian noise over T steps. This forward progression, considering that each time t solely relies on the preceding t 1, can be interpreted as a Markov process. Throughout the evolutionary process, the latent variable x1, x2, . . . , xT is gradually generated over time.Denote the probability distribution of the forward process by q( ), generated as · follows:

T q(x1:T | x0) = Y i=1 q(xt | xt−1) (1)

Where each step of the transformation is deﬁned as a Gaus- sian transformation

(0, 1) }

( ·

N

{

∈

q(xt xt−1) = (xt; p1 βtxt−1, βtI) (2)

|

N

−

Throughout this evolution, as t progresses, xt gradually converges toward pure noise.

Given an original image x0, the forward process of a diffusion model incrementally adds Gaussian noise over T steps, resembling a Markov process where each time t depends solely on the preceding t 1. This process gradually generates the latent variables x1, x2, . . . , xT . The forward process’s probability distribution q( ) is generated as fol- · lows:

abilistic generative models used in machine learning and image processing. The primary idea behind diffusion mod- els is to describe the evolution of data over time through a series of controlled, incremental diffusion steps. The process begins with a real image, and at each step, Gaussian noise is added, gradually transforming the image into a noisier version. The model is then trained to reverse this process, removing the noise and restoring the image to its original

T q(x1:T | x0) = Y i=1 q(xt | xt−1) (1)

Each transformation step is deﬁned as a Gaussian transfor- mation

(0, 1) }

( ·

N

{

∈

q(xt xt−1) = (xt; p1 βtxt−1, βtI) (2)

N As t progresses, xt gradually approaches pure noise.

|

−

Conversely, the reverse process represents denoising in- ference, training a neural network to sequentially remove noise from an entirely noisy image to recover the real image.

T pθ(x0:T ) = p(xT ) Y t=1 pθ(xt−1 | xt) (3)

) is designed to progressively reduce variance. Consequently, the ultimate sample x0 sig- niﬁes a sample extracted from the true distribution. This transformation is typically parameterized by a ﬁxed covari- ance Pt = βtI and a learning mean µθ(xt, t) as deﬁned below:

µθ(xt, t) = 1 √αt (xt − βt √1 αt ǫθ(xt, t)) (4)

−

Where ǫθ(xt, t) denotes a trained neural network tasked with processing noisy samples xt and predicting the in- troduced noise. Given an authentic sample x0 and noise ǫ N (0, 1), xt can be computed at any given time step according to the following:

xt(x0, ǫ) = √αtx0 + √1 αtǫ (5)

−

t s=1 αt.

Where αt = 1 βtand αt = Q

−

In multimodal scenarios, combining image and lan- guage offers comprehensive insights. Allowing concurrent guidance from both image and language introduces extra ﬂexibility, providing users with enhanced control options (Liu et al., 2023a). Using diffusion models for multimodal data augmentation involves conﬁguring additional modes as conditions, particularly through a denoising network ǫθ(xt, y, t) modulated by an auxiliary input y. This allows for sampling from a data distribution conditioned on y. With pre-training integration, diffusion models excel in applica- tions like image editing (Brooks et al., 2023; Couairon et al., 2022; Kawar et al., 2023; Ruiz et al., 2023) and image in- painting (Nichol et al., 2021; Xie et al., 2023) based on text prompts, showcasing their versatility in enhancing multi- modal data. Fig. 3 displays the recent growth in publications on diffusion models.

2.3 Data Augmentation

Data augmentation, a cornerstone technique in deep learn- ing, becomes vital when faced with data collection chal- lenges. By employing diverse augmentation strategies, it enriches datasets, expanding their scope, enhancing the training model’s robustness, and reﬁning its generalization capabilities.

A primary function of data augmentation is to counter overﬁtting, often encountered in training deep neural net- works (Shorten et al., 2021). Without augmentation or reg- ularization, these networks risk adopting spurious corre- lations and memorizing intricate, high-frequency patterns that may elude human detection.

In ﬁelds like computer vision (CV) and natural language processing (NLP), data augmentation plays a critical role in improving model robustness and generalization. In CV, traditional methods include random rotation, ﬂipping, scal- ing, and cropping, introducing variations in orientation and scale. Additional techniques like color dithering and noise addition further diversify the dataset. Innovative methods

like image hybrid data enhancement, exempliﬁed by Mixup (Zhang et al., 2017) and CutMix (Yun et al., 2019), blend images or their sub-regions, enhancing data diversity be- yond basic image processing techniques. In NLP, Easy Data Augmentation (EDA) (Wei and Zou, 2019) techniques like Synonym Replacement, Random Insertion, Random Swap, and Random Deletion are prevalent. However, these meth- ods face limitations, such as the need for the available data’s distribution to closely mirror the actual data distribution, risks of information loss or distortion, and challenges in maintaining labeling consistency.

2014; Madry et al., 2017; Miyato et al., 2016; Shafahi et al., 2019) is a technique where models are exposed to adversarial examples during training. These examples are crafted to deceive the model, forcing it to become more robust against potential adversarial attacks. This approach enhances the model’s resilience by teaching it to handle unexpected variations and disturbances in the input data.

Generative modeling, another promising avenue for data enhancement, creates artiﬁcial instances that retain features similar to the original dataset. Generative Adversarial Net- works (GANs) (Goodfellow et al., 2020), particularly inﬂu- ential in CV, have evolved into various forms, becoming a robust tool for dataset enhancement (Antoniou et al., 2017; Brock et al., 2018; Xia et al., 2018).

Recent advancements in large models like GPT-3, T5 (Raffel et al., 2020), Stable Diffusion (Rombach et al., 2022), and CLIP (Radford et al., 2021) have shown exceptional performance in data augmentation. Pre-trained on extensive datasets and ﬁne-tuned for speciﬁc tasks, large models offer rich representation patterns, serving as a robust founda- tion for various downstream tasks. Initially designed for NLP, LLMs’ versatility extends to image-related challenges, introducing context-aware transformations and enhancing semantic features beyond traditional methods’ capabilities (Nichol et al., 2021; Radford et al., 2021; Zhang et al., 2023a). TABLE 2 delineates the differences between large model- driven and traditional data augmentation methods across various aspects.

3 APPROACHES

The advent of large models has revolutionized data aug- mentation, offering novel and effective means to generate training data with greater diversity compared to traditional methods. This section categorizes existing methodologies into three distinct classes based on the target data type: image augmentation, text augmentation, and paired data augmentation. Image augmentation pertains to expanding image data, text augmentation to expanding text data, and paired data augmentation to both. These methods reﬂect the latest trends in data augmentation, highlighting the signiﬁcant role of large models.

3.1 Image Augmentation

Image augmentation synthesizes realistic images, guided by additional information. We divide these techniques into prompt-driven and subject-driven approaches: text, visual, and multimodal approaches in the prompt-driven category;

and subject-speciﬁc strategies in the subject-driven category. Text prompt-driven approaches generate images from tex- tual descriptions, visual prompt-driven approaches use vi- sual cues, and multimodal prompt-driven approaches com- bine both textual descriptions and visual guidance. Subject- driven approaches tailor augmentation for speciﬁc subjects. These approaches enhance deep learning task performance, contributing to more robust training experiences. Existing approaches are summarized in Table 3.

TABLE 2 Comparison of large model-driven and traditional methods for data augmentation.

| Aspect | Large Model-Driven Methods | Traditional Methods || Semantic Consistency | Contributes to higher semantic consistency by leveraging language understanding. | Lack of explicit mechanisms for maintaining semantic consistency. |
| Flexibility and Creativity | Demonstrates high flexibility and creativity in generating diverse content. | May be constrained by predefined transformation patterns, limiting creative variations. |
| Data Diversity | Learns diverse contexts from extensive data, introducing varied augmentation. | May rely on limited data sources, potentially resulting in less diverse augmentation. |
| Computational Efficiency | Exhibits lower computational efficiency due to a large number of parameters. | Tends to be more computationally efficient, especially in basic geometric transformations. |
| Domain Specificity | Domain-agnostic, applicable to various text and image data types. | Traditional methods may excel in specific domains, optimized for domain-specific features. |

3.1.1 Prompt-driven approaches Text prompt-driven approaches CamDiff (Luo et al., 2023) employs the latent diffusion model (LDM) (Rombach et al., 2022) to synthesize salient objects in camouﬂage scenes, leveraging CLIP’s zero-shot image classiﬁcation capabil- ities to prevent synthesis failures and maintain consis- tency with input prompts. Couairon et al. (2022) proposed DiffEdit, an algorithm that automatically ﬁnds what regions of an input image should be edited given a text query. GLIDE (Nichol et al., 2021) investigates diffusion models for text-driven image synthesis and compares two guid- ance approaches: CLIP-guided and classiﬁer-free guidance. Classiﬁer-free guidance is preferred by evaluators for its photorealism and caption similarity capabilities, resulting in highly realistic output. The authors also demonstrated the potential for powerful text-driven image editing by ﬁne-tuning their models for image inpainting. SeedSelect (Samuel et al., 2023a) is an innovative method for reﬁning the generation of unconventional and poorly-formed con- cepts in diffusion models. This technique optimizes the generation process by identifying appropriate generation seeds using a small reference set of images, ensuring the accurate generation of rare concepts both semantically and visually. Tumanyan et al. (2023) presented a framework for image-to-image translation, employing a pre-trained text-to- image diffusion model to generate images based on guid- ance and text prompts. The model offers precise control over structure through spatial feature manipulation, simpli- fying the process without additional training. Hertz et al. (2022) presented a prompt-to-prompt editing framework that is controlled by text. It focuses on analyzing a text-
conditioned model and highlights the importance of cross- attention layers in managing the relationship between image spatial layout and prompt words. Patashnik et al. (2023) presented a technique for generating diverse images show- casing variations in the shape of a speciﬁc object, facilitating object-level exploration. In order to control the object’s shape while preserving semantics, the speciﬁc challenge of accurate shape manipulation localization is addressed by introducing a prompt-mixing technique during denoising. Additionally, two techniques using self-attention and cross- attention layers are proposed to pinpoint image-space oper- ations. SINE (Zhang et al., 2023c) is a single-image editing method that relies on a pre-trained text-to-image diffu- sion model. By ﬁne-tuning this model with a single image and a brief textual description, it enables versatile image editing at any resolution, maintaining both ﬁdelity and generalization. Text2LIVE (Bar-Tal et al., 2022) is a model for semantically meaningful object appearance editing and visual effects. It generates an RGBA edit layer composited over the input based on text prompts, enabling precise content control through text-driven objectives. Diffusion- CLIP (Kim et al., 2022) is a text-guided image manipula- tion method that utilizes pre-trained diffusion models and CLIP loss. It excels in in-domain and out-of-domain ma- nipulation after ﬁne-tuning. StyleGAN-NADA (Gal et al., 2022b) combines StyleGAN and CLIP to transfer generative models to new domains in a text-driven manner without collecting any images. Dunlap et al. (2023) presented ALIA, automated language-guided image augmentation, a novel approach that utilizes large vision and language models to automatically generate natural language descriptions of a dataset’s domains and augment the training data via language-guided image editing. As a result, the augmented dataset maintains visual consistency with the original train- ing data while offering considerably increased diversity. Trabucco et al. (2023) introduced DA-Fusion, data augmen- tation by fusion, a ﬂexible data augmentation strategy that employs text-to-image diffusion models to generate diverse variations of real images. This method facilitates semantic editing of images through an off-the-shelf diffusion model, allowing for the modiﬁcation of image semantics. Further-
more, DA-Fusion demonstrates the ability to generalize to new visual concepts based on a limited number of labeled examples. 

TABLE 3 Summary of image augmentation including prompt-driven and subject-driven approaches.

 Reference | | Text prompt-driven | | Prompt-driven approaches Visual prompt-driven | | Multimodal prompt-driven | | Subject-driven approaches || Avrahami et al. (2022) | v |  |  |  |
| Bar-Tal etal. (2022 | v |  |  |  |
| Brooks et al. (2023) | v |  |  |  |
| Chen et al. (2023a |  |  |  | vo |
| Quairon et al. |  |  |  |  |
| Doubinsky etal. (2023 | v |  |  |  |
| Du et al. (2023) |  |  | vo |  |
| Dunlap et al. (2023 < |
| Gal et al. (2022a vo |
| Gal et al. (2022b |
| Ge et al. (2023 | A |  |  |  |
| Hertz et al. (2022) |  |  |  |  |
| Huang etal. (2023 v |
| Kawar et al. (2023 |  |  |  |  |
| Kim etal. (2022 | SS |  |  |  |
| Koohpayegani et al. (2023) |  |  |  |  |
| Kumari et al. (2023 vo |
| etal. |
| Li et al. (2023c | v |  |  |  |
| Liu etal. (2023a v |
| uo et al. |
| Ma et al. (2023 |  |  |  | vo |
| Nguyen et al. (2023 v |  |
| ichol et al. |  |  |  |  |
| Patashnik et al. (2023 v |
| Ruiz et al. (2023 v |
| Samuel et al. (2023a | v |  |  |  |
| Schnell et al. (2023) v |
| Shi et al. (2023 |  |  |  | v |
| Sun etal. (2023 |  | v |  |  |
| rabucco et al. |
| Tumanyan et al. (2023 v |
| Wang et al. (2023b v |
| ‘ang et al. |
| Wei et al. (2023 |  |  |  | v |
| Wu et al. (2023d |  | v |  |  |
| iao et al. |  |  |  |  |
| Xie et al. (2023 |  |  | v |  |
| Yin et al. (2023 v |
| ‘wet al. |  | v |  |  |

This addresses the limitation of the standard data augmentation approach, which often relies on simple transformations such as rotations and ﬂips, resulting in limited semantic diversity when generating new images from existing ones. Kawar et al. (2023) presented Imagic, a text-conditioned image editing method that utilizes a pre-trained text-to-image diffusion model. Imagic enables image editing by simply providing a single input image and a target text, eliminating the need for additional inputs such as image masks or additional views of the object. Yin et al. (2023) introduced text-to-text-to-image data aug- mentation (TTIDA), an innovative method that combines the capabilities of large-scale pre-trained text-to-text models (GPT-2) with text-to-image models (GLIDE) to perform data augmentation. TTIDA leverages both models to generate diverse and realistic textual descriptions and corresponding images, enhancing the dataset for training. Avrahami et al. (2022) proposed the ﬁrst solution for general-purpose local

image editing. This approach utilizes a natural language description and an ROI mask as guidance. Combining a pre-trained language-image model (CLIP) with a denois- ing diffusion probabilistic model (DDPM), the proposed method generates results that exhibit a natural and real- istic appearance. Wang et al. (2023a) introduced dynamic prompt learning (DPL) as a technique to address the issue of background and distractor object leakage in image editing with text-to-image diffusion models. They accomplished this by dynamically updating the tokens associated with each noun word in the prompt. This adjustment minimizes attention leakage in the cross-attention maps, leading to better results in text-guided image editing. Brooks et al. (2023) introduced a method for image editing guided by human instructions, combining knowledge from pre-trained language and text-to-image models to create a large training dataset, the design of the InstructPix2Pix enables fast image editing by efﬁciently generalizing to real images and user- written instructions during inference without ﬁne-tuning or inversion. Ge et al. (2023) addressed the limitations of

plain text in text-to-image synthesis by introducing a rich- text editor that supports attributes such as font style, size, color, and footnotes. This allows for precise customization and ﬁne-grained control. A region-based diffusion process is employed to ensure the ﬁdelity of the generated images, permitting detailed prompts and region-speciﬁc guidance. Recognizing the effectiveness and efﬁciency of augmented samples near a classiﬁer’s ideal decision boundary, GeNIe (Koohpayegani et al., 2023) is introduced. GeNIe utilizes a diffusion model conditioned on a text prompt to merge divergent data points (an image from the source category and a text prompt from the target category), generating challenging samples for the target category. Inspired by contemporary image editing methods, the model regulates both the number of diffusion iterations and the noise level. This regulation guarantees the preservation of low-level and contextual features from the source image in the gener- ated image, potentially conﬂicting with the target category. Doubinsky et al. (2023) investigated the enhancement of few-shot class-agnostic counting using synthetic data. A dual conditioning approach, utilizing Stable Diffusion with both a prompt and a density map, is proposed to augment a small training dataset for few-shot counting. To overcome limited dataset diversity, a strategy is introduced involving caption swapping between images and creating novel object conﬁgurations and spatial layouts. Visual prompt-driven approaches ImageBrush (Sun et al., 2023) is a model designed for precise image editing guided by visual instructions. It utilizes transformation images as instructions, extracted from visual demonstrations, and employs a diffusion-based inpainting approach to uncover human intent, enhancing the model’s capacity for accu- rate editing. Yu et al. (2023) introduced a diffusion-based method for augmenting nuclei segmentation datasets. It employs a two-step process: ﬁrst, it generates synthetic nuclei structures, and then, it uses these structures to syn- thesize histopathology images. The resulting synthetic im- ages closely mimic real samples, align well with the nuclei structures, and exhibit diverse styles, making them valu- able for segmentation model training. Wu et al. (2023d) en- hanced weakly-supervised semantic segmentation (WSSS) with their image augmentation with controlled diffusion (IACD) method. This technique enriches labeled datasets by utilizing available images, image-level labels, and detection maps as control inputs. Furthermore, they implemented a robust image selection strategy to mitigate noise in the diffusion model. Their experimental results demonstrate that the IACD approach surpasses existing methods in performance, particularly in scenarios with limited data availability, thereby underscoring its effectiveness. Semantic diffusion guidance (SDG) (Liu et al., 2023a) employs a ﬁne- tuned image encoder for Image-guided image synthesis. Similar to language guidance, it extracts embeddings cap- turing high-level semantics. The use of image encoders allows control over retaining structural information from the reference image, despite embeddings lacking spatial di- mensions. By leveraging spatial feature maps and enforcing alignment, SDG guides generated images to share similar structures with the reference image. Through image-guided diffusion, it achieves diverse image synthesis aligned with the semantics of the guidance image.

Multimodal prompt-driven approaches Nguyen et al. (2023) introduced a framework for image editing via vi- sual prompt inversion. With just one example pair illus- trating the ”before” and ”after” states of an image editing task, this approach achieves competitive results compared to state-of-the-art text-conditioned image editing models. Prompt Diffusion (Wang et al., 2023b) is a framework for enabling in-context learning within diffusion-based gener- ative models. Given pairs of task-speciﬁc images and text guidance, this model automatically comprehends and repli- cates tasks on new images. It introduces a versatile vision- language prompt to model various tasks and is jointly trained on six tasks, demonstrating high-quality in-context generation and effective task generalization, including text- guided image editing. SmartBrush (Xie et al., 2023) intro- duces precise content completion in missing regions by combining text and visual guidance from masks. The model incorporates innovative training and sampling techniques, including object-mask prediction, to enhance background preservation. Additionally, a multi-task training approach jointly training inpainting and text-to-image is applied, generation to maximize the use of extensive training data. ReVersion (Huang et al., 2023) addresses the relation in- version task, focusing on learning speciﬁc relations us- ing relation prompts from exemplar images. The relation prompt is learned from Stable Diffusion and applied to generate relation-speciﬁc images with new elements. The approach emphasizes the ”preposition prior,” where real- world relation prompts are sparsely activated based on basis prepositional words. It is achieved by introducing a novel relation-steering contrastive learning scheme to capture ob- ject interactions while disentangling object appearances and using relation-focal importance sampling to highlight high- level interactions over low-level details. GLIGEN (Li et al., 2023c) enhances LDM by integrating new layers into its existing structure. This advancement enables more effec- tive grounded language-to-image generation. The model adeptly generates images based on inputs such as captions and bounding boxes, exhibiting strong adaptability to new spatial conﬁgurations and concepts. The method is both simple and efﬁcient, offering the ﬂexibility to extend to other conditions like keypoints, reference images, and vari- ous spatially-aligned conditions, including edge and depth maps. ControlNet (Zhang et al., 2023a), a neural network architecture, adds spatial conditioning controls to large, pre- trained text-to-image diffusion models. It harnesses the es- tablished diffusion models’ deep encoding layers, trained on billions of images, as a base for learning varied conditional controls. The architecture features ”zero convolutions” - convolution layers initialized at zero - which progressively increase parameters, ensuring ﬁne-tuning is unaffected by harmful noise. Drawing from ControlNet, Du et al. (2023) employed lesion-speciﬁc visual and textual prompts to gen- erate dermatoscopic images. The framework integrates a controllable lesion function, enabling manipulation of lesion type, textual attributes, and shapes with corresponding lo- cations in mask images during both training and inference. The learned correlation between visual and textual prompts prioritizes rare cases during inference. Additionally, an au- tomated module is introduced for generating lesion shapes and masks. Schnell et al. (2023) proposed a method that

utilizes a ControlNet diffusion model, conditioned on se- mantic scribbles, for generating high-quality training data. To ensure class consistency, the model employs classiﬁer- free diffusion guidance. Encode ratios are introduced to balance data diversity and realism. The proposed augmen- tation schemes, inﬂuenced by guidance scale and encode ratio, yield a spectrum of high-quality training images.

3.1.2 Subject-driven approaches Subject-driven approaches aim at synthesizing diverse and personalized images based on user-provided images captur- ing a speciﬁc subject. In contrast to general image generation techniques, subject-driven generation focuses on allowing users to create novel renditions of a subject in various contexts while maintaining its distinctive features. This ap- proach caters to the user’s desire for customized and imag- inative outputs, addressing a challenging problem in the ﬁeld of image generation. Especially noteworthy is its ability to generate a variety of images with diverse backgrounds from a small number of subject images. Gal et al. (2022a) introduced a method, utilizing LDM, for generating novel ”words” within a text-to-image model’s embedding space with the aid of 3-5 conceptual images, facilitating intuitive, personalized image creation guided by language. Ruiz et al. (2023) proposed a method called DreamBooth, which can generate a variety of photorealistic images of a subject in different contexts with a few reference images and a text prompt. This method has the capacity to create innovative versions of the subject in diverse situations while preserv- ing its unique characteristics. InstantBooth (Shi et al., 2023) allows instant text-guided image personalization without test-time ﬁne-tuning. This is achieved by converting images into textual tokens for concept learning and incorporating adapter layers to preserve ﬁne details and identity dur- ing image generation, without the use of paired images of the same concept. UMM-Diffusion (Ma et al., 2023) is a method for generating customized images by encoding text and images jointly into a uniﬁed multimodal latent space. This method combines text and image information to guide image generation and eliminates irrelevant image parts through a novel sampling technique. Custom Diffu- sion (Kumari et al., 2023) is a fast and efﬁcient ﬁne-tuning technique for text-to-image diffusion models that updates key and value mapping weights in cross-attention layers for new concepts, uses real images with similar captions to pre- vent forgetting, and introduces augmentation for faster con- vergence. It also supports training of multiple concepts to- gether or separately and merging. BLIP-Diffusion (Li et al., 2023a) is a subject-driven image generation model with multimodal control using subject images and text prompts. It utilizes a pre-trained multimodal encoder, aligned visual representation with text following BLIP-2, and a subject representation learning task to generate new subject ren- ditions. FastComposer (Xiao et al., 2023) presents a tuning- free technique for generating personalized, multi-subject text-to-image content. By utilizing a pre-trained vision en- coder, this approach effectively tackles identity blending concerns by training supervised cross-attention maps with segmentation masks. ELITE (Wei et al., 2023) is a learning- based encoder speciﬁcally crafted for rapid and precise customized text-to-image generation. It introduces two key

components: a global mapping network that translates im- age features into ”new” text, and a local mapping network dedicated to preserving details and concept editability. SuTI (Chen et al., 2023a) is a Subject-driven Text-to-Image gener- ator that generates diverse images of a subject in different scenes without the need for subject-speciﬁc ﬁne-tuning. It employs apprenticeship learning to achieve this by learning from a vast number of subject-speciﬁc expert models.

3.2 Text Augmentation

Text augmentation focus on harnessing the advanced capa- bilities of large models to augment text datasets, which in- cludes two strategies: Label-based and Generated Content- based. In the Label-based approach, models are employed to annotate text data, effectively enriching the text dataset with a larger volume of labeled instances. Generated Content- based strategies guide models to synthesize new text data, thereby expanding the dataset with freshly generated tex- tual material. The existing methods are shown in TABLE 4.

3.2.1 Label-based approaches

Thakur et al. (2020) introduced an effective data augmenta- tion approach called Augmented SBERT. This method uti- lizes the cross-encoder BERT to annotate a larger collection of input pairs, thereby enhancing the training data for the bi-encoder SBERT model. Kumar et al. (2020) investigated the utilization of pre-trained models like GPT-2, BERT, and BART (Lewis et al., 2019) for data augmentation with the aim of enhancing text classiﬁcation accuracy. Sahu et al. (2022) presented a prompting-based strategy for producing labeled training data for intent classiﬁcation using general- purpose language models (LM) like GPT-3. This method offers the advantage of not needing task-speciﬁc LM ﬁne- tuning for data generation, eliminating the need for hy- perparameter tuning. Additionally, this approach remains applicable even when there is limited training data avail- able. GPT3Mix (Yoo et al., 2021) is a technique that leverages a large-scale language model, such as GPT-3, to generate highly realistic synthetic text samples from a mixture of real samples and utilizes soft-labels predicted by the language models, effectively distilling knowledge from the large- scale language models. Sharma et al. (2023) used Llama2 to generate unpaired text for existing and new domains, which generates pseudo-labels for the generated utterances using a pre-trained RoBERTa model (Liu et al., 2019). The results for spoken semantic parsing improve further by using different generation strategies. Samuel et al. (2023b) assessed how well GPT-4, an advanced language model, performed as a substitute for human annotators in low-resource reading comprehension tasks, evaluated its performance by com- paring its effectiveness after ﬁne-tuning, and examining the cost involved in annotation processes. Latif et al. (2023) used a 64-dimensional discrete audio representation generated by a vector-quantized variational autoencoder (VQ-VAE) as ChatGPT’s audio context for data annotation, showcasing the potential of LLMs in speech emotion data annotation through experimentation. Chowdhury and Chadha (2023) introduced a framework designed to improve the distri- butional robustness of reading comprehension models by employing generative models for dataset augmentation.


TABLE 4 Summary of text augmentation including label-based and generated content-based approaches.

GPT-3.5 is utilized to generate context based on questions, and T5 is employed to generate question-answering pairs. The research includes a comprehensive quantitative evalua- tion to assess the capability of the LLM to generate high- quality synthetic data for question-answering tasks. This study demonstrates the ability of the LLM to generate high-quality synthetic data for question-answering tasks. Chen et al. (2023b) introduced MINPROMPT, a data aug- mentation framework designed for open-domain question answering (QA). The framework incorporates an approxi- mate graph algorithm and unsupervised question genera- tion techniques. MINPROMPT aims to enhance the perfor- mance of QA models by leveraging these methods to gener- ate additional training data. With the help of MINPROMPT, researchers can improve the robustness and accuracy of their QA systems. Kaddour and Liu (2023) attempted to ﬁne-tune smaller models (student models) with training data generated or annotated by an LLM (Teacher model, e.g. GPT-NeoX (Black et al., 2022)) to improve the downstream performance of much smaller models. To improve the per- formance of seq2seq automated audio captioning (AAC) models, Wu et al. (2023c) proposed a novel data augmen- tation method that uses ChatGPT to produce caption mix- ups (i.e., grammatical and compact combinations of two captions) , together with the corresponding audio mixtures, which increase not only the amount but also the complexity and diversity of training data.

3.2.2 Generated content-based approaches

Zheng et al. (2023) used GPT-J (Wang and Komatsuzaki, 2021) to augment emotional support conversations (ESC) via a dialogue completion task. By prompting a ﬁne-tuned language model with available dialogue posts from var- ied topics, the researchers generated full dialogues that are postprocessed using heuristics. Inspired by the recent success of LLMs, especially the development of ChatGPT, which demonstrates improved language comprehension abilities, Dai et al. (2023) proposed a text data augmenta- tion approach based on ChatGPT (named AugGPT). Aug- GPT rephrases each sentence in the training samples into multiple conceptually similar but semantically different samples, which can ensure both the correct labeling and sufﬁcient diversity of the generated data. CoCa (Yu et al., 2022) represents a sophisticated representation learning ap- proach that adeptly integrates natural language supervi- sion. This is accomplished through pre-training on a di- verse array of image-text data sourced from multiple ori- gins. The approach effectively amalgamates contrastive and captioning losses within an encoder-decoder framework, showcasing its innovative methodology in representation learning. Jo et al. (2022) introduced a data augmentation technique called DAG, which utilizes a generation model. The DAG method employs T5-base (Raffel et al., 2020) as the generation model to summarize a group of sentences from the original data to construct a longer sequence. This process generates augmented data with a representation distribution that is similar to the original data. Oh et al.

(2023) undertook prompt-based data augmentation exper- iments utilizing ChatGPT, examining the diversity of data generated from three distinct prompts. The primary aim of this study is to reﬁne the language model for an online call center’s automatic speech recognition (ASR) system, speciﬁcally tailored for Hungarian, a language characterized by its complex morphology. The authors implemented a pre-training strategy, leveraging parliamentary text along with a GPT-2 transformer language model. This was fol- lowed by ﬁne-tuning the model to align with the targeted domain and generating training text for a bidirectional neural language model (BNLM). The ﬁndings reveal that data augmentation via Transformer-based methods proves effective for the morphologically intricate Hungarian lan- guage, contingent upon a sufﬁciently extensive vocabulary and a robust BNLM. Tarj´an et al. (2020) pre-trained a GPT- 2 Transformer language model on a general text corpus and ﬁne-tuned it on Hungarian conversational call center ASR task. Subsequently, this model is utilized to generate training text for a BNLM. Zhou et al. (2021) proposed a novel data augmentation method FlipDA that jointly uses a generative model and a classiﬁer to generate label-ﬂipped data. Central to the idea of FlipDA is the discovery that generating label-ﬂipped data is more crucial to performance than generating label-preserved data. Based on this obser- vation, FlipDA ﬁrst generates data using word substitution based on a pre-trained T5 and uses a classiﬁer to select label-ﬂipped data. Guo et al. (2022a) presented GENIUS, a text generation model that operates based on conditional input in the form of sketches. GENIUS is designed to ﬁll in the missing contexts in a given sketch. Additionally, the study demonstrates that GENIUS can serve as a powerful and readily applicable tool for data augmentation in vari- ous NLP tasks. InPars (Bonifacio et al., 2022), a method for generating synthetic training data for information retrieval tasks, utilizes LLMs in a few-shot manner. It generates one question per document by employing GPT-3’s Curie model, while using the ”vanilla” and ”guided by bad questions” (GBQ) prompt templates. Khatri et al. (2022) illustrated that the capabilities of a substantial pre-trained transformer- based language model, such as GPT-2, can be effectively utilized to enrich limited datasets created by humans. This enhancement process is designed to preserve the original intent of the expanded utterances while also capturing various alternative expressions for the same intent. Conse- quently, this methodology leads to a notable improvement in the performance of chatbots driven by machine learning, enabling them to respond more accurately and diversely in conversational contexts. Quteineh et al. (2020) introduced a new method of data augmentation that leverages the guided outputs of a language generation model such as GPT-2 to enhance the performance of text classiﬁers through an active learning process, which aims to generate synthetic data as unlabeled data that is required by an active learning algo- rithm. Liu et al. (2022) introduced a fresh method for creat- ing datasets by combining the efforts of human workers and AI. They began with an established dataset called MultiNLI, which deals with natural language inference. Their ap- proach involves employing dataset cartography to auto- matically identify instances that exhibit complex reasoning patterns. These patterns serve as guidelines for instructing

GPT-3 to generate new examples with similar characteris- tics. The generated examples are then ﬁltered automatically and subsequently reviewed and labeled by human crowd workers. Lu and Lam (2023) proposed a novel method called easy prompt augmentation (EPA), which aims at improving the performance of LLMs via in-context learning with the paraphrasing ability of ChatGPT. This approach enables the automatic augmentation of task demonstrations by generating multiple paraphrased versions from various sources and targets. FewGen (Meng et al., 2023) utilizes few- shot samples to ﬁne-tune a generator, generating data that enhances classiﬁcation model generalization. It emphasizes label-discriminative information during tuning through a weighted maximum likelihood objective with automatically learned token weights. Inspired by the impressive text gen- eration capabilities of modern pre-trained language models, Meng et al. (2022) presented supervision generation (Super- Gen). In this approach, training data is generated by a unidirectional PLM, commonly referred to as the generator. Subsequently, a bidirectional PLM, known as the classiﬁer, is ﬁne-tuned on the generated texts to effectively carry out the associated task. Saakyan and Muresan (2023) proposed a framework that utilizes model distillation from ChatGPT to enhance a formality style transfer dataset and provide explanations. Furthermore, a fresh method called in-context learning from expert feedback(ICLEF) is used to integrate scarce expert human feedback, prompting ChatGPT to eval- uate its own outputs and reﬁne them accordingly. Ko et al. (2023) introduced a score-based paraphrasing method that combines score evaluation with linear interpolation. This approach assigns Likert scale scores to sentences, reﬂecting their linguistic variation. The model then rewrites sentences using these scores, ensuring syntactic modiﬁcation while preserving the original meaning, resulting in a seamless yet diverse linguistic transition. Schlegel et al. (2023) utilized the LLMs by providing them with a medical note or a snippet and then prompted the LLMs to generate hypothet- ical conversations simulating interactions between a doctor and a patient. These generated conversations are used as the input to train the model in summarizing patient-doctor dialogues into clinical records. Gao et al. (2022) introduced an innovative noise-robust re-weighting framework called SUNGEN. This framework aims to automatically generate top-notch data for zero-shot classiﬁcation tasks. Speciﬁcally, the synthesized data produced by the pre-trained language model (GPT-2 XL) serves as a vessel of knowledge, which is utilized to train a task-speciﬁc model with signiﬁcantly fewer parameters compared to the PLM. Cai et al. (2023) employed the LLaMA (Touvron et al., 2023a) as a data gen- erator to create high-quality scientiﬁc text data, which aims to address the challenge of imbalanced data and improve the performance of the model in classifying scientiﬁc texts. Tang et al. (2023) leverages state-of-the-art LLMs, such as ChatGPT, GPT-4, Dolly-v2(Conover et al., 2023), and Stable- Vicuna (Stability AI, 2023), for the generation of synthetic data in the context of security patch detection. By prompting these LLMs, explanations for the patches are generated, with explicit instructions provided for binary-detection tasks. Dolly-v2-12b and StableVicuna13B are strategically employed to strike a balance between open-source and proprietary model contributions. The resultant dataset is

organized in the format of < patch, explanation, description, instruction >, encapsulating a comprehensive set of infor- mation for the purpose of security patch detection. Li et al. (2023b) introduced a method called data augmentation for in-context learning (DAIL). DAIL capitalizes on the idea that LLMs have a better understanding of the content they produce. Initially, it utilizes the language model to create paraphrases of the test sample and then uses majority voting to establish the ultimate outcome, considering individual predictions. Ubani et al. (2023) introduced an innovative technique employing zero-shot ChatGPT prompts for data augmentation in machine learning. This method notably surpassed most baseline models in three distinct tasks and matched the performance of few-shot ChatGPT in another, showcasing its considerable potential in low-resource set- tings.

3.3 Paired Data Augmentation

MixGen (Hao et al., 2023), a data augmentation method for vision-language representation learning, generates image- text pairs with preserved semantic relationships through im- age interpolation and text concatenation. Bakhtiarnia et al. (2023) proposed a method called PromptMix that extracted text descriptions from existing datasets, used the extracted text as input to latent diffusion models to generate images that are similar to those in existing datasets, annotated the generated images using high-performing heavy-weight networks, and mixed this fake dataset with real data to improve the training of light-weight deep neural networks. To address the problem of reporting bias in visual language datasets, in particular the potentially detrimental effect of object attribute associations on trained models, Wu et al. (2023b) proposed a bimodal enhancement method called BigAug. This method utilizes object attribute decoupling to synthesize different visual language examples and create cross-modal hard negations. The integration of an LLM and a grounded object detector facilitates the extraction of target objects, where the LLM provides detailed attribute descriptions for each object. These descriptions, along with the corresponding hard negatives, are then used to generate images via the inpainted model. This explicit process in- troduces missing objects and attributes for learning, where the hard negatives guide the model to distinguish object attributes.

4 DATA POST PROCESSING Post-processing of augmented data is crucial for ﬁltering out unsuitable instances. As shown in TABLE 5, the post- processing techniques involve the application of various methodologies, including Top-K selection, model-based ap- proaches, score-based approaches, and cluster-based ap- proaches. These post-processing techniques collectively con- tribute to reﬁning the augmented dataset for optimal perfor- mance in subsequent tasks.

4.1 Top-K Selection Top-K selection involves retaining the top-K relevant instances based on pre-deﬁned criteria. and signiﬁcant In the ﬁnal stage of constructing their training dataset,

Bonifacio et al. (2022) selected the initial K pairs based on log probability. Speciﬁcally, ﬁne-tuning was limited to the top K = 10, 000 data pairs. Notably, ﬁne-tuning across all 100,000 synthetic examples led to a 4% decrease in MRR@10 for the MS MARCO dataset, compared to the approach of ﬁltering the top K pairs. This ﬁnding underscores the importance of strategic data selection in model ﬁne-tuning. To construct a training set, Meng et al. (2022) generated more samples than needed and selected training data based on a scoring formula. For all tasks except CoLA, the top K samples of each class will be selected. In contrast, for CoLA, the top K samples are used as linguistically accept- able training samples, and the last K samples are used as linguistically unacceptable sequences. Additionally, to achieve better ﬁne-tuning stability and generalization, they applied two regularization techniques: label smoothing and temporal ensembling. Vu et al. (2021) designed methods in- cluding overgeneration and ﬁltering to improve the quantity and quality of training data for synthetic NLI. Speciﬁcally, a top-K sampling technique (with K = 40) is applied to generate 100 output samples per input, with duplicates removed. Subsequently, a BERT model, ﬁne-tuned on the original-format MultiNLI (MNLI) dataset (Williams et al., 2017), acts as an NLI classiﬁer to ﬁlter synthetic training examples. The retention of a synthetic example depends on the NLI classiﬁer producing the same label as supplied to the NLI data generator and exhibiting high conﬁdence in its prediction.

4.2 Model-based Approaches

Model-based approaches leverage the knowledge and char- acteristics of the underlying models to reﬁne and select augmented data. Despite the considerable generative capac- ity of LLMs, they tend to produce open-ended questions that cannot be resolved solely based on the input context. With this in mind, Sachdeva et al. (2023) used the Flan- UL2 model through context-generated QA pairs, aiming to determine the consistency between the context and the an- swer. The model makes a binary decision, labeling outputs as either ”true” or ”false.” Consequently, instances where the context lacks sufﬁcient information are eliminated. To salvage questions mistakenly discarded due to contextual relevance ﬁltering, Sachdeva et al. (2023) adopted a round- trip consistency approach (Alberti et al., 2019; Fang et al., 2020). This method employs existing QA models to answer questions generated by the LLM, ensuring the predicted an- swers align with the target answers generated by the LLM. In the noise ﬁltering process, an ensemble of three LLMs, initialized with different random seeds during inference, is utilized. This technique maintains instances where at least two of the generated context-ﬁltered questions (CFs) concur, successfully retaining between 90% to 95% of the data that would otherwise be discarded due to contextual relevance ﬁltering as per the DuoQAG method. Samuel et al. (2023b) used a round-trip ﬁltration technique to improve the quality of synthetic question-answer pairs generated by GPT-4. It involves providing the question back to the model without the answer, allowing the model to attempt to answer the question again based on the context. If the model’s newly generated answer matches the original synthetic answer, the

QA pair is retained as it indicates a high-quality question with a consistent answer. If the answers do not match, the synthetic QA pair is discarded under the assumption that the question is ﬂawed in some way. This helps to improve the quality of synthetic data, which in turn can improve the performance of downstream tasks. Sharma et al. (2023) conducted a validation of the generated sequence logic parses, scrutinizing them for incorrect bracket placements and the occurrence of out-of-vocabulary (OOV) intents and slots. To rectify OOV intents, the model is re-prompted to replace them with the appropriate intents, ensuring that any intents beyond the ﬁrst are substituted accordingly. In situations involving OOV slots, these are excluded from the sequence, while the corresponding slot words are re- tained. To solve the problem that LLMs often generate utterances that belong to a closely-related intent rather than the desired one, Sahu et al. (2022) presented a prompting- based GPT classiﬁer that acts as a ﬁlter to remove un- faithful examples and improves the quality of the training set. Speciﬁcally, the approach involves rejecting generated examples if they are identiﬁed by the GPT-3 classiﬁer as not belonging to the seed intent. The ﬁdelity of the generated data is greatly enhanced by applying this ﬁltering method to both the HWU64 dataset and the Banking77 dataset. Zhou et al. (2021) introduced a data selection methodology that ﬁlters samples generated by a pre-trained T5 model using a classiﬁer trained without data augmentation. This proposed method encompasses two steps. In the ﬁrst step, generated candidate samples whose labels, as predicted by the classiﬁer, differ from the original ones are eliminated. The second step involves categorizing the remaining can- didates by their labels and selecting those with the highest probability within each group. Wu et al. (2023c) used the FENSE disﬂuency detector (Zhou et al., 2022) to remove the mixture of two audio titles with poor quality, which is produced by ChatGPT.

TABLE 5 Summary of data post processing including Top-K Selection, Model-based approaches, Score-based approaches and Cluster-based approaches (ordered by the name of the ﬁrst author).

4.3 Score-based Approaches

Score-based approaches assign scores to instances, allow- ing for the prioritization of those with higher relevance. DiffuseExpand (Shao et al., 2023) utilizes a neural network to discern and retain high-quality Image-Mask pairs while ﬁltering out those of lower quality. This process involves

utilizing a well-trained neural network to evaluate the Dice loss associated with each synthesized image-mask pair. Samples exhibiting a Dice loss below a speciﬁed threshold η are preserved, whereas those surpassing this threshold are excluded. This approach effectively eliminates pairs that are misaligned or inadequately synthesized. Wu et al. (2023d) proposed a high-quality synthetic image selection method. In the selection stage, the synthetic image is fed into a classiﬁer, and then a global max-pooling (GMP) operation is applied to generate the image-level prediction score. Sub- sequently, classes with scores exceeding a speciﬁc threshold are considered as the actual labels for the generated image. If the actual label is a subset of the labels associated with the input image, then the generated sample is included in the synthetic dataset. Chowdhury and Chadha (2023) screened QA pairs according to round-trip consistency (Alberti et al., 2019). Where round-trip consistency is ensured by comput- ing an auxiliary function greater than a set threshold. In the ﬁnal post-processing stage, Zheng et al. (2023) eliminated three types of undesirable situations: (1) failures in enhance- ment, encompassing the generation of non-dialogical con- tent, incomplete dialogues, and cue word leakage; (2) harm- ful self-reinforcement, which targets the model’s inclination to replicate patterns, particularly in instances of imbalanced corpus counts or consecutive speaker statements; and (3) distributional gaps concerning ESConv(Liu et al., 2021b). Additionally, criteria for the total number and length of utterances were established to mi nimize signiﬁcant distri- butional gaps with ESConv and foster in-depth discussions with an ample number of dialogue rounds. These thresholds were determined based on both our heuristics and ESConv statistics. Liu et al. (2022) introduced an automatic ﬁltering approach with the primary objective of selecting and retain- ing the most ambiguous examples from a generated dataset. Initially, this approach involves discarding failure examples produced by GPT-3 through a straightforward heuristic method. The process then introduces a novel metric, termed ’estimated max variability,’ designed to assess the ambiguity of each remaining unlabeled example without necessitating additional training. This metric computes the maximum po- tential variability in the predicted labels for a given example. Following this, an equal number of examples exhibiting the highest max variability are retained from each intended

label class, ensuring a balanced representation of ambiguity across the dataset. The resulting ﬁltered dataset, called Df iltered, will be half the size of the original generated dataset, Dgen. These ﬁltered examples play a crucial role in training a more robust model that can effectively handle a wider array of inputs and reduce its tendency to overﬁt speciﬁc patterns in the training data.

4.4 Cluster-based Approaches

Cluster-based approaches aggregate similar instances, thereby aiding in identifying and eliminating redundant or less informative data. To evaluate the effectiveness of the proposed augmentation method for downstream segmen- tation tasks, Yu et al. (2023) created four subsets from each training dataset. This process entailed cropping images into 256 pixel patches, extracting features using a pre- 256 trained ResNet-50, classifying these patches into six groups through K-means clustering, and then selecting patches proximal to the cluster centers.

5 APPLICATIONS Applying the aforementioned methods for data augmen- tation has proven to be highly effective in downstream tasks. These tasks encompass natural language processing, computer vision, and audio processing, demonstrating sig- niﬁcant performance improvements. TABLE 6 provides a comprehensive summary and presentation of the existing methods.

5.1 Natural Language Processing

Augmented text data plays a crucial role in enhancing the performance of NLP tasks, including text classiﬁcation, question answering (QA), machine translation (ML), natural language inference (NLI), dialogue summarizing (DS), and others. By expanding and diversifying the dataset, text augmentation contributes to a deeper and more nuanced understanding of language variations and contexts.

5.1.1 Text classiﬁcation

Dai et al. (2023) introduced a text data augmentation ap- proach based on ChatGPT (AugGPT) to rephrases each sentence in the training samples into multiple concep- tually similar but semantically different samples. This method shows superior performance over 19 other meth- ods such as CounterFittedEmbedding (Alzantot et al., 2018; Mrkˇsi´c et al., 2016), InsertWordByGoogleNewsEmbedding (Ma, 2019) in enhancing BERT models across Amazon (+8.2%), Symptoms (+25.3%), and PubMed20K (+4.3%) datasets. Jo et al. (2022) proposed a data augmentation method named DAG, where T5-base is used as a gener- ation model. To evaluate the performance of DAG, text classiﬁcation experiments are conducted. Compared to the case without data augmentation, when applying DAG, the accuracy of TRAIN-ALL sampling strategy (Jo et al., 2022) is improved by the range of 0.02% to 0.04% on AG- News (Zhang et al., 2015), 20Newgroup (Lang, 1995), TREC (Li and Roth, 2002) and R8(Debole and Sebastiani, 2005), as well as the accuracy of the proposed TRAIN-HALF sam- pling strategy is improved by the range of 0.2% to 0.65%

on IMDB (Maas et al., 2011), AGNews, 20Newgroup and R8. Kumar et al. (2020) explored three methods respectively using auto-regressive models (Radford et al., 2019), auto- encoder models (Devlin et al., 2018), and seq2seq mod- els (Lewis et al., 2019) for conditional data augmentation on text classiﬁcation datasets including SST-2(Socher et al., 2013), SNIPS (Coucke et al., 2018) and TREC. The experi- mental results indicate that the seq2seq pre-training BART approach performs better than other data augmentation approaches across all datasets. In terms of semantic ﬁ- delity, auto-encoder based methods demonstrate superior performance of the generated data compared to both auto- regressive models like GPT-2 and Seq2seq-based models like BART. In addition, regarding diversity, the methods based on the pre-trained model proposed in this paper do not have any advantages compared to other methods such as EDA, Backtranslation (Sennrich et al., 2015), CBERT (Wu et al., 2019). GPT3Mix (Yoo et al., 2021), a method for generating synthetic text samples utilizing large-scale lan- guage models, such as GPT-3, is evaluated on seven text classiﬁcation datasets such as SST-2, CR (Hu and Liu, 2004), TREC6(Voorhees et al., 1999), etc. Compared to EDA, back- translation (BT)(Fadaee et al., 2017), and TMix (Chen et al., 2020), GPT3Mix is signiﬁcantly superior on most datasets. Considering the average classiﬁcation accuracy across all tasks, GPT3Mix notably enhances performance for both DistilBERT-base (Sanh et al., 2019) and BERT-base mod- els, achieving improvements ranging from 3.7% to 10.8% and 3.9% to 8.9%, respectively. In contrast, other meth- ods demonstrate minimal or no improvement in this re- gard. Gao et al. (2022) proposed a noise-robust framework SUNGEN that can automatically construct high-quality synthetic datasets. Evaluating on eight text classiﬁcation tasks, SUNGEN outperforms ZEROGEN (Ye et al., 2022a) across all the tasks. In particular, compared to DistilBERT (Sanh et al., 2019), the improvement of LSTM is more sig- niﬁcant, with an average relative improvement of 9.8% over ZEROGEN. Additionally, SUNGEN-LSTM achieves better performance than ZEROGEN-DistilBERT on Rotten Toma- toes (Pang and Lee, 2005) and Yelp (Zhang et al., 2015). ZEROGEN-DistilBERT does not require pre-training and has signiﬁcantly fewer parameters. Cai et al. (2023) applied large-scale language models to data augmentation in imbal- anced hierarchical scientiﬁc text classiﬁcation. The results indicate that using 1000 synthesized samples can signiﬁ- cantly improve the overall performance of the model, since the MicroF1, MacroF1, Recall scores, and Precision scores are all increased. Nevertheless, the performance enhance- ment observed with 350 synthesized samples is not as pronounced as that with 1000 synthesized samples, indi- cating that the quantity of synthesized data plays a crucial role. Guo et al. (2022a) introduced GENIUS, a conditional text generation model designed as a versatile and ready- to-use data augmentation tool for various NLP tasks. Ex- periments are conducted on six text classiﬁcation datasets for low-resource text classiﬁcation tasks. In in-distribution evaluations, two variants based on GENIUS – GeniusAug and GeniusAug-f (further ﬁne-tuned on downstream train- ing sets) – demonstrate notable performance enhancements for the base classiﬁer, with average improvements of ap- proximately 2% and 3%, respectively. In out-of-distribution

TABLE 6 Summary of applying data augmentation to downstream tasks: NLP (Natural Language Processing, including TC (Text Classiﬁcation), QA (Question Answering), MT (Machine Translation), NLI (Natural Language Inference), DS (Dialogue Summarising) and Others), CV (Computer Vision, including IC (Image Classiﬁcation), SS (Semantic Segmentation) and OD (Object Detection)), ASP (Audio Signal Processing) (ordered by the name of the ﬁrst author).

Bonifacio et al. (2022) Cai et al. (2023) Chen et al. (2023b) Chowdhury and Chadha (2023) Dai et al. (2023) Du et al. (2023) Dunlap et al. (2023) Gao et al. (2022) Guo et al. (2022a) Jo et al. (2022) Kim et al. (2023) Kumar et al. (2020) Latif et al. (2023) Li et al. (2023b) Liu et al. (2022) Lu and Lam (2023) Lu et al. (2023) Meng et al. (2023) Oh et al. (2023) Sachdeva et al. (2023) Samuel et al. (2023b) Samuel et al. (2023a) Saakyan and Muresan (2023) Schlegel et al. (2023) Schnell et al. (2023) Sharma et al. (2023) Tarj´an et al. (2020) Thakur et al. (2020) Trabucco et al. (2023) Voetman et al. (2023) Wu et al. (2023d) Wu et al. (2023c) Yin et al. (2023) Yoo et al. (2021) Yu et al. (2023) Zang et al. (2023) Zhang et al. (2023d) Zheng et al. (2023)

evaluations, GeniusAug and GeniusAug-f outperform GE- NIUS under known distributions, showing average perfor- mance improvements of around 5% and 7%, respectively, compared to the base classiﬁer. Furthermore, when com- pared to other data augmentation methods, including EDA, STA (Guo et al., 2022b), BackTrans (Silfverberg et al., 2017), MLM (Kumar et al., 2020), C-MLM (Kumar et al., 2020), and LAMBADA (Anaby-Tavor et al., 2020), the approach proposed in this study proved to be superior to most of these methods. Meng et al. (2023) proposed a method called FewGen, which involves using a generator to synthesize a large number of new training samples in order to enhance the original training set and improve the performance of classiﬁcation tasks. FewGen consistently outperforms pre- vious state-of-the-art few-shot methods, such as LM-BFF (Gao et al., 2020), P-Tuning (Liu et al., 2023b), and DART (Zhang et al., 2021b), by a margin of more than 5% on aver- age across GLUE tasks (Wang et al., 2018a). It also achieves a performance improvement of more than 3% compared to the usage of a generation model, GPT3Mix, which is 100 times larger than FewGen. Saakyan and Muresan (2023)

utilized the PAN 2022 dataset to ascertain whether two texts originate from the same author. The AlpacaIF →F model is employed to extract explanations comprising informality attributes and evidence. Despite the exclusion of evidence in the simplistic approach, the resulting explanations achieve a classiﬁcation performance of 0.60 AUC, indicating their prospective use as interpretable authorship features in fu- ture investigations. For in-contextual learning (ICL), Li et al. (2023b) proposed a data augmentation technique called DAIL. DAIL augments test samples by generating multi- ple paraphrases and combines individual results through ensembling to derive the ﬁnal prediction. Comparative ex- periments are conducted between DAIL, standard ICL, and alternative ensemble-based methods, demonstrating the ef- ﬁcacy of DAIL, particularly in low-resource ICL scenar- ios. Additionally, an exploration is carried out on utilizing voting consistency as a method for estimating conﬁdence, revealing a positive correlation between voting consistency and model accuracy.

5.1.2 Question answering

Chen et al. (2023b) compared the performance of MIN- PROMPT, a robust data augmentation framework, with four few-shot question-answering methods, including RoBERTa, SpanBERT (Joshi et al., 2020), Splinter (Ram et al., 2021), and FewshotQA (Chada and Natarajan, 2021). The MIN- PROMPT method achieves the highest average F1 score in all cases, which surpasses the second-best method, Few- shotQA, by a margin from 0.2 to 1.3. In addition, as the number of few-shot QA training samples decreases, the improvement of MIN-PROMPT becomes more pronounced because MIN-PROMPT incorporates external prior knowl- edge that is not present in the actual training samples. Chowdhury and Chadha (2023) demonstrated how to use generated data to enhance reading comprehension. The RoBERTa-Base model is trained using SQUAD-V1.1 dataset (Rajpurkar et al., 2016) for reading comprehension across all the experiments. Training the model on the real and generated data achieves the highest exact match (EM) and F1 scores on all datasets, and it can be seen that the mixture of real data and generated data can balance ro- bustness and accuracy. A GPT-4 based data augmentation approach for low-resource machine reading comprehension is introduced by Samuel et al. (2023b). This approach uses a RoBERTa-base model as the extractive reading compre- hension model across all experiments. For the CovidQA dataset (M ¨oller et al., 2020), when using one-shot or two- shot synthetic data, the EM and F1 score are both im- proved compared to using the original training data. For the PolicyQA dataset (Ahmad et al., 2020), augmenting the original training data with one-shot synthetic data improves EM and F1 scores by 1.6 and 1.5 compared to using just the original examples. Sachdeva et al. (2023) harnessed the capabilities of LLMs to augment the training data of small language models (SLMs) by generating counterfac- tual (CF) instances. The EM scores of the RoBERTa-base model, enhanced with CF data, showed improvements across six out-of-distribution (OOD) datasets. The proposed method exceeds the performance of baseline models on all OOD datasets, with the exception of the NewsQA dataset (Trischler et al., 2016). This exception might be due to the el- evated level of inferential complexity required by NewsQA, posing a challenge for LLMs in generating suitable data. Additionally, instruction-tuned versions of various models like GPT-NeoXT and LLaMA achieved notable success on speciﬁc datasets, such as TriviaQA (Joshi et al., 2017) and BioASQ datasets (Tsatsaronis et al., 2015), while Flan-UL2 (Tay et al., 2022) demonstrated exceptional performance on the SQuAD-adversarial (Jia and Liang, 2017), HotpotQA (Yang et al., 2018), and NQ datasets (Kwiatkowski et al., 2019). This variation in performance highlights the bene- ﬁts of diverse data augmentation methods across different datasets. Furthermore, the study observed that larger-scale models like LLaMA, GPT-NeoXT, and Flan-UL2 displayed superior performance, suggesting that the size and training data of CF generation models are critical factors. Inter- estingly, even the smaller, instruction-tuned GPT-JT model provided substantial improvements on OOD datasets, in- creasing the EM score by approximately 3 compared to the baseline model, underscoring the effectiveness of these

augmentation strategies in enhancing model performance across varied datasets.

5.1.3 Machine translation

Lu and Lam (2023) introduced an innovative method easy prompt augmentation (EPA), which is designed to enhance the performance of LLMs through the automatic augmen- tation of demonstrations. In machine translation tasks, EPA consistently achieves notable improvements. Speciﬁcally, it attains gains of up to 6x chrF++ (Popovi´c, 2015) points in low-resource languages and up to 3x chrF++ points in high-resource languages. This method addresses the chal- lenge posed by the limited availability of large parallel cor- pora in Neural Machine Translation (NMT). Complement- ing this approach, Oh et al. (2023) conducted prompt-based data augmentation experiments using LLMs like Chat- GPT (OpenAI, 2023b). They employed three distinct types of prompts: Paraphrase, Multi-Target, and Storytelling, to generate synthetic data. Remarkably, applying the story- telling method led to an improvement in the BLEU score (Papineni et al., 2002) of the mBART-50 model (Tang et al., 2020), outperforming the baselines. The most signiﬁcant enhancement was observed when the volume of synthetic parallel data was double that of the original parallel data, achieving an impressive BLEU score of 29.17. These ﬁnd- ings underscore the potential of creative data augmentation methods in improving NMT, particularly in scenarios with limited language resources.

5.1.4 Natural language inference

EPA can also be used as a demonstration augmentation method in natural language inference (NLI) Lu and Lam (2023). Extensive experiments demonstrate that EPA is highly effective in enhancing performance in NLI tasks. Speciﬁcally, EPA outperforms GPT in all cases and obtains an improvement of 5.01 accuracy on MNLI (Williams et al., 2017) with 3-shot in-context learning. Liu et al. (2022) intro- duced an innovative dataset creation method named Worker and AI NLI (WANLI), which synergizes the generative capabilities of language models with human evaluative skills. In comparisons of model performances trained sep- arately on MultiNLI and WANLI, the models trained on WANLI consistently exhibited enhanced performance. This improvement is particularly noteworthy considering that WANLI is only a quarter of the size of MultiNLI and is predominantly composed of machine-generated examples. More speciﬁcally, models trained on WANLI showed a sub- stantial performance increase: a 4% rise on the Diagnostics test set (Wang et al., 2018a), an 11% improvement on the HANS test set (McCoy et al., 2019), and a 9% increase on the Adversarial NLI (ANLI) test set (Nie et al., 2019). These signiﬁcant gains highlight the effectiveness of combining language model generation with human assessment in cre- ating high-quality, efﬁcient datasets for NLI tasks.

5.1.5 Dialogue summarising

Employing EPA (Lu and Lam, 2023) as a method for demon- stration augmentation in dialogue summarization, the re- sults indicate that EPA makes signiﬁcant strides, achieving improvements of up to 0.79 in F1 scores on ROUGE-L

(Lin, 2004). Schlegel et al. (2023) introduced PULSAR for the ImageClef 2023 MediQA-Sum task, which involves summa- rizing patient-doctor dialogues into clinical records. In this context, large models were utilized to synthesize data for augmentation. With the provided training data for Task B (1,201 training examples, 100 validation examples, and 200 test examples), data augmentation does not yield notable score improvements for larger models. However, for smaller models, data augmentation lead to minor improvements. In contrast, under conditions of data scarcity, such as Task C (67 training examples, 20 validation examples, and 40 testing examples), data augmentation signiﬁcantly boosts performance across all metrics. Speciﬁcally, ROUGE-1 in- creased from 27.64 to 29.41, ROUGE-2 from 9.79 to 11.60, ROUGE-L from 16.24 to 19.18, and ROUGE-LSum from 23.63 to 26.08 (Lin, 2004). This highlights the efﬁcacy of data augmentation in enhancing model performance, par- ticularly in low-resource scenarios.

5.1.6 Others

(2023) developed an innovative approach Zheng et al. for dialogue augmentation and created an augmenta- tion dataset named AUGESC, speciﬁcally for emotional support conversation (ESC). Through interactive human evaluations using two BlenderBot models (Roller et al., 2020), one ﬁne-tuned on ESConv (Liu et al., 2021b) and the other further ﬁne-tuned on AUGESC, it is demon- strated that AUGESC substantially enhances ESC perfor- mance in terms of Fluency, Identiﬁcation, Comforting, and Suggestion. Thakur et al. (2020) introduced Augmented SBERT (AugSBERT), a data augmentation strategy, and evaluated it in both in-domain and domain adaptation tasks. Across all in-domain datasets, including Spanish- STS (Agirre et al., 2014), BWS (cross-topic) (Stab et al., 2018), BWS (in-topic) (Stab et al., 2018), Quora-QP (Wang et al., 2017), and MRPC (Dolan et al., 2004), AugSBERT out- performed the bi-encoder SBERT (Reimers and Gurevych, 2019) by 1 to 6 points (Spearman’s rank correlation ρ 100 and F1 score of the positive class). It also surpassed the synonym replacement data augmentation technique in all tasks, which even had adverse effects on BWS and Quora- QP. Compared to the off-the-shelf USE model, AugSBERT showed improvements in performance metrics by 3 to 12 points for all tasks except Spanish-STS. In domain adaptation tasks, AugSBERT exceeded SBERT in nearly all source-target combination schemes using the Quora-Sprint combination, with improvements reaching up to 37 points (AUC(0.05) scores). AugSBERT is observed to perform better when the source domain is more general and the target domain more specialized. Bonifacio et al. (2022) proposed InPars, a method leveraging pre-trained models to generate synthetic data for enhancing information retrieval perfor- mance. It is observed that unsupervised models ﬁne-tuned with InPars outperform models of the same size in OpenAI’s Search API. For instance, T5 with InPars, having 3 billion parameters, surpassed the larger Curie and Davinci models (Neelakantan et al., 2022) by a considerable margin. InPars also outperformed Contriever (Izacard et al., 2021) and cpt- text (Neelakantan et al., 2022). The efﬁcacy of InPars and OpenAI’s Search, which includes re-ranking documents us- ing BM25 (Robertson et al., 1995), exceeds the performance

reported by Neelakantan et al. (2022). In paraphrasing tasks, EPA (Lu and Lam, 2023) demonstrates signiﬁcant improve- ments in BLEU and ROUGE-L scores compared to ChatGPT OpenAI (2023b).

5.2 Computer Vision

Augmented images improve performance in computer vi- sion tasks like image classiﬁcation, semantic segmenta- tion, and object detection. The diversiﬁed dataset enhances model accuracy and understanding of diverse visual con- texts, enabling more robust and versatile applications.

5.2.1 Image classiﬁcation

For few-shot classiﬁcation, the application of SeedSelect (Samuel et al., 2023a) when ﬁne-tuning the CLIP classiﬁer outperforms multiple benchmark methods such as zero- shot CLIP, CooP, Tip Adapter, CT & SD (classiﬁer tuning with images generated using SD), and Textual Inversion. Notably, even with just one training image, SeedSelect con- sistently generates valuable, diverse, and superior augmen- tations compared to previous methods. Dunlap et al. (2023) proposed a augmentation method, automated language- guided image augmentation(ALIA), for ﬁne-grained clas- siﬁcation tasks. Conducting domain generalization experi- ments on iWildCam (Koh et al., 2021), an animal Classiﬁca- tion dataset, the results show that ALIA not only surpasses all baseline methods (‘+CutMix’, ‘+RandAug’ (Cubuk et al., 2020), ‘+Txt2Img’, ‘+Real’ (Dunlap et al., 2023)), exhibiting a remarkable 17% performance boost compared to training solely on the original data, but it also outperforms the addition of an equivalent amount of real data. For the ﬁne-grained classiﬁcation task on CUB, a ﬁne-grained bird classiﬁcation dataset (Wah et al., 2011), ALIA demonstrates superior performance compared to all baselines except for when real data is added. These ﬁndings highlight that ALIA offers greater performance enhancements than existing data augmentation methods, even in scenarios without domain shifts. On waterbirds (Sagawa et al., 2019), a constructed dataset with contextual bias, when compared to other aug- mentation baselines, ALIA improves the class-balanced ac- curacy by 7% and demonstrates a similar in-domain accu- racy of other augmentation methods, and it outperforms all methods except for the real data baseline in terms of overall accuracy. Trabucco et al. (2023) introduced a ﬂexible data augmentation strategy, data augmentation by fusion (DA- Fusion), and investigated two approaches, model-centric leakage prevention, and data-centric leakage prevention, for avoiding the leakage of Stable Diffusion’s training data. The researchers conduct experiments on few-shot image classiﬁcation tasks using three classiﬁcation datasets in- cluding Leafy Spurge (Trabucco et al., 2023), PascalVOC (Everingham et al., 2010), COCO (Lin et al., 2014). With Model-Centric Leakage Prevention, DA-Fusion outperforms the baseline which uses a standard data augmentation strat- egy including random rotations and ﬂips by about 1.8 points on PascalVOC, 5 points on COCO, and 1 point on Leafy Spurge. Additionally, DA-Fusion outperforms Real Guid- ance (He et al., 2022) in terms of overall performance. With Data-Centric Leakage Prevention, DA-Fusion outperforms the baseline by about 3 points on PascalVOC, 5.5 points on

COCO, and 1.6 on Leafy Spurge, as well as outperforms Real Guidance in all domains. Yin et al. (2023) proposed text-to- text-to-image data augmentation (TTIDA) for data augmen- tation. For in-domain classiﬁcation, the experimental results on the CIFAR-100 dataset show that TTIDA outperforms all methods that add synthetic images generated by differ- ent approaches (traditional image transformations, DCGAN (Radford et al., 2015), CycleGAN (Zhu et al., 2017), Style- GAN (Karras et al., 2019)) on each synthetic ratio (“+20%”, “+50%”, “+100%”, “+200%”, “+300%”, “+400%”, “+500%”), and can achieve a maximum accuracy improvement of 3%. For cross-domain classiﬁcation, TTIDA performs better in all situations (different source domains or target domains) compared to not using synthetic data. DiffAug (Zang et al., 2023) excels in classiﬁcation, outperforming its competitors by a margin of 2.1% to 11.1% across eight evaluations on four datasets. Its data augmentation surpasses other meth- ods, effectively addressing the deﬁciency of robust tech- niques in these approaches and thereby improving overall performance. The DiffAug-processed data exhibits reduced overlap between groups, leading to enhanced classiﬁcation and clustering results by establishing clearer boundaries between different data categories. Lastly, the versatility of the DiffAug approach opens opportunities to enhance various unsupervised learning methods, particularly in the complex realm of effective data augmentation techniques for biological data. Koohpayegani et al. (2023) evaluated the impact of GeNIe on few-shot classiﬁcation, long-tailed classiﬁcation, and ﬁne-grained classiﬁcation. The classiﬁ- cation experiments, covering both few-shot and long-tail distribution scenarios, highlight the effectiveness of GeNIe, particularly in categories with limited examples.

5.2.2 Semantic segmentation EMIT-Diff leverages ControlNet (Zhang et al., 2023a) to generate synthetic medical images that retain vital characteristics while using edge information to guide the synthesis. The method demonstrates substan- tial improvements in medical image segmentation across diverse datasets, including Ultrasound breast (+13.87%), CT spleen (+0.38%), and MRI prostate (+7.78%). Du et al. (2023) enhanced ControlNet with lesion-speciﬁc visual and textual prompts for dermatoscopic image generation and showed the superiority of our framework in improving skin lesion segmentation performance, surpassing Pix2PixHD (Wang et al., 2018b) by over 5%. The effect of different amounts (e.g., 1K, 3K, 5K) of synthetic data on performance has also been explored, demonstrating that increasing the amount of data has the potential to improve the effec- tiveness of segmented training. Yu et al. (2023) presented a diffusion-based method for augmenting nuclei segmen- tation datasets by generating synthetic nuclei structures and histopathology images, which are then integrated with synthetic instance maps into the real dataset for segmen- tation model training. They evaluated the effectiveness of this augmentation method by comparing the segmentation performance (Graham et al., 2019) using both the original and augmented subsets for training two nuclei segmen- tation models, Hover-Net (Graham et al., 2019) and PFF- Net (Liu et al., 2021a). Remarkably, the experimental results demonstrate that augmenting as little as 10% of the labeled

real dataset with synthetic samples leads to segmentation performance on par with a fully-supervised baseline. In the training phase of Weakly-supervised semantic segmenta- tion (WSSS), Wu et al. (2023d) merged the synthetic dataset with the original training dataset to form the ﬁnal training dataset. To evaluate the approach, it is incorporated into the existing state-of-the-art ViT-PCM (Rossetti et al., 2022) as an upstream data augmentation technique, with consistency maintained in the downstream WSSS process. Subsequently, a comparison is made with segmentation results from vari- ous state-of-the-art methods. Notably, the method proposed demonstrates superior performance to the baseline ViT- PCM, even when only 50% of the training data is uti- lized. Schnell et al. (2023) examined two weakly-supervised semantic segmentation methods: simple regularized losses (RLoss) (Tang et al., 2018) and adaptive Gaussian mixture models (AGMM) (Wu et al., 2023a). Both methods undergo joint training on the original and augmented training sets. The proposed framework effectively narrows the perfor- mance gap between scribble-supervised segmentation and fully-supervised segmentation. Furthermore, a notable im- provement in segmentation performance on small datasets is demonstrated, surpassing the performance of even fully- supervised segmentation.

5.2.3 Object detection

Voetman et al. (2023) presented Genfusion, a framework that integrates recent developments to generate synthetic datasets. Synthetic training datasets are generated from a few real-world images using DreamBooth (Ruiz et al., 2023) ﬁne-tuning, these images are manually annotated. YOLOv5 (Jocher, 2020) and YOLOv8 (Jocher et al., 2023) are trained as object detectors. Evaluation is performed on the MinneApple benchmark (H¨ani et al., 2020). Additionally, a preliminary investigation explores the potential for ﬁne- tuning the diffusion model to automate annotation creation. The performance of these models, evaluated on a real-world test set of 331 images, closely matches that of a baseline model trained on genuine data. Speciﬁcally, when applied to apple detection in orchards, the average precision deviation from the baseline ranges from 0.09 to 0.12. These ﬁndings underscore the viability of synthetic data generation as a pragmatic alternative for training deep models, reducing the imperative for extensive real-world data acquisition. WoVo- Gen (Lu et al., 2023), an innovative framework, addresses the difﬁculties linked to the creation of multi-camera driving scene videos, offering potential applications in augmenting datasets for autonomous driving. Through the utilization of a distinct 4D world volume, WoVoGen establishes con- sistency within the world and across sensors, overcom- ing challenges associated with diversity and adapting to changing lighting conditions. WoVoGen achieves superior experimental results on the nuScenes dataset (Caesar et al., 2020). Speciﬁcally, BEVDet (Huang et al., 2021) is utilized as the benchmark model, and the training and evaluation are conducted for 3D object detection tasks on both the original nuScenes dataset and the nuScenes dataset generated using WoVoGen. The results indicate that the data generated by WoVoGen signiﬁcantly improved 3D object detection (mAP) from 34.9 to 36.2.

5.3 Audio signal processing

Through synthesizing audio signals or corresponding text, data augmentation enhances the model’s performance in audio signal processing tasks. Wu et al. (2023c) proposed a data augmentation method employing ChatGPT to ‘mix- up’ pairs of captions in the Clotho dataset (Drossos et al., 2020), and generate more complex and diverse in-domain training data for automated audio captioning. Conducting an ablation experiment, the results show that the ChatGPT mix-up method improves the SPIDEr-FL score by 1.9 or 2.4 points in the cases where other model components are different. Compared to other top methods such as Xu et al. (2022) and Ye et al. (2022b), the proposed approach is state- of-the-art regarding the SPIDEr-FL metric and demonstrates strong performance in other metrics such as METEOR, CIDEr, SPICE, and SPIDEr. For spoken semantic parsing (SSP), when unpaired text is lacking in current textual datasets, Sharma et al. (2023) suggested prompting Llama2 to generate transcript-semantic parse data (unpaired text) for both existing and new domains. Employing the data generated by Llama2 with JAT (Just Add Text) and TTS (Text-to-Speech) can enhance the performance of SSP by a signiﬁcant margin of 1.4% EM and 2.6% EM absolute for both existing and new domains. Latif et al. (2023) evaluated the effectiveness of ChatGPT in annotating speech data for speech emotion recognition (SER). The results show that when using the data augmented by samples with ChatGPT labels the unweighted average recall (UAR) is obviously improved compared to using the actual IECMOAP labels in both within-corpus and cross-corpus settings. Additionally, compared to previous studies, the UAR of the proposed method is about 2 to 3 points higher than the second place. A new approach called subword-based neural text augmen- tation (Tarj´an et al., 2020) is proposed, and in this approach, GPT-2 is ﬁrst applied to generate augmentation data for an ASR language model. This approach achieved a signiﬁcant improvement in the word error rate (WER) of the online automatic speech recognition (ASR) system on Hungarian call center conversations, and the WER is reduced from 21.9% to 19.6%. In terms of WER and its ability to recognize out-of-vocabulary (OOV) words, subword-based neural text augmentation outperforms the original word-based data augmentation technique (Wang et al., 2019). At the same time, it maintains a relatively small vocabulary size and low memory requirements for the system. Kim et al. (2023) introduced a technique to produce high-ﬁdelity respiratory sound samples, leveraging an audio diffusion model as a conditional neural vocoder. By employing a proposed adversarial ﬁne-tuning approach, the generated samples are seamlessly integrated with authentic data to alleviate distribution inconsistencies between synthetic and real sam- ples, which improves the performance of respiratory sound classiﬁcation. On the ICBHI dataset (Rocha et al., 2018), the proposed method, adversarial Fine-tuning with synthetic samples, achieved a 2.24% improvement in the ICBHI score and up to 26.58% improvement in accuracy for minority classes over the baseline, AST (Gong et al., 2021) ﬁne-tuning without synthetic samples.

6 SUMMARY In this section, we present a consolidated overview of the key ﬁndings from our review in sections 3, 4, and 5.

Large model-based data augmentation remains a ﬁeld ﬁlled with opportunities and challenges. This survey aims to comprehensively review the large model-based data aug- mentation approaches, the accompanying data post pro- cessing techniques, and applications in downstream tasks. It also meticulously categorizes existing large model-based data augmentation methods. By summarizing and analyz- ing current works, we identify successes and failures of current methods and discern new trends in large model- based data augmentation. Furthermore, we summarize the existing methods used for evaluating large model-based data augmentation. Most importantly, these summaries can help in proposing new challenges and opportunities for future research.

6.1 The Success and Failure of Large Model-Based Data Augmentation

Large models in data augmentation exhibit a mix of suc- cesses and failures. The following summary, drawn from evaluations of current models on target datasets, provides a concise overview of these outcomes.

6.1.1 Achievements

• Large models exhibit a high level of competence in natural language understanding (NLU), as evi- denced by their ability to excel in tasks such as text classiﬁcation (Dai et al., 2023), question answering (Chen et al., 2023b), and natural language inference (Lu and Lam, 2023), through the comprehension and interpretation of textual data with accuracy and pre- cision.

Large models display a remarkable aptitude for nat- ural language generation (NLG), as evident from their capacity to excel in tasks like machine trans- lation (Oh et al., 2023) and dialogue summarizing (Schlegel et al., 2023), where they adeptly generate coherent and contextually relevant textual outputs. • Large models demonstrate the capacity to customize content generation while maintaining the given sub- ject (Gal et al., 2022a; Kumari et al., 2023; Ruiz et al., 2023), showcasing their adaptability to a wide range of prompts.

• Large models exhibit remarkable proﬁciency in generating images through prompts (Brooks et al., 2023)(Nguyen et al., 2023; Sun et al., 2023), extend- ing their applicability to downstream tasks such as image classiﬁcation (Samuel et al., 2023a), semantic segmentation (Du et al., 2023), and object detection (Lu et al., 2023). The generated data proves to be highly effective and impactful in enhancing perfor- mance in these subsequent tasks.

• Large models are proﬁcient in handling complex audio data, demonstrating expertise in downstream applications such as automatic speech recognition (ASR) (Tarj´an et al., 2020) , speech emotion recog- nition (SER) (Latif et al., 2023) , spoken semantic parsing (SSP) (Sharma et al., 2023), and automated audio captioning (AAC) (Wu et al., 2023c).

6.1.2 Limitations

• Large models are susceptible to underlying models, such as potential misunderstanding of text prompts and complex relationships (Li et al., 2023a), sensi- tivity to typographic attacks (Avrahami et al., 2022), difﬁculty in processing images containing text char- acters (Wei et al., 2023), and challenges in handling multiple subject combinations (Kumari et al., 2023; Ma et al., 2023).

• Large models often encounter challenges in generat- ing results that deviate signiﬁcantly from the norm, particularly in scenarios where there is a substantial disparity between prompts and images (Sun et al., 2023; Tumanyan et al., 2023), and in cases involv- ing the synthesis of rare or highly ﬁctional subjects (Ma et al., 2023).

• Large models require precise prompts from users, but it is difﬁcult to provide speciﬁc instructions for complex objectives (Hertz et al., 2022), and the effectiveness of the generated output is limited by the ambiguity inherent in natural language prompts (Gal et al., 2022b).

• Large models may exhibit biases in generation (Brooks et al., 2023; Nguyen et al., 2023) and have strong priors towards certain subjects (Chen et al., 2023a).

• The text data generated by large models may poten- tially contain toxic or biased content, which cannot be fully assessed through either automatic or human evaluation (Zheng et al., 2023). Fine-tuning a large model within a speciﬁc domain can enhance the performance of data augmentation, but it requires a signiﬁcant amount of resources (Kaddour and Liu, 2023).

text data augmentation method based on large models, akin to Rotation or Translation, that can be widely applied across di- verse downstream tasks does not exist (Kumar et al., 2023). A general-domain LLM like ChatGPT may produce inaccurate augmentation results owing to its deﬁciency in domain-speciﬁc knowledge (Dai et al., 2023).

6.2 Protocols and Benchmarks for Evaluation

The evaluation methods for large model-based data aug- mentation can be divided into two categories: one involves assessing the efﬁcacy of data augmentation methods based on changes in performance metrics in the corresponding dataset of downstream tasks; the other entails evaluating data augmentation methods by calculating quality metrics for the data generated through large model-based data augmentation. The second type, however, is relatively less explored and not widely applied.

Currently, the effectiveness of large model-based data augmentation is typically assessed using performance met- rics of downstream tasks. For instance, in the ﬁelds of NLP, CV, and speech signal processing, the impact of data augmentation is measured by the improvement in model performance on corresponding datasets for these down- stream tasks. Classiﬁcation accuracy is commonly used

to evaluate the performance of text classiﬁcation models and thereby assess the effectiveness of data augmentation (Dai et al., 2023; Jo et al., 2022; Yoo et al., 2021). In Ques- tion Answering tasks, exact match (EM) and F1 are two prevalent metrics for evaluating data augmentation per- formance (Chowdhury and Chadha, 2023; Sachdeva et al., 2023; Samuel et al., 2023b). For Machine Translation tasks, BLEU is used to assess precision, while CHRF++ compre- hensively evaluates the quality of text generation. Both metrics can measure the enhancement of machine transla- tion performance due to data augmentation (Lu and Lam, 2023; Oh et al., 2023). In Dialogue Summarizing, common evaluation metrics include Rouge and its various opti- mized versions, which judge the quality of summaries by statistically analyzing the overlap of n-grams between the machine-generated candidate summaries and standard summaries (Lu and Lam, 2023; Schlegel et al., 2023). Addi- tionally, in other NLP tasks like semantic textual similarity (STS), Spearman’s rank correlation is utilized to evaluate the impact of data augmentation on improving model performance (Thakur et al., 2020). In Image Classiﬁcation tasks, classiﬁcation accuracy is commonly used to evalu- ate model performance (Dunlap et al., 2023; Trabucco et al., 2023; Yin et al., 2023). In Object Detection tasks, model per- formance is typically assessed by calculating the average precision (AP) or mean average precision (mAP) across various categories (Lu et al., 2023; Voetman et al., 2023). In Image Segmentation, dice or mean intersection over union (mIoU) are frequently used to measure the overlap be- tween predicted and actual segmentations (Du et al., 2023; Wu et al., 2023d; Yu et al., 2023; Zhang et al., 2023d). Both of these metrics can be applied to evaluate the effectiveness of data augmentation methods. In spoken semantic parsing (SSP) tasks, the EM score is commonly used as an evaluation metric (Sharma et al., 2023). For automatic speech recogni- tion (ASR) tasks, model performance is measured using the word error rate (WER) of online automatic speech recog- nition systems, as well as the ability to recognize out-of- vocabulary (OOV) words (Tarj´an et al., 2020). In automated audio captioning (AAC) tasks, the SPIDEr-FL is employed as a performance metric (Wu et al., 2023c).

Additionally, the quality of data generated by large model-based data augmentation methods can also be di- rectly assessed using certain metrics. In the ﬁeld of text augmentation, various methods are employed to evaluate the quality and relevance of generated data. For example, Dai et al. (2023) evaluate the augmented datasets generated by the proposed method, AugGPT, and other baseline meth- ods using two metrics: cosine similarity and TransRate. The cosine similarity measures the similarity between the gen- erated data and the test dataset, while the TransRate metric measures the learnability of the data. To assess the similarity between data generated by ChatGPT and the training and test data, in addition to calculating the cosine similarity, Ubani et al. (2023) employed Word Overlap, determining the percentage of unique overlapping words between ex- ample pairs, post removal of stop words and punctuation. In the ﬁeld of image augmentation, three prominent metrics for assessing the quality of generated images are the fr´echet inception distance (FID) score (Heusel et al., 2017), the CLIP score (Radford et al., 2021) and the DINO score (Caron et al.,

2021). The FID score measures the distance between the distribution of generated and real images in a feature space from the Inception network. A lower FID score indicates a higher similarity to real images, suggesting better quality. This metric is extensively used in evaluating generative models, particularly in image synthesis (Kumari et al., 2023; Zhang et al., 2023a). On the other hand, the CLIP score, leveraging the capabilities of the CLIP, assesses how well the generated images align with speciﬁc textual descriptions. This makes the CLIP score a valuable tool in assessing the performance of generative models, especially in tasks that require precise alignment between text and image content (Brooks et al., 2023; Ge et al., 2023; Shi et al., 2023; Tumanyan et al., 2023). Lastly, the DINO score assesses the preservation of structural and contextual elements in gen- erated images, using the DINO-ViT self-similarity distance. Lower DINO scores indicate better structural integrity, mak- ing this metric essential for maintaining the authenticity of image features during augmentation processes. The DINO score is extensively used due to its ability to evaluate the structural preservation in augmented images (Ruiz et al., 2023; Tumanyan et al., 2023; Wei et al., 2023).

7 GRAND CHALLENGES Although previous research on large model-based data aug- mentation has achieved numerous notable successes, this ﬁeld remains in its nascent stages, with several critical chal- lenges yet to be addressed. This section underscores these challenges and explores potential future research directions.

7.1 Theoretical Understanding

The ﬁeld of data augmentation currently lacks substantial theoretical research, often being perceived merely as a sup- plementary tool for enhancing model performance. Speciﬁc data augmentation approaches may increase accuracy, but these improvements generally hinge on the assumption that augmented data are label-preserving and do not alter the data distribution. However, these assumptions frequently do not hold in practical scenarios, potentially leading to noisy labels, shifts in data distribution, and subsequently, diminished performance or generalization. Moreover, large models are typically treated as black boxes. Gaining a deeper understanding of the characteristics that empower these models is crucial, especially in determining their reliability for processing sensitive data. A comprehensive and rigorous interpretation of these models is essential, not only to elucidate why certain augmentation techniques effectively improve model performance, but also to guide the selection or design of the most appropriate and effective methods for dataset expansion. Consequently, a critical fu- ture direction lies in developing theoretical support for data augmentation. This would involve establishing frameworks and principles to underpin the practical application of augmentation techniques, ensuring their effectiveness and suitability for diverse datasets and modeling challenges.

7.2 The Number of Augmented Data

An intriguing aspect of data augmentation is that the en- hancement in training data quantity does not invariably cor-

relate with a linear improvement in performance. Firstly, be- yond a certain data threshold, further augmentation might actually impair performance. This phenomenon could be attributed to the fact that, although the quantity of data increases, its diversity may not. Secondly, there is a lack of theoretical guidance regarding the optimal size of training datasets. The decision on dataset size, suitable for speciﬁc tasks and models, is often based on empirical judgment and extensive experimentation. Researchers typically tailor dataset sizes to align with the speciﬁc models, training goals, and challenges in data collection. Thirdly, class im- balance can signiﬁcantly distort data distribution, with the learning process frequently biased towards the majority class, leading to inadequate modeling of minority classes. Therefore, oversampling minority classes becomes crucial in data augmentation. However, oversampling essentially en- tails repeated sampling from the existing distribution, which might result in overﬁtting. Consequently, determining the appropriate amount of data generation for different classes is crucial to enhance model performance without compro- mising data diversity. This necessitates a strategic balance to ensure that the augmented data contributes effectively to the model’s learning, without losing the variety essential for robust generalization.

7.3 Multimodal Data Augmentation

While several studies have explored paired data augmen- tation (Bakhtiarnia et al., 2023; Hao et al., 2023; Wu et al., 2023c), developing effective large model-based methods for multimodal data generation remains a challenge. Most exist- ing works concentrate on augmenting a single modality, yet there lies signiﬁcant potential in simultaneous multimodal data augmentation for various tasks, such as image caption- ing and speech recognition. Additionally, while paired data augmentation is predominantly inspired by large models and has the capability to enrich data patterns, introduce more diversity, and ensure ﬁdelity in the generated data, the exploration of multimodal data augmentation techniques represents a signiﬁcant and promising challenge for future research in data augmentation.

7.4 Language and Vision Foundation Models

The rise of artiﬁcial intelligence-generated content (AIGC), from Stable Diffusion to ChatGPT, has captured signif- icant attention in both academic and industrial circles. The GPT family, particularly GPT-4 (OpenAI, 2023a), has demonstrated remarkable content generation capabilities and unexpected emergent abilities. However, to date, there is no equivalent ’vision foundation model’ in computer vision that demonstrates comparable generalization across thousands of tasks. An intriguing approach is to use text generated by these models as prompts to create augmented images, leveraging the full diversity of text prompts to enhance the quality of the generated images. Furthermore, utilizing the knowledge emergence ability of these models as a bridge to develop similar emergent capabilities in vision foundation models presents an exciting challenge for future research.

7.5 Automatic Data Augmentation

Despite their effectiveness, current large model-based data augmentation approaches predominantly rely on manual design. The development of methods for automatically se- lecting suitable types of large model-based data augmenta- tion remains relatively unexplored. While some approaches have proven effective in speciﬁc tasks or scenarios, their generalizability across different tasks is often limited. The exploration of techniques to automatically learn data aug- mentation strategies, or to search for an optimal augmen- tation policy tailored to speciﬁc tasks, could signiﬁcantly improve the generalizability of augmented data.

7.6 Robust and Consistent Data Augmentation

Despite the promising outcomes of current large model- based data augmentation methods in practical applications, they are constrained by the potential lack of robustness and consistency in the generated data. For instance, certain data augmentation methods might alter the breed of a cat in an image, leading to erroneous classiﬁcations by the classiﬁer, as demonstrated in (Trabucco et al., 2023). In the realm of natural language processing, particularly for tasks like aug- menting medical texts, LLM-based data augmentation can produce irrelevant sentences owing to ChatGPT’s limited domain-speciﬁc knowledge. Consequently, it is crucial to tailor general-domain large models with domain-speciﬁc data when addressing particular tasks, ensuring the aug- mented data’s relevance and accuracy.

7.7 Trustworthy Data Augmentation

In the process of data augmentation, ensuring the trust- worthiness of the augmented data is paramount, especially when it is used to train large models. The presence of bias and toxicity in the training data can lead large models to generate content that signiﬁcantly deviates from human preferences and standards. Consequently, there is a pressing need not only for generating trustworthy data but also for implementing reliable data augmentation approaches. This is particularly relevant for NLP applications, where a major concern is how to rephrase sentences to convey high- level information without incorporating offensive content. Currently, there is a notable gap in research addressing this issue. Future work should focus on developing both trustworthy data augmentation techniques and robust eval- uation frameworks for augmented data, ensuring that they adhere to ethical standards and reﬂect the desired level of quality and reliability.

7.8 The Evaluation of Augmented Data

The quantity of data generated by augmentation approaches is critically important. However, currently, there are no standardized evaluation metrics speciﬁcally for augmented data, making its quality assessment a major challenge. Presently, the quality of augmented data is typically as- sessed based on task-speciﬁc performances, such as eval- uating data augmentation methods by their impact on tasks like text classiﬁcation, measured by accuracy, or semantic segmentation, gauged by IOU scores. Yet, these do not provide direct metrics for the augmented data itself. Ideally,

evaluation metrics should measure both the diversity of individual data points and the overall consistency of the dataset, independent of the speciﬁc task at hand. Moreover, it appears impractical to expect one or a few general datasets to capture the nuances of all data augmentation methods, particularly those tailored to speciﬁc tasks. Nonetheless, a small benchmark capable of evaluating various data aug- mentation approaches would be highly beneﬁcial. Such a benchmark should assess different aspects of data aug- mentation methods, including diversity and faithfulness. With the increasing reliance on large models, it may also be prudent to use data that are not part of the training sets for these models. Incorporating testing data from these models might lead to inaccurate conclusions. Consequently, the development of such metrics and datasets is vital for the progression of data augmentation techniques, providing a clearer understanding of their effectiveness and applicabil- ity across different contexts.

7.9 Beyond Augmentation: Training Large Models Us- ing Augmented Data

Data augmentation, while pivotal, serves merely as a start- ing point rather than the ultimate goal in the realm of machine learning. Training data, algorithmic innovation, and computational power are the triad underpinning the performance of large models. With these models rapidly advancing in capability, the scarcity of high-quality data is emerging as a primary bottleneck in scaling large models. This scenario underscores the importance of leveraging data generated by large models for training purposes. To optimize the use of such data, it is imperative to develop effective metrics that assess the diversity and faithfulness of the augmented data, thereby preventing model overﬁt- ting. A comprehensive data augmentation system should encompass not just metrics evaluating speciﬁc attributes of augmented data, such as diversity, but also a robust theoretical framework that elucidates the usefulness of this data. In conclusion, data augmentation holds the potential to address the challenge of data scarcity in training large models. There is signiﬁcant scope for future advancements in this area, with the aim of enhancing the efﬁcacy and understanding of data augmentation techniques.

8 CONCLUSION

Data augmentation holds profound signiﬁcance, emerging as a crucial component in the advancement of artiﬁcial intelligence models, particularly in the context of large mod- els. This survey offers an exhaustive examination of data augmentation methods driven by large models. We dissect and review these studies across three dimensions: approach, data post-processing, and application. For each dimension, we construct a detailed taxonomy to interlink existing re- search, summarizing key techniques and clarifying their strengths and limitations. Beyond reviewing past work, this survey also identiﬁes several challenges within the ﬁeld, poised to steer prospective future research directions.
