Digital Object Identifier 10.1109/ACCESS.2024.3358206

Socially Aware Synthetic Data
Generation for Suicidal Ideation
Detection Using Large Language Models
HAMIDEH GHANADIAN1 , ISAR NEJADGHOLI2 , HUSSEIN AL OSMAN3

1

arXiv:2402.01712v1 [cs.CL] 25 Jan 2024

2
3

University of Ottawa, Ottawa, Canada (e-mail: Hghan053@uottawa.ca)
National Research Council Canada, Ottawa, Canada (e-mail: Isar.nejadgholi@nrc-cnrc.gc.ca)
University of Ottawa, Ottawa, Canada (e-mail: Hussein.alosman@uottawa.ca)

Corresponding author: Hamideh Ghanadian (e-mail: Hghan053@ uOttawa.ca)

ABSTRACT Suicidal ideation detection is a vital research area that holds great potential for improving
mental health support systems. However, the sensitivity surrounding suicide-related data poses challenges
in accessing large-scale, annotated datasets necessary for training effective machine learning models. To
address this limitation, we introduce an innovative strategy that leverages the capabilities of generative AI
models, such as ChatGPT, Flan-T5, and Llama, to create synthetic data for suicidal ideation detection. Our
data generation approach is grounded in social factors extracted from psychology literature and aims to
ensure coverage of essential information related to suicidal ideation. In our study, we benchmarked against
state-of-the-art NLP classification models, specifically, those centered around the BERT family structures.
When trained on the real-world dataset, UMD, these conventional models tend to yield F1-scores ranging
from 0.75 to 0.87. Our synthetic data-driven method, informed by social factors, offers consistent F1scores of 0.82 for both models, suggesting that the richness of topics in synthetic data can bridge the
performance gap across different model complexities. Most impressively, when we combined a mere 30%
of the UMD dataset with our synthetic data, we witnessed a substantial increase in performance, achieving
an F1-score of 0.88 on the UMD test set. Such results underscore the cost-effectiveness and potential of our
approach in confronting major challenges in the field, such as data scarcity and the quest for diversity in
data representation.
INDEX TERMS Artificial Intelligence, Deep Learning, Large Language Models, Suicide Detection,
Synthetic Data Generation, Transformer Based Models

I. INTRODUCTION

CCORDING to the World Health Organization1 more
than 700,000 people die due to suicide every year.
Suicide remains a global health crisis, accounting for a significant proportion of mortality rates across various age groups.
Suicidal ideation, often a precursor to actual suicide attempts,
involves the presence of persistent thoughts, contemplation,
or planning related to self-harm or death. Early identification
of suicide ideation and intervention to protect individuals at
risk of suicide are crucial steps in reducing suicide rates and
providing appropriate mental health support. Early detection
of suicidal ideation is a complex task, as it requires the integration of various factors, including psychological, social,
and environmental variables [1].

A

1 The World Health Organization (WHO)

VOLUME 4, 2016

In recent years, the proliferation of digital platforms and
social media has provided an unprecedented opportunity to
capture and analyze large-scale data related to mental health
[2] [3]. Machine learning and Natural Language Processing
(NLP) techniques have shown promise in detecting linguistic
patterns and indicators of suicidal ideation in diverse textbased data sources, such as social media posts, online forums,
and electronic health records [4], [5], [6], [7].
However, the use of machine learning technologies requires high volumes of data. Data collection and annotation
processes are time-consuming and impose significant financial costs [8]. Specifically, obtaining a substantial amount
of labeled data related to suicide can be challenging and
limited due to several factors inherent to the nature of suicide
research. The sensitive and stigmatized nature of suicide
often presents barriers to data collection. Individuals and
1

Ghanadian et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

organizations may be rightfully reluctant to share personal or
confidential information related to suicide, fearing potential
negative consequences.
Synthetic data generation offers a viable solution to mitigate the data availability limitation by creating artificially
generated data that closely resembles real-world data. Synthetic data generation can be instrumental in machine learning applications as it addresses many challenges of real data
collection and annotation. Here we review a list of common
challenges in data collection that can be managed through
synthetic data generation.
Data Scarcity: In many NLP tasks, such as mental healthrelated applications, there may be limited availability of
relevant data due to privacy concerns or the complexity and
cost associated with manual annotation. Synthetic data generation allows researchers and practitioners to overcome data
scarcity issues and augment the limited amount of publically
available data [9].
Data Diversity: NLP models trained on limited data may
suffer from poor generalization and performance when exposed to diverse and previously unseen examples. Moreover,
in real data, certain topics can be undermined or overlooked
due to being less discussed. This can happen for several
reasons. For example, certain topics may be stigmatized and
considered too sensitive or taboo, making people hesitant
to openly discuss them. This could include subjects related
to mental health, addiction, discrimination, or social issues
that carry societal stigmas. Additionally, topics relevant to
marginalized or minority communities may receive less discussion due to systemic biases, unequal representation, or
limited platforms for their voices to be heard. Also, some topics may be highly specialized or complex, requiring specific
expertise or background knowledge to engage in meaningful
discussions. Encouraging diverse perspectives and actively
seeking out less-discussed topics can contribute to a more
comprehensive and nuanced understanding of real-world issues. Synthetic data generation can help enrich the training
data by introducing a wider variety of linguistic patterns,
sentence structures, vocabulary and topics. This, in turn,
improves the model’s ability to handle variations in natural
language and increases its robustness [10].
Privacy Preservation: Suicide detection tasks often involve
sensitive information. Generating synthetic data allows researchers to create representative samples that preserve the
privacy of individuals while maintaining the statistical properties and distribution of the original data. [11]
Annotation Cost: Suicide detection is a complex task, and
high-quality annotation can only be performed by experts and
trained annotators, which can be costly [12]. Synthetic data
generation addresses the data annotation issue by targeted
data generation so that each generated example is pre-labeld
with a specific category. Although these labels might be noisy
to some extent, they might be preferable in some settings as
they come at no additional cost.
2

To investigate the feasibility and effectiveness of synthetic
data generation in the task of suicide ideation generation,
we use Generative Large Language Models (GLLMs) for
data synthesis and use the generated data to train/test text
classifiers. To train classifiers, we fine-tune pretrained BERTlike Large Language Models (LLMs) as state-of-the-art text
classifiers.
To enhance the quality of the generated data, we benefit
from domain knowledge from psychology. Previous research
highlights the importance of incorporating social factors in
the design process of NLP systems [13]. Specifically, when
generating data with LLMs, external sources of domain
knowledge can be leveraged to guide the data generation
process [14]. For the task of suicidal ideation detection,
such knowledge can be drawn from a vast body of research
in psychology devoted to gaining an understanding of the
social factors associated with suicidal ideation and behavior.
In this work, we review the psychology literature to extract
the social factors tightly tied to suicidal ideation and use
this knowledge for more effective prompt engineering when
generating data with GLLMs. Guiding the data generation
with these factors enables the creation of diverse and representative examples of suicidal ideation.
The main contributions of this study are as follows:
We extracted the relevant social factors associated with suicidal ideation through a comprehensive review of existing
literature, research papers and clinical studies to identify
key themes related to suicidal ideation. These themes
encompass a wide range of factors, including risk factors,
common triggers and mental health indicators. Leveraging
a socially aware data synthesis approach, we pave the way
for more accurate and reliable suicidal ideation detection
systems.
• Our study examines three GLLMs’ performance in producing synthetic datasets with Zero-Shot and Few-Shot
learning techniques. Utilizing the ChatGPT, Flan-T5, and
Llama 2 models and leveraging the extracted social factors
from the psychology literature, we generated nine datasets
with diverse characteristics.
• We trained classifiers by fine-tuning two pre-trained language models, ALBERT and DistilBERT, using the generated datasets. We tested these models on two test sets, the
University of Maryland Suicidality dataset (UMD) and a
human-annotated synthetic dataset presented in this paper.
Our findings indicate that the GLLMs have significant
potential for generating a suicide-related dataset comparable with real available datasets such as UMD. More
significantly, the integration of social knowledge may significantly enhance the quality of the generated datasets and
lead to more robust classifiers.
• We augmented our best-performing synthetic dataset using
subsets of the UMD dataset to evaluate the efficacy of
data augmentation in suicidal detection applications. Our
results show that models trained with synthetic data augmented with a small set of real-world data can outperform
•

VOLUME 4, 2016

Ghanadian et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

models trained by large annotated real-world datasets.
This paper is organized as follows: In Section II, we review
the literature and related background. In Section III, we
explain our methodology for generating and evaluating the
proposed synthetic data generation, specifics of the classifiers and datasets we use in this work. Section IV presents
our results, and Section V discusses these results in detail.
Additionally, the conclusion and possible future works are
discussed in Section VI. We complete the article by including
an ethical statement in Section VII, which delves into the
ethical aspects and considerations associated with our work.
II. BACKGROUND AND RELATED WORK

In this section, we review the related work in suicidal ideation
detection in psychology as well as NLP research that addresses the task of suicide detection. We also review the
previous works that focused on generating synthetic data for
a variety of NLP tasks.
A. SUICIDAL IDEATION AND RELATED SOCIAL
FACTORS

Suicidal ideation has been a subject of extensive research
within the field of psychology. Understanding the underlying
elements and risk factors related to suicidal thoughts and
behaviors is crucial for developing effective prevention and
intervention strategies.
One important area of investigation is the identification
of risk factors associated with suicidal ideation. Numerous
studies have examined the impact of psychological factors
such as depression, anxiety, hopelessness, and feelings of
worthlessness on the development of suicidal thoughts [15],
[16]. These Studies investigate the strong association between suicidal thoughts and conditions like depression [17],
[18], bipolar disorder [19], borderline personality disorder
[20], and substance abuse [21]. By examining the interplay
between these conditions, researchers aim to develop targeted
interventions to address the unique challenges faced by individuals struggling with suicidal ideation [22], [23]. Additionally, environmental factors such as a history of trauma, social
isolation, and access to lethal means have been identified as
potential risk factors [24]–[26].
Psychology offers valuable insights into the diverse processes and factors that contribute to suicide risk. Psychological theories and frameworks such as the interpersonal theory
of suicide [27], the cognitive model of suicidal behavior [28],
and the social-ecological model [29] provide a theoretical
foundation for understanding the complex interplay between
individual vulnerabilities and environmental factors.
The extensive research conducted on suicidal ideation and
associated topics in psychology has significantly contributed
to the understanding of the complex factors involved. By
unraveling the causes, risk factors, and protective factors
associated with suicidal thoughts, researchers aim to develop effective prevention strategies, enhance mental health
interventions, and ultimately reduce the global burden of
suicide. In Section III-A, we enumerate the social factors that
VOLUME 4, 2016

are discussed in the literature as relevant topics to suicidal
ideation.
B. SUICIDAL IDEATION DETECTION USING NLP

In recent years, there has been a growing interest in using NLP techniques for suicide prevention [30], [31]. Researchers have developed suicide detection systems to analyze and interpret social media data, including text data. By
detecting linguistic markers of distress and other risk factors,
these systems can help identify individuals with a risk of
suicidality and provide early interventions to prevent such
incidents [32].
Several studies indicated the impact of social network
reciprocal connectivity on users’ suicidal ideation. Hsiung
et al. [33] analyzed the changes in user behavior following
a suicide case that occurred within a social media group.
Jashinsky et al. [34] highlighted the geographic correlation
between suicide mortality rates and the occurrence of risk
factors in tweets. Colombo et al. [35] focused on analyzing
tweets that contained suicidal ideation, with a particular emphasis on the users’ behavior within social network interactions that resulted in strong and reciprocal connectivity, leading to strengthened bonds between users. NLP techniques,
therefore, offer a promising avenue for suicide prevention
efforts, enabling more proactive and effective interventions
to support those in need.
Generative Language models: Ghanadian et al. [36] utilized ChatGPT for assessing suicidality from social media
posts. They performed Zero-Shot and Few-Shot experiments
and extensive performance comparison between ChatGPT
and two fine-tuned transformer-based models. They also
investigated the impact of different temperature parameters
on ChatGPT’s response generation. The findings of this
paper suggest that ChatGPT achieves notable accuracy in
the suicidal risk assessment task; however, transformer-based
pre-trained models fine-tuned on human-annotated datasets
exhibit superior performance. Furthermore, the analysis provides insights into adjusting ChatGPT’s hyperparameters to
enhance its effectiveness in assisting mental health professionals with this critical task.
Yang et al. [37] conducted a comprehensive evaluation of
ChatGPT’s mental health analysis and emotional reasoning
ability across five tasks. They also investigated the impact of
different emotion-based prompting strategies. Additionally,
they explored the use of generative models to generate explanations for the decisions made by ChatGPT, aiming for
interpretable mental health analysis. The experimental results revealed that ChatGPT performed better than traditional
neural network-based methods such as Convolutional Neural
Network (CNN) and Gated Recurrent Unit (GRU) in mental
health analysis but still lagged behind advanced task-specific
methods.
Available Dataset: Several datasets have been collected from
social media platforms to serve as a resource for creating
suicidal ideation detection systems. These datasets encom3

Ghanadian et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

pass a wide range of information collected from various
social media sources, including Twitter, Reddit and other
user-generated content. Sinha et al. [38] created a manually
annotated dataset from Twitter using a lexicon of suicidal
phrases and a lexicon along with the social engagement
data associated with real-time and historical tweets. The
resulting dataset consists of 34,306 tweets with two labels,
Suicidal and Non-Suicidal. Gaur et al. [39] collected and
annotated a 5-label Suicide Risk Severity Assessment dataset
from Reddit, which includes Suicidal Ideation (ID), Suicidal
Behavior (BR), Actual Attempt (AT), Suicide Indicator (IN)
and Supportive (SU) categories. This dataset is extracted
from SucideWatch2 subreddit and has been annotated by four
practicing clinical psychiatrists, ensuring the accuracy and
reliability of the annotations. The dataset comprises a total
of 500 posts, which have been carefully selected to represent
a diverse range of content related to suicidal ideation.
Another widely referenced dataset in the field of suicidal
ideation detection is the University of Maryland Reddit Suicidality Dataset(UMD) [40], [41]. The UMD dataset is a collection of Reddit posts and comments created by individuals
who expressed suicidal thoughts or behaviors. The dataset
contains over 100,000 posts and comments collected from
various subreddits, including those related to mental health
and suicide prevention, such as Depression3 and SucideWatch
subreddits. The data was collected over a period of several
years and includes the content of the posts and comments, as
well as the location and timing of the posts.
C. SYNTHETIC DATA COLLECTION

To overcome the limitations of real-world data availability,
NLP researchers have explored the use of synthetic datasets
for several applications. For example, He et al. [42] utilized
language models to generate synthetic unlabeled text. They
introduced the Generate, Annotate, and Learn (GAL) framework that leverages synthetic text for knowledge distillation,
self-training, and few-shot learning purposes. To generate the
data, they fine-tune pre-trained language models on relevant
datasets with limited examples. The synthetic text is then
annotated with soft pseudo labels using the best available
classifier for knowledge distillation and self-training. This
paper achieves state-of-the-art results for knowledge distillation with 6-layer transformers on the GLUE leaderboard
[43].
Bonifacia et al. [44] presents an effective approach to
leverage LLMs in retrieval tasks, resulting in significant
improvements across various datasets. Instead of directly
utilizing LLMs during the retrieval process, they harness the
LLMs’ capabilities to generate labeled data using a fewshot learning approach. Subsequently, they fine-tune smaller
retrieval models on this synthetic dataset and employ them
to re-rank the search results obtained from a primary retrieval system. They provide a novel method to adapt LLMs
2 SuicideWatch subreddit
3 Depression subreddit

4

FIGURE 1: Workflow of the proposed methodology
for Information Retrieval (IR) tasks that were previously
deemed infeasible due to their demanding computational
requirements. By shifting the computational burden from
the retrieval stage to the generation of synthetic data for
training, they make it feasible to exploit the power of LLMs
without compromising efficiency. In an unsupervised setting,
their approach significantly outperforms recently proposed
methods, highlighting its superiority in terms of retrieval
performance and scalability.
III. METHODOLOGY

In this section, we elaborate on our proposed methodology.
Figure 1 shows the workflow we use to generate synthetic
datasets and our testing process. As shown in this figure, our
method has three steps:
• STEP 1- Domain knowledge Extraction: Extract relevant social factors from the psychology literature for an
informed prompting of GLLMs in data synthesis.
• STEP 2- Synthetic Data Generation: Use three
GLLMs to generate socially aware synthetic data, that
is, data that covers a wide range of suicide-related
topics.
• STEP 3- Evaluate the effectiveness of Synthetic data:
Train state-of-the-art classifiers with real-world, synthetic, and augmented datasets and test those classifiers
on real-world as well as synthetic test sets.
In the following, we explain the three steps described
above. The complete implementation of our project, including Zero-Shot Learning and Few-Shot Learning of GLLMs,
as well as the fine-tuned classifiers, is available on GitHub 4 .
A. SUICIDE RELATED TOPICS IN PSYCHOLOGY

We conducted a comprehensive search across various academic databases, including PsycINFO, PubMed, and Google
Scholar, with keywords and combinations such as “suicide”,
“suicidal ideation” and “psychology” to identify relevant
articles, research papers, and review papers. Thematic analysis was employed to identify the most recurring and significant topics across the included studies. Repeated topics
4 https://github.com/Hamideh-ghanadian/Synthetic_Data_Generation_
using_Generative_LLMs
VOLUME 4, 2016

Ghanadian et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

that demonstrate relevance to suicide in psychology were
considered the most related topics.
Based on our analysis of the literature, the following
social and psychological factors were consistently reported
in relation to suicidal ideation in psychology. These topics
are not listed in a specific order of importance but represent
the consistently reported themes in the literature reviewed:
Depression: Depression emerged as a frequently reported
topic, highlighting its strong association with suicidal
ideation. Numerous studies have explored the relationship
between depressive symptoms, including sadness, loss of
interest, feelings of worthlessness, and the increased risk of
suicidal thoughts. [45]–[47]
Anxiety: Anxiety disorders were also commonly associated
with suicidal ideation. Research has emphasized the link
between excessive worry, fear, and agitation and the presence
of suicidal thoughts and behaviors [48].
Unemployment: The experience of unemployment has been
consistently identified as a topic closely related to suicidal ideation. Studies have examined the psychological distress and negative impact on self-esteem and social support
that can arise from unemployment, contributing to suicidal
ideation [49].
Hopelessness: Hopelessness, characterized by a lack of optimism and a perceived absence of future prospects, has been
consistently linked to suicidal ideation. Studies have demonstrated the significant role of hopelessness as a predictor of
suicidal thoughts [45], [46], [49].
Anger: The expression and experience of anger have been
reported as influential factors in suicidal ideation. Unresolved
anger, hostility, and intense emotional distress have been
associated with an increased risk of suicidal thoughts [45].
Perfectionism: Perfectionism, marked by excessively high
standards and self-criticism, has been identified as a psychological factor related to suicidal ideation. Research has
explored the relationship between perfectionistic tendencies
and the development of suicidal thoughts and behaviors [45].
Family Issues: Family-related issues, such as conflict, dysfunctional dynamics, and poor communication, have consistently emerged as topics associated with suicidal ideation.
These factors can contribute to a sense of isolation, distress,
and feelings of being a burden, increasing the risk of suicidal
thoughts [47], [49].
Relationship Problems: Difficulties in intimate relationships, including conflicts, breakups, and marital dissatisfaction, have been reported as significant topics in relation to
suicidal ideation. Relationship problems can contribute to
emotional distress and feelings of hopelessness, leading to
thoughts of suicide [49].
Financial Crisis: Financial difficulties and crises have been
consistently linked to suicidal ideation. Economic stressors,
such as debt, unemployment, and financial insecurity, can
VOLUME 4, 2016

contribute to psychological distress and an increased risk of
suicidal thoughts [48].
Education: Issues related to educational pressures, academic
stress, and performance expectations have been reported as
topics associated with suicidal ideation. Research has highlighted the impact of academic-related stressors on mental
well-being and the risk of suicidal thoughts among students
[50].
Bullying: Bullying, including physical, verbal, or cyberbullying, has consistently emerged as a significant topic related to suicidal ideation. The experience of bullying can lead
to social isolation, low self-esteem, and emotional distress,
contributing to the development of suicidal thoughts [48].
Death of Loved Ones: The loss of close family members or
friends through death has been reported as a topic associated
with suicidal ideation. Grief, feelings of loneliness, and a
sense of being unable to cope with the loss can increase the
risk of suicidal thoughts [51].
Immigration: Issues related to immigration, discrimination,
and racism have been identified as topics linked to suicidal
ideation. Experiences of marginalization, social exclusion,
and acculturative stress can contribute to psychological distress and suicidal thoughts among individuals facing these
challenges [52], [53].
Racism: Studies have consistently highlighted the significant
impact of racial discrimination on suicidal ideation. Experiencing racism and racial prejudice can increase the risk of
suicidal thoughts. [54]
B. SYNTHETIC DATA GENERATION

We utilized three generative language models to generate
a synthetic dataset related to suicidal ideation. GLLM’s
foundation is constructed with transformers. Transformers
are a class of deep learning models, first introduced by
Vaswani et al. [55] in 2017. Researchers build state-ofthe-art NLP models using transformer-based architectures
because they can be quickly trained on large datasets, and
studies have shown that they are better at modeling longterm dependencies in natural language text [56]. GLLMs,
including ChatGPT, FlanT5, and Llama are designed with
the primary purpose of generating coherent and contextually
relevant text. They excel at tasks such as text generation [57],
completion [58], and dialogue generation [59]. These models
are typically based on decoder transformer architectures and
focus on the generative aspect of language which involves the
auto-regressive generation, where the models predict the next
word based on the preceding context. Generative models are
trained on a vast corpus of text, however, their main strength
lies in their ability to generate text that flows naturally and
contextually appropriate.
We aim to build a diverse dataset in order to train a
generalizable and robust model in suicidal ideation detection.
In total, nine different datasets are generated with different
specifications and models.
5

Ghanadian et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

1) ChatGPT

The language model utilized by ChatGPT is gpt-3.5-turbo5 ,
which is one of the most advanced language models developed by OpenAI. ChatGPT accept a sequence of messages as an input and produce a message generated by the
model as an output. Although the chat format is primarily
intended for conversations spanning multiple turns, it is also
equally useful for single-turn tasks that do not involve any
conversations. We used the OpenAI Python library6 to access
the ChatCompletion functionality of the gpt-3.5-turbo model
through its API.
In this project, we evaluate the capability of ChatGPT
in Zero-Shot Learning and Few-Shot Learning settings to
generate a diverse suicidality dataset. However, we are primarily focused on Zero-Shot Learning methods as ChatGPT
has exhibited superior performance in this setting compared
to Few-Shot Learning for a suicidal ideation detection task.
Ghanadian et al. [36] conducted an extensive comparison
of the Zero-Shot and Few-Shot approaches using ChatGPT.
According to their findings, fine-tuning a model on Few-Shot
setting might yield poorer performance compared to ZeroShot in various scenarios. In Few-Shot, with few examples
available for fine-tuning, the risk of overfitting increases.
The model might learn specific nuances or noise within the
limited few-shot data, leading to poor generalization on unseen examples. Moreover, Few-shot learning relies on a small
subset of labeled examples, which might not adequately
represent the entire diversity of the dataset. The model might
fail to capture the complexity and variability present in the
broader dataset during fine-tuning.
The temperature hyperparameter in ChatGPT is a crucial
parameter that influences the generated output. A higher
temperature value, such as 1.0, increases the randomness and
produces more varied responses. Conversely, a lower temperature value, such as 0.1, reduces randomness and generates
more focused and deterministic responses. Ghanadian et al.
[36] investigated the effect of the temperature parameter on
the generated output of ChatGPT for suicide risk assessment.
Furthermore, the authors introduced a parameter known as
the "Inconclusiveness Rate," which indicates the proportion
of test cases that do not produce a definitive or conclusive
result. According to their paper, this parameter decreases
as the temperature parameter is increased. As such, for this
project, we have configured the temperature parameter of
ChatGPT to be 1.
We generated five datasets using ChatGPT. Four of these
datasets are informed by 14 main suicide-related topics in
psychology, while one dataset is generated without providing
any specific topics. For incorporating suicide-related topics
in data generation, we utilized prompt engineering techniques. Prompt engineering involves carefully crafting and
designing the prompts provided to the model to elicit desired
responses. By employing prompt engineering strategies, we

aimed to enhance the quality and relevance of the generated
dataset. This methodology allowed us to tailor the dataset
generation process to align with our specific objectives and
requirements. For prompt engineering, we drew inspiration
from a short course on ChatGPT Prompt Engineering7 .
In both Zero-Shot and Few-Shot Learning, we initiated the
prompt construction process with a simple initial prompt and
iteratively refined it through multiple rounds of trial and error.
This iterative approach allowed us to gradually evolve the
prompt, making necessary adjustments based on the observed
outcomes. The following example shows a prompt we used
to generate a binary dataset (Suicide/non-suicide), guided by
the topics introduced in Section III-A, in Zero-shot Learning
setting.
C r i t e r i a = " " " These a r e t h e c r i t e r i a of
different suicide risk level :
R i s k L e v e l =Non S u i c i d a l : I do n o t s e e
evidence t h a t t h i s person i s at r i s k
for suicide
Risk Level= S u i c i d a l : I b e l i e v e t h i s
person i s a t high r i s k of attempting
suicide in the near future . """
prompt = f " " "
Your t a s k i s t o g e n e r a t e a s u i c i d a l t e x t f o r
each of the f o l l o w i n g " t o p i c s " with
d i f f e r e n t Risk l e v e l s .
1− D e p r e s s i o n
2− A n x i e t y
3− H o p e l e s s n e s s
4− Anger
5− P e r f e c t i o n i s m
6− F a m i l y i s s u e s
7− R e l a t i o n s h i p p r o b l e m s
8− Unemployment
9− F i n a n c i a l C r i s i s
10− E d u c a t i o n
11− B e i n g B u l l i e d
12− D e a t h o f c l o s e d one
13− I m m i g r a t i o n
14− Racism
P r o v i d e t h e a n s w e r s i n JSON f o r m a t w i t h t h e
f o l l o w i n g columns : t e x t , t o p i c , r i s k
level .
Risk l e v e l c r i t e r i a : ‘ ‘ ‘{ C r i t e r i a } ‘ ‘ ‘
"""

In Few-Shot Learning, the prompt is structured to include
two examples for each category (8 in total) from the training
set of UMD Dataset, followed by a text generation question.
This approach enables the model to learn from a limited set
of labeled examples before generating a dataset. Moreover,
by combining the Few-Shot Learning methodology with
the inclusion of psychology topics in the prompt, we aim
to enhance the model’s ability to generate meaningful and

5 https://platform.openai.com/docs/models/gpt-3-5
6 https://github.com/openai/openai-python

6

7 ChatGPT Prompt Engineering for Developers
VOLUME 4, 2016

Ghanadian et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

contextually relevant responses when dealing with suiciderelated discussions.
2) Flan-T5

FLAN-T5 models are instruction fine-tuned across a diverse
set of tasks, aiming to enhance their zero-shot performance
on various tasks. During instruction tuning, pretrained models undergo fine-tuning using drafts of instructions that guide
them on how to perform a specific task. These instructions
can include real-time feedback to assist the model in learning
from its mistakes and improving at a faster rate. By providing
explicit guidance and incorporating feedback mechanisms,
the instruction-tuning process enables the model to refine its
performance and enhance its ability to accurately execute the
given task. This iterative approach of incorporating instructions and feedback facilitates the model’s learning process,
allowing it to adapt and improve its performance based on
the provided guidance.
In this project, we utilized Flan-T5-XXL8 presented by
Google Research [60] in a Zero-Shot setting. Two datasets
are generated using Flan-T5, one with topics and another
without topics. Moreover, similar to ChatGPT, the temperature value is set to 1, and the same prompt structure is
utilized.
3) Llama 2

LLaMA (Large Language Model Meta AI) is an autoregressive language model constructed based on transformer
architecture. Similar to other generative models, LLaMA operates by taking a sequence of words as its input and making
predictions about the subsequent word, iteratively producing
text in a recursive manner. It is a collection of state-ofthe-art foundational language models, with parameter counts
ranging from 7 billion to 65 billion. The foundation models
were trained on large unlabeled datasets, making them ideal
for fine-tuning on a variety of tasks. The newest version
of this model, Llama 2, expanded its pre-training corpus
size, allowing the model to learn from a more extensive
and diverse set of publicly available data. Additionally, the
context length of Llama 2 has been doubled, enabling the
model to consider a more extensive context when generating
responses, leading to improved output quality and accuracy
[61]. In this paper, we used Llama 2-13B, presented by Meta
in the Zero-Shot setting. In total, we generated two datasets
with Llama2, one with topics and another without topics.
These datasets were created using the temperature of 1 and
maintained the same prompt structure as ChatGPT.
C. EVALUATION OF SYNTHETIC DATASET

To evaluate the utility and effectiveness of synthetic datasets,
we fine-tuned pre-trained transformer-based language models, ALBERT and DistilBERT, to train classifiers with each
set of the generated synthetic data. We compared the trained

classifiers with classifiers with similar structures fine-tuned
with real-world data as the benchmark model.
ALBERT and DistilBERT are two pre-trained language
models from the BERT family of LMs. The BERT model
was initially proposed by Delvin et al. [62] as a bidirectional
language model pretrained on a large corpus comprising the
Toronto Book Corpus and Wikipedia. The model is named
bidirectional because it can simultaneously gather the context
of a word from either direction. Unlike the generative models
such as ChatGPT, FlanT5 or Llama, which include a decoder
structure, the BERT family of language models are encoder
models and can be fine-tuned for specific tasks such as
classification tasks.
The ALBERT model was proposed by Lan et al. [63]
to reduce memory consumption and increase the training
speed compared to BERT. In other words, ALBERT is a
more lightweight version of BERT that maintains its high
level of accuracy, making it a powerful tool for various NLP
applications. The DistilBERT model was proposed by Sanh
et al. [64]. The authors reported it has 40% fewer parameters
than BERT and runs 60% faster while preserving over 95%
of BERT’s performances as measured on the GLUE language understanding benchmark. Both models are designed
as lightweight alternatives to BERT, with ALBERT emphasizing parameter efficiency and DistilBERT focusing on
knowledge transfer through distillation. Overall, ALBERT,
with a smaller number of parameters, shows more efficient
performance compared to DistilBERT.
To fine-tune these models, we utilized the Huggingface
library [65]. The Huggingface is an open-source library and
data science platform that provides tools to build, train and
deploy ML models. We compare our classification results
with baseline ALBERT9 and DistilBERT10 models finetuned on the UMD dataset by Ghanadian et al. [36]. We
used the Trainer11 class from Huggingface transformers12 for
feature-complete training in PyTorch.
The hyperparameters were selected based on the default
values commonly used in similar studies. The final hyperparameters used in our experiments were Learning Rate=
2e−5 , Batch Size = 4, Dropout Rate = 0.1, and Maximum
Sequence Length = 512. By comparing the performance of
these models on synthetic datasets against the baseline, we
can assess the efficiency of using the synthetic datasets and
gauge the improvements achieved through our fine-tuning
process.
To conduct a comprehensive assessment of the finetuned classifiers’ performance, we generated two distinct sets
of testing subsets. Furthermore, we created an augmented
dataset to showcase the application of synthetic data in the
suicidal ideation detection domain.

9 ALBERT
10 DistilBERT
11 Trainer

8 https://huggingface.co/google/flan-t5-xxl
VOLUME 4, 2016

12 Huggingface Transformers

7

Ghanadian et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

1) Testing subsets

We selected two test sets for evaluation purposes: The first
test dataset is the test subset of the UMD datasets utilized in
[36], which is annotated as a 4-class dataset. We employed
a 10-20-70 split for validation, test, and training sets, respectively. Out of the entire dataset, 10% was allocated for
validation purposes, ensuring the model’s hyperparameters
and configurations were appropriately set. 20% of the data
was set aside as a test set to evaluate the model’s performance
on unseen data and ensure its generalizability. The remaining
70% formed the training set, where the bulk of the data was
utilized to train the model and learn the underlying patterns.
This distribution was chosen to provide substantial data for
training while reserving enough distinct data for validation
and robust performance testing. The detailed description of
the Multi-class UMD dataset is presented in Table 1.
TABLE 1: The description of the training and testing subset
of UMD Dataset used in [36] for multi-class task
Multi-class Dataset No Risk Low Risk Moderate Risk High Risk
Training Subset 27.45 % 16.39 %
31.90 %
24.24 %
Number of Users
154
92
179
136
Testing Subset
24.41 % 11.62 %
26.74 %
37.20 %
Number of Users
42
20
46
64

Furthermore, to employ binary classification, we binarize
the UMD Dataset. Based on the definition of each class, “No
Risk” and “Low Risk” classes are considered as Non-Suicidal
and “Moderate Risk” and“High Risk” as Suicidal. Table 2
presents the description of binarized UMD dataset.
TABLE 2: The description of the training and testing subset
of UMD Dataset for binary task
Binary Dataset
Training Subset
Number of Users

Non Suicidal
43.84%
246

Suicidal
56.14%
315

Testing Subset
Number of Users

36.3
62

63.94
110

The second testing set is composed of 10% of each synthetic dataset generated in our project. This test set is annotated independently by two human annotators. A notable 89%
of the labels, initially generated by the generative models,
were agreed upon by the human annotators. However, for the
remaining 11% of the data, the labels were altered based on
the decision of the annotators. In cases where both annotators
agreed on a label, that label was retained. Conversely, when
disagreements arose, the annotators engaged in discussions
to ultimately reach a consensus on the appropriate label. The
details of the generated synthetic dataset are presented in
Table 3 as the eleventh dataset.
2) Augmented Dataset

Data augmentation involves enriching a dataset by introducing variations to its existing instances or generating entirely
8

new instances. This process is designed to enhance the diversity and quality of the dataset, which, in turn, can lead
to improved model performance and generalization. Hence,
in this study, we augment the best performing synthetic
dataset generated by LLMs with different subset sizes of
UMD dataset. Starting with 10% of the UMD training subset,
this subset is combined with the selected synthetic dataset.
The augmented dataset, which is a mix of synthetic and real
instances, is used to fine-tune the pretrained models. We
continue this process by increasing the number of real data
instances, such as 20% and 30%, until achieving comparable
results to those obtained from the model trained on the full
UMD dataset.
IV. RESULTS

In this section, first, we present the characteristics of each
synthetic dataset. Second, we report a comprehensive comparison of the models fine-tuned with them. Third, we report
the data augmentation results. For evaluation, we report two
widely-used metrics in this task, accuracy and F-score, to
provide a complete and informative evaluation of the performance of the classification models [66].
A. SYNTHETIC DATA GENERATION

A total of nine datasets are generated. An extensive description of these datasets, as well as a mixed set and a test
subset, is presented in Table 3. As shown in Table 3, we
created binary datasets and four-class datasets, each with the
option of including or not including the topics. The binary
datasets contain two classes, which allows us to evaluate
the model’s ability to distinguish between suicidal ideation
and non-suicidal instances. On the other hand, the four-class
datasets involve multiple categories, enabling us to explore
more nuanced predictions of suicidal ideation levels, including “No Risk”, “Low Risk”, “Moderate Risk” and “High Risk”
classes. Moreover, the option to include or not include the
topics in these datasets allows us to investigate the impact of
information provided by topics on the model’s performance.
By comparing the results from datasets with and without
topics, we can gain insights into how incorporating topicrelated data enhances or influences the model’s effectiveness
in suicidal ideation detection. Furthermore, as explained in
Section III-C1, we created a synthetic testing dataset comprising 10% of each dataset which is annotated by human
experts.
B. FINE-TUNED CLASSIFIERS

Two models, ALBERT and DistilBERT are fine-tuned with
the generated synthetic datasets. Table 4 presents the results
of the performance evaluation of models fine-tuned with
multi-class synthetic datasets generated by ChatGPT in ZeroShot and Few-Shot settings, tested on the multi-class UMD
test set. Considering the poor performance of the multi-class
synthetic dataset, we have chosen to disregard the multi-class
aspect and proceed solely with the binary approach.
VOLUME 4, 2016

Ghanadian et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

TABLE 3: Detailed description of generated synthetic datasets
Dataset #
1
2
3
4
5
6
7
8
9
10
11

Model
Chat GPT
Chat GPT
Chat GPT
Chat GPT
Chat GPT
Flan-T5
Flan-T5
Llama 2
Llama 2
Mix Dataset
Synthetic Testing Set

Learning Method
Zero-Shot
Zero-Shot
Few-Shot
Zero-Shot
Few-Shot
Zero-Shot
Zero-Shot
Zero-Shot
Zero-Shot
Zero-Shot
Zero-Shot

TABLE 4: Performance evaluation of ALBERT and DistilBERT models on Multi-class datasets generated by ChatGPT
Non-Synthetic ChatGPT ChatGPT
UMD Dataset Zero-Shot Few-Shot
Accuracy
0.865
0.41
0.36
ALBERT
F1-Score
0.87
0.43
0.27
Accuracy
0.77
0.06
0.06
DistilBERT
F1-Score
0.75
0.1
0.12
Models

Metrics

Moreover, Table 5 provides the performance of the models
trained on the binary synthetic datasets generated by ChatGPT, Flan-T5 and Llama models and tested on the binary
UMD testing subset. We compare the results for the synthetic
datasets with those of the UMD training set. Table 5 shows
that incorporating topic in generating the datasets significantly improves the performance of the models. For instance,
for Llama 2, the topic-oriented dataset increased the F1-score
and accuracy of the ALBERT model by 10% and 14% points,
respectively. We also created a mixed dataset, including
all topic-oriented datasets, to further evaluate the effects of
topics on the performance of the models. With both ALBERT
and DistilBERT, an F1-score of 0.82 is achieved by the mixed
dataset, which is significantly higher than the DistilBERT
model trained on the UMD dataset and comparable with the
performance of the ALBERT model fine-tuned on the UMD
dataset with an F1-score of 0.87.
Table 6 presents the results of models included in Table 5
but tested on synthetic testing datasets. Similar to the results
of Table 5, all of the topic-oriented datasets show significant improvement compared to the datasets without any
topics. ChatGPT-generated training data, with an F1-score of
0.82, exhibits the best performance, while the performances
of Flan-T5 and Llama2-generated datasets are acceptable.
Moreover, the mixed dataset shows a 0.81 F1-score, which
is an 11% improvement compared to the model trained with
the UMD dataset.
C. DATA AUGMENTATION

Based on the results presented in Table 5 and Table 6, the
datasets generated by ChatGPT in the Zero-Shot setting show
the best results compared to the other datasets. As explained
VOLUME 4, 2016

Topic-Oriented
Yes
No
Yes
Yes
Yes
Yes
No
Yes
No
Yes
N/A

# of Class
2
2
2
4
4
2
2
2
2
2
2

# of Instances
549
646
545
492
594
561
502
395
613
1352
318

in section III-C2, the augmented dataset now contains a
mix of synthetic and real data instances. The augmented
dataset is used to fine-tune the pretrained models and then
evaluated on two separate testing sets. In each iteration,
three folds, each comprising 10% of non-overlapping random
samples from the UMD dataset, are added to the synthetic
data. Subsequently, the average13 of the accuracy and F1score are calculated and reported in Table 7. If the model’s
performance with the augmented dataset is less than the
model trained with the UMD dataset, additional real-world
data is gradually incorporated. For instance, the percentage
of real data can be increased to 20% in the next iteration, and
the training and evaluation process is repeated.
Throughout the iterations, the model’s performance is
closely monitored and compared to the baseline model
trained solely on the UMD dataset. The aim is to identify
the point at which the augmented dataset starts producing
results comparable to or even surpassing those of the baseline
model. The process continues until an optimal percentage
of real data is found, where the model achieves similar
results as the baseline. This ratio indicates the ideal balance
between synthetic and real data for achieving high model
performance and generalization. Table 7 shows the results of
each augmentation process until we achieved the F1-score of
0.87 on the UMD testing subset at 30% augmentation rate
and F1-score of 0.85 on the synthetic testing subset at 10%
augmentation rate.
V. DISCUSSIONS

This study focuses on the generation of synthetic datasets
using generative models and subsequently assessing the
performance of models fine-tuned with these datasets. Our
synthetic data generation framework addresses two limitations of real-world data collection and annotation. First, we
address the data scarcity and annotation cost by generating
micropost-like suicidal/non-suicidal text. Second, we address
the lack of diversity in real-world data by forcing the generative models to create a balanced number of examples related
to each of the psychological and social factors impacting sui13 we also calculated the standard deviation of the metrics which were
always <0.02.

9

Ghanadian et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

TABLE 5: Performance evaluation of the ALBERT and DistilBERT models fine-tuned with binary datasets and tested on UMD
testing subset
Non-Synthetic
Models
ALBERT
DistilBERT

Metrics

UMD Dataset

Accuracy
F1-Score
Accuracy
F1-Score

0.87
0.87
0.77
0.75

With Topic
Few-Shot
0.67
0.66
0.61
0.59

ChatGPT
Without Topic
Zero-Shot
0.70
0.79
0.63
0.69

Flan-T5
With Topic
Zero-Shot
0.71
0.79
0.64
0.71

Llama 2

Mix Dataset

Without Topic

With Topic

Without Topic

With Topic

With Topic

0.48
0.54
0.59
0.61

0.62
0.64
0.77
0.84

0.33
0.49
0.32
0.15

0.75
0.78
0.75
0.77

0.77
0.82
0.76
0.82

TABLE 6: Performance evaluation of the ALBERT and DistilBERT models fine-tuned with binary datasets and tested on
synthetic testing subset
Non-Synthetic
Models
ALBERT
DistilBERT

Metrics

UMD Dataset

Accuracy
F1-Score
Accuracy
F1-Score

0.67
0.70
0.40
0.61

With Topic
Few-Shot
0.71
0.69
0.65
0.61

ChatGPT
Without Topic
Zero-Shot
0.81
0.78
0.83
0.81

Flan-T5
With Topic
Zero-Shot
0.81
0.82
0.85
0.81

TABLE 7: Performance evaluation of the ALBERT model
fine-tuned with the augmented dataset (synthetic data + a
subset of the UMD train set) and tested on UMD and synthetic testing subsets
10%
20%
30%
UMD
f
Dataset (Avg. of 3 Folds)* (Avg. of 3 Folds)* (Avg. of 3 Folds)*
Accuracy 0.87
0.75
0.81
0.83
UMD Testing Set
F1-Score 0.87
0.79
0.84
0.88
Accuracy 0.67
0.87
0.87
0.90
Synthetic Testing Set
F1-Score 0.70
0.83
0.86
0.88
Test Set

*

Metric

Standard Deviation< 2%

cidality. Integrating insights from psychology into the NLP
pipeline in this context can illuminate previously unexplored
facets of suicide and mental health detection in social media.
We created several datasets, including binary and multiclass, in Zero-Shot and Few-Shot settings, topic-oriented and
non-topic-oriented, with three different generative LLMs.
Early in our experiments (Table 4), we observed that ChatGPT is not able to produce high-quality multi-class datasets
in either the Zero-Shot or the Few-Shot settings. Generating
multi-class datasets using LLMs such as ChatGPT is more
complex and challenging task due to the inherent complexities involved in distinguishing between multiple and finegrained, classes. Even with the availability of a high-quality
dataset, one should anticipate lower accuracies in multi-class
scenarios. This is largely attributed to the ambiguous boundaries that exist between these classes, creating a complex
landscape that proves difficult for any classifier to navigate
successfully. Moreover, the creation of such datasets necessitates not only a detailed prompt but also specific instructions
that outline the multi-class scenarios. This process demands
a nuanced understanding and a level of specificity that often
poses a considerable challenge to ChatGPT. Longer [67].
As a result, we opted to exclusively create binary datasets
and focus our investigation on how topics impact the overall
generalizability of the fine-tuned models.
Our results show the critical role of incorporating domain
knowledge in synthetic data generation. We extracted the
relevant social topics from the Psychology literature and used
that to create more focused prompts for data generation.
10

Llama 2

Mix Dataset

Without Topic

With Topic

Without Topic

With Topic

With Topic

0.34
0.41
0.63
0.69

0.63
0.69
0.86
0.84

0.48
0.24
0.49
0.12

0.70
0.73
0.63
0.69

0.83
0.81
0.78
0.73

Table 8 displays a selection of binary samples generated by
ChatGPT within the synthetic dataset using social topics. The
table provides an illustration of specific examples generated
by this GLLM.
Table 5 presents the results of fine-tuned models on synthetic datasets and tested on the UMD dataset. Comparison between topic-oriented datasets and no topic-oriented
datasets shows the significant effects of including the topics
on the performance of the generated datasets. Informing the
data generation with topics in Flan-T5 and Llama2 increased
the F1-Score of the ALBERT model by 10% and 29% points,
respectively. Fine-tuning models on topic-oriented synthetic
datasets allows them to gain diverse domain-specific knowledge and patterns. Moreover, non-topic-oriented synthetic
datasets might lack specificity, leading to noise and irrelevant
content. In contrast, topic-oriented datasets are curated to focus on a specific domain, reducing the chances of generating
irrelevant or out-of-context text.
We showed that the BERT family fine-tuned with realworld data can achieve an F1 score ranging from 0.75 to 0.87,
depending on the complexity of their structure. Specifically,
DistilBERT, a less efficient model from the BERT Family,
achieves an F1-score of 0.75, while ALBERT, a more optimized model designed for speed and accuracy, attains an
F1-score of 0.87. In contrast, both DistilBERT and ALBERT
achieve a consistent F1-score of 0.82 when trained on purely
synthetic data and tested on real-world data. With this, we
demonstrate that the diversity of synthetic data compensates
for model complexity irrespective of its architecture. This
not only underscores the considerable potential of synthetic
data but also suggests that it can mitigate the limitations
of real-world data in capturing diverse topics. Most notably, our results emphasize an optimal strategy that involves
augmenting synthetic data with real data. This innovative
method achieves performances comparable to the ALBERT
model, even when relying on merely 30% of the manually annotated dataset. This solidifies our proposed method
as a cost-effective alternative, addressing the challenges of
data scarcity and diversity more effectively than the current
VOLUME 4, 2016

Ghanadian et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

TABLE 8: Generated synthetic samples using extracted social and psychological topics by ChatGPT
Topics

Depression

Labels
1

0

Bullying

1

0

Unemployment

1

0

Relationship problems

1

0

Sample
Growing up, I always felt like nobody wanted me. No one ever seemed to care about my thoughts, feelings,
or emotions. As I got older, these feelings intensified, and I started to feel like I was a burden on everyone
around me. I never felt like I belonged anywhere and I constantly struggled with my depression. The constant
emptiness and loneliness eventually became too much to bear, and I began to contemplate ending my life.
I’ve been struggling with mental health problems for as long as I can remember. Depression have been my
constant companions. But I’ve never let it stop me from living my life. I’ve learned to manage my symptoms
with the help of therapy and medication. Some days are harder than others, but I keep pushing forward and
focusing on the good in my life. I’ve come to accept that mental illness is a part of me, but it doesn’t define me.
I felt like I was constantly being attacked by my classmates. They called me names, pushed me around,
and even vandalized my belongings. The worst part was feeling so alone and misunderstood. I didn’t
want to burden my parents with my problems so I kept it all inside. The idea of ending it all seemed like
the only way to escape the pain and torment.
When I was in middle school, I was constantly bullied for being overweight. It was tough going to school
every day and constantly being made fun of by my peers. I felt so alone and helpless. But I didn’t let the
bullies control my life. I joined a support group and found friends who lifted me up. I also started exercising
and eating healthier, not to please anyone else, but to feel better about myself. It wasn’t an easy journey, but
it was worth it. I’m happy to say that today I am confident and proud of who I am.
After losing my job, I stayed unemployed for months. My savings ran out and with no source of income,
my bills piled up. The constant fear of not being able to provide for myself and my family drove me to
the brink of despair. I feel worthless and like a burden on everyone around me. The future seems bleak
and hopeless, and I wonder if it’s worth it to keep going.
After graduating from college, I struggled to find a job in my field for a few months. It was frustrating
and disheartening, but I kept applying and networking. Eventually, I landed a job in a related field that
I enjoy. It wasn’t my dream job, but it paid the bills and gave me experience. I’m still looking for my
dream job, but I’m grateful for what I have and optimistic about my future prospects.
I thought I had found the one but it seems like I was wrong because he left me for someone else. I don’t
know how to deal with this pain . I can’t sleep, I can’t eat, and I just want to disappear. Maybe everything
would be easier if I just ended it all.
My relationship with my partner hasn’t been going well lately. We have been arguing over small things,
and it’s affecting our mental health. We decided to go for couples therapy, and it’s been a turning point for
us. We learned to communicate better and understand each other’s perspective. Now we are in a better
place and happier than ever before.

benchmarks. However, Synthetic datasets often exhibit a
distributional shift from real-world data. This shift arises due
to the inherent differences in the data generation processes
between synthetic and real domains. As a result, models
trained solely on synthetic datasets may not be applicable
in real-world situations, leading to a lack of robustness
and adaptability. Therefor, exploring hybrid approaches that
combine synthetic and real-world data for training can offer
a more comprehensive solution. Leveraging both sources
allows models to learn from the strengths of synthetic data
while adapting to the intricacies of real-world environments.
As presented in Table 5 and Table 7, our study’s central
objective was to investigate the potential of synthetic and
augmented data in training models to perform effectively on
real-world data . Given this setup, the chance of overfitting is
inherently reduced since the training (synthetic) and testing
(real-world) datasets are obtained from distinct distributions.
Moreover, to better understand the performance, robustness,
and limitations of the fine-tuned classifiers, we curated an
additional test set by manual annotation of a subset of the
synthetic data. Additional tests ensure that the models do
not overfit a particular dataset and can handle a variety
of data distributions and scenarios. Table 6 presents the
performance results of the fine-tuned models evaluated on
the human-annotated synthetic dataset. Notably, the topicoriented ChatGPT dataset stands out with an F1-score of
0.82, demonstrating its superior performance compared to
VOLUME 4, 2016

the other datasets. Specifically, the model trained with the
UMD dataset falls short in handling the synthetic test set,
presumably because of its less diverse topics.
VI. CONCLUSION AND FUTURE WORKS

The accurate identification of suicidal ideation from textual
data holds paramount importance for early intervention and
prevention efforts. Natural Language Processing (NLP) techniques have shown promise in this domain, but the scarcity
and sensitivity of real suicide-related data pose significant
challenges. Gathering and annotating real suicide-related
data is a resource-intensive and ethically sensitive process.
Synthetic data generation methods, such as text generation
models and data augmentation techniques, offer a more costeffective way to supplement real data. Also, our synthetic
datasets offer a potential solution by providing additional
social and psychological context in training instances for
models to address the limitations of the existing real data.
Our data augmentation results show that incorporating
synthetic data into the training pipeline helps diversify the
dataset and enhance model generalization. Real data is often
limited in size, leading to over-fitting and reduced model
performance. However, by carefully blending synthetic data
with real data, we can bolster the model’s performance while
maintaining a balance between practicality and sensitivity.
Moving forward, exploring the diversity of Language
Models (LLMs) stands as an intriguing avenue for future
11

Ghanadian et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

research. Investigating and quantifying the extent of diversity
within LLMs across various domains, languages, and training
methodologies could offer valuable insights. Future works
could delve into developing robust metrics or methodologies
specifically tailored to assess and measure diversity within
these models.
This paper has effectively highlighted the advantages of
using synthetic data generation techniques in detecting suicidal ideation. However, the field still presents numerous
opportunities for further research and refinement. Future
initiatives could focus on the adaptation of models to various
linguistic and cultural environments, acknowledging the diverse ways people express suicidal thoughts across different
languages and cultures. Furthermore, a holistic approach that
integrates multiple data modalities, such as images, audio, or
behavioral data, alongside textual information could enhance
the detection process. It’s also crucial to set up a framework
that allows for the continuous evaluation and optimization of
models, given the ever-changing nature of online communication patterns and user behaviors.
VII. ETHICAL CONSIDERATIONS

For this research, we obtained ethics approval from the
research ethics board at the University of Ottawa. Moreover, the UMD dataset was used with authorization from
its creators, and we adhered to the terms of use and ethical
standards 14 provided by them.
The use of LLMs for suicide-related synthetic datasets
raises several ethical considerations. Firstly, synthetic
datasets should be generated in a way that avoids perpetuating or amplifying biases present in the original data. It is
important to carefully examine the underlying data and the
algorithms used in generating synthetic datasets to ensure
fairness and mitigate potential biases.
Secondly, the process of generating synthetic datasets
should be transparent and well-documented. It is essential to
provide clear information about the methods used, assumptions made, and limitations of the synthetic data. This enables
others to assess and evaluate the validity and appropriateness
of using synthetic datasets.
Thirdly, to use synthetic datasets in sensitive applications
or decision-making processes, accountability and liability
should be considered. Care should be taken to understand
the potential impact and consequences of decisions or actions based on synthetic data and establish mechanisms for
addressing any negative outcomes or biases that may arise.
REFERENCES
[1] Domenico De Berardis, Giovanni Martinotti, and Massimo Di Giannantonio. Understanding the complex phenomenon of suicide: from research to
clinical practice. Frontiers in psychiatry, 9:61, 2018.
[2] E Rajesh Kumar and N Venkatram. Predicting and analyzing suicidal
risk behavior using rule-based approach in twitter data. Soft Computing,
ePub:1–9, 2023.
14 The University of Maryland Reddit Suicidality Dataset

12

[3] Ali Raza, Furqan Rustam, Hafeez Ur Rehman Siddiqui, Isabel de la Torre
Diez, Begoña Garcia-Zapirain, Ernesto Lee, and Imran Ashraf. Predicting
genetic disorder and types of disorder using chain classifier approach.
Genes, 14(1):71, 2022.
[4] Asma Abdulsalam and Areej Alhothali. Suicidal ideation detection on
social media: A review of machine learning methods. arXiv preprint
arXiv:2201.10515, 2022.
[5] Zepeng Li, Jiawei Zhou, Zhengyi An, Wenchuan Cheng, and Bin Hu. Deep
hierarchical ensemble model for suicide detection on imbalanced social
media data. Entropy, 24(4):442, 2022.
[6] Dheeraj Kodati and Ramakrishnudu Tene. Identifying suicidal emotions
on social media through transformer-based deep learning. Applied Intelligence, 53(10):11885–11917, 2023.
[7] Mian Muhammad Sadiq Fareed, Ali Raza, Na Zhao, Aqil Tariq, Faizan
Younas, Gulnaz Ahmed, Saleem Ullah, Syeda Fizzah Jillani, Irfan Abbas,
and Muhammad Aslam. Predicting divorce prospect using ensemble
learning: Support vector machine, linear model, and neural network.
Computational Intelligence and Neuroscience, 2022, 2022.
[8] Qiang Wei, Amy Franklin, Trevor Cohen, and Hua Xu. Clinical text
annotation–what factors are associated with the cost of time? In AMIA
Annual Symposium Proceedings, volume 2018, page 1552. American
Medical Informatics Association, 2018.
[9] Rohit Babbar and Bernhard Schölkopf. Data scarcity, robustness and
extreme multi-label classification. Machine Learning, 108(8-9):1329–
1351, 2019.
[10] Sergey I Nikolenko. Synthetic data for deep learning, volume 174.
Springer, 2021.
[11] Yingzhou Lu, Huazheng Wang, and Wenqi Wei. Machine learning for
synthetic data generation: a review. arXiv preprint arXiv:2302.04062,
2023.
[12] Hung Chau, Saeid Balaneshin, Kai Liu, and Ondrej Linda. Understanding
the tradeoff between cost and quality of expert annotations for keyphrase
extraction. In Proceedings of the 14th Linguistic Annotation Workshop,
pages 74–86, 2020.
[13] Dirk Hovy and Diyi Yang. The importance of modeling social factors of
language: Theory and practice. In Proceedings of the 2021 Conference
of the North American Chapter of the Association for Computational
Linguistics: Human Language Technologies, pages 588–602, Online, June
2021. Association for Computational Linguistics.
[14] Ziniu Hu, Yichong Xu, Wenhao Yu, Shuohang Wang, Ziyi Yang, Chenguang Zhu, Kai-Wei Chang, and Yizhou Sun. Empowering language
models with knowledge graph reasoning for open-domain question answering. In Proceedings of the 2022 Conference on Empirical Methods in
Natural Language Processing, pages 9562–9581, Abu Dhabi, United Arab
Emirates, December 2022. Association for Computational Linguistics.
[15] Joseph C Franklin, Jessica D Ribeiro, Kathryn R Fox, Kate H Bentley,
Evan M Kleiman, Xieyining Huang, Katherine M Musacchio, Adam C
Jaroszewski, Bernard P Chang, and Matthew K Nock. Risk factors for
suicidal thoughts and behaviors: A meta-analysis of 50 years of research.
Psychological bulletin, 143(2):187, 2017.
[16] Kate H Bentley, Joseph C Franklin, Jessica D Ribeiro, Evan M Kleiman,
Kathryn R Fox, and Matthew K Nock. Anxiety and its disorders as
risk factors for suicidal thoughts and behaviors: A meta-analytic review.
Clinical psychology review, 43:30–46, 2016.
[17] Laura Orsolini, Roberto Latini, Maurizio Pompili, Gianluca Serafini,
Umberto Volpe, Federica Vellante, Michele Fornaro, Alessandro Valchera,
Carmine Tomasetti, Silvia Fraticelli, et al. Understanding the complex of
suicide in depression: from research to clinics. Psychiatry investigation,
17(3):207, 2020.
[18] Ned H Kalin. Insights into suicide and depression. Am J Psychiatry, pages
877–880, 2020.
[19] Lucas da Silva Costa, Átila Pereira Alencar, Pedro Januário Nascimento
Neto, Maria do Socorro Vieira dos Santos, Cláudio Gleidiston Lima
da Silva, Sally de França Lacerda Pinheiro, Regiane Teixeira Silveira,
Bianca Alves Vieira Bianco, Roberto Flávio Fontenelle Pinheiro Júnior,
Marcos Antonio Pereira de Lima, et al. Risk factors for suicide in bipolar
disorder: a systematic review. Journal of affective disorders, 170:237–254,
2015.
[20] Joel Paris. Suicidality in borderline personality disorder. Medicina,
55(6):223, 2019.
[21] Kyoung Hag Lee, Jung Sim Jun, Yi Jin Kim, Soonhee Roh, Sung Seek
Moon, Ngoyi Bukonda, and Lisa Hines. Mental health, substance abuse,
and suicide among homeless adults. Journal of evidence-informed social
work, 14(4):229–242, 2017.
VOLUME 4, 2016

Ghanadian et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

[22] Chukwudi Okolie, Michael Dennis, Emily Simon Thomas, and Ann John.
A systematic review of interventions to prevent suicidal behaviors and
reduce suicidal ideation in older people. International psychogeriatrics,
29(11):1801–1824, 2017.
[23] Evan M Kleiman, Brianna J Turner, Szymon Fedor, Eleanor E Beale, Jeff C
Huffman, and Matthew K Nock. Examination of real-time fluctuations
in suicidal ideation and its risk factors: Results from two ecological momentary assessment studies. Journal of abnormal psychology, 126(6):726,
2017.
[24] Nicholas Leigh-Hunt, David Bagguley, Kristin Bash, Victoria Turner,
Stephen Turnbull, Nicole Valtorta, and Woody Caan. An overview of
systematic reviews on the public health consequences of social isolation
and loneliness. Public health, 152:157–171, 2017.
[25] Julianne Holt-Lunstad, Timothy B Smith, Mark Baker, Tyler Harris, and
David Stephenson. Loneliness and social isolation as risk factors for
mortality: a meta-analytic review. Perspectives on psychological science,
10(2):227–237, 2015.
[26] Adelyn Allchin, Vicka Chaplin, and Joshua Horwitz. Limiting access
to lethal means: applying the social ecological model for firearm suicide
prevention. Injury prevention, 25(Suppl 1):i44–i48, 2019.
[27] Kimberly A Van Orden, Tracy K Witte, Kelly C Cukrowicz, Scott R
Braithwaite, Edward A Selby, and Thomas E Joiner Jr. The interpersonal
theory of suicide. Psychological review, 117(2):575, 2010.
[28] Amy Wenzel and Aaron T Beck. A cognitive model of suicidal behavior:
Theory and treatment. Applied and preventive psychology, 12(4):189–
201, 2008.
[29] Robert J Cramer and Nestor D Kapusta. A social-ecological framework
of theory, assessment, and prevention of suicide. Frontiers in psychology,
8:1756, 2017.
[30] Andrea C Fernandes, Rina Dutta, Sumithra Velupillai, Jyoti Sanyal, Robert
Stewart, and David Chandran. Identifying suicide ideation and suicidal
attempts in a psychiatric clinical research database using natural language
processing. Scientific reports, 8(1):7426, 2018.
[31] Cosmin A Bejan, Michael Ripperger, Drew Wilimitis, Ryan Ahmed,
JooEun Kang, Katelyn Robinson, Theodore J Morley, Douglas M Ruderfer, and Colin G Walsh. Improving ascertainment of suicidal ideation
and suicide attempt with natural language processing. Scientific reports,
12(1):15146, 2022.
[32] M Johnson Vioules, Bilel Moulahi, Jérôme Azé, and Sandra Bringay.
Detection of suicide-related posts in twitter data streams. IBM Journal
of Research and Development, 62(1):7–1, 2018.
[33] Robert C Hsiung. A suicide in an online mental health support group:
reactions of the group members, administrative responses, and recommendations. CyberPsychology & Behavior, 10(4):495–500, 2007.
[34] Jared Jashinsky, Scott H Burton, Carl L Hanson, Josh West, Christophe
Giraud-Carrier, Michael D Barnes, and Trenton Argyle. Tracking suicide
risk factors through twitter in the us. Crisis, 2014.
[35] Gualtiero B Colombo, Pete Burnap, Andrei Hodorog, and Jonathan Scourfield. Analysing the connectivity and communication of suicidal users on
twitter. Computer communications, 73:291–300, 2016.
[36] Hamideh Ghanadian, Isar Nejadgholi, and Hussein Al Osman. Chatgpt for
suicide risk assessment on social media: Quantitative evaluation of model
performance, potentials and limitations. arXiv preprint arXiv:2306.09390,
2023.
[37] Kailai Yang, Shaoxiong Ji, Tianlin Zhang, Qianqian Xie, and Sophia Ananiadou. On the evaluations of chatgpt and emotion-enhanced prompting
for mental health analysis. arXiv preprint arXiv:2304.03347, 2023.
[38] Pradyumna Prakhar Sinha, Rohan Mishra, Ramit Sawhney, Debanjan
Mahata, Rajiv Ratn Shah, and Huan Liu. # suicidal-a multipronged
approach to identify and explore suicidal ideation in twitter. In Proceedings
of the 28th ACM international conference on information and knowledge
management, pages 941–950, 2019.
[39] Manas Gaur, Amanuel Alambo, Joy Prakash Sain, Ugur Kursuncu, Krishnaprasad Thirunarayan, Ramakanth Kavuluru, Amit Sheth, Randy Welton,
and Jyotishman Pathak. Knowledge-aware assessment of severity of
suicide risk for early intervention. In The world wide web conference,
pages 514–525, 2019.
[40] Ayah Zirikly, Philip Resnik, Özlem Uzuner, and Kristy Hollingshead.
CLPsych 2019 shared task: Predicting the degree of suicide risk in Reddit
posts. In Proceedings of the Sixth Workshop on Computational Linguistics
and Clinical Psychology, June 2019.
[41] Han-Chin Shing, Suraj Nair, Ayah Zirikly, Meir Friedenberg, Hal Daumé
III, and Philip Resnik. Expert, crowdsourced, and machine assessment of
suicide risk via online postings. In Proceedings of the Fifth Workshop
VOLUME 4, 2016

on Computational Linguistics and Clinical Psychology: From Keyboard to
Clinic, pages 25–36, 2018.
[42] Xuanli He, Islam Nassar, Jamie Kiros, Gholamreza Haffari, and Mohammad Norouzi. Generate, annotate, and learn: NLP with synthetic text.
Transactions of the Association for Computational Linguistics, 10:826–
842, 2022.
[43] Ahmad Rashid, Vasileios Lioutas, and Mehdi Rezagholizadeh. Mate-kd:
Masked adversarial text, a companion to knowledge distillation. arXiv
preprint arXiv:2105.05912, 2021.
[44] Luiz Bonifacio, Hugo Abonizio, Marzieh Fadaee, and Rodrigo Nogueira.
Inpars: Unsupervised dataset generation for information retrieval. In Proceedings of the 45th International ACM SIGIR Conference on Research
and Development in Information Retrieval, pages 2387–2392, 2022.
[45] Julie Boergers, Anthony Spirito, and Deidre Donaldson. Reasons for
adolescent suicide attempts: Associations with psychological functioning.
Journal of the American Academy of Child & Adolescent Psychiatry,
37(12):1287–1293, 1998.
[46] E David Klonsky, Alexis M May, and Boaz Y Saffer. Suicide, suicide
attempts, and suicidal ideation. Annual review of clinical psychology,
12:307–330, 2016.
[47] Rúnar Vilhjálmsson, E Sveinbjarnardottir, and G Kristjansdottir. Factors
associated with suicide ideation in adults. Social psychiatry and psychiatric epidemiology, 33:97–103, 1998.
[48] Cristina Lázaro-Pérez, Pilar Munuera Gómez, José Ángel MartínezLópez, and José Gómez-Galán. Predictive factors of suicidal ideation
in spanish university students: a health, preventive, social, and cultural
approach. Journal of clinical medicine, 12(3):1207, 2023.
[49] Jia-In Lee, Ming-Been Lee, Shih-Cheng Liao, Chia-Ming Chang, SuzChieh Sung, Hung-Chi Chiang, and Chuan-Wan Tai. Prevalence of suicidal
ideation and associated risk factors in the general population. Journal of
the Formosan Medical Association, 109(2):138–147, 2010.
[50] Amy Farabaugh, Stella Bitran, Maren Nyer, Daphne J Holt, Paola Pedrelli,
Irene Shyu, Steven D Hollon, Sidney Zisook, Lee Baer, Wilma Busse, et al.
Depression and suicidal ideation in college students. Psychopathology,
45(4):228–234, 2012.
[51] John R Peteet, Guy Maytal, and Haleh Rokni. Unimaginable loss:
contingent suicidal ideation in family members of oncology patients.
Psychosomatics, 51(2):166–170, 2010.
[52] Katarzyna Anna Ratkowska and Diego De Leo. Suicide in immigrants: An
overview. 2013.
[53] Joseph D Hovey. Acculturative stress, depression, and suicidal ideation in
mexican immigrants. Cultural Diversity and Ethnic Minority Psychology,
6(2):134, 2000.
[54] Brian TaeHyuk Keum, Michele J Wong, and Rangeena Salim-Eissa. Gendered racial microaggressions, internalized racism, and suicidal ideation
among emerging adult asian american women. International journal of
social psychiatry, 69(2):342–350, 2023.
[55] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion
Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention
is all you need. Advances in neural information processing systems, 30:5–
8, 2017.
[56] Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement
Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan
Funtowicz, et al. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 conference on empirical methods in natural
language processing: system demonstrations, pages 38–45, 2020.
[57] Renqian Luo, Liai Sun, Yingce Xia, Tao Qin, Sheng Zhang, Hoifung
Poon, and Tie-Yan Liu. Biogpt: generative pre-trained transformer for
biomedical text generation and mining. Briefings in Bioinformatics,
23(6):bbac409, 2022.
[58] Xin Xie, Ningyu Zhang, Zhoubo Li, Shumin Deng, Hui Chen, Feiyu
Xiong, Mosha Chen, and Huajun Chen. From discrimination to generation:
Knowledge graph completion with generative transformer. In Companion
Proceedings of the Web Conference 2022, pages 162–165, 2022.
[59] Fei Mi, Yitong Li, Yulong Zeng, Jingyan Zhou, Yasheng Wang, Chuanfei
Xu, Lifeng Shang, Xin Jiang, Shiqi Zhao, and Qun Liu. Pangu-bot: Efficient generative dialogue pre-training from pre-trained language model.
arXiv preprint arXiv:2203.17090, 2022.
[60] Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay,
William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha
Brahma, et al. Scaling instruction-finetuned language models. arXiv
preprint arXiv:2210.11416, 2022.
[61] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhar13

Ghanadian et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

gava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat
models. arXiv preprint arXiv:2307.09288, 2023.
[62] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova.
Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
[63] Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel,
Piyush Sharma, and Radu Soricut. Albert: A lite bert for self-supervised
learning of language representations. arXiv preprint arXiv:1909.11942,
2019.
[64] Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv
preprint arXiv:1910.01108, 2019.
[65] Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement
Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan
Funtowicz, et al. Huggingface’s transformers: State-of-the-art natural
language processing. arXiv preprint arXiv:1910.03771, 2019.
[66] Marina Sokolova and Guy Lapalme. A systematic analysis of performance
measures for classification tasks. Information processing & management,
45(4):427–437, 2009.
[67] Yejin Bang, Samuel Cahyawijaya, Nayeon Lee, Wenliang Dai, Dan Su,
Bryan Wilie, Holy Lovenia, Ziwei Ji, Tiezheng Yu, Willy Chung, et al.
A multitask, multilingual, multimodal evaluation of chatgpt on reasoning,
hallucination, and interactivity. arXiv preprint arXiv:2302.04023, 2023.

HUSSEIN AL OSMAN is an Associate Professor at the School of Electrical Engineering and
Computer Science at the University of Ottawa. He
completed his Ph.D. in Electrical and Computer
Engineering at the University of Ottawa in 2014.
He leads the Multimedia Processing and Interaction group and is a member of the Multimedia
Computing and the Distributed and Collaborative
Virtual Environments Research laboratories. His
research focuses on the application of artificial
intelligence in affective computing and biomedical engineering. In particular, he is interested in the development of multi-modal affect recognition
methods using deep artificial neural networks to estimate facial expressions
and speech sentiment. He studies remote physiological signal measurement
using video signals and applies this technology to biomedical and HumanComputer Interaction (HCI) applications. He conducts research in HCI,
especially the development of serious games intended for physical rehabilitation and education.

HAMIDEH GHANADIAN is Ph.D. Candidate in
Electrical Engineering and Computer Science at
the University of Ottawa. She also completed her
MASc degree in Electrical Engineering and Computer Science at the University of Ottawa in 2018.
Her research focuses on Natural Language Processing, Applied Machine Learning, Social Media
Processing and explainability of AI systems. Her
work particularly focuses on the application of
natural language processing techniques on suicide
and mental health detection on social media platforms and exploring the
ways in which NLP can be used to better understand human psychology.

ISAR NEJADGHOLI is a senior research scientist at the National Research Council Canada and
an adjunct professor at the University of Ottawa.
She completed her PhD in Artificial Intelligence
at the AmirKabir University of Technology, Iran
and her postdoctoral studies at the University of
Ottawa, Canada, in 2016. Her research interests include machine learning applications, particularly
natural language processing, social media data
analysis and medical text processing. Her work
also focuses on responsible AI, specifically on evaluating and improving the
transparency and fairness of natural language processing systems.
14

VOLUME 4, 2016

