diff --git a/_bibliography.bib b/_bibliography.bib index de51cef8a3000c93245f6d054e4d67eb4090b7c3..f606052dd5c3762651839278474b1d6e56e111ac 100644 --- a/_bibliography.bib +++ b/_bibliography.bib @@ -113,5 +113,11 @@ isbn="978-3-540-36127-5" note = {Accessed: 2024-07-19} } +@misc{huggingfacetokenizer, + title = {tokenizer_summary}, + howpublished = {\url{https://huggingface.co/docs/transformers/en/tokenizer_summary}}, + note = {Accessed: 2024-07-19} +} + diff --git a/methodology.tex b/methodology.tex index 68061098c13d47f32753b3f5cfcf91709ac8a463..17730bca1580fa6680fb5ff16e7582e893462577 100644 --- a/methodology.tex +++ b/methodology.tex @@ -1,87 +1,158 @@ \chapter{Methodology} -\section{Related Work} -Now first of all there already has been a decent amount of approaches for automatic text-summarization.\\ -One of the oldest and most cited papers from 2002 belongs to ``Automatic Text Summarization Using a Machine Learning Approach'' from \cite{10.1007/3-540-36127-8_20}. It describes a summarization procedure based on naive Bayes and C4.5 decision tree with different compression rates. The results where it utilizes the Naive Bayes classifier and a higher compression rate beeing more yielding better precision and recall. -Creating a Characterization is quite similar to making a Summarization of character related content but could also include deductions made from the behavior of that character. -A recent Paper from 2021 \cite{brahman-etal-2021-characters-tell} presents a dataset called LiSCU (Literary Summaries with Character Understanding) that aims to facilitate research in character-centric narrative understanding. They used techniques for Character Identification, where the goal is to identify a character's name from an anonymized description, and Character Description Generation, which involves generating a description for a given character based on a literature summary. -might exceed model limits: -Length Truncation: Simply truncating the summary at the end. -Coreference Truncation: Using SpanBERT to identify sentences in the summary that mention the character, focusing on these sentences. +\section{Tokenization} +Tokens are the fundamental units of data processing in natural language processing (NLP). A token is the smallest meaningful unit of text, which can be a word, subword, or even a single character or punctuation mark. Tokenization is typically performed at one of three levels: single characters (character-based tokenization), subwords (subword-based tokenization), or whole words (word-based tokenization). +In most modern NLP models, subword tokenization is predominantly used. This technique breaks words into smaller units, such as prefixes and suffixes. Unlike word-based tokenizers, which generate a very large vocabulary and suffer from a loss of meaning across very similar words as well as a large quantity of out-of-vocabulary tokens, or character-based tokenization, where each token has minimal meaning in context and the overall number of tokens on a tokinzed text is enormous, subword-based tokenization seeks to find a middle ground. The idea is to decompose rare words into meaningful subwords while maintaining few to single tokens for every meaningful or frequently used word. +Subword tokenizers are employed in almost every widely-used large language model (LLM) such as GPT-2, Llama 3, and in large pre-trained language models like BERT. -GPT-2: With a maximum input length of 1024 tokens. -BART (Bidirectional and Auto-Regressive Transformers): Extended to accept up to 2048 tokens. +\cite{huggingfacetokenizer} -Longformer: Leveraged for its efficient encoding mechanism to handle long texts, allowing inputs up to 16,384 tokens when using the full text of books. +\section{The Transformer} +The Transformer architecture, introduced in June 2017, marked a significant advancement in natural language processing (NLP), initially focusing on sequence-to-sequence NLP problems like machine translation tasks. However, its capabilities quickly revealed a broader potential, particularly in developing large language models (LLMs). These models are trained on vast amounts of raw text using self-supervised learning, a method where the training objective is derived automatically from the input data. After that the model developed a statistically understanding of the language but still needs to be improved by e.g. masked language-modeling or causal language modeling. The Tranformer consists of a encoder and a decoder. +% https://arxiv.org/abs/1706.03762 -BLEU-4, ROUGE-n (n=1, 2), ROUGE-L F-1 scores, and BERTScore to measure similarity and quality. -performed better with length truncation +\begin{figure}[h] + + \centering + \includegraphics[width=8cm]{ressources/images/Transformer.png} + \caption{transformer architecture from the original paper} + \end{figure} +\subsection{Encoder} +The encoder takes an input sequence, and breaks it down into individual tokens (words or sub-words). +For each token an embedding vector is computed, which is a numerical representation of that token, capturing its semantic meaning. -Errors in coreference resolution impacted the coreference truncation performance. +A key component of the encoder is the self-attention mechanism. Self-attention enables the model to consider the entire sequence when encoding each token, allowing it to weigh the relevance of other tokens in the input sequence dynamically. For each token, the self-attention mechanism computes attention scores that determine the influence of all other tokens in the sequence. So the generated embedded vector for each token does not only represent the token alone but also its left and right contextual influence. +The encoder consists of multiple identical layers, or encoder blocks. Each encoder block contains two main sub-layers: -\cite*{schroder-etal-2021-neural} -coarse-to-fine approach, which first generates coarse coreference clusters and then refines them. This method allows the model to handle the complexity of coreference resolution by breaking it down into more manageable steps. -Two primary neural network models were developed: the base model and the large model. The large model uses the ELECTRA-large model for contextual embeddings, while the base model uses the ELECTRA-base model. -Data Preprocessing +\begin{itemize} + \item \textbf{Multi-Head Self-Attention Layer}: This sub-layer allows the model to attend to different parts of the sequence from multiple perspectives or "heads." Each head performs self-attention independently, and their outputs are concatenated and linearly transformed to provide a richer representation. -% The models were trained on multiple datasets, including SemEval-2010, TüBa-D\\/Z, OntoNotes 5.0, and the DROC dataset. These datasets provide a diverse range of documents, which helps in training robust coreference resolution models. -Special attention was given to handling singletons, which are mentions that do not corefer with any other mention in the document. A discard functionality was introduced to manage singletons effectively. -Training and Evaluation: + \item \textbf{Feed-Forward Layer}: After the self-attention sub-layer, each token's representation is passed through a feed-forward neural network. This layer is a simple fully connected feed-forward network applied to each position (word) in the sequence independently and identically. It consists of two linear transformations with a ReLU activation in between, allowing the model to apply non-linear transformations and further refine the encoded representation. +\end{itemize} -The models were trained using a variety of loss functions and optimization techniques to ensure convergence and high performance. -The performance was measured using the CoNLL-F1 score, which is a standard metric for coreference resolution tasks. -Results -Performance +Both sub-layers in the encoder block are followed by residual connections and layer normalization, which help in stabilizing the training and improving convergence. -% The coarse-to-fine models significantly outperformed previous state-of-the-art systems on both the SemEval-2010 and TüBa-D/Z datasets. The improvements were substantial, with the model achieving an increase of +25.85 F1 on SemEval-2010 and +30.25 F1 on TüBa-D\\/Z. -% Even when compared to systems using gold mentions, which are mentions manually annotated in the dataset, the models still showed a performance increase of more than 10 F1 points. -% Impact of Model Variations +\subsection{Decoder} +The decoder works quiet similar to the encoder and can be also be used for same tasks but with respect to loss of performance. It also uses multiple decoder blocks, similar to the encoder but has two additional sub-layers per block as compared to the encoder block. In the transformer's architecture the decoder's role is to generate the output sequence based on the encoded representation from the encoder (cross-attention). This is done auto-regressively, which means that the generated computed feature-vector, which holds information about the input sequence will be tranformed by the language modelling head mapping into the next probable following word, which then will be added to the input text and then get feeded back into the decoder. The most important difference to the encoder is the masked multi-head self-attention. -% The use of the ELECTRA-large model for contextual embeddings provided a small but notable improvement over the base model, with an increase of +1.58 F1 on TüBa-D\\/Z and +1.92 F1 on SemEval-2010. -% Different configurations and model variations were tested to analyze their impact on performance. It was found that models including a discard functionality for singletons performed better. -% Error Analysis +\begin{itemize} + \item \textbf{Masked Multi-Head Self-Attention Layer}: + Since the decoder cannot predict future words based on information not yet generated, it only attends uni-directional to the previously generated tokens in the output sequence. Therfor only the left context (for "LTR" text) is used and the right context is masked. +\end{itemize} -The error analysis indicated that the coarse-to-fine model generally produced accurate coreference links both locally and document-wide. However, there were frequent errors related to missed and added mentions. These errors were attributed to inconsistent training signals and the inherent complexity of coreference tasks. -The analysis also highlighted that the model’s performance decreases as the document length increases, which aligns with previous findings in coreference resolution research. -Visualizations and Examples: -The paper includes visualizations and specific examples to demonstrate the model’s predictions on unseen documents. These examples show how the model accurately predicts coreference relationships in complex sentences, validating its effectiveness in practical scenarios. -Overall, the methods and results presented in the paper highlight the significant advancements made in coreference resolution through the use of coarse-to-fine neural network models. The study provides a comprehensive evaluation of these models, demonstrating their superiority over existing systems . -\subsection{Project Gutenberg} -Project Gutenberg, founded in 1971 by Michael S. Hart, is one of the oldest and most extensive digital libraries, aimed at providing free access to a vast collection of over 60,000 eBooks. Hart's initiative began with the digitization of the United States Declaration of Independence, setting the stage for the project's goal of democratizing access to literature and cultural works. Named after Johannes Gutenberg, the inventor of the printing press, Project Gutenberg echoes his mission of making written works widely accessible. The Project Gutenberg Literary Archive Foundation, a non-profit organization, oversees the project's administration, legal issues, and fundraising efforts. -\section{RAG} -In contrast to their approach for Character Description Generation which required modeling long-range dependencies, I am using Retrieval-augmented generation (RAG), which is a technique to improve the quality of LLM-generated responses by grounding the model on external sources. LLMs are inconsistent in terms of producing same quality responses for each and every topic, since they knowledge is based on finite amount of information, that isn't equally distributed for every potential topic. But Retrieval-augmented generation doesn't only reduce the need for internal sources (continuous training, lowering computational and financial costs) but also ensures that the model has access to the most current, reliable facts. -In this thesis I am primarily focusing on getting those important properties and behavior (key features) from the characters described in the literature to achieve better characterizations with grounded models that utilize this external information. + +\section{BERT} +BERT (Bidirectional Encoder Representations from Transformers) is a pre-trained language representation model introduced by Devlin et al. in 2019 (ref). Its based on the Transformer architecture from\dots but instead of using in contrast to using both, an encoder and a decoder as in the original transformer, BERT only utilizes the encoder component. Consequently, unlike other large language models (LLMs), BERT cannot predict new tokens and thus is not suitable for text generation. Instead, it still achieves state-of-the-art results in tasks such as text classification, sentiment analysis, and named entity recognition. The attention scores are computed using queries, keys, and values derived from the input embeddings. + +\subsection{Embeddings} +The three matrices in BERT—token embeddings, segment embeddings, and positional embeddings are generated as part of the model's training process. + +For each unique Token ID (i.e. for each of the 30,522 words and subwords in the BERT Tokenizer’s vocabulary), the BERT model contains an embedding that is trained to represent that specific token. The Embedding Layer within the model is responsible for mapping tokens to their corresponding embeddings. + +Before a string of text is passed to the BERT model, the BERT Tokenizer is used to convert the input from a string into a list of integer Token IDs, where each ID directly maps to a word or part of a word in the original string. In addition to the Token Embeddings described so far, BERT also relies on Position Embeddings. While Token Embeddings are used to represent each possible word or subword that can be provided to the model, Position Embeddings represent the position of each token in the input sequence. + +The final type of embedding used by BERT is the Token Type Embedding, also called the Segment Embedding in the original BERT Paper. One of the tasks that BERT was originally trained to solve was Next Sentence Prediction. That is, given two sentences A and B, BERT was trained to determine whether B logically follows A.\\ + +BERT introduces two pre-training objectives, the masked language model objective (MLM), and the next sentence prediction objective (NSP). + + +\begin{itemize} + \item \textbf{Masked Language Modeling (MLM)}: + 15\% of the words in a sentence are randomly masked, and the model is trained to predict these masked words based on the context provided by the other words in the sentence. This enables BERT to learn bidirectional representations. + + \item \textbf{Next Sentence Prediction (NSP)}: + To understand relationships between sentences, BERT is trained on pairs of sentences. Given two sentences, the model predicts whether the second sentence is the actual next sentence in the original text or a randomly chosen one. This task helps BERT capture the coherence and context between sentences. +\end{itemize} + + +\subsection{Fine-Tuning} +After pre-training on large text corpora, BERT can be fine-tuned on specific downstream tasks with relatively small amounts of data. Fine-tuning involves adjusting the pre-trained model weights slightly to better fit the target task. This approach leverages the robust pre-trained language representations and adapts them to the specific requirements of the task at hand. + + + + +\subsection{BERTScore} +BERTScore is an evaluation metric that utilizes the BERT model to compare texts more semantically than traditional metrics like BLEU. It leverages the contextualized embeddings provided by a pre-trained BERT model to assess the similarity between candidate and reference texts.\\ + +The process begins by inputting both candidate and reference texts into the BERT model, which generates contextualized embeddings for each token in both texts. For each token, the similarity between its embedding and every token embedding in the comparison text is calculated using cosine similarity +\begin{equation} + \cos(\theta) = \frac{\mathbf{A} \cdot \mathbf{B}}{\|\mathbf{A}\| \|\mathbf{B}\|} = \frac{\sum_{i=1}^{n} \mathbf{A}_{i} \mathbf{B}_{i} }{\sqrt{\sum_{i=1}^{n} \mathbf{A}_{i}} \cdot \sqrt{\sum_{i=1}^{n} \mathbf{B}_{i}} } +\end{equation} +This results in a similarity matrix where each entry represents the cosine similarity between the embeddings of a pair of tokens (one from the candidate sentence and one from the reference sentence).\\ -\section{Query generation} -There are multiple approaches to consider when selecting the optimal prompts and information for a language model to generate high-quality summaries or characterizations. While rephrasing a task for a language model can influence the prompt's outcome to some degree, this study will primarily focus on selecting and curating the relevant literature data, which will be appended to the task of the prompt.\\ +The metric is computed symmetrically as follows:\\ -One method to obtain the necessary information is to filter the text for sentences containing certain keywords. However, simply finding all sentences that mention a character's name is insufficient for a comprehensive description. Critical information about the character may be present in sentences that do not explicitly mention their name but refer to them indirectly. Consequently, important details can be missed using this technique and also additionally, this approach might include too much unnecessary information, especially for main characters, making it unsuitable for zero-shot character summary generation.\\ +For each token embedding in the candidate sentence, find the maximum similarity score with any token embedding in the reference sentence, and average these scores across all tokens in the candidate sentence to obtain precision.\\ -(https://huggingface.co/intfloat/e5-mistral-7b-instruct) expand... -To improve upon this, text embeddings can be utilized. For instance, using a model like E5-Mistral-7B-Instruct, which has 32 layers and an embedding size of 4096, we can chunk the literature into sections of roughly equal length and embed each chunk. This allows us to identify chunks that satisfy a certain premise, such as describing a particular character more accurately than others.\\ +Similarly, for each token embedding in the reference sentence, find the maximum similarity score with any token embedding in the candidate sentence, and average these scores across all tokens in the reference sentence to obtain recall. -Further improvements can be achieved by applying coreference resolution techniques (\cite{schroder-etal-2021-neural}, \cite{dobrovolskii-2021-word}) to identify all tokens that refer to the given entity. This helps in gathering more sentences relevant to the characters context.\\ +\[P_{BERT} = \frac{1}{|\hat{x}|} \sum_{\hat{x}_j\in \hat{x}} \max_{x_i \in x} x_i^T \hat{x_j} \] +\[R_{BERT} = \frac{1}{|x|} \sum_{x_i \in x} \max_{\hat{x}_j\in \hat{x}} x_i^T \hat{x_j} \] -If it is possible to identify self-contained content scopes using coreference resolution and segmenting the content by highly self-referenced text passages, the language model can generate even better character profiles due to the additional relevant information.\\ +Finally the $F_1$-score (an $F$-measure) +is computed as the harmonic mean of precision and recall and is providing a balanced measure that considers both the model's ability to capture relevant information and its accuracy in predicting new text equally. + +\[F_{BERT} = 2\frac{P_{BERT}R_{BERT}}{P_{BERT} + R_{BERT}} \] + +\section{BLEU-Score} + +BLEU-Score is a different metric I use in my thesis for comparing texts. BLEU is not evaluating and comparing the semantic of the reference and candidate text but instead comparing similarity of vocabulary between them. + +Let $\left\{y^{1}, y^{2}, ..., y^{N}\right\}$ be the words of the reference text and $\left\{\hat{y}^{1}, \hat{y}^{2}, ..., \hat{y}^{N}\right\}$ + + +The first step is to create n-grams $\text{G}_n(y)$ for both texts. An n-gram is just a set of consecutive words of length n in a text. + +\[ + \text{G}_n(y) = \left\{y_1, y_2, ..., y_k\right\} +\] + +Next we define the function $\text{C}(s,y)$ that counts the appearances of s as a substring in y. +Now we can count n-grams of the candidate that appear in the reference text. We can compute the clipped precision by taking the minimum of the appearances of the n-gram in $y$ and $\hat{y}$ and then dividing by the amount of all occurences of n-grams in $\hat{y}$. Therefor candidates that have the same n-gram repeating over and over again don't get a higher precision score if the same n-gram does not appear in the reference text the same amount. + +\[ + \text{p}_n(\hat{y} , y) = \frac{\sum_{s \in G_n(\hat{y})} \min(\text{C}(s,\hat{y}), \text{C}(s,y))}{\sum_{s \in G_n(\hat{y})} \text{C}(s,\hat{y})} +\] + + +Right now short candidate texts are more likely to get a good score although the reference text is much longer. Therefor we add a brevity penalty in order to give higher scores to texts that are closer or even longer to the reference texts real size. +\[ + \text{BP}(c, r) = \left\{\begin{array}{lr} + 1, & \text{if } c > r \\ + \ e^{(1 - r/c)}, & \text{if } c \leq r \\ + \end{array}\right\} +\] + +Finally for BLEU-Score we combine the brevity penalty with the clipped precision of n-grams. We additionally add a distribution vector to weigh each $ \text{p}_n$ by $w_n$ in order to have the opportunity to give n-grams with different $n$ also a different impact on the overall result. Although in the end most BLEU-Scores just use a uniform distribution with $N = 4$ so that $w_n$ always stays $\frac{1}{4}$ + +\[\text{BLEU} = \text{BP}(c, r) \cdot \exp\left(\sum_{n=1}^{N} \text{w}_n \cdot \ln(p_n)\right)\] + + +\section{RAG} +%In contrast to their approach for Character Description Generation which required modeling long-range dependencies, I am using + -Another approach to consider is fine-tuning a language model like LLAMA to enhance text summarization results.\ +Retrieval-augmented generation (RAG), is a technique used to improve the quality of LLM-generated responses by grounding the model on external sources. LLMs are inconsistent in terms of producing same quality responses for each and every topic, since they knowledge is based on finite amount of information, that isn't equally distributed for every potential topic. But Retrieval-augmented generation doesn't only reduce the need for internal sources (continuous training, lowering computational and financial costs) but also ensures that the model has access to the most current, reliable facts. It therfore reduce the chances of generating false information and ensures that the generated content is relevant. +In this thesis I am utilizing RAG to hopefully improve (key features) from the characters described in the literature to achieve better characterizations with sgrounded models that utilize this external information. -The generated characterizations can be evaluated both qualitatively and quantitatively. To compare human-written characterizations with those generated by the model, we can measure recall and precision using metrics like ROUGE and BLEU, and employ BERTScore as a semantic evaluation metric.\\ -Since language models are typically trained on extensive data, they might already contain information about certain books. To test this, we can compare queries that include key sentences to those that omit them. If the model produces the same output despite the missing key information, it suggests prior training on that data. Additionally, using books released after the model's training period ensures no pre-existing knowledge about the characters.\\ +Benefits of RAG: -Existing human-written characterizations will serve as benchmarks for assessing the model's output in terms of style, content, structure, and level of detail. \ No newline at end of file + Accuracy: By grounding generative outputs in real data, RAG models can significantly reduce the chances of generating false or misleading information. + Contextual Relevance: The retrieval step ensures that the generated content is relevant to the user's query or context. + Flexibility: Combines the benefits of retrieval-based and generative approaches, making it adaptable to various applications. + Efficiency: Can dynamically incorporate new information without the need for retraining the generative model, allowing for real-time updates and adaptability. \ No newline at end of file diff --git a/related_work.tex b/related_work.tex index e3f6f1a9e5d8936526537b568707715e8b31c505..32eaad0639069baf60abd4b3dddea4ee030d3f44 100644 --- a/related_work.tex +++ b/related_work.tex @@ -1,142 +1,93 @@ \chapter{Related Work} -\section{Tokenization} -Tokens are the fundamental units of data processing in natural language processing (NLP). A token is the smallest meaningful unit of text, which can be a word, subword, or even a single character or punctuation mark. Tokenization is typically performed at one of three levels: single characters (character-based tokenization), subwords (subword-based tokenization), or whole words (word-based tokenization). +Now first of all there already has been a decent amount of approaches for automatic text-summarization.\\ +One of the oldest and most cited papers from 2002 belongs to ``Automatic Text Summarization Using a Machine Learning Approach'' from \cite{10.1007/3-540-36127-8_20}. It describes a summarization procedure based on naive Bayes and C4.5 decision tree with different compression rates. The results where it utilizes the Naive Bayes classifier and a higher compression rate beeing more yielding better precision and recall. -In most modern NLP models, subword tokenization is predominantly used. This technique breaks words into smaller units, such as prefixes and suffixes. Unlike word-based tokenizers, which generate a very large vocabulary and suffer from a loss of meaning across very similar words as well as a large quantity of out-of-vocabulary tokens, or character-based tokenization, where each token has minimal meaning in context and the overall number of tokens on a tokinzed text is enormous, subword-based tokenization seeks to find a middle ground. The idea is to decompose rare words into meaningful subwords while maintaining few to single tokens for every meaningful or frequently used word. +Creating a Characterization is quite similar to making a Summarization of character related content but could also include deductions made from the behavior of that character. +A recent Paper from 2021 \cite{brahman-etal-2021-characters-tell} presents a dataset called LiSCU (Literary Summaries with Character Understanding) that aims to facilitate research in character-centric narrative understanding. They used techniques for Character Identification, where the goal is to identify a character's name from an anonymized description, and Character Description Generation, which involves generating a description for a given character based on a literature summary. -Subword tokenizers are employed in almost every widely-used large language model (LLM) such as GPT-2, Llama 3, and in large pre-trained language models like BERT. +might exceed model limits: +Length Truncation: Simply truncating the summary at the end. +Coreference Truncation: Using SpanBERT to identify sentences in the summary that mention the character, focusing on these sentences. -% https://huggingface.co/docs/transformers/en/tokenizer_summary -\section{The Transformer} -The Transformer architecture, introduced in June 2017, marked a significant advancement in natural language processing (NLP), initially focusing on sequence-to-sequence NLP problems like machine translation tasks. However, its capabilities quickly revealed a broader potential, particularly in developing large language models (LLMs). These models are trained on vast amounts of raw text using self-supervised learning, a method where the training objective is derived automatically from the input data. After that the model developed a statistically understanding of the language but still needs to be improved by e.g. masked language-modeling or causal language modeling. The Tranformer consists of a encoder and a decoder. -% https://arxiv.org/abs/1706.03762 +GPT-2: With a maximum input length of 1024 tokens. +BART (Bidirectional and Auto-Regressive Transformers): Extended to accept up to 2048 tokens. +Longformer: Leveraged for its efficient encoding mechanism to handle long texts, allowing inputs up to 16,384 tokens when using the full text of books. -\begin{figure}[h] - - \centering - \includegraphics[width=8cm]{ressources/images/Transformer.png} - \caption{transformer architecture from the original paper} - \end{figure} -\subsection{Encoder} -The encoder takes an input sequence, and breaks it down into individual tokens (words or sub-words). -For each token an embedding vector is computed, which is a numerical representation of that token, capturing its semantic meaning. +BLEU-4, ROUGE-n (n=1, 2), ROUGE-L F-1 scores, and BERTScore to measure similarity and quality. -A key component of the encoder is the self-attention mechanism. Self-attention enables the model to consider the entire sequence when encoding each token, allowing it to weigh the relevance of other tokens in the input sequence dynamically. For each token, the self-attention mechanism computes attention scores that determine the influence of all other tokens in the sequence. So the generated embedded vector for each token does not only represent the token alone but also its left and right contextual influence. +performed better with length truncation -The encoder consists of multiple identical layers, or encoder blocks. Each encoder block contains two main sub-layers: +Errors in coreference resolution impacted the coreference truncation performance. -\begin{itemize} - \item \textbf{Multi-Head Self-Attention Layer}: This sub-layer allows the model to attend to different parts of the sequence from multiple perspectives or "heads." Each head performs self-attention independently, and their outputs are concatenated and linearly transformed to provide a richer representation. - \item \textbf{Feed-Forward Layer}: After the self-attention sub-layer, each token's representation is passed through a feed-forward neural network. This layer is a simple fully connected feed-forward network applied to each position (word) in the sequence independently and identically. It consists of two linear transformations with a ReLU activation in between, allowing the model to apply non-linear transformations and further refine the encoded representation. -\end{itemize} -Both sub-layers in the encoder block are followed by residual connections and layer normalization, which help in stabilizing the training and improving convergence. +\cite*{schroder-etal-2021-neural} +coarse-to-fine approach, which first generates coarse coreference clusters and then refines them. This method allows the model to handle the complexity of coreference resolution by breaking it down into more manageable steps. +Two primary neural network models were developed: the base model and the large model. The large model uses the ELECTRA-large model for contextual embeddings, while the base model uses the ELECTRA-base model. +Data Preprocessing -\subsection{Decoder} -The decoder works quiet similar to the encoder and can be also be used for same tasks but with respect to loss of performance. It also uses multiple decoder blocks, similar to the encoder but has two additional sub-layers per block as compared to the encoder block. In the transformer's architecture the decoder's role is to generate the output sequence based on the encoded representation from the encoder (cross-attention). This is done auto-regressively, which means that the generated computed feature-vector, which holds information about the input sequence will be tranformed by the language modelling head mapping into the next probable following word, which then will be added to the input text and then get feeded back into the decoder. The most important difference to the encoder is the masked multi-head self-attention. +% The models were trained on multiple datasets, including SemEval-2010, TüBa-D\\/Z, OntoNotes 5.0, and the DROC dataset. These datasets provide a diverse range of documents, which helps in training robust coreference resolution models. +Special attention was given to handling singletons, which are mentions that do not corefer with any other mention in the document. A discard functionality was introduced to manage singletons effectively. +Training and Evaluation: -\begin{itemize} - \item \textbf{Masked Multi-Head Self-Attention Layer}: - Since the decoder cannot predict future words based on information not yet generated, it only attends uni-directional to the previously generated tokens in the output sequence. Therfor only the left context (for "LTR" text) is used and the right context is masked. -\end{itemize} +The models were trained using a variety of loss functions and optimization techniques to ensure convergence and high performance. +The performance was measured using the CoNLL-F1 score, which is a standard metric for coreference resolution tasks. +Results +Performance +% The coarse-to-fine models significantly outperformed previous state-of-the-art systems on both the SemEval-2010 and TüBa-D/Z datasets. The improvements were substantial, with the model achieving an increase of +25.85 F1 on SemEval-2010 and +30.25 F1 on TüBa-D\\/Z. +% Even when compared to systems using gold mentions, which are mentions manually annotated in the dataset, the models still showed a performance increase of more than 10 F1 points. +% Impact of Model Variations +% The use of the ELECTRA-large model for contextual embeddings provided a small but notable improvement over the base model, with an increase of +1.58 F1 on TüBa-D\\/Z and +1.92 F1 on SemEval-2010. +% Different configurations and model variations were tested to analyze their impact on performance. It was found that models including a discard functionality for singletons performed better. +% Error Analysis +The error analysis indicated that the coarse-to-fine model generally produced accurate coreference links both locally and document-wide. However, there were frequent errors related to missed and added mentions. These errors were attributed to inconsistent training signals and the inherent complexity of coreference tasks. +The analysis also highlighted that the model’s performance decreases as the document length increases, which aligns with previous findings in coreference resolution research. +Visualizations and Examples: +The paper includes visualizations and specific examples to demonstrate the model’s predictions on unseen documents. These examples show how the model accurately predicts coreference relationships in complex sentences, validating its effectiveness in practical scenarios. +Overall, the methods and results presented in the paper highlight the significant advancements made in coreference resolution through the use of coarse-to-fine neural network models. The study provides a comprehensive evaluation of these models, demonstrating their superiority over existing systems . -\section{BERT} -BERT (Bidirectional Encoder Representations from Transformers) is a pre-trained language representation model introduced by Devlin et al. in 2019 (ref). Its based on the Transformer architecture from\dots but instead of using in contrast to using both, an encoder and a decoder as in the original transformer, BERT only utilizes the encoder component. Consequently, unlike other large language models (LLMs), BERT cannot predict new tokens and thus is not suitable for text generation. Instead, it still achieves state-of-the-art results in tasks such as text classification, sentiment analysis, and named entity recognition. The attention scores are computed using queries, keys, and values derived from the input embeddings. +\subsection{Project Gutenberg} +Project Gutenberg, founded in 1971 by Michael S. Hart, is one of the oldest and most extensive digital libraries, aimed at providing free access to a vast collection of over 60,000 eBooks. Hart's initiative began with the digitization of the United States Declaration of Independence, setting the stage for the project's goal of democratizing access to literature and cultural works. Named after Johannes Gutenberg, the inventor of the printing press, Project Gutenberg echoes his mission of making written works widely accessible. The Project Gutenberg Literary Archive Foundation, a non-profit organization, oversees the project's administration, legal issues, and fundraising efforts. -\subsection{Embeddings} -The three matrices in BERT—token embeddings, segment embeddings, and positional embeddings are generated as part of the model's training process. -For each unique Token ID (i.e. for each of the 30,522 words and subwords in the BERT Tokenizer’s vocabulary), the BERT model contains an embedding that is trained to represent that specific token. The Embedding Layer within the model is responsible for mapping tokens to their corresponding embeddings. -Before a string of text is passed to the BERT model, the BERT Tokenizer is used to convert the input from a string into a list of integer Token IDs, where each ID directly maps to a word or part of a word in the original string. In addition to the Token Embeddings described so far, BERT also relies on Position Embeddings. While Token Embeddings are used to represent each possible word or subword that can be provided to the model, Position Embeddings represent the position of each token in the input sequence. -The final type of embedding used by BERT is the Token Type Embedding, also called the Segment Embedding in the original BERT Paper. One of the tasks that BERT was originally trained to solve was Next Sentence Prediction. That is, given two sentences A and B, BERT was trained to determine whether B logically follows A.\\ +\section{Query generation} +There are multiple approaches to consider when selecting the optimal prompts and information for a language model to generate high-quality summaries or characterizations. While rephrasing a task for a language model can influence the prompt's outcome to some degree, this study will primarily focus on selecting and curating the relevant literature data, which will be appended to the task of the prompt.\\ -BERT introduces two pre-training objectives, the masked language model objective (MLM), and the next sentence prediction objective (NSP). +One method to obtain the necessary information is to filter the text for sentences containing certain keywords. However, simply finding all sentences that mention a character's name is insufficient for a comprehensive description. Critical information about the character may be present in sentences that do not explicitly mention their name but refer to them indirectly. Consequently, important details can be missed using this technique and also additionally, this approach might include too much unnecessary information, especially for main characters, making it unsuitable for zero-shot character summary generation.\\ +(https://huggingface.co/intfloat/e5-mistral-7b-instruct) expand... +To improve upon this, text embeddings can be utilized. For instance, using a model like E5-Mistral-7B-Instruct, which has 32 layers and an embedding size of 4096, we can chunk the literature into sections of roughly equal length and embed each chunk. This allows us to identify chunks that satisfy a certain premise, such as describing a particular character more accurately than others.\\ -\begin{itemize} - \item \textbf{Masked Language Modeling (MLM)}: - 15\% of the words in a sentence are randomly masked, and the model is trained to predict these masked words based on the context provided by the other words in the sentence. This enables BERT to learn bidirectional representations. +Further improvements can be achieved by applying coreference resolution techniques (\cite{schroder-etal-2021-neural}, \cite{dobrovolskii-2021-word}) to identify all tokens that refer to the given entity. This helps in gathering more sentences relevant to the characters context.\\ - \item \textbf{Next Sentence Prediction (NSP)}: - To understand relationships between sentences, BERT is trained on pairs of sentences. Given two sentences, the model predicts whether the second sentence is the actual next sentence in the original text or a randomly chosen one. This task helps BERT capture the coherence and context between sentences. -\end{itemize} +If it is possible to identify self-contained content scopes using coreference resolution and segmenting the content by highly self-referenced text passages, the language model can generate even better character profiles due to the additional relevant information.\\ -\subsection{Fine-Tuning} -After pre-training on large text corpora, BERT can be fine-tuned on specific downstream tasks with relatively small amounts of data. Fine-tuning involves adjusting the pre-trained model weights slightly to better fit the target task. This approach leverages the robust pre-trained language representations and adapts them to the specific requirements of the task at hand. +Another approach to consider is fine-tuning a language model like LLAMA to enhance text summarization results.\ +The generated characterizations can be evaluated both qualitatively and quantitatively. To compare human-written characterizations with those generated by the model, we can measure recall and precision using metrics like ROUGE and BLEU, and employ BERTScore as a semantic evaluation metric.\\ +Since language models are typically trained on extensive data, they might already contain information about certain books. To test this, we can compare queries that include key sentences to those that omit them. If the model produces the same output despite the missing key information, it suggests prior training on that data. Additionally, using books released after the model's training period ensures no pre-existing knowledge about the characters.\\ -\subsection{BERTScore} -BERTScore is an evaluation metric that utilizes the BERT model to compare texts more semantically than traditional metrics like BLEU. It leverages the contextualized embeddings provided by a pre-trained BERT model to assess the similarity between candidate and reference texts.\\ +Existing human-written characterizations will serve as benchmarks for assessing the model's output in terms of style, content, structure, and level of detail. -The process begins by inputting both candidate and reference texts into the BERT model, which generates contextualized embeddings for each token in both texts. For each token, the similarity between its embedding and every token embedding in the comparison text is calculated using cosine similarity -\begin{equation} - \cos(\theta) = \frac{\mathbf{A} \cdot \mathbf{B}}{\|\mathbf{A}\| \|\mathbf{B}\|} = \frac{\sum_{i=1}^{n} \mathbf{A}_{i} \mathbf{B}_{i} }{\sqrt{\sum_{i=1}^{n} \mathbf{A}_{i}} \cdot \sqrt{\sum_{i=1}^{n} \mathbf{B}_{i}} } -\end{equation} -This results in a similarity matrix where each entry represents the cosine similarity between the embeddings of a pair of tokens (one from the candidate sentence and one from the reference sentence).\\ - - -The metric is computed symmetrically as follows:\\ - -For each token embedding in the candidate sentence, find the maximum similarity score with any token embedding in the reference sentence, and average these scores across all tokens in the candidate sentence to obtain precision.\\ - -Similarly, for each token embedding in the reference sentence, find the maximum similarity score with any token embedding in the candidate sentence, and average these scores across all tokens in the reference sentence to obtain recall. - -\[P_{BERT} = \frac{1}{|\hat{x}|} \sum_{\hat{x}_j\in \hat{x}} \max_{x_i \in x} x_i^T \hat{x_j} \] -\[R_{BERT} = \frac{1}{|x|} \sum_{x_i \in x} \max_{\hat{x}_j\in \hat{x}} x_i^T \hat{x_j} \] - - - -Finally the $F_1$-score (an $F$-measure) -is computed as the harmonic mean of precision and recall and is providing a balanced measure that considers both the model's ability to capture relevant information and its accuracy in predicting new text equally. - -\[F_{BERT} = 2\frac{P_{BERT}R_{BERT}}{P_{BERT} + R_{BERT}} \] - -\section{BLEU-Score} - -BLEU-Score is a different metric I use in my thesis for comparing texts. BLEU is not evaluating and comparing the semantic of the reference and candidate text but instead comparing similarity of vocabulary between them. - -Let $\left\{y^{1}, y^{2}, ..., y^{N}\right\}$ be the words of the reference text and $\left\{\hat{y}^{1}, \hat{y}^{2}, ..., \hat{y}^{N}\right\}$ - - -The first step is to create n-grams $\text{G}_n(y)$ for both texts. An n-gram is just a set of consecutive words of length n in a text. - -\[ - \text{G}_n(y) = \left\{y_1, y_2, ..., y_k\right\} -\] - -Next we define the function $\text{C}(s,y)$ that counts the appearances of s as a substring in y. -Now we can count n-grams of the candidate that appear in the reference text. We can compute the clipped precision by taking the minimum of the appearances of the n-gram in $y$ and $\hat{y}$ and then dividing by the amount of all occurences of n-grams in $\hat{y}$. Therefor candidates that have the same n-gram repeating over and over again don't get a higher precision score if the same n-gram does not appear in the reference text the same amount. - -\[ - \text{p}_n(\hat{y} , y) = \frac{\sum_{s \in G_n(\hat{y})} \min(\text{C}(s,\hat{y}), \text{C}(s,y))}{\sum_{s \in G_n(\hat{y})} \text{C}(s,\hat{y})} -\] - - -Right now short candidate texts are more likely to get a good score although the reference text is much longer. Therefor we add a brevity penalty in order to give higher scores to texts that are closer or even longer to the reference texts real size. -\[ - \text{BP}(c, r) = \left\{\begin{array}{lr} - 1, & \text{if } c > r \\ - \ e^{(1 - r/c)}, & \text{if } c \leq r \\ - \end{array}\right\} -\] - -Finally for BLEU-Score we combine the brevity penalty with the clipped precision of n-grams. We additionally add a distribution vector to weigh each $ \text{p}_n$ by $w_n$ in order to have the opportunity to give n-grams with different $n$ also a different impact on the overall result. Although in the end most BLEU-Scores just use a uniform distribution with $N = 4$ so that $w_n$ always stays $\frac{1}{4}$ - -\[\text{BLEU} = \text{BP}(c, r) \cdot \exp\left(\sum_{n=1}^{N} \text{w}_n \cdot \ln(p_n)\right)\] +TODO +Ergebnisse vergleichen und herausfinden ob die Modelle bereits information über die Bücher besitzen? +Vllt schnell Bücher finden finden die keine Modelle kennen und dann schauen ob die Modelle trotzdem gute Ergebnisse liefern +Finetuning? Andere Evaluation Metrics? +Ergebnisse besser vergleichen (neuer Graph) und dann eine neue Section uir gesamten verbesserung +Graphen erklären (Bücher die schlechter waren oder charactere die schlechter waren qualitatively)