State of the art data extraction for LLMs
- convert documents to text in a simple API call.
- works absolutely everywhere: [server, browser, CLI].
- no dependencies, no dockerfiles, no deployment.
- get started for free.
New users start with 250 free credits.
Created and used by world-class teams
Turn your documents into data your LLM can understand.

- [] 76 items▶
- {} 7 keys▶
- "OpenAI, Karthik Narasimhan = OpenAI. OpenAI, Tim Salimans = OpenAI. OpenAI, Ilya Sutskever = OpenAI. alec@openai.com, Karthik Narasimhan = karthikn@openai.com. alec@openai.com, Tim Salimans = tim@openai.com. alec@openai.com, Ilya Sutskever = ilyasu@openai.com"
- "table"
- {} 3 keys▶
- 1
- {} 5 keys▶
- 114.3003921508789
- 611.7520751953125
- 497.4048767089844
- 578.1337280273438
- "BOTTOMLEFT"
- [] 2 items▶
- 0
- 0
- {} 1 key▶
- "#/body"
- [] 0 items
- [] 1 item▶
- "Improving Language Understanding by Generative Pre-Training"
- {} 2 keys▶
- "application/pdf"
- "gpt_radford.pdf"
- {} 7 keys▶
- "Natural language understanding comprises a wide range of diverse tasks such as textual entailment, question answering, semantic similarity assessment, and document classification. Although large unlabeled text corpora are abundant, labeled data for learning these specific tasks is scarce, making it challenging for discriminatively trained models to perform adequately. We demonstrate that large gains on these tasks can be realized by generative pre-training of a language model on a diverse corpus of unlabeled text, followed by discriminative fine-tuning on each specific task. In contrast to previous approaches, we make use of task-aware input transformations during fine-tuning to achieve effective transfer while requiring minimal changes to the model architecture. We demonstrate the effectiveness of our approach on a wide range of benchmarks for natural language understanding. Our general task-agnostic model outperforms discriminatively trained models that use architectures specifically crafted for each task, significantly improving upon the state of the art in 9 out of the 12 tasks studied. For instance, we achieve absolute improvements of 8.9% on commonsense reasoning (Stories Cloze Test), 5.7% on question answering (RACE), and 1.5% on textual entailment (MultiNLI)."
- "text"
- {} 3 keys▶
- 1
- {} 5 keys▶
- 142.96617126464844
- 523.6973876953125
- 469.20623779296875
- 350.1388854980469
- "BOTTOMLEFT"
- [] 2 items▶
- 0
- 1287
- {} 1 key▶
- "#/body"
- [] 0 items
- [] 1 item▶
- "Abstract"
- {} 2 keys▶
- "application/pdf"
- "gpt_radford.pdf"
- {} 7 keys▶
- "The ability to learn effectively from raw text is crucial to alleviating the dependence on supervised learning in natural language processing (NLP). Most deep learning methods require substantial amounts of manually labeled data, which restricts their applicability in many domains that suffer from a dearth of annotated resources [61]. In these situations, models that can leverage linguistic information from unlabeled data provide a valuable alternative to gathering more annotation, which can be time-consuming and expensive. Further, even in cases where considerable supervision is available, learning good representations in an unsupervised fashion can provide a significant performance boost. The most compelling evidence for this so far has been the extensive use of pretrained word embeddings [10 , 39 , 42] to improve performance on a range of NLP tasks [8 , 11 , 26 , 45]."
- "text"
- {} 3 keys▶
- 1
- {} 5 keys▶
- 107.04915618896484
- 300.989013671875
- 505.1749267578125
- 203.70025634765625
- "BOTTOMLEFT"
- [] 2 items▶
- 0
- 883
- {} 1 key▶
- "#/body"
- [] 0 items
- [] 1 item▶
- "1 Introduction"
- {} 2 keys▶
- "application/pdf"
- "gpt_radford.pdf"
- {} 7 keys▶
- "Leveraging more than word-level information from unlabeled text, however, is challenging for two main reasons. First, it is unclear what type of optimization objectives are most effective at learning text representations that are useful for transfer. Recent research has looked at various objectives such as language modeling [44], machine translation [38], and discourse coherence [22], with each method outperforming the others on different tasks. 1 Second, there is no consensus on the most effective way to transfer these learned representations to the target task. Existing techniques involve a combination of making task-specific changes to the model architecture [43 , 44], using intricate learning schemes [21] and adding auxiliary learning objectives [50]. These uncertainties have made it difficult to develop effective semi-supervised learning approaches for language processing."
- "text"
- {} 3 keys▶
- 1
- {} 5 keys▶
- 107.21430969238281
- 197.66314697265625
- 504.34869384765625
- 99.7039794921875
- "BOTTOMLEFT"
- [] 2 items▶
- 0
- 890
- {} 1 key▶
- "#/body"
- [] 0 items
- [] 1 item▶
- "1 Introduction"
- {} 2 keys▶
- "application/pdf"
- "gpt_radford.pdf"
- {} 7 keys▶
- "1 https://gluebenchmark.com/leaderboard"
- "footnote"
- {} 3 keys▶
- 1
- {} 5 keys▶
- 120.8305435180664
- 89.83885192871094
- 266.82098388671875
- 80.04328918457031
- "BOTTOMLEFT"
- [] 2 items▶
- 0
- 39
- {} 1 key▶
- "#/body"
- [] 0 items
- [] 1 item▶
- "1 Introduction"
- {} 2 keys▶
- "application/pdf"
- "gpt_radford.pdf"
- {} 7 keys▶
- "Preprint. Work in progress."
- "page_footer"
- {} 3 keys▶
- 1
- {} 5 keys▶
- 107.6248550415039
- 58.776611328125
- 205.82122802734375
- 49.88275146484375
- "BOTTOMLEFT"
- [] 2 items▶
- 0
- 27
- {} 1 key▶
- "#/body"
- [] 0 items
- [] 1 item▶
- "1 Introduction"
- {} 2 keys▶
- "application/pdf"
- "gpt_radford.pdf"
- {} 7 keys▶
- "In this paper, we explore a semi-supervised approach for language understanding tasks using a combination of unsupervised pre-training and supervised fine-tuning. Our goal is to learn a universal representation that transfers with little adaptation to a wide range of tasks. We assume access to a large corpus of unlabeled text and several datasets with manually annotated training examples (target tasks). Our setup does not require these target tasks to be in the same domain as the unlabeled corpus. We employ a two-stage training procedure. First, we use a language modeling objective on the unlabeled data to learn the initial parameters of a neural network model. Subsequently, we adapt these parameters to a target task using the corresponding supervised objective."
- "text"
- {} 3 keys▶
- 2
- {} 5 keys▶
- 107.0160140991211
- 717.6932983398438
- 504.34759521484375
- 631.1981811523438
- "BOTTOMLEFT"
- [] 2 items▶
- 0
- 772
- {} 1 key▶
- "#/body"
- [] 0 items
- [] 1 item▶
- "1 Introduction"
- {} 2 keys▶
- "application/pdf"
- "gpt_radford.pdf"
- {} 7 keys▶
- "For our model architecture, we use the Transformer [62], which has been shown to perform strongly on various tasks such as machine translation [62], document generation [34], and syntactic parsing [29]. This model choice provides us with a more structured memory for handling long-term dependencies in text, compared to alternatives like recurrent networks, resulting in robust transfer performance across diverse tasks. During transfer, we utilize task-specific input adaptations derived from traversal-style approaches [52], which process structured text input as a single contiguous sequence of tokens. As we demonstrate in our experiments, these adaptations enable us to fine-tune effectively with minimal changes to the architecture of the pre-trained model."
- "text"
- {} 3 keys▶
- 2
- {} 5 keys▶
- 106.87811279296875
- 624.872314453125
- 505.06256103515625
- 538.3401489257812
- "BOTTOMLEFT"
- [] 2 items▶
- 0
- 763
- {} 1 key▶
- "#/body"
- [] 0 items
- [] 1 item▶
- "1 Introduction"
- {} 2 keys▶
- "application/pdf"
- "gpt_radford.pdf"
- {} 7 keys▶
- "We evaluate our approach on four types of language understanding tasks - natural language inference, question answering, semantic similarity, and text classification. Our general task-agnostic model outperforms discriminatively trained models that employ architectures specifically crafted for each task, significantly improving upon the state of the art in 9 out of the 12 tasks studied. For instance, we achieve absolute improvements of 8.9% on commonsense reasoning (Stories Cloze Test) [40], 5.7% on question answering (RACE) [30], 1.5% on textual entailment (MultiNLI) [66] and 5.5% on the recently introduced GLUE multi-task benchmark [64]. We also analyzed zero-shot behaviors of the pre-trained model on four different settings and demonstrate that it acquires useful linguistic knowledge for downstream tasks."
- "text"
- {} 3 keys▶
- 2
- {} 5 keys▶
- 106.88473510742188
- 532.18798828125
- 505.1310119628906
- 434.7469787597656
- "BOTTOMLEFT"
- [] 2 items▶
- 0
- 818
- {} 1 key▶
- "#/body"
- [] 0 items
- [] 1 item▶
- "1 Introduction"
- {} 2 keys▶
- "application/pdf"
- "gpt_radford.pdf"
- {} 7 keys▶
- "Semi-supervised learning for NLP Our work broadly falls under the category of semi-supervised learning for natural language. This paradigm has attracted significant interest, with applications to tasks like sequence labeling [24 , 33 , 57] or text classification [41 , 70]. The earliest approaches used unlabeled data to compute word-level or phrase-level statistics, which were then used as features in a supervised model [33]. Over the last few years, researchers have demonstrated the benefits of using word embeddings [11 , 39 , 42], which are trained on unlabeled corpora, to improve performance on a variety of tasks [8 , 11 , 26 , 45]. These approaches, however, mainly transfer word-level information, whereas we aim to capture higher-level semantics."
- "text"
- {} 3 keys▶
- 2
- {} 5 keys▶
- 106.8495101928711
- 388.781494140625
- 504.69183349609375
- 302.60601806640625
- "BOTTOMLEFT"
- [] 2 items▶
- 0
- 759
- {} 1 key▶
- "#/body"
- [] 0 items
- [] 1 item▶
- "2 Related Work"
- {} 2 keys▶
- "application/pdf"
- "gpt_radford.pdf"
- {} 7 keys▶
- "Recent approaches have investigated learning and utilizing more than word-level semantics from unlabeled data. Phrase-level or sentence-level embeddings, which can be trained using an unlabeled corpus, have been used to encode text into suitable vector representations for various target tasks [28 , 32, 1, 36, 22, 12, 56, 31]."
- "text"
- {} 3 keys▶
- 2
- {} 5 keys▶
- 107.06312561035156
- 296.31097412109375
- 504.910888671875
- 253.7572021484375
- "BOTTOMLEFT"
- [] 2 items▶
- 0
- 327
- {} 1 key▶
- "#/body"
- [] 0 items
- [] 1 item▶
- "2 Related Work"
- {} 2 keys▶
- "application/pdf"
- "gpt_radford.pdf"
- {} 7 keys▶
- "Unsupervised pre-training Unsupervised pre-training is a special case of semi-supervised learning where the goal is to find a good initialization point instead of modifying the supervised learning objective. Early works explored the use of the technique in image classification [20 , 49 , 63] and regression tasks [3]. Subsequent research [15] demonstrated that pre-training acts as a regularization scheme, enabling better generalization in deep neural networks. In recent work, the method has been used to help train deep neural networks on various tasks like image classification [69], speech recognition [68], entity disambiguation [17] and machine translation [48]."
- "text"
- {} 3 keys▶
- 2
- {} 5 keys▶
- 106.93354034423828
- 237.70379638671875
- 504.1886291503906
- 162.15557861328125
- "BOTTOMLEFT"
- [] 2 items▶
- 0
- 670
- {} 1 key▶
- "#/body"
- [] 0 items
- [] 1 item▶
- "2 Related Work"
- {} 2 keys▶
- "application/pdf"
- "gpt_radford.pdf"
- {} 7 keys▶
- "The closest line of work to ours involves pre-training a neural network using a language modeling objective and then fine-tuning it on a target task with supervision. Dai et al. [13] and Howard and Ruder [21] follow this method to improve text classification. However, although the pre-training phase helps capture some linguistic information, their usage of LSTM models restricts their prediction ability to a short range. In contrast, our choice of transformer networks allows us to capture longerrange linguistic structure, as demonstrated in our experiments. Further, we also demonstrate the effectiveness of our model on a wider range of tasks including natural language inference, paraphrase detection and story completion. Other approaches [43 , 44 , 38] use hidden representations from a"
- "text"
- {} 3 keys▶
- 2
- {} 5 keys▶
- 106.97233581542969
- 156.146484375
- 505.1766357421875
- 69.238525390625
- "BOTTOMLEFT"
- [] 2 items▶
- 0
- 795
- {} 1 key▶
- "#/body"
- [] 0 items
- [] 1 item▶
- "2 Related Work"
- {} 2 keys▶
- "application/pdf"
- "gpt_radford.pdf"
- {} 7 keys▶
- "2"
- "page_footer"
- {} 3 keys▶
- 2
- {} 5 keys▶
- 302.886962890625
- 49.93804931640625
- 308.2412414550781
- 41.82183837890625
- "BOTTOMLEFT"
- [] 2 items▶
- 0
- 1
- {} 1 key▶
- "#/body"
- [] 0 items
- [] 1 item▶
- "2 Related Work"
- {} 2 keys▶
- "application/pdf"
- "gpt_radford.pdf"
- {} 7 keys▶
- "pre-trained language or machine translation model as auxiliary features while training a supervised model on the target task. This involves a substantial amount of new parameters for each separate target task, whereas we require minimal changes to our model architecture during transfer."
- "text"
- {} 3 keys▶
- 3
- {} 5 keys▶
- 107.56468200683594
- 717.521240234375
- 503.9138488769531
- 685.5468139648438
- "BOTTOMLEFT"
- [] 2 items▶
- 0
- 287
- {} 1 key▶
- "#/body"
- [] 0 items
- [] 1 item▶
- "2 Related Work"
- {} 2 keys▶
- "application/pdf"
- "gpt_radford.pdf"
- {} 7 keys▶
- "Auxiliary training objectives Adding auxiliary unsupervised training objectives is an alternative form of semi-supervised learning. Early work by Collobert and Weston [10] used a wide variety of auxiliary NLP tasks such as POS tagging, chunking, named entity recognition, and language modeling to improve semantic role labeling. More recently, Rei [50] added an auxiliary language modeling objective to their target task objective and demonstrated performance gains on sequence labeling tasks. Our experiments also use an auxiliary objective, but as we show, unsupervised pre-training already learns several linguistic aspects relevant to target tasks."
- "text"
- {} 3 keys▶
- 3
- {} 5 keys▶
- 107.20752716064453
- 673.593017578125
- 504.50103759765625
- 597.9339599609375
- "BOTTOMLEFT"
- [] 2 items▶
- 0
- 652
- {} 1 key▶
- "#/body"
- [] 0 items
- [] 1 item▶
- "2 Related Work"
- {} 2 keys▶
- "application/pdf"
- "gpt_radford.pdf"
- {} 7 keys▶
- "Our training procedure consists of two stages. The first stage is learning a high-capacity language model on a large corpus of text. This is followed by a fine-tuning stage, where we adapt the model to a discriminative task with labeled data."
- "text"
- {} 3 keys▶
- 3
- {} 5 keys▶
- 107.37720489501953
- 558.2410278320312
- 503.81585693359375
- 527.5881958007812
- "BOTTOMLEFT"
- [] 2 items▶
- 0
- 242
- {} 1 key▶
- "#/body"
- [] 0 items
- [] 1 item▶
- "3 Framework"
- {} 2 keys▶
- "application/pdf"
- "gpt_radford.pdf"
- {} 7 keys▶
- "Given an unsupervised corpus of tokens U = {u1, . . . , u n }, we use a standard language modeling objective to maximize the following likelihood:"
- "text"
- {} 3 keys▶
- 3
- {} 5 keys▶
- 107.45760345458984
- 493.0939636230469
- 503.8688659667969
- 471.80975341796875
- "BOTTOMLEFT"
- [] 2 items▶
- 0
- 146
- {} 1 key▶
- "#/body"
- [] 0 items
- [] 1 item▶
- "3.1 Unsupervised pre-training"
- {} 2 keys▶
- "application/pdf"
- "gpt_radford.pdf"
- {} 7 keys▶
- "L 1 (U) = X i log P(ui|ui - k, . . . , ui - 1 ; Θ) (1)"
- "formula"
- {} 3 keys▶
- 3
- {} 5 keys▶
- 220.6671905517578
- 468.4500732421875
- 503.950439453125
- 446.34423828125
- "BOTTOMLEFT"
- [] 2 items▶
- 0
- 54
- {} 1 key▶
- "#/body"
- [] 0 items
- [] 1 item▶
- "3.1 Unsupervised pre-training"
- {} 2 keys▶
- "application/pdf"
- "gpt_radford.pdf"
- {} 7 keys▶
- "where k is the size of the context window, and the conditional probability P is modeled using a neural network with parameters Θ. These parameters are trained using stochastic gradient descent [51]."
- "text"
- {} 3 keys▶
- 3
- {} 5 keys▶
- 107.33065795898438
- 442.8547668457031
- 503.84930419921875
- 421.6244201660156
- "BOTTOMLEFT"
- [] 2 items▶
- 0
- 198
- {} 1 key▶
- "#/body"
- [] 0 items
- [] 1 item▶
- "3.1 Unsupervised pre-training"
- {} 2 keys▶
- "application/pdf"
- "gpt_radford.pdf"
- {} 7 keys▶
- "In our experiments, we use a multi-layer Transformer decoder [34] for the language model, which is a variant of the transformer [62]. This model applies a multi-headed self-attention operation over the input context tokens followed by position-wise feedforward layers to produce an output distribution over target tokens:"
- "text"
- {} 3 keys▶
- 3
- {} 5 keys▶
- 107.41326141357422
- 415.41015625
- 504.0208435058594
- 372.8392639160156
- "BOTTOMLEFT"
- [] 2 items▶
- 0
- 321
- {} 1 key▶
- "#/body"
- [] 0 items
- [] 1 item▶
- "3.1 Unsupervised pre-training"
- {} 2 keys▶
- "application/pdf"
- "gpt_radford.pdf"
- {} 7 keys▶
- "h 0 = UWe We + Wp Wp h l = transformer_block(hl - 1 )∀i ∈ [1, n] P(u) = softmax(h n W e T W e ) (2)"
- "formula"
- {} 3 keys▶
- 3
- {} 5 keys▶
- 210.28749084472656
- 371.87255859375
- 504.22918701171875
- 331.2057189941406
- "BOTTOMLEFT"
- [] 2 items▶
- 0
- 101
- {} 1 key▶
- "#/body"
- [] 0 items
- [] 1 item▶
- "3.1 Unsupervised pre-training"
- {} 2 keys▶
- "application/pdf"
- "gpt_radford.pdf"
- {} 7 keys▶
- "where U = (u - k, . . . , u - 1 ) is the context vector of tokens, n is the number of layers, We We is the token embedding matrix, and Wp Wp is the position embedding matrix."
- "text"
- {} 3 keys▶
- 3
- {} 5 keys▶
- 107.15132904052734
- 329.1113586425781
- 504.1338195800781
- 307.3338623046875
- "BOTTOMLEFT"
- [] 2 items▶
- 0
- 174
- {} 1 key▶
- "#/body"
- [] 0 items
- [] 1 item▶
- "3.1 Unsupervised pre-training"
- {} 2 keys▶
- "application/pdf"
- "gpt_radford.pdf"
- {} 7 keys▶
- "After training the model with the objective in Eq. 1, we adapt the parameters to the supervised target task. We assume a labeled dataset C, where each instance consists of a sequence of input tokens, x 1 , . . . , x m , along with a label y. The inputs are passed through our pre-trained model to obtain the final transformer block's activation h m l , which is then fed into an added linear output layer with parameters Wy Wy to predict y:"
- "text"
- {} 3 keys▶
- 3
- {} 5 keys▶
- 107.25365447998047
- 274.236572265625
- 504.6893615722656
- 220.09532165527344
- "BOTTOMLEFT"
- [] 2 items▶
- 0
- 440
- {} 1 key▶
- "#/body"
- [] 0 items
- [] 1 item▶
- "3.2 Supervised fine-tuning"
- {} 2 keys▶
- "application/pdf"
- "gpt_radford.pdf"
- {} 7 keys▶
- "P(y|x 1 , . . . , x m ) = softmax(h m l Wy Wy ) . (3)"
- "formula"
- {} 3 keys▶
- 3
- {} 5 keys▶
- 227.04054260253906
- 216.83560180664062
- 504.1090087890625
- 204.85162353515625
- "BOTTOMLEFT"
- [] 2 items▶
- 0
- 53
- {} 1 key▶
- "#/body"
- [] 0 items
- [] 1 item▶
- "3.2 Supervised fine-tuning"
- {} 2 keys▶
- "application/pdf"
- "gpt_radford.pdf"
- {} 7 keys▶
- "This gives us the following objective to maximize:"
- "text"
- {} 3 keys▶
- 3
- {} 5 keys▶
- 106.5847396850586
- 196.8427734375
- 308.4329528808594
- 186.88818359375
- "BOTTOMLEFT"
- [] 2 items▶
- 0
- 50
- {} 1 key▶
- "#/body"
- [] 0 items
- [] 1 item▶
- "3.2 Supervised fine-tuning"
- {} 2 keys▶
- "application/pdf"
- "gpt_radford.pdf"
- {} 7 keys▶
- "L 2 (C) = X (x,y) log P(y|x 1 , . . . , x m ) . (4)"
- "formula"
- {} 3 keys▶
- 3
- {} 5 keys▶
- 233.9741973876953
- 184.01953125
- 504.07891845703125
- 158.7152099609375
- "BOTTOMLEFT"
- [] 2 items▶
- 0
- 51
- {} 1 key▶
- "#/body"
- [] 0 items
- [] 1 item▶
- "3.2 Supervised fine-tuning"
- {} 2 keys▶
- "application/pdf"
- "gpt_radford.pdf"
- {} 7 keys▶
- "We additionally found that including language modeling as an auxiliary objective to the fine-tuning helped learning by (a) improving generalization of the supervised model, and (b) accelerating convergence. This is in line with prior work [50 , 43], who also observed improved performance with such an auxiliary objective. Specifically, we optimize the following objective (with weight λ):"
- "text"
- {} 3 keys▶
- 3
- {} 5 keys▶
- 107.08598327636719
- 150.09197998046875
- 504.07904052734375
- 106.2120361328125
- "BOTTOMLEFT"
- [] 2 items▶
- 0
- 389
- {} 1 key▶
- "#/body"
- [] 0 items
- [] 1 item▶
- "3.2 Supervised fine-tuning"
- {} 2 keys▶
- "application/pdf"
- "gpt_radford.pdf"
- {} 7 keys▶
- "L 3 (C) = L2(C) + λ ∗ L1(C) (5)"
- "formula"
- {} 3 keys▶
- 3
- {} 5 keys▶
- 248.16424560546875
- 105.101318359375
- 504.110595703125
- 93.23126220703125
- "BOTTOMLEFT"
- [] 2 items▶
- 0
- 31
- {} 1 key▶
- "#/body"
- [] 0 items
- [] 1 item▶
- "3.2 Supervised fine-tuning"
- {} 2 keys▶
- "application/pdf"
- "gpt_radford.pdf"
- {} 7 keys▶
- "Overall, the only extra parameters we require during fine-tuning are Wy Wy , and embeddings for delimiter tokens (described below in Section 3.3)."
- "text"
- {} 3 keys▶
- 3
- {} 5 keys▶
- 107.31987762451172
- 90.14111328125
- 504.24591064453125
- 70.24658203125
- "BOTTOMLEFT"
- [] 2 items▶
- 0
- 146
- {} 1 key▶
- "#/body"
- [] 0 items
- [] 1 item▶
- "3.2 Supervised fine-tuning"
- {} 2 keys▶
- "application/pdf"
- "gpt_radford.pdf"
- {} 7 keys▶
- "3"
- "page_footer"
- {} 3 keys▶
- 3
- {} 5 keys▶
- 302.77484130859375
- 49.4654541015625
- 307.97015380859375
- 41.3525390625
- "BOTTOMLEFT"
- [] 2 items▶
- 0
- 1
- {} 1 key▶
- "#/body"
- [] 0 items
- [] 1 item▶
- "3.2 Supervised fine-tuning"
- {} 2 keys▶
- "application/pdf"
- "gpt_radford.pdf"
- {} 7 keys▶
- "Figure 1: (left) Transformer architecture and training objectives used in this work. (right) Input transformations for fine-tuning on different tasks. We convert all structured inputs into token sequences to be processed by our pre-trained model, followed by a linear+softmax layer."
- "caption"
- {} 3 keys▶
- 4
- {} 5 keys▶
- 107.15206146240234
- 556.10888671875
- 504.034912109375
- 524.68310546875
- "BOTTOMLEFT"
- [] 2 items▶
- 0
- 282
- {} 1 key▶
- "#/body"
- [] 0 items
- [] 1 item▶
- "3.2 Supervised fine-tuning"
- {} 2 keys▶
- "application/pdf"
- "gpt_radford.pdf"
- {} 7 keys▶
- "For some tasks, like text classification, we can directly fine-tune our model as described above. Certain other tasks, like question answering or textual entailment, have structured inputs such as ordered sentence pairs, or triplets of document, question, and answers. Since our pre-trained model was trained on contiguous sequences of text, we require some modifications to apply it to these tasks. Previous work proposed learning task specific architectures on top of transferred representations [44]. Such an approach re-introduces a significant amount of task-specific customization and does not use transfer learning for these additional architectural components. Instead, we use a traversal-style approach [52], where we convert structured inputs into an ordered sequence that our pre-trained model can process. These input transformations allow us to avoid making extensive changes to the architecture across tasks. We provide a brief description of these input transformations below and Figure 1 provides a visual illustration. All transformations include adding randomly initialized start and end tokens (hsi , hei)."
- "text"
- {} 3 keys▶
- 4
- {} 5 keys▶
- 106.88343811035156
- 480.168701171875
- 505.0693359375
- 350.1255798339844
- "BOTTOMLEFT"
- [] 2 items▶
- 0
- 1125
- {} 1 key▶
- "#/body"
- [] 0 items
- [] 1 item▶
- "3.3 Task-specific input transformations"
- {} 2 keys▶
- "application/pdf"
- "gpt_radford.pdf"
- {} 7 keys▶
- "Textual entailment For entailment tasks, we concatenate the premise p and hypothesis h token sequences, with a delimiter token ($) in between."
- "text"
- {} 3 keys▶
- 4
- {} 5 keys▶
- 106.97191619873047
- 335.5577392578125
- 504.1901550292969
- 314.677734375
- "BOTTOMLEFT"
- [] 2 items▶
- 0
- 142
- {} 1 key▶
- "#/body"
- [] 0 items
- [] 1 item▶
- "3.3 Task-specific input transformations"
- {} 2 keys▶
- "application/pdf"
- "gpt_radford.pdf"
- {} 7 keys▶
- "Similarity For similarity tasks, there is no inherent ordering of the two sentences being compared. To reflect this, we modify the input sequence to contain both possible sentence orderings (with a delimiter in between) and process each independently to produce two sequence representations h m l which are added element-wise before being fed into the linear output layer."
- "text"
- {} 3 keys▶
- 4
- {} 5 keys▶
- 106.88548278808594
- 299.5601806640625
- 505.0616455078125
- 256.5992431640625
- "BOTTOMLEFT"
- [] 2 items▶
- 0
- 372
- {} 1 key▶
- "#/body"
- [] 0 items
- [] 1 item▶
- "3.3 Task-specific input transformations"
- {} 2 keys▶
- "application/pdf"
- "gpt_radford.pdf"
- {} 7 keys▶
- "Question Answering and Commonsense Reasoning For these tasks, we are given a context document z, a question q, and a set of possible answers {ak}. We concatenate the document context and question with each possible answer, adding a delimiter token in between to get [z; q; $; ak]. Each of these sequences are processed independently with our model and then normalized via a softmax layer to produce an output distribution over possible answers."
- "text"
- {} 3 keys▶
- 4
- {} 5 keys▶
- 107.0148696899414
- 242.29168701171875
- 504.3286437988281
- 187.95330810546875
- "BOTTOMLEFT"
- [] 2 items▶
- 0
- 444
- {} 1 key▶
- "#/body"
- [] 0 items
- [] 1 item▶
- "3.3 Task-specific input transformations"
- {} 2 keys▶
- "application/pdf"
- "gpt_radford.pdf"
- {} 7 keys▶
- "Unsupervised pre-training We use the BooksCorpus dataset [71] for training the language model. It contains over 7,000 unique unpublished books from a variety of genres including Adventure, Fantasy, and Romance. Crucially, it contains long stretches of contiguous text, which allows the generative model to learn to condition on long-range information. An alternative dataset, the 1B Word Benchmark, which is used by a similar approach, ELMo [44], is approximately the same size"
- "text"
- {} 3 keys▶
- 4
- {} 5 keys▶
- 107.07925415039062
- 122.84698486328125
- 505.15106201171875
- 69.5682373046875
- "BOTTOMLEFT"
- [] 2 items▶
- 0
- 477
- {} 1 key▶
- "#/body"
- [] 0 items
- [] 1 item▶
- "4.1 Setup"
- {} 2 keys▶
- "application/pdf"
- "gpt_radford.pdf"
- {} 7 keys▶
- "4"
- "page_footer"
- {} 3 keys▶
- 4
- {} 5 keys▶
- 302.40863037109375
- 49.57403564453125
- 308.38299560546875
- 41.68377685546875
- "BOTTOMLEFT"
- [] 2 items▶
- 0
- 1
- {} 1 key▶
- "#/body"
- [] 0 items
- [] 1 item▶
- "4.1 Setup"
- {} 2 keys▶
- "application/pdf"
- "gpt_radford.pdf"
- {} 7 keys▶
- "Table 1: A list of the different tasks and datasets used in our experiments."
- "caption"
- {} 3 keys▶
- 5
- {} 5 keys▶
- 158.66375732421875
- 720.77392578125
- 452.1585693359375
- 710.9884033203125
- "BOTTOMLEFT"
- [] 2 items▶
- 0
- 76
- {} 1 key▶
- "#/body"
- [] 0 items
- [] 1 item▶
- "4.1 Setup"
- {} 2 keys▶
- "application/pdf"
- "gpt_radford.pdf"
- {} 7 keys▶
- "but is shuffled at a sentence level - destroying long-range structure. Our language model achieves a very low token level perplexity of 18.4 on this corpus."
- "text"
- {} 3 keys▶
- 5
- {} 5 keys▶
- 107.23218536376953
- 622.1171264648438
- 504.079345703125
- 600.5718383789062
- "BOTTOMLEFT"
- [] 2 items▶
- 0
- 156
- {} 1 key▶
- "#/body"
- [] 0 items
- [] 1 item▶
- "4.1 Setup"
- {} 2 keys▶
- "application/pdf"
- "gpt_radford.pdf"
- {} 7 keys▶
- "Model specifications Our model largely follows the original transformer work [62]. We trained a 12-layer decoder-only transformer with masked self-attention heads (768 dimensional states and 12 attention heads). For the position-wise feed-forward networks, we used 3072 dimensional inner states. We used the Adam optimization scheme [27] with a max learning rate of 2.5e-4. The learning rate was increased linearly from zero over the first 2000 updates and annealed to 0 using a cosine schedule. We train for 100 epochs on minibatches of 64 randomly sampled, contiguous sequences of 512 tokens. Since layernorm [2] is used extensively throughout the model, a simple weight initialization of N(0 , 0 . 02) was sufficient. We used a bytepair encoding (BPE) vocabulary with 40,000 merges [53] and residual, embedding, and attention dropouts with a rate of 0.1 for regularization. We also employed a modified version of L2 regularization proposed in [37], with w = 0 . 01 on all non bias or gain weights. For the activation function, we used the Gaussian Error Linear Unit (GELU) [18]. We used learned position embeddings instead of the sinusoidal version proposed in the original work. We use the ftfy library 2 to clean the raw text in BooksCorpus, standardize some punctuation and whitespace, and use the spaCy tokenizer. 3"
- "text"
- {} 3 keys▶
- 5
- {} 5 keys▶
- 106.72154998779297
- 588.4483642578125
- 505.4134826660156
- 435.9697265625
- "BOTTOMLEFT"
- [] 2 items▶
- 0
- 1322
- {} 1 key▶
- "#/body"
- [] 0 items
- [] 1 item▶
- "4.1 Setup"
- {} 2 keys▶
- "application/pdf"
- "gpt_radford.pdf"
- {} 7 keys▶
- "Fine-tuning details Unless specified, we reuse the hyperparameter settings from unsupervised pre-training. We add dropout to the classifier with a rate of 0.1. For most tasks, we use a learning rate of 6.25e-5 and a batchsize of 32. Our model finetunes quickly and 3 epochs of training was sufficient for most cases. We use a linear learning rate decay schedule with warmup over 0.2% of training. λ was set to 0.5."
- "text"
- {} 3 keys▶
- 5
- {} 5 keys▶
- 106.96757507324219
- 422.9845275878906
- 504.1258850097656
- 371.09295654296875
- "BOTTOMLEFT"
- [] 2 items▶
- 0
- 414
- {} 1 key▶
- "#/body"
- [] 0 items
- [] 1 item▶
- "4.1 Setup"
- {} 2 keys▶
- "application/pdf"
- "gpt_radford.pdf"
- {} 7 keys▶
- "We perform experiments on a variety of supervised tasks including natural language inference, question answering, semantic similarity, and text classification. Some of these tasks are available as part of the recently released GLUE multi-task benchmark [64], which we make use of. Figure 1 provides an overview of all the tasks and datasets."
- "text"
- {} 3 keys▶
- 5
- {} 5 keys▶
- 107.04237365722656
- 334.7701416015625
- 504.9407043457031
- 292.2330627441406
- "BOTTOMLEFT"
- [] 2 items▶
- 0
- 341
- {} 1 key▶
- "#/body"
- [] 0 items
- [] 1 item▶
- "4.2 Supervised fine-tuning"
- {} 2 keys▶
- "application/pdf"
- "gpt_radford.pdf"
- {} 7 keys▶
- "Natural Language Inference The task of natural language inference (NLI), also known as recognizing textual entailment, involves reading a pair of sentences and judging the relationship between them from one of entailment , contradiction or neutral. Although there has been a lot of recent interest [58 , 35 , 44], the task remains challenging due to the presence of a wide variety of phenomena like lexical entailment, coreference, and lexical and syntactic ambiguity. We evaluate on five datasets with diverse sources, including image captions (SNLI), transcribed speech, popular fiction, and government reports (MNLI), Wikipedia articles (QNLI), science exams (SciTail) or news articles (RTE)."
- "text"
- {} 3 keys▶
- 5
- {} 5 keys▶
- 107.13055419921875
- 278.97161865234375
- 505.1729736328125
- 193.03958129882812
- "BOTTOMLEFT"
- [] 2 items▶
- 0
- 695
- {} 1 key▶
- "#/body"
- [] 0 items
- [] 1 item▶
- "4.2 Supervised fine-tuning"
- {} 2 keys▶
- "application/pdf"
- "gpt_radford.pdf"
- {} 7 keys▶
- "Table 2 details various results on the different NLI tasks for our model and previous state-of-the-art approaches. Our method significantly outperforms the baselines on four of the five datasets, achieving absolute improvements of upto 1.5% on MNLI, 5% on SciTail, 5.8% on QNLI and 0.6% on SNLI over the previous best results. This demonstrates our model's ability to better reason over multiple sentences, and handle aspects of linguistic ambiguity. On RTE, one of the smaller datasets we evaluate on (2490 examples), we achieve an accuracy of 56%, which is below the 61.7% reported by a multi-task biLSTM model. Given the strong performance of our approach on larger NLI datasets, it is likely our model will benefit from multi-task training as well but we have not explored this currently."
- "text"
- {} 3 keys▶
- 5
- {} 5 keys▶
- 106.94844055175781
- 186.24542236328125
- 505.0565490722656
- 99.22906494140625
- "BOTTOMLEFT"
- [] 2 items▶
- 0
- 792
- {} 1 key▶
- "#/body"
- [] 0 items
- [] 1 item▶
- "4.2 Supervised fine-tuning"
- {} 2 keys▶
- "application/pdf"
- "gpt_radford.pdf"
- {} 7 keys▶
- "2 https://ftfy.readthedocs.io/en/latest/"
- "footnote"
- {} 3 keys▶
- 5
- {} 5 keys▶
- 119.82808685302734
- 90.70085144042969
- 302.5846252441406
- 80.5389404296875
- "BOTTOMLEFT"
- [] 2 items▶
- 0
- 40
- {} 1 key▶
- "#/body"
- [] 0 items
- [] 1 item▶
- "4.2 Supervised fine-tuning"
- {} 2 keys▶
- "application/pdf"
- "gpt_radford.pdf"
- {} 7 keys▶
- "10"
- "page_footer"
- {} 3 keys▶
- 10
- {} 5 keys▶
- 301.150146484375
- 49.35443115234375
- 311.2596740722656
- 41.3822021484375
- "BOTTOMLEFT"
- [] 2 items▶
- 0
- 2
- {} 1 key▶
- "#/body"
- [] 0 items
- [] 1 item▶
- "References"
- {} 2 keys▶
- "application/pdf"
- "gpt_radford.pdf"
- {} 7 keys▶
- "11"
- "page_footer"
- {} 3 keys▶
- 11
- {} 5 keys▶
- 301.0162658691406
- 49.4591064453125
- 310.46221923828125
- 41.6802978515625
- "BOTTOMLEFT"
- [] 2 items▶
- 0
- 2
- {} 1 key▶
- "#/body"
- [] 0 items
- [] 1 item▶
- "References"
- {} 2 keys▶
- "application/pdf"
- "gpt_radford.pdf"
- {} 7 keys▶
- "[68] D. Yu, L. Deng, and G. Dahl. Roles of pre-training and fine-tuning in context-dependent dbn-hmms for real-world speech recognition. In Proc. NIPS Workshop on Deep Learning and Unsupervised Feature Learning, 2010. [69] R. Zhang, P. Isola, and A. A. Efros. Split-brain autoencoders: Unsupervised learning by cross-channel prediction. In CVPR, volume 1, page 6, 2017. [70] X. Zhu. Semi-supervised learning literature survey. 2005. [71] Y. Zhu, R. Kiros, R. Zemel, R. Salakhutdinov, R. Urtasun, A. Torralba, and S. Fidler. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proceedings of the IEEE international conference on computer vision, pages 19-27, 2015."
- "list_item"
- {} 3 keys▶
- 12
- {} 5 keys▶
- 107.51638793945312
- 716.7481689453125
- 504.4619140625
- 688.2738647460938
- "BOTTOMLEFT"
- [] 2 items▶
- 0
- 217
- {} 1 key▶
- "#/groups/0"
- [] 0 items
- [] 1 item▶
- "References"
- {} 2 keys▶
- "application/pdf"
- "gpt_radford.pdf"
- {} 7 keys▶
- "12"
- "page_footer"
- {} 3 keys▶
- 12
- {} 5 keys▶
- 301.0538024902344
- 49.50811767578125
- 311.0802307128906
- 41.559326171875
- "BOTTOMLEFT"
- [] 2 items▶
- 0
- 2
- {} 1 key▶
- "#/body"
- [] 0 items
- [] 1 item▶
- "References"
- {} 2 keys▶
- "application/pdf"
- "gpt_radford.pdf"
46 ... 71▶▶